You are on page 1of 78

Introduction

Introduction to Data Mining with Case Studies


Author: G. K. Gupta
Prentice Hall India, 2006.
Objectives

• What is data mining?


• Why data mining?
• What applications?
• What techniques?
• What process?
• What software?

27 November 2008 ©GKGupta 2


Definition
Data mining may be defined as follows:
data mining is a collection of techniques for efficient
automated discovery of previously unknown, valid,
novel, useful and understandable patterns in large
databases. The patterns must be actionable so they
may be used in an enterprise’s decision making.

27 November 2008 ©GKGupta 3


What is Data Mining?

• Efficient automated discovery of previously unknown


patterns in large volumes of data.
• Patterns must be valid, novel, useful and
understandable.
• Businesses are mostly interested in discovering past
patterns to predict future behaviour.
• A data warehouse, to be discussed later, can be an
enterprise’s memory. Data mining can provide
intelligence using that memory.

27 November 2008 ©GKGupta 4


Examples

• amazon.com uses associations. Recommendations to


customers are based on past purchases and what other
customers are purchasing.
• A store in USA “Just for Feet” has about 200 stores, each
carrying up to 6000 shoe styles, each style in several
sizes. Data mining is used to find the right shoes to
stock in the right store.
• More examples in case studies to be discussed later.

27 November 2008 ©GKGupta 5


Data Mining
• We assume we are dealing with large data, perhaps
Gigabytes, perhaps in Terabytes.
• Although data mining is possible with smaller amount
of data, bigger the data, higher the confidence in any
unknown pattern that is discovered.
• There is considerable hype about data mining at the
present time and Gartner Group has listed data mining
as one of the top ten technologies to watch.

Question: How many books could one store in one Terabyte of memory?

27 November 2008 ©GKGupta 6


Why Data Mining Now?
• Growth in generation and storage of corporate data –
information explosion
• Need for sophisticated decision making – current
database systems are Online Transaction Processing
(OLTP) systems. The OLTP data is difficult to use for
such applications. Why?
• Evolution of technology – much cheaper storage,
easier data collection, better database management,
to data analysis and understanding.

27 November 2008 ©GKGupta 7


Information explosion
• Database systems are being used since the 1960s in
the Western countries (perhaps since 1980s in India).
These systems have generated mountains of data.
• Point of sale terminals and bar codes on many
products, railway bookings, educational institutions,
huge number of mobile phones, electronic commerce,
all generate data.
• Government is now collecting a lot of information.

27 November 2008 ©GKGupta 8


Information explosion

• Internet banking via networked computers and


ATMs.
• Credit and debit cards.
• Medical data, doctors, hospitals.
• Transportation, Indian railways, automatic toll
collection on toll roads, growing air travel.
• Passports, NRI visas, Other visas, NRI money
transfers.

Question: Can you think of other examples of data collection?

27 November 2008 ©GKGupta 9


Information explosion
Many adults in India generate:
• Mobile phone transactions. More than 300 million
phones in India, reportedly growing at the rate of
10,000 new ones every hour! Mobile companies must
save information about calls.
• Growing middle class with growing number of credit
and debit card transactions. About 25m credit cards
and 70m debit cards in 2007. Annual growth rate about
30% and 40% respectively. Could be 55m credit cards
and 200m debit cards in 2010 resulting in perhaps
500m transactions annually.

27 November 2008 ©GKGupta 10


Information explosion
• India has some huge enterprises, for example Indian
railways, perhaps the busiest network in the world
with 2.5m employees, 10,000 locomotives, 10,000
passenger trains daily, 10,000 freight trains daily and
20m passengers daily.
• Growing airline traffic with more than ten airlines.
Perhaps 30m passengers annually.
• Growing number of motor vehicles – registration,
insurance, driver license
• Internet surfing records

27 November 2008 ©GKGupta 11


OLTP
As noted earlier, most enterprise database systems were
designed in the 1970’s or 1980’s and were mainly
designed to automate some of the office procedures e.g.
order entry, student enrolment, patient registration,
airline reservations. These are well structured repetitive
operations easily automated.

27 November 2008 ©GKGupta 12


Decision Making
• Need for business memory and intelligence.
• Need to serve customers better by learning from past
interactions.
• OLTP data is not a good basis for maintaining an
enterprise memory.
• The intelligence hidden in data could be the secret
weapon in a competitive business world but given the
information explosion not even a small fraction could
be looked at by human eye.

Question: Why OLTP is not good for maintaining an enterprise memory?

27 November 2008 ©GKGupta 13


OLTP vs Decision Making
Clerical view of data focuses on details required for
day-to-day running of an enterprise.

Management view of data focuses on summary data to


identify trends, challenges and opportunities.

The detailed data view is the operational view while


the management view is decision-support view.
Comparison of the two views:

27 November 2008 ©GKGupta 14


Operational vs Management View
Operational Decision-Support
Users – Admin staff Users – Management
Day–to–day work Decision support
Application oriented Subject oriented
Current data Historical data
Detailed Overall view – summaries
Simple queries Complex queries
Predetermined queries Ad hoc queries
Update/Select Only Select
Real–time Not real–time

27 November 2008 ©GKGupta 15


Evolution of Technology
• Corporate data growth accompanied by decline in the
cost of storage and processing.
• PC motherboard performance, measured in MHz/$, is
currently doubling every 27 ± 2 months.
• Next slide using logarithmic scale shows that disk is
now about 10GB per US dollar and the following slide
shows that sales of disk storage is growing
exponentially.
• Look at computing trends at
http://www.zoology.ubc.ca/~rikblok/ComputingTrends/

Question: How much is the cost of 100GB disk? What is the cost of a PC and what is
its CPU performance?

27 November 2008 ©GKGupta 16


Decline in Hard Drive cost

27 November 2008 ©GKGupta 17


Growth in Worldwide Disk
Capacity
18000
16000
14000
Storage in Petabytes

12000
10000
8000
6000
4000
2000
0
1996 1997 1998 1999 2000 2001 2002 2003
Year

27 November 2008 ©GKGupta 18


Evolution of Technology

Question: What do the graphs in the last two slides tell us? What scales are used in
them? What was the pink line is the first graph?

27 November 2008 ©GKGupta 19


Evolution of Technology
• Database technology has improved over the years.
• Data collection is often much better and cheaper now
• The need for analyzing and synthesizing information
is growing in a fiercely competitive business
environment of today.

27 November 2008 ©GKGupta 20


New applications
Sophisticated applications of modern enterprises include:
- sales forecasting and analysis
- marketing and promotion planning
- business modeling

OLTP is not designed for such applications. Also, large


enterprises operate a number of database systems and
then it is necessary to integrate information for decision
making applications.

Question: Why OLTP cannot be used for sales forecasting and analysis?

27 November 2008 ©GKGupta 21


Why Data Mining Now?
As noted earlier, the reasons may be summarized as:
•Accumulation of large amounts of data
• Increased affordable computing power enabling data
mining processing
• Statistical and learning algorithms
• Availability of software
• Strong business competition

27 November 2008 ©GKGupta 22


Large amount of data
Already discussed that many enterprises have large
amounts of data accumulated over 30+ years.

Noted earlier that some enterprises collect information


for analysis, for example, supermarkets in USA offer
loyalty cards in exchange for shopper information.
Loyalty cards in Australia also collect information
using a reward system.

27 November 2008 ©GKGupta 23


Growth of cards
A recent survey in USA found that the percentages of
US adults using the following types of cards were:

• Credit cards - 88%;


• ATM cards - 60%
• Membership cards - 58%
• Debit cards - 35%
• Prepaid cards - 35%
• Loyalty cards - 29%
Question: What kind of data do these cards generate?

27 November 2008 ©GKGupta 24


Affordable computing power
Data mining is usually computationally intensive.
Dramatic reduction in the price of computer systems,
as noted earlier, is making it possible to carry out
data mining without investing huge amounts of
resources in hardware and software.

In spite of affordable computing power, using data


mining can be resources intensive.

27 November 2008 ©GKGupta 25


Algorithms
A variety of statistical and learning algorithms have
been available in fields like statistics and artificial
intelligence that have been adapted for data mining.

With new focus on data mining, new algorithms are


being developed.

27 November 2008 ©GKGupta 26


Availability of Software
Large variety of DM software is now available. Some
more widely used software is:
• IBM - Intelligent Miner and more
• SAS - Enterprise Miner
• Silicon Graphics - MineSet
• Oracle - Thinking Machines - Darwin
• Angoss - knowledgeSEEKER

27 November 2008 ©GKGupta 27


Strong Business Competition
Growth in service economies. Almost every business
is a service business. Service economies are
information rich and very competitive.

Consider the telecommunications environment in


Australia. About 20 years ago, Telstra was a
monopoly. The field is now very competitive. Mobile
phone market in India is also very competitive.

27 November 2008 ©GKGupta 28


Applications
In finance, telecom, insurance and retail:
– Loan/credit card approval
– market segmentation
– fraud detection
– better marketing
– trend analysis
– market basket analysis
– customer churn
– Web site design and promotion

27 November 2008 ©GKGupta 29


Loan/Credit card approvals
In a modern society, a bank does not know its
customers. Only knowledge a bank has is their
information stored in the computer.

Credit agencies and banks collect a lot of customers’


behavioural data from many sources. This
information is used to predict the chances of a
customer paying back a loan.

27 November 2008 ©GKGupta 30


Market Segmentation

• Large amounts of data about customers contains


valuable information
• The market may be segmented into many subgroups
according to variables that are good discriminators
• Not always easy to find variables that will help in
market segmentation

27 November 2008 ©GKGupta 31


Fraud Detection
• Very challenging since it is difficult to define
characteristics of fraud. Often based on detecting
changes from the norm.
• In statistics, it is common to throw out the outliers
but in data mining it may be useful to identify them
since they could either be due to errors or perhaps
fraud.

27 November 2008 ©GKGupta 32


Better Marketing
When customers buy new products, other products may
be suggested to them when they are ready.
As noted earlier, in mail order marketing for example,
one wants to know:
- will the customer respond?
- will the customer buy and how much?
- will the customer return purchase?
- will the customer pay for the purchase?

27 November 2008 ©GKGupta 33


Better Marketing
It has been reported that more than 1000 variable
values on each customer are held by some mail order
marketing companies.

The aim is to “lift” the response rate.

27 November 2008 ©GKGupta 34


Trend analysis
In a large company, not all trends are always visible to
the management. It is then useful to use data mining
software that will identify trends.

Trends may be long term trends, cyclic trends or


seasonal trends.

27 November 2008 ©GKGupta 35


Market Basket Analysis

• Aims to find what the customers buy and what they


buy together
• This may be useful in designing store layouts or in
deciding which items to put on sale
• Basket analysis can also be used for applications
other than just analysing what items customers buy
together

27 November 2008 ©GKGupta 36


Customer Churn
• In businesses like telecommunications, companies are
trying very hard to keep their good customers and to
perhaps persuade good customers of their competitors
to switch to them.
• In such an environment, businesses want to find
which customers are good, why customers switch and
what makes customers loyal.
• Cheaper to develop a retention plan and retain an old
customer than to bring in a new customer.

27 November 2008 ©GKGupta 37


Customer Churn
• The aim is to get to know the customers better so you
will be able to keep them longer.
• Given the competitive nature of businesses,
customers will move if not looked after.
• Also, some businesses may wish to get rid of
customers that cost more than they are worth e.g.
credit card holders that don’t use the card, bank
customers with very small amount of money in their
accounts.

27 November 2008 ©GKGupta 38


Web site design
• A Web site is effective only if the visitors easily find
what they are looking for.
• Data mining can help discover affinity of visitors to
pages and the site layout may be modified based on
this information.

27 November 2008 ©GKGupta 39


Data Mining Process
Successful data mining involves careful determining
the aims and selecting appropriate data.
The following steps should normally be followed:
1. Requirements analysis
2. Data selection and collection
3. Cleaning and preparing data
4. Data mining exploration and validation
5. Implementing, evaluating and monitoring
6. Results visualisation

27 November 2008 ©GKGupta 40


Requirements Analysis
The enterprise decision makers need to formulate
goals that the data mining process is expected to
achieve. The business problem must be clearly
defined. One cannot use data mining without a good
idea of what kind of outcomes the enterprise is looking
for.
If objectives have been clearly defined, it is easier to
evaluate the results of the project.

27 November 2008 ©GKGupta 41


Data Selection and Collection
Find the best source databases for the data that is
required. If the enterprise has implemented a data
warehouse, then most of the data could be available
there. Otherwise source OLTP systems need to be
identified and required information extracted and
stored in some temporary system.
In some cases, only a sample of the data available
may be required.

27 November 2008 ©GKGupta 42


Cleaning and Preparing Data
This may not be an onerous task if a data warehouse
containing the required data exists, since most of this
must have already been done when data was loaded in
the warehouse.
Otherwise this task can be very resource intensive,
perhaps more than 50% of effort in a data mining
project is spent on this step. Essentially a data store
that integrates data from a number of databases may
need to be created. When integrating data, one often
encounters problems like identifying data, dealing
with missing data, data conflicts and ambiguity. An
ETL (extraction, transformation and loading) tool may
be used to overcome these problems.
27 November 2008 ©GKGupta 43
Exploration and Validation
Assuming that the user has access to one or more
data mining tools, a data mining model may be
constructed based on the enterprise’s needs. It may
be possible to take a sample of data and apply a
number of relevant techniques. For each technique
the results should be evaluated and their significance
interpreted.
This is likely to be an iterative process which should
lead to selection of one or more techniques that are
suitable for further exploration, testing and
validation.

27 November 2008 ©GKGupta 44


Implementing, Evaluating and
Monitoring
Once a model has been selected and validated, the
model can be implemented for use by the decision
makers. This may involve software development for
generating reports or for results visualisation and
explanation for managers.
If more than one technique is available for the given
data mining task, it is necessary to evaluate the
results and choose the best. This may involve checking
the accuracy and effectiveness of each technique.

27 November 2008 ©GKGupta 45


Implementing, Evaluating and
Monitoring
Regular monitoring of the performance of the
techniques that have been implemented is required.
Every enterprise evolves with time and so must the
data mining system. Monitoring may from time to
time to lead to the refinement of tools and techniques
that have been implemented.

27 November 2008 ©GKGupta 46


Results Visualisation
Explaining the results of data mining to the decision
makers is an important step. Most DM software
includes data visualisation modules which should be
used in communicating data mining results to the
managers.
Clever data visualisation tools are being developed to
display results that deal with more than two
dimensions. The visualisation tools available should
be tried and used if found effective for the given
problem.

27 November 2008 ©GKGupta 47


Data Mining Process – Another
Approach
The last few slides presented one approach.
Another approach that also includes six steps has
been proposed by CRISP–DM (Cross–Industry
Standard Process for Data Mining) developed by an
industry consortium.
The six steps are:

27 November 2008 ©GKGupta 48


CRISP–DM Steps
The six CRISP–DM steps are:
1. Business understanding
2. Data understanding
3. Data preparation
4. Modelling
5. Evaluation
6. Deployment

27 November 2008 ©GKGupta 49


CRISP–DM Steps
The six steps proposed in CRISP–DM are similar to
the six steps proposed earlier.
. The CRIS–DM steps are shown in the following
figure.

Question: Compare the two sets of steps, one given in previous few slides and the
CRISP-DM approach. Which approach is better?

27 November 2008 ©GKGupta 50


CRISP Data Mining Model

27 November 2008 ©GKGupta 51


Data Mining Techniques

• Although data mining is a new field, it uses many


techniques developed years ago in other fields
• Machine learning, statistics, artificial intelligence, etc
• These techniques are in some cases modified to deal
with large amounts of data

27 November 2008 ©GKGupta 52


Data Mining Techniques
Data mining includes a large number of techniques
including concept/class description, association analysis,
classification and prediction, cluster analysis, outlier
analysis etc.

Expression and visualization of data mining results is a


challenging task.

Privacy issues also need to be considered.

27 November 2008 ©GKGupta 53


Data Mining Tasks
• Association analysis
• Classification and prediction
• Cluster analysis
• Web data mining
• Search Engines
• Data warehouse and OLAP
• Others, for example, Sequential patterns and Time-
series analysis, not covered in this book

27 November 2008 ©GKGupta 54


Association Analysis
• Association analysis involves discovery of
relationships or correlations among a set of items.
• Discovering that personal loans are repaid with 80%
confidence when the person owns his home.
• The classical example is the one where a store
discovered that people buying nappies tend also to
buy beer.

27 November 2008 ©GKGupta 55


Associations
The association rules are often written as X → Y
meaning that whenever X appears Y also tends to
appear. X and Y may be collection of attributes.

A supermarket like Woolworths may have several


thousand items and many millions of transactions a
week (i.e. Gigabytes of data each week). Note that the
quantities of items bought is ignored.

27 November 2008 ©GKGupta 56


Classification and Prediction
A set of training objects each with a number of
attribute values are given to the classifier. The
classifier formulates rules for each class in the training
set so that the rules may be used to classify new
objects. Some techniques do not require training data.

Classification may be used for predicting the class label


of data objects. Number of techniques including
decision tree and neural network.

27 November 2008 ©GKGupta 57


Cluster Analysis
Similar to classification in that the aim is to build
clusters such that each of them is similar within itself
but is dissimilar to others. Clustering does not rely on
class-labeled data objects.

Based on the principle of maximizing the intracluster


similarity and minimizing the intercluster similarity.

27 November 2008 ©GKGupta 58


Web data mining
The Web revolution has had a profound impact on the
way we search and find information at home and at
work. From its beginning in the early 1990s, the web
has grown to more than ten billion pages in 2008
(estimates vary), perhaps even more by the time you
are looking at this slide. Web usage, Web content and
Web structure are discussed in Chapter 5.

27 November 2008 ©GKGupta 59


Search engines
Normally the search engine databases of Web pages
are built and updated automatically by Web crawlers.
When one searches the Web using one of the search
engines, one is not searching the entire Web. Instead
one is only searching the database that has been
compiled by the search engine. There are a number of
challenging problems related to search engines that
are discussed in Chapter 6 including how to assign a
ranking to each Web page that is retrieved in response
to a user query.

27 November 2008 ©GKGupta 60


Data Warehousing and OLAP
Data warehousing is a process by which an enterprise
collects data from the whole enterprise to build a
single version of the truth. This information is useful
for decision makers and may also be used for data
mining. A data warehouse can be of real help in data
mining since data cleaning and other problems of
collecting data would have already been overcome.

OLAP (Online Analytical Processing) tools are


decision support tools that are often built on top of a
data warehouse or another database. OLAP goes
further than traditional query and report tools in that
a decision maker already has a hypothesis which
he/she is trying to test.

27 November 2008 ©GKGupta 61


Data Warehousing and OLAP
Data mining is somewhat different than OLAP since
in data mining a hypothesis is not being tested.
Instead data mining is used to uncover novel patterns
in the data.

27 November 2008 ©GKGupta 62


Before Data Mining
To define a data mining task, one needs to answer the
following:
• What data set do I want to mine?
• What kind of knowledge do I want to mine?
• What background knowledge could be useful?
• How do I measure if the results are interesting?
• How do I display what I have discovered?

27 November 2008 ©GKGupta 63


Task-relevant Data
The whole database may not be required since it may
be that we only want to study something specific e.g.
trends in postgraduate students
- countries they come from
- degree program they are doing
- their age?
- time they take to finish the degree
- scholarship they have they been awarded
May need to build a database subset before data
mining can be done.

27 November 2008 ©GKGupta 64


Task-relevant Data

Data collection is non-trivial.

OLTP data is not useful since it is changing all the


time. In some cases, data from more than one database
may be needed.

27 November 2008 ©GKGupta 65


Preprocessing
• A data mining process would normally involve
preprocessing
• Often data mining applications use data warehousing
• One approach is to pre-mine the data, warehouse it,
then carry out data mining
• The process is usually iterative and can take years of
effort for a large project

27 November 2008 ©GKGupta 66


Data Preprocessing
• Preprocessing is very important although often
considered too mundane to be taken seriously
• Preprocessing may also be needed after the data
warehouse phase
• Data reduction may be needed to transform very
high dimensional data to a lower dimensional data

27 November 2008 ©GKGupta 67


Data Preprocessing

• Feature Selection
• Use sampling?
• Normalization
• Smoothing
• Dealing with duplicates, missing data
• Dealing with time-dependent data

27 November 2008 ©GKGupta 68


Background knowledge

Background information may be useful in the discovery


process.

For example, concept hierarchies or relationships


between data may be useful in data mining. For
postgraduate degrees, we may wish to look at all
Masters degrees and all doctorate degrees separately.

27 November 2008 ©GKGupta 69


Measuring interest

Data mining process may generate many patterns. We


cannot look at all of them and so need some way to
separate uninteresting results from the interesting
ones.

This may be based on simplicity of pattern, rule length,


or level of confidence.

27 November 2008 ©GKGupta 70


Visualization

We must be able to display results so that they are easy


to understand.

Display may be a graph, pie chart, tables etc. Some


displays are better than others for a given kind of
knowledge.

27 November 2008 ©GKGupta 71


Guidelines for Successful Data
Mining
• The data must be available
• The data must be relevant, adequate and clean
• There must be a well-defined problem
• The problem should not be solvable by means of
ordinary query or OLAP tools
• The results must be actionable

27 November 2008 ©GKGupta 72


Guidelines for Successful Data
Mining
1. Use a small team with a strong internal integration
and a loose management style.
2. Carry out a small pilot project before a major data
mining project.
3. Identify a clear problem owner responsible for the
project. Could be someone in a sales or marketing.
This will benefit the external integration.

Question: Why each of the above guidelines is important for success?

27 November 2008 ©GKGupta 73


Guidelines for Successful Data
Mining
4. Try to realise a positive return on investment within
6 to 12 months.
5. The whole data mining project should have the
support of the top management of the company.

Question: Why each of the above guidelines is important for success?

27 November 2008 ©GKGupta 74


Data Mining Software
As noted earlier, a large variety of DM software is now
available. Some more widely used software is:
• IBM - Intelligent Miner and more
• SAS - Enterprise Miner
• Silicon Graphics - MineSet
• Oracle - Thinking Machines - Darwin
• Angoss - knowledgeSEEKER

27 November 2008 ©GKGupta 75


Choosing Data Mining Software

Many factors need to be considered if purchasing significant software:


• Product and vendor information
• Total cost of ownership
• Performance
• Functionality and modularity
• Training and support
• Reporting facilities and visualization
• Usability

Question: Which one of the above is the most important? Why?

27 November 2008 ©GKGupta 76


References
•D. Hand, H. Mannila and P. Smyth, Principles of Data Mining, MIT
Press, 2001.

•J. Han and M. Kamber, Data Mining: Concepts and Techniques,


Morgan Kaufmann, 2001. The Web site for this book is
http://www.cs.sfu.ca/~han/DM_Book.

•I. H. Witten and E. Frank, Data Mining: Practical Machine Learning


Tools and Techniques with Java Implementations, Morgan Kaufmann,
2000. The Web site for this book is www.mkp.com/datamining.

•Dhar, V. and Stein, R., 1997, Seven methods for transforming corporate
data into business intelligence, Prentice Hall.

27 November 2008 ©GKGupta 77


References
•U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy
(eds.), Advances in Knowledge Discovery and Data Mining, AAAI/MIT
Press, 1996

•M.S. Chen, J. Han, and P.S. Yu, Data Mining: An Overview from a
Database Perspective, IEEE Transactions on Knowledge and Data
Engineering, 8(6), pp 866-883, 1996.

•Berry, M. and Linoff, G., 1997, Data mining techniques for marketing,
sales and support, John Wiley & Sons.

•Berry, M. and Linoff, G., 1999, Mastering data mining, John Wiley &
Sons.

27 November 2008 ©GKGupta 78

You might also like