DM GK Gupta

Introduction
Introduction to Data Mining with Case Studies

Author: G. K. Gupta
Prentice Hall India, 2006.
Objectives
• What is data mining?

• Why data mining?
• What applications?
• What techniques?
• What process?
• What software?
27 November 2008 ©GKGupta 2

Definition
Data mining may be defined as follows:
data mining is a collection of techniques for efficient
automated discovery of previously unknown, valid,
novel, useful and understandable patterns in large
databases. The patterns must be actionable so they
may be used in an enterprise’s decision making.

What is Data Mining?
• Efficient automated discovery of previously unknown

patterns in large volumes of data.
• Patterns must be valid, novel, useful and
understandable.
• Businesses are mostly interested in discovering past
patterns to predict future behaviour.
• A data warehouse, to be discussed later, can be an
enterprise’s memory. Data mining can provide
intelligence using that memory.

Examples
• amazon.com uses associations. Recommendations to

customers are based on past purchases and what other
customers are purchasing.
• A store in USA “Just for Feet” has about 200 stores, each
carrying up to 6000 shoe styles, each style in several
sizes. Data mining is used to find the right shoes to
stock in the right store.
• More examples in case studies to be discussed later.

Data Mining
• We assume we are dealing with large data, perhaps
Gigabytes, perhaps in Terabytes.
• Although data mining is possible with smaller amount
of data, bigger the data, higher the confidence in any
unknown pattern that is discovered.
• There is considerable hype about data mining at the
present time and Gartner Group has listed data mining
as one of the top ten technologies to watch.
Question: How many books could one store in one Terabyte of memory?

Why Data Mining Now?
• Growth in generation and storage of corporate data –
information explosion
• Need for sophisticated decision making – current
database systems are Online Transaction Processing
(OLTP) systems. The OLTP data is difficult to use for
such applications. Why?
• Evolution of technology – much cheaper storage,
easier data collection, better database management,
to data analysis and understanding.

Information explosion
• Database systems are being used since the 1960s in
the Western countries (perhaps since 1980s in India).
These systems have generated mountains of data.
• Point of sale terminals and bar codes on many
products, railway bookings, educational institutions,
huge number of mobile phones, electronic commerce,
all generate data.
• Government is now collecting a lot of information.

• Internet banking via networked computers and

ATMs.
• Credit and debit cards.
• Medical data, doctors, hospitals.
• Transportation, Indian railways, automatic toll
collection on toll roads, growing air travel.
• Passports, NRI visas, Other visas, NRI money
transfers.
Question: Can you think of other examples of data collection?

Many adults in India generate:
• Mobile phone transactions. More than 300 million
phones in India, reportedly growing at the rate of
10,000 new ones every hour! Mobile companies must
save information about calls.
• Growing middle class with growing number of credit
and debit card transactions. About 25m credit cards
and 70m debit cards in 2007. Annual growth rate about
30% and 40% respectively. Could be 55m credit cards
and 200m debit cards in 2010 resulting in perhaps
500m transactions annually.

• India has some huge enterprises, for example Indian
railways, perhaps the busiest network in the world
with 2.5m employees, 10,000 locomotives, 10,000
passenger trains daily, 10,000 freight trains daily and
20m passengers daily.
• Growing airline traffic with more than ten airlines.
Perhaps 30m passengers annually.
• Growing number of motor vehicles – registration,
insurance, driver license
• Internet surfing records

OLTP
As noted earlier, most enterprise database systems were
designed in the 1970’s or 1980’s and were mainly
designed to automate some of the office procedures e.g.
order entry, student enrolment, patient registration,
airline reservations. These are well structured repetitive
operations easily automated.

Decision Making
• Need for business memory and intelligence.
• Need to serve customers better by learning from past
interactions.
• OLTP data is not a good basis for maintaining an
enterprise memory.
• The intelligence hidden in data could be the secret
weapon in a competitive business world but given the
information explosion not even a small fraction could
be looked at by human eye.
Question: Why OLTP is not good for maintaining an enterprise memory?

OLTP vs Decision Making
Clerical view of data focuses on details required for
day-to-day running of an enterprise.
Management view of data focuses on summary data to

identify trends, challenges and opportunities.
The detailed data view is the operational view while

the management view is decision-support view.
Comparison of the two views:

Operational vs Management View
Operational Decision-Support
Users – Admin staff Users – Management
Day–to–day work Decision support
Application oriented Subject oriented
Current data Historical data
Detailed Overall view – summaries
Simple queries Complex queries
Predetermined queries Ad hoc queries
Update/Select Only Select
Real–time Not real–time

Evolution of Technology
• Corporate data growth accompanied by decline in the
cost of storage and processing.
• PC motherboard performance, measured in MHz/$, is
currently doubling every 27 ± 2 months.
• Next slide using logarithmic scale shows that disk is
now about 10GB per US dollar and the following slide
shows that sales of disk storage is growing
exponentially.
• Look at computing trends at
http://www.zoology.ubc.ca/~rikblok/ComputingTrends/
Question: How much is the cost of 100GB disk? What is the cost of a PC and what is
its CPU performance?

Decline in Hard Drive cost

Growth in Worldwide Disk
Capacity
18000
16000
14000
Storage in Petabytes
12000
10000
8000
6000
4000
2000
0
1996 1997 1998 1999 2000 2001 2002 2003
Year

Question: What do the graphs in the last two slides tell us? What scales are used in
them? What was the pink line is the first graph?

• Database technology has improved over the years.
• Data collection is often much better and cheaper now
• The need for analyzing and synthesizing information
is growing in a fiercely competitive business
environment of today.

New applications
Sophisticated applications of modern enterprises include:
- sales forecasting and analysis
- marketing and promotion planning
- business modeling
OLTP is not designed for such applications. Also, large

enterprises operate a number of database systems and
then it is necessary to integrate information for decision
making applications.
Question: Why OLTP cannot be used for sales forecasting and analysis?

Why Data Mining Now?
As noted earlier, the reasons may be summarized as:
•Accumulation of large amounts of data
• Increased affordable computing power enabling data
mining processing
• Statistical and learning algorithms
• Availability of software
• Strong business competition

Large amount of data
Already discussed that many enterprises have large
amounts of data accumulated over 30+ years.
Noted earlier that some enterprises collect information

for analysis, for example, supermarkets in USA offer
loyalty cards in exchange for shopper information.
Loyalty cards in Australia also collect information
using a reward system.

Growth of cards
A recent survey in USA found that the percentages of
US adults using the following types of cards were:
• Credit cards - 88%;

• ATM cards - 60%
• Membership cards - 58%
• Debit cards - 35%
• Prepaid cards - 35%
• Loyalty cards - 29%
Question: What kind of data do these cards generate?

Affordable computing power
Data mining is usually computationally intensive.
Dramatic reduction in the price of computer systems,
as noted earlier, is making it possible to carry out
data mining without investing huge amounts of
resources in hardware and software.
In spite of affordable computing power, using data

mining can be resources intensive.

Algorithms
A variety of statistical and learning algorithms have
been available in fields like statistics and artificial
intelligence that have been adapted for data mining.
With new focus on data mining, new algorithms are

being developed.

Availability of Software
Large variety of DM software is now available. Some
more widely used software is:
• IBM - Intelligent Miner and more
• SAS - Enterprise Miner
• Silicon Graphics - MineSet
• Oracle - Thinking Machines - Darwin
• Angoss - knowledgeSEEKER

Strong Business Competition
Growth in service economies. Almost every business
is a service business. Service economies are
information rich and very competitive.
Consider the telecommunications environment in

Australia. About 20 years ago, Telstra was a
monopoly. The field is now very competitive. Mobile
phone market in India is also very competitive.

Applications
In finance, telecom, insurance and retail:
– Loan/credit card approval
– market segmentation
– fraud detection
– better marketing
– trend analysis
– market basket analysis
– customer churn
– Web site design and promotion

Loan/Credit card approvals
In a modern society, a bank does not know its
customers. Only knowledge a bank has is their
information stored in the computer.
Credit agencies and banks collect a lot of customers’

behavioural data from many sources. This
information is used to predict the chances of a
customer paying back a loan.

Market Segmentation
• Large amounts of data about customers contains

valuable information
• The market may be segmented into many subgroups
according to variables that are good discriminators
• Not always easy to find variables that will help in
market segmentation

Fraud Detection
• Very challenging since it is difficult to define
characteristics of fraud. Often based on detecting
changes from the norm.
• In statistics, it is common to throw out the outliers
but in data mining it may be useful to identify them
since they could either be due to errors or perhaps
fraud.

Better Marketing
When customers buy new products, other products may
be suggested to them when they are ready.
As noted earlier, in mail order marketing for example,
one wants to know:
- will the customer respond?
- will the customer buy and how much?
- will the customer return purchase?
- will the customer pay for the purchase?

Better Marketing
It has been reported that more than 1000 variable
values on each customer are held by some mail order
marketing companies.
The aim is to “lift” the response rate.

Trend analysis
In a large company, not all trends are always visible to
the management. It is then useful to use data mining
software that will identify trends.
Trends may be long term trends, cyclic trends or

seasonal trends.

Market Basket Analysis
• Aims to find what the customers buy and what they

buy together
• This may be useful in designing store layouts or in
deciding which items to put on sale
• Basket analysis can also be used for applications
other than just analysing what items customers buy
together

Customer Churn
• In businesses like telecommunications, companies are
trying very hard to keep their good customers and to
perhaps persuade good customers of their competitors
to switch to them.
• In such an environment, businesses want to find
which customers are good, why customers switch and
what makes customers loyal.
• Cheaper to develop a retention plan and retain an old
customer than to bring in a new customer.

Customer Churn
• The aim is to get to know the customers better so you
will be able to keep them longer.
• Given the competitive nature of businesses,
customers will move if not looked after.
• Also, some businesses may wish to get rid of
customers that cost more than they are worth e.g.
credit card holders that don’t use the card, bank
customers with very small amount of money in their
accounts.

Web site design
• A Web site is effective only if the visitors easily find
what they are looking for.
• Data mining can help discover affinity of visitors to
pages and the site layout may be modified based on
this information.

Data Mining Process
Successful data mining involves careful determining
the aims and selecting appropriate data.
The following steps should normally be followed:
1. Requirements analysis
2. Data selection and collection
3. Cleaning and preparing data
4. Data mining exploration and validation
5. Implementing, evaluating and monitoring
6. Results visualisation

Requirements Analysis
The enterprise decision makers need to formulate
goals that the data mining process is expected to
achieve. The business problem must be clearly
defined. One cannot use data mining without a good
idea of what kind of outcomes the enterprise is looking
for.
If objectives have been clearly defined, it is easier to
evaluate the results of the project.

Data Selection and Collection
Find the best source databases for the data that is
required. If the enterprise has implemented a data
warehouse, then most of the data could be available
there. Otherwise source OLTP systems need to be
identified and required information extracted and
stored in some temporary system.
In some cases, only a sample of the data available
may be required.

Cleaning and Preparing Data
This may not be an onerous task if a data warehouse
containing the required data exists, since most of this
must have already been done when data was loaded in
the warehouse.
Otherwise this task can be very resource intensive,
perhaps more than 50% of effort in a data mining
project is spent on this step. Essentially a data store
that integrates data from a number of databases may
need to be created. When integrating data, one often
encounters problems like identifying data, dealing
with missing data, data conflicts and ambiguity. An
ETL (extraction, transformation and loading) tool may
be used to overcome these problems.
Exploration and Validation
Assuming that the user has access to one or more
data mining tools, a data mining model may be
constructed based on the enterprise’s needs. It may
be possible to take a sample of data and apply a
number of relevant techniques. For each technique
the results should be evaluated and their significance
interpreted.
This is likely to be an iterative process which should
lead to selection of one or more techniques that are
suitable for further exploration, testing and
validation.

Implementing, Evaluating and
Monitoring
Once a model has been selected and validated, the
model can be implemented for use by the decision
makers. This may involve software development for
generating reports or for results visualisation and
explanation for managers.
If more than one technique is available for the given
data mining task, it is necessary to evaluate the
results and choose the best. This may involve checking
the accuracy and effectiveness of each technique.

Implementing, Evaluating and
Monitoring
Regular monitoring of the performance of the
techniques that have been implemented is required.
Every enterprise evolves with time and so must the
data mining system. Monitoring may from time to
time to lead to the refinement of tools and techniques
that have been implemented.

Results Visualisation
Explaining the results of data mining to the decision
makers is an important step. Most DM software
includes data visualisation modules which should be
used in communicating data mining results to the
managers.
Clever data visualisation tools are being developed to
display results that deal with more than two
dimensions. The visualisation tools available should
be tried and used if found effective for the given
problem.

Data Mining Process – Another
Approach
The last few slides presented one approach.
Another approach that also includes six steps has
been proposed by CRISP–DM (Cross–Industry
Standard Process for Data Mining) developed by an
industry consortium.
The six steps are:

CRISP–DM Steps
The six CRISP–DM steps are:
1. Business understanding
2. Data understanding
3. Data preparation
4. Modelling
5. Evaluation
6. Deployment

CRISP–DM Steps
The six steps proposed in CRISP–DM are similar to
the six steps proposed earlier.
. The CRIS–DM steps are shown in the following
figure.
Question: Compare the two sets of steps, one given in previous few slides and the
CRISP-DM approach. Which approach is better?

CRISP Data Mining Model

Data Mining Techniques
• Although data mining is a new field, it uses many

techniques developed years ago in other fields
• Machine learning, statistics, artificial intelligence, etc
• These techniques are in some cases modified to deal
with large amounts of data

Data Mining Techniques
Data mining includes a large number of techniques
including concept/class description, association analysis,
classification and prediction, cluster analysis, outlier
analysis etc.
Expression and visualization of data mining results is a

challenging task.
Privacy issues also need to be considered.

Data Mining Tasks
• Association analysis
• Classification and prediction
• Cluster analysis
• Web data mining
• Search Engines
• Data warehouse and OLAP
• Others, for example, Sequential patterns and Time-
series analysis, not covered in this book

Association Analysis
• Association analysis involves discovery of
relationships or correlations among a set of items.
• Discovering that personal loans are repaid with 80%
confidence when the person owns his home.
• The classical example is the one where a store
discovered that people buying nappies tend also to
buy beer.

Associations
The association rules are often written as X → Y
meaning that whenever X appears Y also tends to
appear. X and Y may be collection of attributes.
A supermarket like Woolworths may have several

thousand items and many millions of transactions a
week (i.e. Gigabytes of data each week). Note that the
quantities of items bought is ignored.

Classification and Prediction
A set of training objects each with a number of
attribute values are given to the classifier. The
classifier formulates rules for each class in the training
set so that the rules may be used to classify new
objects. Some techniques do not require training data.
Classification may be used for predicting the class label

of data objects. Number of techniques including
decision tree and neural network.

Cluster Analysis
Similar to classification in that the aim is to build
clusters such that each of them is similar within itself
but is dissimilar to others. Clustering does not rely on
class-labeled data objects.
Based on the principle of maximizing the intracluster

similarity and minimizing the intercluster similarity.

Web data mining
The Web revolution has had a profound impact on the
way we search and find information at home and at
work. From its beginning in the early 1990s, the web
has grown to more than ten billion pages in 2008
(estimates vary), perhaps even more by the time you
are looking at this slide. Web usage, Web content and
Web structure are discussed in Chapter 5.

Search engines
Normally the search engine databases of Web pages
are built and updated automatically by Web crawlers.
When one searches the Web using one of the search
engines, one is not searching the entire Web. Instead
one is only searching the database that has been
compiled by the search engine. There are a number of
challenging problems related to search engines that
are discussed in Chapter 6 including how to assign a
ranking to each Web page that is retrieved in response
to a user query.

Data Warehousing and OLAP
Data warehousing is a process by which an enterprise
collects data from the whole enterprise to build a
single version of the truth. This information is useful
for decision makers and may also be used for data
mining. A data warehouse can be of real help in data
mining since data cleaning and other problems of
collecting data would have already been overcome.
OLAP (Online Analytical Processing) tools are

decision support tools that are often built on top of a
data warehouse or another database. OLAP goes
further than traditional query and report tools in that
a decision maker already has a hypothesis which
he/she is trying to test.

Data Warehousing and OLAP
Data mining is somewhat different than OLAP since
in data mining a hypothesis is not being tested.
Instead data mining is used to uncover novel patterns
in the data.

Before Data Mining
To define a data mining task, one needs to answer the
following:
• What data set do I want to mine?
• What kind of knowledge do I want to mine?
• What background knowledge could be useful?
• How do I measure if the results are interesting?
• How do I display what I have discovered?

Task-relevant Data
The whole database may not be required since it may
be that we only want to study something specific e.g.
trends in postgraduate students
- countries they come from
- degree program they are doing
- their age?
- time they take to finish the degree
- scholarship they have they been awarded
May need to build a database subset before data
mining can be done.

Task-relevant Data
Data collection is non-trivial.
OLTP data is not useful since it is changing all the

time. In some cases, data from more than one database
may be needed.

Preprocessing
• A data mining process would normally involve
preprocessing
• Often data mining applications use data warehousing
• One approach is to pre-mine the data, warehouse it,
then carry out data mining
• The process is usually iterative and can take years of
effort for a large project

Data Preprocessing
• Preprocessing is very important although often
considered too mundane to be taken seriously
• Preprocessing may also be needed after the data
warehouse phase
• Data reduction may be needed to transform very
high dimensional data to a lower dimensional data

Data Preprocessing
• Feature Selection
• Use sampling?
• Normalization
• Smoothing
• Dealing with duplicates, missing data
• Dealing with time-dependent data

Background knowledge
Background information may be useful in the discovery

process.
For example, concept hierarchies or relationships

between data may be useful in data mining. For
postgraduate degrees, we may wish to look at all
Masters degrees and all doctorate degrees separately.

Measuring interest
Data mining process may generate many patterns. We

cannot look at all of them and so need some way to
separate uninteresting results from the interesting
ones.
This may be based on simplicity of pattern, rule length,

or level of confidence.

Visualization
We must be able to display results so that they are easy

to understand.
Display may be a graph, pie chart, tables etc. Some

displays are better than others for a given kind of
knowledge.

Guidelines for Successful Data
Mining
• The data must be available
• The data must be relevant, adequate and clean
• There must be a well-defined problem
• The problem should not be solvable by means of
ordinary query or OLAP tools
• The results must be actionable

Mining
1. Use a small team with a strong internal integration
and a loose management style.
2. Carry out a small pilot project before a major data
mining project.
3. Identify a clear problem owner responsible for the
project. Could be someone in a sales or marketing.
This will benefit the external integration.
Question: Why each of the above guidelines is important for success?

Mining
4. Try to realise a positive return on investment within
6 to 12 months.
5. The whole data mining project should have the
support of the top management of the company.
Question: Why each of the above guidelines is important for success?

Data Mining Software
As noted earlier, a large variety of DM software is now
available. Some more widely used software is:
• IBM - Intelligent Miner and more
• SAS - Enterprise Miner
• Silicon Graphics - MineSet
• Oracle - Thinking Machines - Darwin
• Angoss - knowledgeSEEKER

Choosing Data Mining Software
Many factors need to be considered if purchasing significant software:

• Product and vendor information
• Total cost of ownership
• Performance
• Functionality and modularity
• Training and support
• Reporting facilities and visualization
• Usability
Question: Which one of the above is the most important? Why?

References
•D. Hand, H. Mannila and P. Smyth, Principles of Data Mining, MIT
Press, 2001.
•J. Han and M. Kamber, Data Mining: Concepts and Techniques,

Morgan Kaufmann, 2001. The Web site for this book is
http://www.cs.sfu.ca/~han/DM_Book.
•I. H. Witten and E. Frank, Data Mining: Practical Machine Learning

Tools and Techniques with Java Implementations, Morgan Kaufmann,
2000. The Web site for this book is www.mkp.com/datamining.
•Dhar, V. and Stein, R., 1997, Seven methods for transforming corporate
data into business intelligence, Prentice Hall.

References
•U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy
(eds.), Advances in Knowledge Discovery and Data Mining, AAAI/MIT
Press, 1996
•M.S. Chen, J. Han, and P.S. Yu, Data Mining: An Overview from a
Database Perspective, IEEE Transactions on Knowledge and Data
Engineering, 8(6), pp 866-883, 1996.
•Berry, M. and Linoff, G., 1997, Data mining techniques for marketing,
sales and support, John Wiley & Sons.
•Berry, M. and Linoff, G., 1999, Mastering data mining, John Wiley &
Sons.

DM GK Gupta

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DM GK Gupta

Uploaded by

Copyright:

Available Formats

Introduction

Introduction to Data Mining with Case Studies

• What is data mining?

27 November 2008 ©GKGupta 2

27 November 2008 ©GKGupta 3

• Efficient automated discovery of previously unknown

27 November 2008 ©GKGupta 4

• amazon.com uses associations. Recommendations to

27 November 2008 ©GKGupta 5

27 November 2008 ©GKGupta 6

27 November 2008 ©GKGupta 7

27 November 2008 ©GKGupta 8

• Internet banking via networked computers and

Question: Can you think of other examples of data collection?

27 November 2008 ©GKGupta 9

27 November 2008 ©GKGupta 10

27 November 2008 ©GKGupta 11

27 November 2008 ©GKGupta 12

Question: Why OLTP is not good for maintaining an enterprise memory?

27 November 2008 ©GKGupta 13

Management view of data focuses on summary data to

The detailed data view is the operational view while

27 November 2008 ©GKGupta 14

27 November 2008 ©GKGupta 15

27 November 2008 ©GKGupta 16

27 November 2008 ©GKGupta 17

27 November 2008 ©GKGupta 18

27 November 2008 ©GKGupta 19

27 November 2008 ©GKGupta 20

OLTP is not designed for such applications. Also, large

27 November 2008 ©GKGupta 21

27 November 2008 ©GKGupta 22

Noted earlier that some enterprises collect information

27 November 2008 ©GKGupta 23

• Credit cards - 88%;

27 November 2008 ©GKGupta 24

In spite of affordable computing power, using data

27 November 2008 ©GKGupta 25

With new focus on data mining, new algorithms are

27 November 2008 ©GKGupta 26

27 November 2008 ©GKGupta 27

Consider the telecommunications environment in

27 November 2008 ©GKGupta 28

27 November 2008 ©GKGupta 29

Credit agencies and banks collect a lot of customers’

27 November 2008 ©GKGupta 30

• Large amounts of data about customers contains

27 November 2008 ©GKGupta 31

27 November 2008 ©GKGupta 32

27 November 2008 ©GKGupta 33

The aim is to “lift” the response rate.

27 November 2008 ©GKGupta 34

Trends may be long term trends, cyclic trends or

27 November 2008 ©GKGupta 35

• Aims to find what the customers buy and what they

27 November 2008 ©GKGupta 36

27 November 2008 ©GKGupta 37

27 November 2008 ©GKGupta 38

27 November 2008 ©GKGupta 39

27 November 2008 ©GKGupta 40

27 November 2008 ©GKGupta 41

27 November 2008 ©GKGupta 42

27 November 2008 ©GKGupta 44

27 November 2008 ©GKGupta 45