You are on page 1of 40

1

Data Mining - Beer and Nappies

On Thursday nights people who buy diapers also tend to buy beer

Introduction
Data is growing at a phenomenal rate
Users expect more sophisticated information How?

UNCOVER HIDDEN INFORMATION DATA MINING

Data Mining
Data mining, the extraction of hidden predictive

information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledgedriven decisions.

Data mining
Data mining involves the use of sophisticated data

analysis tools to discover previously unknown, valid patterns and relationships in large data sets. These tools can include statistical models, mathematical algorithms, and machine learning methods (algorithms that improve their performance automatically through experience, such as neural networks or decision trees). Consequently, data mining consists of more than collecting and managing data, it also includes analysis and prediction.
5

Data Mining Algorithm


Objective: Fit Data to a Model Descriptive Predictive Preference Technique to choose the best model Search Technique to search the data Query

Data Mining
Descriptive
Identify and describe groups of customers with

common buying behavior

Data Mining
Predictive
Given a customers characteristics a model predicts

how much the customer will spend on the next catalog order.

Predicting likelihood (probability) a customer would respond to an offer


8

Data Mining Models and Tasks

Data Mining
Association (purchasing a pen and purchasing paper), Sequence or Path analysis (birth of a child and purchasing diapers), Classification (duct tape purchases and plastic sheeting purchases), clustering Finding and visually documenting groups of previously

unknown facts,

geographic location and brand preferences) forecasting (discovering patterns

from which one can make reasonable predictions regarding

future activities, such as

10

Data Mining
Data Mining is Knowledge discovery using a sophisticated blend of techniques from traditional statistics, artificial intelligence and computer graphics. Data mining is the process of semi-automatically analyzing large databases to find interesting and useful patterns Data mining overlaps with machine learning, statistics, artificial intelligence and databases.
11

Goals of Data Mining


Explanatory : To explain some observed event or

condition.

(Why sales of Maruti Swift has increased in Chennai).

Confirmatory : To confirm a hypothesis. (whether two-income families are more likely to buy family medical coverage than single-income families) Exploratory : To analyze data for new or

unexpected relationships.

(What spending patterns are likely to accompany credit card fraud.)


12

Issues in data mining


Data quality, which refers to the accuracy and completeness of the data being analyzed. Interoperability of the data mining software and databases being used by different agencies. Mission creep, The use of data for purposes other than for which the data were originally collected. Privacy.
13

Advanced forms of Data Mining


Web mining
Spatial Mining Temporal Mining

14

Web mining
Crawlers Robot (spider) Focused crawler PageRank backlinks Personalization

15

Spatial mining
Goal: data mining on spatial data Spatial selection may involve specialized selection comparison operations:
Near North, South, East, West Contained in Overlap/intersect

16

Temporal mining
Goal: data mining for temporal data Time Series Pattern Detection Sequences Temporal Association Rules
HR database

17

Temporal Database
Snapshot Traditional database
Temporal Multiple time points

18

Types of Database (Temporal)


Snapshot No temporal support Transaction Time Supports time when

transaction inserted data


Timestamp Range

Valid Time Supports time range when

data values are valid Bitemporal Supports both transaction and valid time
19

Database Searching vs. Data Mining


Query Well defined SQL

Query Poorly defined No precise query language

Data Operational data Output Precise Subset of database

Data Not operational data

Output Fuzzy Not a subset of database

20

Query Examples
Database Find all credit applicants with last name of Smith. Identify customers who have purchased more than $10,000 in the last month. Find all customers who have purchased milk
Data Mining Find all credit applicants who are poor credit risks. (classification)
Identify customers with similar buying habits. (Clustering) Find all items which are frequently purchased with milk.

(association rules)

21

Data Mining vs. KDD


Knowledge Discovery in Databases

(KDD): process of finding useful information and patterns in data. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process.

22

KDD Process

Selection: Obtain data from various sources. Preprocessing: Cleanse data. Transformation: Convert to common format. Transform to new format. Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to user in meaningful manner.
23

Data Warehousing
A data warehouse is subject-oriented, integrated, time-variant, and nonvolatile collection of data
Subject-oriented : Contains information regarding

objects of interest for decision support: Sales by region, by product, etc. Integrated: Data are typically extracted from multiple, heterogeneous data sources (e.g., from sales, inventory, billing databases etc.). Time-variant: Contain historical data, longer horizon than operational system. Nonvolatile : Data is not (or rarely) directly updated.
24

Why build a data warehouse


Access to data from multiple sources, have a

comprehensive data collection. Separate transactional and analysis systems: Improve query response time (without slowing down transaction processing) Easy formulation of complex queries Access to historical data (not in operational systems) Improved data quality (fewer errors and missing values)

25

Data Warehouse Back-End Tools


Data extraction: Extract data from multiple,

heterogeneous, and external sources Data cleaning (scrubbing): Detect errors in the data and rectify them when possible Data converting: Convert data from legacy or host format to warehouse format Transforming: Sort, summarize, compute views, check integrity, and build indices Refresh: Propagate the updates from the data sources to the warehouse

26

Database
Application Oriented (OLTP) Used to run business Clerical User Detailed data Current up to date Operational Data Repetitive access by small transactions Fast response time (seconds) Read/Update access Relational Schema

Data Warehouse
Subject Oriented (OLAP) Used to analyze business Manager/Analyst Summarized and refined Historical data Integrated Data Ad-hoc access using large queries Slow response time (minutes) Mostly read access (batch update) Star / Snowflake Schema
27

On-Line Analytical Processing OLAP


Front-end to the data warehouse. Allowing easy data manipulation
Allows conducting inquiries over the data at various levels of abstractions Fast and easy because some aggregations are computed in advance No need to formulate entire query OLAP uses data in multidimensional format (e.g., data cubes) to facilitate query and response time
28

Data Mining Vs. Data Warehouse


Data Mining: Applications of methods (algorithms) to discover patterns in data. Include some OLAP operations
OLAP: deductive process - testing existence of hypothetical

patterns in data Good to explore the data and test hypotheses

Data Mining mostly refers to modeling underlying

data

Uncovering patterns in data Potentially surprising patterns may arise

Data Mining methods may use data from a data

warehouse (when available)

29

Data Mining + Data Warehouse


Data Warehousing provides the Enterprise with a memory

Data Mining provides the Enterprise with intelligence

30

31

MDDBMS
Multidimensional data model emerged over the past

10-15 years MDDBMS is the Rubik's Cube of database management systems Focuses on analyzing the data, not recording transactions Data is categorized as either facts with numerical measures, or as dimensions that characterize the fact

32

MDDBMS
Takes data from many sources, such as RDBMS, Legacy

System, etc Data is physically stored on disk in a data structure that is highly optimized for multidimensional processing and fast retrieval Storage is between 2 and 10 times more efficient over RDBMS due to better indexing, compression and representation of sparse data

33

Benefits
Queries are simply a request to see pre-

existing data organized in a specific fashion. Already highly organized, so the requested data is removed and reorganized Stores information in the same way that it is viewed (less data management, and maintenance)

34

The drawbacks
Not the best solution for every problem
Works only on information with

interrelations Database explosion with large amounts of sparse data (calculating all relationships can increase the database size dramatically).
35

Example
MDDBMS are an important tool in KM,

36

Thank You

40

You might also like