Professional Documents
Culture Documents
On Thursday nights people who buy diapers also tend to buy beer
Introduction
Data is growing at a phenomenal rate
Users expect more sophisticated information How?
Data Mining
Data mining, the extraction of hidden predictive
information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledgedriven decisions.
Data mining
Data mining involves the use of sophisticated data
analysis tools to discover previously unknown, valid patterns and relationships in large data sets. These tools can include statistical models, mathematical algorithms, and machine learning methods (algorithms that improve their performance automatically through experience, such as neural networks or decision trees). Consequently, data mining consists of more than collecting and managing data, it also includes analysis and prediction.
5
Data Mining
Descriptive
Identify and describe groups of customers with
Data Mining
Predictive
Given a customers characteristics a model predicts
how much the customer will spend on the next catalog order.
Data Mining
Association (purchasing a pen and purchasing paper), Sequence or Path analysis (birth of a child and purchasing diapers), Classification (duct tape purchases and plastic sheeting purchases), clustering Finding and visually documenting groups of previously
unknown facts,
10
Data Mining
Data Mining is Knowledge discovery using a sophisticated blend of techniques from traditional statistics, artificial intelligence and computer graphics. Data mining is the process of semi-automatically analyzing large databases to find interesting and useful patterns Data mining overlaps with machine learning, statistics, artificial intelligence and databases.
11
condition.
Confirmatory : To confirm a hypothesis. (whether two-income families are more likely to buy family medical coverage than single-income families) Exploratory : To analyze data for new or
unexpected relationships.
14
Web mining
Crawlers Robot (spider) Focused crawler PageRank backlinks Personalization
15
Spatial mining
Goal: data mining on spatial data Spatial selection may involve specialized selection comparison operations:
Near North, South, East, West Contained in Overlap/intersect
16
Temporal mining
Goal: data mining for temporal data Time Series Pattern Detection Sequences Temporal Association Rules
HR database
17
Temporal Database
Snapshot Traditional database
Temporal Multiple time points
18
data values are valid Bitemporal Supports both transaction and valid time
19
20
Query Examples
Database Find all credit applicants with last name of Smith. Identify customers who have purchased more than $10,000 in the last month. Find all customers who have purchased milk
Data Mining Find all credit applicants who are poor credit risks. (classification)
Identify customers with similar buying habits. (Clustering) Find all items which are frequently purchased with milk.
(association rules)
21
(KDD): process of finding useful information and patterns in data. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process.
22
KDD Process
Selection: Obtain data from various sources. Preprocessing: Cleanse data. Transformation: Convert to common format. Transform to new format. Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to user in meaningful manner.
23
Data Warehousing
A data warehouse is subject-oriented, integrated, time-variant, and nonvolatile collection of data
Subject-oriented : Contains information regarding
objects of interest for decision support: Sales by region, by product, etc. Integrated: Data are typically extracted from multiple, heterogeneous data sources (e.g., from sales, inventory, billing databases etc.). Time-variant: Contain historical data, longer horizon than operational system. Nonvolatile : Data is not (or rarely) directly updated.
24
comprehensive data collection. Separate transactional and analysis systems: Improve query response time (without slowing down transaction processing) Easy formulation of complex queries Access to historical data (not in operational systems) Improved data quality (fewer errors and missing values)
25
heterogeneous, and external sources Data cleaning (scrubbing): Detect errors in the data and rectify them when possible Data converting: Convert data from legacy or host format to warehouse format Transforming: Sort, summarize, compute views, check integrity, and build indices Refresh: Propagate the updates from the data sources to the warehouse
26
Database
Application Oriented (OLTP) Used to run business Clerical User Detailed data Current up to date Operational Data Repetitive access by small transactions Fast response time (seconds) Read/Update access Relational Schema
Data Warehouse
Subject Oriented (OLAP) Used to analyze business Manager/Analyst Summarized and refined Historical data Integrated Data Ad-hoc access using large queries Slow response time (minutes) Mostly read access (batch update) Star / Snowflake Schema
27
data
29
30
31
MDDBMS
Multidimensional data model emerged over the past
10-15 years MDDBMS is the Rubik's Cube of database management systems Focuses on analyzing the data, not recording transactions Data is categorized as either facts with numerical measures, or as dimensions that characterize the fact
32
MDDBMS
Takes data from many sources, such as RDBMS, Legacy
System, etc Data is physically stored on disk in a data structure that is highly optimized for multidimensional processing and fast retrieval Storage is between 2 and 10 times more efficient over RDBMS due to better indexing, compression and representation of sparse data
33
Benefits
Queries are simply a request to see pre-
existing data organized in a specific fashion. Already highly organized, so the requested data is removed and reorganized Stores information in the same way that it is viewed (less data management, and maintenance)
34
The drawbacks
Not the best solution for every problem
Works only on information with
interrelations Database explosion with large amounts of sparse data (calculating all relationships can increase the database size dramatically).
35
Example
MDDBMS are an important tool in KM,
36
Thank You
40