You are on page 1of 22

Data Warehousing and Data Mining

Unit 2: Data Preprocessing:

By.
Varsha Gaikwad
M-Tech (IT)

Contents
Needs

Preprocessing the Data


Data Cleaning
Data Integration and Transformation
Data Reduction
Discretization and Concept Hierarchy
Generation
Online Data Storage
Data Mining Primitives
Data Mining Query Languages
Designing Graphical User Interfaces Based on a
Data Mining Query Language Architectures of
Data Mining Systems.

Needs Preprocessing the Data


Quality

Three elements defining quality: accuracy,


completeness, and consistency.
Factors

affecting quality

Faulty instruments
Human or computer errors(disguised missing data)
Errors in data transmission
Technology limitations
Duplicate tuples
Incomplete information
Timeliness
Believability
Interpretability

Forms of data Processing

Data Cleaning
Data

cleaning (or data cleansing)


routines attempt to fill in missing
values, smooth out noise while
identifying outliers, and correct
inconsistencies in the data.
Where needs cleaning:
Missing values
Noisy data

How to handle data:


Missing values
1.
2.
3.
4.
5.
6.

Ignore the tuple


Fill in the missing value manually
Use a global constant to fill in the missing value
Use a measure of central tendency for the attribute
(e.g., the mean or median) to fill in the missing value
Use the attribute mean or median for all samples
belonging to the same class as the given tuple
Use the most probable value to fill in the missing
value

Noisy data:
Data smoothing techniques
Binning
Regression
Outlier analysis

Data Cleaning Process


Step

1: Discrepancy detection

Use any knowledge you may already


have regarding properties of the
data(metadata)
Field overloading
Data should also be examined
regarding unique rules, consecutive
rules, and null rules.
Discrepancy

detection tools

Data scrubbing tools


Data auditing tools

Data Cleaning Process


Step

2:Data transformation for


discrepant data
Replacing string.
Discrepancy detection tools

Transformation

tools:

Data migration tools

ETL(extraction/transformation/loadin
g)tools

Data Integration and


Transformation
Merging

of data from multiple data stores.


There are four issues at integration level
and their solutions:
1. Entity Identication Problem

Schema integration problem-Metadata for each


attribute
Functional dependency problem-attention towards
the structure of data

2. Redundancy

Correlation analysis
nominal data-2 (chi-square)test
Numeric attributes-correlation coefficient and
covarianc

Issues at integration level and


their solutions
3. Tuple duplication

Check for duplications

4. Data value conicts

Data Reduction
Data

reduction techniques can be


applied to obtain a reduced
representation of the data set
that is much smaller in volume,
yet closely maintains the
integrity of the original data.

Data reduction strategies


Dimensionality

reduction:

Reducing the number of random variables


or attributes under consideration
Wavelet Transforms
principal components analysis
Attribute subset selection

Numerosity

reduction:

replace the original data volume by


alternative
Parametric methods
Nonparametric methods

Data reduction strategies


Data

compression:

To obtain a reduced or compressed


representation of the original data
Lossless
lossy

Discretization and Concept


Hierarchy Generation
Data

discretization, a form of
data transformation where the
raw values of a numeric attribute
(e.g., age) are replaced by
interval labels (e.g., 010, 1120,
etc.) or conceptual labels (e.g.,
youth, adult, senior).

Discretization process
top-down

vs. bottom-up

how the discretization is performed,


such as whether it uses class
information or which direction it
proceeds

Concept Hierarchy
Generation
raw data are replaced by a
smaller number of interval or
concept labels.

1. Specification of a partial ordering


of attributes explicitly at the schema
level by users or experts
2. Specification of a portion of a
hierarchy by explicit data grouping:
3. Specification of a set of attributes,
but not of their partial ordering

Online Data Storage


OLTP

[On-line Transaction
Processing]

OLAP

[On-line Analytical
Processing]

Data Mining Primitives


what

defines a data mining task?


Task-relevant data:
The kinds of knowledge to be
mined:
Background Knowledge:
Interestingness measures:

Data Mining Query Languages


Syntax

of DMQL for mining di


erent kinds of rules
1. Data generalization.

2. Mining characteristic rules.

3. Mining discriminant rules.

4. Data classication and mining classication rules.

Designing Graphical User


Interfaces
From

DMQL to exible GUIs


Based on our experience, a data
mining GUI may consist of the
following functional components.

1.
2.
3.
4.
5.

Data collection
Presentation
Manipulation
Interactive multi-level mining
Other miscellaneous information

!
U
O
Y
K
N
A
H
T

You might also like