Data Preprocessing

Data Warehousing and Data Mining
Unit 2: Data Preprocessing:
By.
Varsha Gaikwad
M-Tech (IT)
Contents
Needs
Preprocessing the Data

Data Cleaning
Data Integration and Transformation
Data Reduction
Discretization and Concept Hierarchy
Generation
Online Data Storage
Data Mining Primitives
Data Mining Query Languages
Designing Graphical User Interfaces Based on a
Data Mining Query Language Architectures of
Data Mining Systems.
Needs Preprocessing the Data

Quality
Three elements defining quality: accuracy,

completeness, and consistency.
Factors
affecting quality
Faulty instruments
Human or computer errors(disguised missing data)
Errors in data transmission
Technology limitations
Duplicate tuples
Incomplete information
Timeliness
Believability
Interpretability
Forms of data Processing
Data Cleaning
Data
cleaning (or data cleansing)

routines attempt to fill in missing
values, smooth out noise while
identifying outliers, and correct
inconsistencies in the data.
Where needs cleaning:
Missing values
Noisy data
How to handle data:

Missing values
1.
2.
3.
4.
5.
6.
Ignore the tuple

Fill in the missing value manually
Use a global constant to fill in the missing value
Use a measure of central tendency for the attribute
(e.g., the mean or median) to fill in the missing value
Use the attribute mean or median for all samples
belonging to the same class as the given tuple
Use the most probable value to fill in the missing
value
Noisy data:
Data smoothing techniques
Binning
Regression
Outlier analysis
Data Cleaning Process

Step
1: Discrepancy detection
Use any knowledge you may already

have regarding properties of the
data(metadata)
Field overloading
Data should also be examined
regarding unique rules, consecutive
rules, and null rules.
Discrepancy
detection tools
Data scrubbing tools

Data auditing tools
Data Cleaning Process

Step
2:Data transformation for

discrepant data
Replacing string.
Discrepancy detection tools
Transformation
tools:
Data migration tools
ETL(extraction/transformation/loadin
g)tools
Data Integration and

Transformation
Merging
of data from multiple data stores.

There are four issues at integration level
and their solutions:
1. Entity Identication Problem
Schema integration problem-Metadata for each

attribute
Functional dependency problem-attention towards
the structure of data
2. Redundancy
Correlation analysis
nominal data-2 (chi-square)test
Numeric attributes-correlation coefficient and
covarianc
Issues at integration level and

their solutions
3. Tuple duplication
Check for duplications
4. Data value conicts
Data Reduction
Data
reduction techniques can be

applied to obtain a reduced
representation of the data set
that is much smaller in volume,
yet closely maintains the
integrity of the original data.
Data reduction strategies

Dimensionality
reduction:
Reducing the number of random variables

or attributes under consideration
Wavelet Transforms
principal components analysis
Attribute subset selection
Numerosity
reduction:
replace the original data volume by

alternative
Parametric methods
Nonparametric methods
Data reduction strategies

Data
compression:
To obtain a reduced or compressed

representation of the original data
Lossless
lossy
Discretization and Concept

Hierarchy Generation
Data
discretization, a form of
data transformation where the
raw values of a numeric attribute
(e.g., age) are replaced by
interval labels (e.g., 010, 1120,
etc.) or conceptual labels (e.g.,
youth, adult, senior).
Discretization process
top-down
vs. bottom-up
how the discretization is performed,

such as whether it uses class
information or which direction it
proceeds
Concept Hierarchy
Generation
raw data are replaced by a
smaller number of interval or
concept labels.
1. Specification of a partial ordering

of attributes explicitly at the schema
level by users or experts
2. Specification of a portion of a
hierarchy by explicit data grouping:
3. Specification of a set of attributes,
but not of their partial ordering
Online Data Storage

OLTP
[On-line Transaction
Processing]
OLAP
[On-line Analytical
Processing]
Data Mining Primitives

what
defines a data mining task?

Task-relevant data:
The kinds of knowledge to be
mined:
Background Knowledge:
Interestingness measures:
Data Mining Query Languages

Syntax
of DMQL for mining di

erent kinds of rules
1. Data generalization.
2. Mining characteristic rules.
3. Mining discriminant rules.
4. Data classication and mining classication rules.
Designing Graphical User

Interfaces
From
DMQL to exible GUIs

Based on our experience, a data
mining GUI may consist of the
following functional components.
1.
2.
3.
4.
5.
Data collection
Presentation
Manipulation
Interactive multi-level mining
Other miscellaneous information
!
U
O
Y
K
N
A
H
T

Data Preprocessing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Preprocessing

Uploaded by

Copyright:

Available Formats

Data Warehousing and Data Mining

Unit 2: Data Preprocessing:

Preprocessing the Data

Needs Preprocessing the Data

Three elements defining quality: accuracy,

Forms of data Processing

cleaning (or data cleansing)

How to handle data:

Ignore the tuple

Data Cleaning Process

Use any knowledge you may already

Data scrubbing tools

Data Cleaning Process

2:Data transformation for

Data migration tools

Data Integration and

of data from multiple data stores.

Schema integration problem-Metadata for each

Issues at integration level and

Check for duplications

4. Data value conicts

reduction techniques can be

Data reduction strategies

Reducing the number of random variables

replace the original data volume by

Data reduction strategies

To obtain a reduced or compressed

Discretization and Concept

how the discretization is performed,

1. Specification of a partial ordering

Online Data Storage

Data Mining Primitives

defines a data mining task?

Data Mining Query Languages

of DMQL for mining di

2. Mining characteristic rules.

3. Mining discriminant rules.

4. Data classication and mining classication rules.

Designing Graphical User

DMQL to exible GUIs

You might also like