You are on page 1of 52

Data Mining:

Concepts and
Techniques
(3rd ed.)

Chapter 3
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
2011 Han, Kamber & Pei. All rights reserved.
1

Chapter 3: Data Preprocessing

Data Preprocessing: An Overview

Data Quality

Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

Data Transformation and Data Discretization

Summary
2

Data Quality: Why Preprocess the


Data?

There are many factors comprising data quality.


Measures for data quality are:

Accuracy: correct or wrong, accurate or not

Completeness: not recorded, unavailable

Consistency: some modified but some not

Timeliness: timely update?

Believability: how much data are trusted by users

Interpretability: how easily the data can be


understood?
3

Major Tasks in Data Preprocessing

Data cleaning

Data integration

Fill in missing values, smooth noisy data, identify or


remove outliers, and resolve inconsistencies
Integration of multiple databases, or files

Data reduction

Dimensionality reduction

Numerosity reduction

Data compression

Data transformation and data discretization

Normalization

Aggregation
4

Forms of Data
Preprocessing

February 19, 2008

Chapter 3: Data Preprocessing

Data Preprocessing: An Overview

Data Quality

Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

Data Transformation and Data Discretization

Summary
6

Data Cleaning

Data in the Real World Is Dirty: Lots of potentially incorrect data,


e.g., instrument faulty, human or computer error, transmission
error

incomplete: lacking attribute values, lacking certain


attributes of interest, or containing only aggregate data

noisy: containing noise, errors, or outliers

e.g., Occupation= (missing data)


e.g., Salary=10 (an error)

inconsistent: containing discrepancies in codes or names,


e.g.,

Age=42, Birthday=03/07/2010

Was rating 1, 2, 3, now rating A, B, C

Intentional (e.g., disguised missing data)

Jan. 1 as everyones birthday?


7

Incomplete (Missing) Data

Data is not always available

E.g., many tuples have no recorded value for several


attributes, such as customer income in sales data

Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus


deleted

data not entered due to misunderstanding

certain data may not be considered important at the


time of entry

not register history or changes of the data

Missing data may need to be inferred


8

How to Handle Missing


Data?

Ignore the tuple: usually done when class label is


missing (when doing classification)not effective when
the % of missing values per attribute varies considerably

Fill in the missing value manually: tedious + infeasible

Fill in it automatically with

a global constant : e.g., unknown, a new class?!

the attribute mean

the attribute mean for all samples belonging to the


same class: smarter

the most probable value: inference-based such as


Bayesian formula or decision tree
9

Noisy Data

Noise: random error or variance in a measured


variable
Incorrect attribute values may be due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention

10

How to Handle Noisy Data?

Binning
first sort data and partition into (equal-frequency)
bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g.,
deal with possible outliers)
11

Binning Methods for Data


Smoothing

12

Data Cleaning as a Process

Data discrepancy detection


Use metadata (e.g., domain, range, dependency, distribution)
Check field overloading
Check uniqueness rule, consecutive rule and null rule
Use commercial tools

Data scrubbing: use simple domain knowledge (e.g., postal


code, spell-check) to detect errors and make corrections

Data auditing: by analyzing data to discover rules and


relationship to detect violators (e.g., correlation and
clustering to find outliers)
Data migration and integration
Data migration tools: allow transformations to be specified
ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
Integration of the two processes
Iterative and interactive (e.g., Potters Wheels)
13

Exercise

14

Chapter 3: Data Preprocessing

Data Preprocessing: An Overview

Data Quality

Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

Data Transformation and Data Discretization

Summary
15

Data Integration

Data integration:

Combines data from multiple sources into a coherent store

Entity identification problem:

Identify real world entities from multiple data sources, e.g.,


Bill Clinton = William Clinton, Cust-id = Cust-#

Data value conflicts

For the same real world entity, attribute values from


different sources are different

Possible reasons: different representations, different


scales, e.g., metric vs. British units

16

Handling Redundancy in Data


Integration

Redundant data occur often when integrating multiple


databases

The same attribute or object may have different


names in different databases

One attribute may be a derived attribute in


another table, e.g., age

Redundant attributes may be detected by correlation


analysis and covariance analysis

Careful integration of the data from multiple sources


may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
17

Correlation Analysis (Nominal Data)

2 (chi-square) test
2
(
Observed

Expected
)
2
Expected

Expected = (count(A=ai)*count(B=bj))/n

The 2 statistic tests the hypothesis that A and B are independent,


i.e.., there is no correlation between them

The test is based on significance level with (r-1)(c-1) degrees of


freedom

If the hypothesis can be rejected, then we say that A and B are


statistically correlated

The larger the 2 value, the more likely the variables are related

Correlation does not imply causality

18

Chi-Square Calculation: An
Example

male

female

Sum
(row)

fiction

250(90)

200(360)

450

non-fiction

50(210)

1000(840)

1050

Sum(col.)

300

1200

1500

2 (chi-square) calculation (numbers in parenthesis


are expected counts calculated based on the data
distribution in the two categories)
(250 90) 2 (50 210) 2 (200 360) 2 (1000 840) 2

507.93
90
210
360
840
2

19

Chi-Square Calculation: An
Example
male

female

Sum
(row)

fiction

250(90)

200(360)

450

non-fiction

50(210)

1000(840)

1050

Sum(col.)

300

1200

1500

For this 2*2 table, the degrees of freedom are (2-1)(2-1)=1. For 1
degree of freedom, the 2 value needed to reject the hypothesis at
0.001 significance level is 10.828 (using 2 distribution table)

Since the computed value is above this, we can reject the hypothesis
that gender and preferred reading are independent

We can conclude that the two attributes are strongly correlated for
the given group of people

20

Correlation Analysis (Numeric Data)

Correlation coefficient (also called Pearsons product


moment coefficient)

i 1 (ai A)(bi B)
n

rA, B

(n 1) A B

i 1

(ai bi ) n A B

(n 1) A B

where n is the number of tuples,


and
are the respective
B
A the respective
means of A and B, A and B are
standard
deviation of A and B, and (a ibi) is the sum of the AB crossproduct.

If rA,B > 0, A and B are positively correlated (As values increase as


Bs). The higher the value, the stronger the correlation.

rA,B = 0: independent; rAB < 0: negatively correlated


21

Covariance (Numeric Data)

Covariance is similar to correlation

Correlation coefficient:
where n is the number of tuples,
and
are the respective mean or
expected values of A and B, AAand BB
are the respective standard
deviation of A and B.

Positive covariance: If CovA,B > 0, then if A is larger than its expected


value, B is also likely to be larger than its expected value.

Negative covariance: If CovA,B < 0 then if A is larger than its expected


value, B is likely to be smaller than its expected value.
Independence: CovA,B = 0 but the converse is not true:

Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence
22

Co-Variance: An Example

It can be simplified in computation as

Suppose two stocks A and B have the following values in one


week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).

Question: If the stocks are affected by the same industry trends,


will their prices rise or fall together?

E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

Cov(A,B) = (25+38+510+411+614)/5 4 9.6 = 4

Thus, A and B rise together since Cov(A, B) > 0.

Exercise

24

Chapter 3: Data Preprocessing

Data Preprocessing: An Overview

Data Quality

Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

Data Transformation and Data Discretization

Summary
25

Data Transformation

Mapping the entire set of values of a given attribute to a new set of replacement
values so that each old value can be identified with one of the new values

Strategies for data transformation include the following

Smoothing: Remove noise from data. Techniques include binning, regression


and clustering.

Attribute/feature construction

New attributes constructed from the given ones

Aggregation: summery or aggregation operations are applied to the data.

Normalization: Scaled to fall within a smaller, specified range such as-1.0 to


1.0

min-max normalization

z-score normalization

normalization by decimal scaling

Discretization: raw values of numeric attributes (e.g., age) replaced by interval


labels (e.g., 0-10, 11-20, etc.) or conceptual labels (e.g., youth, adult, senior)

Concept hierarchy generation: where attributes such as street can be


generalized to higher level concepts , like city or country.
26

Normalization

Min-max normalization: performs linear transformation on the original


data. [new_minA, new_maxA]
v'

v minA
( new _ maxA new _ minA) new _ minA
maxA minA

Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].


73,600 12,000
(1.0 0) 0 0.716
Then $73,600 is mapped to
98,000 12,000

Z-score normalization : the values of an attribute A are normalized based


on the mean and std deveation (: mean, : standard deviation):

Ex. Let = 54,000, = 16,000. Then

v'

v A

73,600 54,000
1.225
16,000

Normalization by decimal scaling: Normalizes by moving the decimal


point of values of attribute A.A value, vi of A is normalized to v by

v
v' j
10

Where j is the smallest integer such that Max(||) < 1


27

Data Discretization

Reduce the number of values for a given


continuous attribute by dividing the range
of the attribute into intervals
Interval labels can then be used to replace
actual data values
Split (topdown) vs. merge (bottomup)
Discretization can be performed recursively
on an attribute

12/29/16

Data Mining: Concepts and


Techniques

28

Why Discretization is
Used?

Reduce data size.


Transforming quantitative data to
qualitative data.

12/29/16

Data Mining: Concepts and


Techniques

29

Data Discretization Methods

Typical methods: All the methods can be applied


recursively

Binning

Top-down split, unsupervised

Histogram analysis

Top-down split, unsupervised

Clustering analysis (unsupervised, top-down split or


bottom-up merge)

Decision-tree analysis (supervised, top-down split)

Correlation (e.g., 2) analysis (unsupervised, bottomup merge)


30

Concept Hierarchy Generation

Concept hierarchy organizes concepts (i.e., attribute values)


hierarchically

Concept hierarchies facilitate drilling and rolling in data


warehouses to view data in multiple granularity

Concept hierarchy formation: Recursively reduce the data by


collecting and replacing low level concepts (such as numeric
values for age) by higher level concepts (such as youth, adult,
or senior)

Concept hierarchies can be explicitly specified by domain


experts and/or data warehouse designers

Concept hierarchy can be automatically formed for both


numeric and nominal data.
31

Concept Hierarchy Generation


for Nominal Data

Specification of a partial/total ordering of attributes


explicitly at the schema level by users or experts

street < city < state < country

Specification of a hierarchy for a set of values by


explicit data grouping

{Urbana, Champaign, Chicago} < Illinois

32

Automatic Concept Hierarchy


Generation

Some hierarchies can be automatically


generated based on the analysis of the number
of distinct values per attribute in the data set
The attribute with the most distinct values is
placed at the lowest level of the hierarchy
Exceptions, e.g., weekday, month, year
country

15 distinct values

province_or_ state

365 distinct values

city

3567 distinct values

street

674,339 distinct values


33

Exercise

For the following group of data: 200, 300,


400, 600, 1000, use the following methods
to normalize the values.
min-max normalization
z-score normalization
normalization by decimal scaling

34

Chapter 3: Data Preprocessing

Data Preprocessing: An Overview

Data Quality

Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

Data Transformation and Data Discretization

Summary
35

Data Reduction Strategies

Data reduction: Obtain a reduced representation of the data


set that is much smaller in volume but yet produces the same
analytical results
Why data reduction? A database/data warehouse may store
terabytes of data. Complex data analysis may take a very
long time to run on the complete data set.
Data reduction strategies
Dimensionality reduction
In dimensionality reduction data encoding schemes are
applied so as to obtain a reduced or compressed
representation of the original data.

Wavelet transforms

Principal Components Analysis (PCA)

Attribute subset selection, attribute creation


Numerosity reduction
The data are replaced by alternative, smaller representations
using parametric models or non parametric models

36

Regression and Log-Linear Models


Histograms, clustering, sampling
Data cube aggregation

Data compression
In data compression transformations are applied so as to
obtain a reduced or compressed representation of the
original data

Lossless
Lossy

37

Attribute Subset Selection


ASS Reduces the data size by removing:

Redundant attributes

Irrelevant attributes

Contain no information that is useful for the


data mining task.

E.g., students' ID is often irrelevant to the task


of predicting students' GPA

38

Attribute Subset Selection


Greedy methods for attribute subset selection

39

Attribute Creation (Feature


Generation)

Create new attributes (features) that can capture


the important information in a data set more
effectively than the original one
Attribute construction can help to improve
accuracy and understanding of structure in high
dimensional data

40

Wavelet Transforms

The discrete wavelet transform (DWT) is a linear


signal processing technique that, when applied to a
data vector X, transforms it to a numerically different
vector, X , of wavelet coefficients.
The two vectors are of the same length. When
applying this technique to data reduction, we
consider each tuple as an n-dimensional data vector,
that is, X = (x1,x2,...,xn), depicting n measurements
made on the tuple from n database attributes.
Wavelet
transforms
can
be
applied
to
multidimensional data such as a data cube. This is
done by first applying the transform to the first
dimension, then to the second, and so on.
Wavelet transforms give good results on sparse or
skewed data and on data with ordered attributes.
41

Principal components analysis

Principal components analysis (PCA; also called the


Karhunen-Loeve, or K-L, method) searches for k ndimensional orthogonal vectors that can best be
used to represent the data, where k n.
The original data are thus projected onto a much
smaller space, resulting in dimensionality reduction.
PCA can be applied to ordered and unordered
attributes, and can handle sparse data and skewed
data
In comparison with wavelet transforms, PCA tends
to be better at handling sparse data, whereas
wavelet transforms are more suitable for data of
high dimensionality
42

Data Reduction 2: Numerosity


Reduction

Reduce data volume by choosing alternative,


smaller forms of data representation

43

Regression and log linear models can be used to


approximate the given data. In Linear regression the
data are modeled to fit a straight line.
Linear regression : Y = w X + b
Two regression coefficients, w and b, specify the line and
are to be estimated by using the data at hand

Log- linear models:


o Approximate
discrete
multidimensional
probability
distributions
Estimate the probability of each point (tuple) in a multi
dimensional space for a set of discretized attributes,
based on a smaller subset of dimensional combinations
Useful for dimensionality reduction and data smoothing

44

Histogram Analysis
Divide data into buckets
Partitioning rules:

40

Equal-width: equal
bucket range

30

Equal-frequency (or
equal-depth)

20
15
10
5
100000

90000

80000

70000

60000

50000

40000

0
30000

25

20000

35

10000

45

Clustering

Partition data set into clusters based on similarity,


and store cluster representation (e.g., centroid
and diameter) only

Can have hierarchical clustering and be stored in


multi-dimensional index tree structures

46

Sampling

Sampling: obtaining a small sample s to represent


the whole data set N

Key principle: Choose a representative subset of the


data

Common ways of sampling:

Simple random sample without replacement of


size (SRSWOR)

Simple random sample with replacement of size


(SRSWR)

Cluster sample

Stratified sample
47

Types of Sampling

Simple random sampling


There is an equal probability of selecting any
particular item
Sampling without replacement
Once an object is selected, it is removed from the
population
Sampling with replacement
A selected object is not removed from the
population
Stratified sampling:
Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the
same percentage of the data)
Used in conjunction with skewed data
48

Sampling: With or without


Replacement

49

Data Cube Aggregation

Data Cube Aggregation

Summarize (aggregate) data based on dimensions


The resulting data set is smaller in volume, without
loss of information necessary for analysis analysis
task
Concept hierarchies may exist for each attribute,
allowing the analysis of data at multiple levels of
abstraction

50

Data Reduction 3: Data


Compression

String compression
There are extensive theories and well-tuned
algorithms
Typically lossless, but only limited manipulation is
possible without expansion
Audio/video compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Dimensionality and numerosity reduction may also
be considered as forms of data compression
51

Exercise
Using the data for age below
13 15 16 16 19 20 20 21 22 22 25 25 25 25 30 33 33
35 35 35 35 36 40 45 46 52 70
Plot an equal width histogram of width 10.
Sketch examples of each of the following sampling
techniques: SRSWOR, SRSWR, cluster sampling and
stratified sampling. Use samples of size 5 and the
strata youth, middle-aged and senior.

52

You might also like