ITS665dm Topic2-DataUnderstanding

ITS 665 Data Mining
Topic 2
Understanding your Data
Shuzlina Abdul Rahman

(shuzlina@fskm.uitm.edu.my)
Centre of Information Systems Studies

Faculty of Computer and Mathematical Sciences, UiTM
Source: Adapted Jiawei Han and Micheline Kamber (2012); Tan et al (2012)
Objectives
To differentiate Data Objects and Attribute Types
To understand Basic Statistical Descriptions of
Data
To explain several types of Data Visualization

Types of Data Sets
Record Ordered
Relational records Video data: sequence
Data matrix, e.g., numerical
of images
matrix, crosstabs
Document data: text Temporal data: time-
documents: term-frequency series
vector
Sequential Data:
Transaction data
transaction
Graph and network
sequences
World Wide Web
Social or information Genetic sequence
networks data
Molecular Structures Spatial, image and
multimedia:
3

Record Data
Data that consists of a collection of records, each

of which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data Matrix
If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as points
in a multi-dimensional space, where each dimension
represents a distinct attribute
Such data set can be represented by an m by n matrix,

where there are m rows, one for each object, and n
columns, one for each attribute
P r o j e c t i o n P r o j e c t i o n D i s t a n c e L o a d T h i c k n e s s
o f x L o a d o f y l o a d
1 0 . 2 3 5 . 2 7 1 5 . 2 2 2 . 7 1 . 2
1 2 . 6 5 6 . 2 5 1 6 . 2 2 2 . 2 1 . 1
Document Data
Each document becomes a `term' vector,

each term is a component (attribute) of the vector
the value of each component is the number of times the
corresponding term occurs in the document.
Transaction Data
A special type of record data, where
each record (transaction) involves a set of items.
For example, consider a grocery store. The set of
products purchased by a customer during one shopping
trip constitute a transaction, while the individual
products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
Examples: Generic graph and HTML Links
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa"> 2
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
5 1
Parallel Solution of Sparse Linear System of Equations </a>
<li>
2
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers 5
Other Types of Data
Ordered Data
Sequences of transactions Genomic sequence data
Items/Events
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
An element of the sequence
Data Objects
Data sets are made up of data objects.

A data object represents an entity.
Examples:
sales database: customers, store items, sales
medical database: patients, treatments
university database: students, professors, courses
Also called samples , examples, instances, data
points, objects, tuples.
Data objects are described by attributes.
Database rows -> data objects; columns
->attributes.
10
Attributes
Attribute (or dimensions, features,

variables): a data field, representing a
characteristic or feature of a data object.
E.g., customer _ID, name, address
Types:
Nominal
Binary
Numeric: quantitative
Interval-scaled
Ratio-scaled
11
Attribute Types
Nominal: categories, states, or names of

things
Hair_color = {black, brown, blond, red, auburn, grey,
white}
We can assign a code of 0 for black, 1 for brown
marital status, occupation, ID numbers, zip codes
Nominal attribute values do not have any meaningful order

about them and are not quantitative
It makes no sense to find the mean (average) value or
median (middle) value for such an attribute, given a set of
objects.
Except the attributes most commonly occurring value.
12
Attribute Types
Binary (Boolean true or false)

Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally
important
e.g., gender
Asymmetric binary: outcomes not equally

important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important
outcome (e.g., HIV positive)
13
Attribute Types
Ordinal
Values have a meaningful order (ranking) but
magnitude between successive values is not
known.
Size = {small, medium, large}, grades, army
rankings
Grade (e.g., A+, A, A-, B+, B, B-, C+, C, C-,
D+, D, E, F)
Note that nominal, binary, and ordinal attributes

are qualitative.
Describe a feature of an object, without giving an
actual size or quantity.
14
Numeric Attribute Types
Numeric attribute: a measurable quantity

(represented in integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order
E.g., temperature in Cor F, calendar
dates
No true zero-point, neither 0C nor 0F
indicates no temperature.
We can compute their mean value, in addition to
the median and mode measures of central

tendency.
15
Numeric Attribute Types
Ratio
Inherent zero-point
We can speak of values as being an order of
magnitude larger than the unit of
measurement (10 K is twice as high as 5
K).
e.g., temperature in Kelvin, length,
counts,
monetary quantities (e.g., you are 100 times
richer with $100 than with $1).
16
Discrete vs. Continuous Attributes
Discrete Attribute
Has only a finite or countable infinite set of
values
E.g., zip codes, profession, or the set of
words in a collection of documents

Sometimes, represented as integer variables
Note: Binary attributes are a special case of
discrete attributes
Note that discrete attributes may have numeric
values, such as 0 and 1 for binary attributes, or,
the values 0 to 110 for the attribute Age.
17
Discrete vs. Continuous Attributes
Continuous Attribute
Has real numbers as attribute values
E.g., temperature, height, or weight
Practically, real values can only be measured
and represented using a finite number of

digits
Are real numbers, whereas numeric values
can be either integers or real numbers.

Continuous attributes are typically
represented as floating-point variables
18
Properties of Attribute Values
The type of an attribute depends on which of

the following properties it possesses:
Distinctness: =
Order: < >
Addition: + -
Multiplication: */
Nominal attribute: distinctness

Ordinal attribute: distinctness & order
Interval attribute: distinctness, order & addition
Ratio attribute: all 4 properties
Attribute Description Examples Operations
Type
Nominal The values of a nominal attribute are zip codes, employee mode, entropy,
just different names, i.e., nominal ID numbers, eye color, contingency
attributes provide only enough sex: {male, female} correlation, 2 test
information to distinguish one
object from another. (=, )
Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,

provide enough information to order {good, better, best}, rank correlation,
objects. (<, >) grades, street numbers run tests, sign tests
Interval For interval attributes, the calendar dates, mean, standard

differences between values are temperature in Celsius deviation, Pearson's
meaningful, i.e., a unit of or Fahrenheit correlation, t and F
measurement exists. tests
(+, - )
Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical
current
20
Examples
Source:
http://www.perceptualedge.com/articles/dmreview/qua
nt_vs_cat_data.pdf
BASIC STATISTICAL DESCRIPTIONS
OF DATA
Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population):
Note: n is sample size and N is population size
1 n
x xi
x
n i 1 N
Weighted arithmetic mean: n
Trimmed mean: chopping extreme values w x i i

x i 1
n
Median: A holistic measure w
i 1
i
Middle value if odd number of values, or average of the middle two

values otherwise
Estimated by interpolation (for grouped data):
n / 2 ( f )l
Mode median L1 ( )c
f median
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
For unimodal frequency that are moderately skewed; the formula
mean mode 3 (mean median)

23
Symmetric vs. Skewed Data
Median, mean and mode of

symmetric, positively and
negatively skewed data
symmetric
positively skewed
negatively skewed
24
Measuring the Dispersion of Data
Quartiles, outliers and boxplots

Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 Q1
Five number summary: min, Q1, M, Q3, max
Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot
outlier individually 1 n
1 n
2 ( xi ) 2 x 2
2
i
Outlier: usually, a value higher/lower than 1.5 x IQR N i 1 N i 1
Variance and standard deviation (sample: s, population: )

Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n
s
2

n 1 i 1
( xi x )
2
[ xi ( xi ) 2 ]
n 1 i 1 n i 1
Standard deviation s (or ) is the square root of variance s2 (or 2)
25
Properties of Normal Distribution Curve
The normal (distribution) curve

From to +: contains about 68% of the
measurements (: mean, : standard deviation)

From 2 to +2: contains about 95% of it
From 3 to +3: contains about 99.7% of it
26
Boxplot Analysis
Five-number summary of a distribution:

Minimum, Q1, M, Q3, Maximum
Boxplot
Data is represented with a box
The ends of the box are at the first and
third quartiles, i.e., the height of the box
is IQR
The median is marked by a line within
the box
Whiskers: two lines outside the box
extend to Minimum and Maximum
27
Visualization of Data Dispersion:
Boxplot Analysis
28
Histogram Analysis
Histogram: Graph display of

tabulated frequencies, shown as
bars
It shows what proportion of cases
fall into each of several categories
Differs from a bar chart in that it is
the area of the bar that denotes
the value, not the height as in bar
charts, a crucial distinction when
the categories are not of uniform
width
The categories are usually
specified as non-overlapping
intervals of some variable. The
categories (bars) must be adjacent
29
Histograms Often Tell More than Boxplots
The two histograms shown

in the left may have the
same boxplot
representation
The same values for:
min, Q1, median, Q3,
max
But they have rather
different data distributions
30
Quantile Plot
Displays all of the data (allowing the user to assess
both the overall behavior and unusual occurrences)
Plots quantile information
For a data xi data sorted in increasing order, fi indicates
that approximately 100 fi% of the data are below or
equal to the value xi
31
Quantile-Quantile (Q-Q) Plot
Graphs the quantiles of one univariate distribution
against the corresponding quantiles of another
View: Is there is a shift in going from one distribution to
another?
Example shows unit price of items sold at Branch 1 vs.
Branch 2 for each quantile. Unit prices of items sold at
Branch 1 tend to be lower than those at Branch 2.
32
Scatter plot
Provides a first look at bivariate data to see
clusters of points, outliers, etc
Each pair of values is treated as a pair of
coordinates and plotted as points in the plane
33
Positively and Negatively Correlated Data
34
Positively and Negatively Correlated Data
If the pattern of plotted points slopes from lower left If the pattern of plotted points slopes from upper left to
to upper right, this means that the values of X lower right, then the values of X increase as the values
increase as the values of Y increase, which of Y decrease, suggesting a negative correlation .
suggests a positive correlation.
The left half fragment is positively correlated

The right half is negative correlated
35
Uncorrelated Data
36
Exercise 1
What is the median

value?
What are the lower and
upper values?
What are the outlier
values?
Interpret the box-and-
whisker plot.
SEVERAL TYPES OF DATA
VISUALIZATION
Data Visualization
Why data visualization?

Gain insight into an information space by mapping data onto
graphical primitives
Provide qualitative overview of large data sets
Search for patterns, trends, structure, irregularities, relationships
among data
Help find interesting regions and suitable parameters for further
quantitative analysis
Provide a visual proof of computer representations derived
Typical visualization methods:

Geometric techniques
Icon-based techniques
Hierarchical techniques
39
Geometric Techniques
Visualization of geometric transformations and

projections of the data
Methods
Direct data visualization
Scatterplot matrices
Landscapes
Projection pursuit technique
Finding meaningful projections of
multidimensional data
Prosection views
Hyperslice
Parallel coordinates
40
Scatterplot Matrices
Used by ermission of M. Ward, Worcester Polytechnic Institute
Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of

(k2/2-k) scatterplots]
41
Landscapes
news articles
Used by permission of B. Wright, Visible Decisions Inc.
visualized as
a landscape
Visualization of the data as perspective landscape

The data needs to be transformed into a (possibly artificial)
2D spatial representation which preserves the characteristics
of the data 42
Icon-based Techniques
Visualization of the data values as features of

icons
Typical visualization methods:
Chernoff Faces
Stick Figures
General techniques
Shape Coding: Use shape to represent certain
information encoding
Color Icons: Using color icons to encode more
information
TileBars: The use of small icons representing the
relevance feature vectors in document retrieval
43
44
Chernoff Faces
A way to display variables on a two-dimensional surface,
e.g., let x be eyebrow slant, y be eye size, z be nose length,
etc.
The figure shows faces produced using 10 characteristics--
head eccentricity, eye size, eye spacing, eye eccentricity,
pupil size, eyebrow slant, nose size, mouth shape, mouth
size, and mouth opening): Each assigned one of 10 possible
values, generated using Mathematica (S. Dickson)
REFERENCE: Gonick, L. and Smith, W.

The Cartoon Guide to Statistics. New York:
Harper Perennial, p. 212, 1993
Weisstein, Eric W. "Chernoff Face." From
MathWorld--A Wolfram Web Resource.
mathworld.wolfram.com/ChernoffFace.html
Hierarchical Techniques
Visualization of the data using a

hierarchical partitioning into subspaces.
Methods
Dimensional Stacking
Worlds-within-Worlds
Tree-Map
Cone Trees
InfoCube
45
46
attr ib u te4
attr ib u te2
a ttr ib u te3
a ttri b u te 1
Partitioning of the n-dimensional attribute space in 2-D
subspaces, which are stacked into each other
Partitioning of the attribute value ranges into classes.
The important attributes should be used on the outer
levels.
Adequate for data with ordinal attributes of low cardinality
But, difficult to display more than nine dimensions
Important to map dimensions appropriately
Visualization of oil mining data with longitude and latitude mapped to the outer x-, y-
axes and ore grade and depth mapped to the inner x-, y-axes
47
Tree-Map
Screen-filling method which uses a hierarchical
partitioning of the screen into regions depending on the
attribute values
The x- and y-dimension of the screen are partitioned
alternately according to the attribute values (classes)
MSR Netscan Image
48
Tree-Map of a File System
(Schneiderman)
49
Three-D Cone Trees
3D cone tree visualization technique

works well for up to a thousand nodes
or so
First build a 2D circle tree that
arranges its nodes in concentric circles
centered on the root node
Cannot avoid overlaps when projected
to 2D
G. Robertson, J. Mackinlay, S. Card.
Cone Trees: Animated 3D
Visualizations of Hierarchical
Information, ACM SIGCHI'91
Graph from Nadeau Software
Consulting website: Visualize a social
network data set that models the way
an infection spreads from one person to
the next
50
InfoCube
A 3-D visualization technique where hierarchical
information is displayed as nested semi-
transparent cubes
The outermost cubes correspond to the top level
data, while the subnodes or the lower level data
are represented as smmaller cubes inside the
outermost cubes, and so on
51
Source of Public Datasets
UC Irvine Machine Learning Repository

http://archive.ics.uci.edu/ml/
Datasets for Data Mining The University of Edinburgh
http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets0405.ht
ml
Google Public Data
http://www.google.com/publicdata/directory
Kent Ridge Bio-medical Dataset
http://datam.i2r.a-star.edu.sg/datasets/krbd/
Frequent Itemset Mining Dataset Repository
http://fimi.ua.ac.be/data/
Bioinformatics Datasets
http://www.kent.ac.uk/library/subjects/biosciences/bioinformatics.
html?tab=genomes
Summary
Data attribute types: nominal, binary, ordinal,
interval-scaled, ratio-scaled
Many types of data sets, e.g., numerical, text,
graph, Web, image.
Gain insight into the data by:
Basic statistical data description: central tendency,
dispersion, graphical displays
Data visualization: map data onto graphical
primitives
Measure data similarity
Above steps are the beginning of data
preprocessing.
53

ITS665dm Topic2-DataUnderstanding

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ITS665dm Topic2-DataUnderstanding

Uploaded by

Copyright:

Available Formats

ITS 665 Data Mining

Shuzlina Abdul Rahman

Centre of Information Systems Studies

To differentiate Data Objects and Attribute Types

To understand Basic Statistical Descriptions of

To explain several types of Data Visualization

Data that consists of a collection of records, each

1 Yes Single 125K No

Such data set can be represented by an m by n matrix,

Each document becomes a `term' vector,

Data sets are made up of data objects.

Attribute (or dimensions, features,

Nominal: categories, states, or names of

Nominal attribute values do not have any meaningful order

Binary (Boolean true or false)

Asymmetric binary: outcomes not equally

Note that nominal, binary, and ordinal attributes

Numeric attribute: a measurable quantity

the median and mode measures of central

words in a collection of documents

Note: Binary attributes are a special case of

E.g., temperature, height, or weight

Practically, real values can only be measured

and represented using a finite number of

can be either integers or real numbers.

represented as floating-point variables

The type of an attribute depends on which of

Nominal attribute: distinctness

Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,

Interval For interval attributes, the calendar dates, mean, standard

Trimmed mean: chopping extreme values w x i i

Middle value if odd number of values, or average of the middle two

mean mode 3 (mean median)

Median, mean and mode of

Quartiles, outliers and boxplots

Variance and standard deviation (sample: s, population: )

Standard deviation s (or ) is the square root of variance s2 (or 2)

The normal (distribution) curve

measurements (: mean, : standard deviation)

Five-number summary of a distribution:

Histogram: Graph display of

The two histograms shown

The left half fragment is positively correlated

What is the median

Why data visualization?

Typical visualization methods:

Visualization of geometric transformations and

Used by ermission of M. Ward, Worcester Polytechnic Institute

Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of

Visualization of the data as perspective landscape

Visualization of the data values as features of

relevance feature vectors in document retrieval

REFERENCE: Gonick, L. and Smith, W.

Visualization of the data using a

MSR Netscan Image

3D cone tree visualization technique

UC Irvine Machine Learning Repository

You might also like