Concepts and Techniques: Data Mining

Data Mining:
Concepts and Techniques

— Chapter 11 —
—Additional Theme: RFID Data Warehousing and
Mining and High-Performance Computing—
Jiawei Han and Micheline Kamber

Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber. All rights reserved.
Acknowledgements: Hector Gonzalez and Shengnan Cong
12/08/21 Data Mining: Concepts and Techniques 1
Outline
 Introduction to RFID Technology

 Motivation: Why RFID-Warehousing?
 RFID-Warehouse Architecture
 Performance Study
 Linking RFID Data Analysis with HPC
 Conclusions

What is RFID?
 Radio Frequency Identification (RFID)
 Technology that allows a sensor (reader)
to read, from a distance, and without line
of sight, a unique electronic product
code (EPC) associated with a tag
Tag Reader

RFID System
Source: www.belgravium.com

Applications
 Supply Chain Management:
real-time inventory tracking
 Retail: Active shelves monitor
product availability
 Access control: toll collection,
credit cards, building access
 Airline luggage management:
(British airways) Implemented to
reduce lost/misplaced luggage (20
million bags a year)
 Medical: Implant patients with a
tag that contains their medical
history
 Pet identification: Implant RFID
tag with pet owner information
(www.pet-id.net)

Outline

 Conclusions

RFID Warehouse Architecture

Challenges of RFID Data Sets
 Data generated by RFID systems is enormous due to redundancy
and low level of abstraction
 Walmart is expected to generate 7 terabytes of RFID
data per day
 Solution Requirements
 Highly compact summary of the data
 OLAP operations on multi-dimensional view of the data
 Summary should preserve the path structure of RFID
data
 It should be possible to efficiently drill down to individual
tags when an interesting pattern is discovered

Why RFID-Warehousing? (1)
 Lossless compression
 Significantly reduce the size of the RFID data set by
redundancy removal and grouping objects that
move and stay together
 Data cleaning: reasoning based on more complete info
 Multi-reading, miss-reading, error-reading, bulky
movement, …
 Multi-dimensional summary: product, location, time, …
 Store manager: Check item movements from the
backroom to different shelves in his store
 Region manager: Collapse intra-store movements and
look at distribution centers, warehouses, and stores

Why RFID-Warehousing? (2)
 Query Processing
 Support for OLAP: roll-up, drill-down, slice, and dice
 Path query: New to RFID-Warehouses, about the
structure of paths
What products that go through quality control have

shorter paths?
 What locations are common to the paths of a set of
defective auto-parts?
 Identify containers at a port that have deviated from their
historic paths
 Data mining
 Find trends, outliers, frequent, sequential, flow
patterns, …

Example: A Supply Chain Store
 A retailer with 3,000 stores, selling 10,000 items a day per store
 Each item moves 10 times on average before being sold
 Movement recorded as (EPC, location, second)
 Data volume: 300 million tuples per day (after redundancy
removal)
 OLAP Query
 Avg time for outwear items to move from warehouse
to checkout counter in March 2006?
 Costly to answer if scanning 1 billion tuples for March

Outline

 Conclusions

Cleaning of RFID Data Records
 Raw Data
 (EPC, location, time)
 Duplicate records due to multiple readings of a
product at the same location
 (r1,l1,t1) (r1,l1,t2) ... (r1,l1,t10)
 Cleansed Data: Minimal information to store, raw data will
be then removed
 (EPC, Location, time_in, time_out)
 (r1,l1,t1,t10)
 Warehousing can help fill-up missing records and correct
wrongly-registered information
Key Compression Ideas (I)
 Bulky object movements
 Objects often move and stay together through the supply chain
 If 1000 packs of soda stay together at the distribution center,
register a single record

 (GID, distribution center, time_in, time_out)
 GID is a generalized identifier that represents the 1000 packs that
stayed together at the distribution center

shelf 1
10 pallets store 1
(1000 cases)
Dist. Center 1 shelf 2
store 2
Dist. Center2 …
Factory
…
10 packs
… (12 sodas)
20 cases
(1000 packs)

Key Compression Ideas (II)
 Data generalization
 Analysis usually takes place at a much higher level of
abstraction than the one present in raw RFID data
 Aggregate object movements into fewer records
 If interested in time at the day level, merge records
at the minute level into records at the hour level
 Merge and/or collapse of path segments
 Uninteresting path segments can be ignored or merged
 Multiple item movements within the same store may
be uninteresting to a regional manager and thus can

be merged

Path-Independent Generalization
Category level Clothing
Type level Interesting

Outerwear Shoes
Level
SKU level Shirt Jacket …
EPC level Shirt 1 … Shirt n Cleansed RFID

Database Level

Path Generalization
Store View:
Transportation backroom shelf checkout
dist. center truck backroom shelf checkout
Transportation View:
dist. center truck Store

Why Not Using Traditional Data Cube?
 Fact Table: (EPC, location, time_in, time_out)

 Aggregate: A measure at a single location
 e.g., what is the average time that milk stays in the
refrigerator in Illinois stores?

 What is missing?
 Measures computed on items that travel through a
series of locations
 e.g., what is the average time that milk stays at the
refrigerator in Champaign when coming from farm A,

and Warehouse B?
 Traditional cubes miss the path structure of the data

RFID-Cube Architecture

RFID-Cuboid Architecture (II)
 Stay Table: (GIDs, location, time_in, time_out: measures)
 Records information on items that stay together at a given
location
 If using record transitions: difficult to answer queries, lots of
intersections needed
 Map Table: (GID, <GID1,..,GIDn>)
 Links together stages that belong to the same path. Provides
additional: compression and query processing efficiency
 High level GID points to lower level GIDs
 If saving complete EPC Lists: high costs of IO to retrieve long
lists, costly query processing

 Information Table: (EPC list, attribute 1,...,attribute n)
 Records path-independent attributes of the items, e.g., color,
manufacturer, price

RFID-Cuboid Example
Cleansed RFID Database Stay Table
epc loc t_in t_out epcs
gids loc t_in t_out
r1 l1 t1 t10 r1,r2,r3
g1 l1 t1 t10
r1 l2 t20 t30 g1.1

r1,r2 l2 t20 t30
r2 l1 t1 t10 r3
g1.2 l4 t15 t20
r2 l3 t20 t30
Map Table
r3 l1 t1 10
gid gids
r3 l4 t15 t20 g1 g1.1,g1.2
g1.1 r1,r2
g1.2 r3

Benefits of the Stay Table (I)
Query: What is the average time that items stay

at location l ?
Transition Grouping l1 ln+1

 Retrieve all transitions with destination = l
 Retrieve all transitions with origin = l
l2 l ln+2
 Intersect results and compute average time
 IO Cost: n + m retrievals
… …
Prefix Tree
 Retrieve n records ln ln+m
Stay Grouping
 Retrieve stay record with location = l
 IO Cost: 1

Benefits of the Stay Table (II)
Query: How many boxes of milk traveled through the locations l1, l7, l13?
With Cleansed Database With Stay Table

Strategy: Strategy:
(r1,l1,t1,t2) (g1,l1,t1,t2)
 Retrieve itemsets for  Retrieve the gids
(r1,l2,t3,t4) (g2,l2,t3,t4)
locations l1, l7, l13 … for l1, l7, l13 …
(r2,l1,t1,t2)  Intersect the gids
 Intersect itemsets
(r2,l2,t3,t4) IO Cost:
IO Cost: …  One IO per GID in
 One IO per item in (rk,l1,t1,t2) locations l1, l7, and
locations l1 or l7 or l13 (rk,l2,t3,t4) l13
Observation: Observation:
 Retrieve records
 Very costly, we retrieve
at the group level
records at the individual
and thus greatly
item level reduce IO costs

Benefits of the Map Table
#EPCs #GIDs
l1 n 1
n 3
l2 l3 l4
l5 l6 l7 l8 l9 l10 n 6
{r1,..,ri} {ri+1,..,rj} {rj+1,..,rk} {rk+1,..,rl} {rl+1,..,rm} {rm+1,..,rn} 3n 10+n

Path-Dependent Naming of GIDs
0.0 0.1
 Assign to each GID a unique
l1 l2
identifier that encodes the path
traversed by the items that it 0.0.0 0.1.0
0.1.1
points to l3 l4
 Path-dependent name: Makes
0.0.0.0
0.1.0.1
it easy to detect if locations
form a path l5 l6

RFID-Cuboid Construction Algorithm
1. Build a prefix tree for the paths in the cleansed

database
2. For each node, record a separate measure for each
group of items that share the same leaf and information
record
3. Assign GIDs to each node:
GID = parent GID + unique id
4. Each node generates a stay record for each distinct
measure
5. If multiple nodes share the same location, time, and
measure, generate a single record with multiple GIDs

RFID-Cube Construction
Path Tree Stay Table
GIDs loc t_in t_out count

0.0 l1 t1 t10 3
0.0.0 l3 t20 t30 6

0.0 0.1 l2 3
t1,t10: 3 l1 t1,t8: 3 0.1.0
0.0.0.0 l5 t40 t60 5

3
0.1.0.0
0.0.0
l3 0.1.0 l3 0.1.1 l4 0.1 l2 t1 t8 3
t20,t30: 3 t20,t30: 3 t10,t20: 2
{r8,r9} 0.1.0.1 l6 t35 t50 1
0.1.0.0 0.1.0.1
t40,t60: 2 t35,t50: 1
0.1.1 l4 t10 t20 2
0.0.0.0
t40,t60: 3 l5 l5 l6
{r1,r2,r3} {r5,r6} {r7}

RFID-Cube Properties
 The RFID-cuboid can be constructed on a single scan of

the cleansed RFID database
 The RFID-cuboid provides lossless compression at its level
of abstraction
 The size of the RFID-cuboid is smaller than the cleansed
data
 In our experiments we get 80% lossless compression
at the level of abstraction of the raw data

Query Processing
 Traditional OLAP operations
 Roll up, drill down, slice, and dice
 Can be implemented efficiently with traditional
optimization techniques, e.g., what is the average time

spent by milk at the shelf
stay.location = 'shelf', info.product = 'milk' (stay gid info)
 Path selection (New operation)

 Compute an aggregate measure on the tags that
travel through a set of locations and that match a

selection criteria on path independent dimensions
q Ã < c info,(c stage1, ..., c stagek) >
1 k

Query Processing (II)
 Query: What is the average time spent from l3 to l5?

 GIDs for l3 <0.0.0>, <0.1.0>
 GIDs for l5 <0.0.0.0>, <0.1.0.1>
 Prefix pairs: p1: (<0.0.0>,<0.0.0.0>)
p2: (<0.1.0>,<0.1.0.1>)
 Retrieve stay records for each pair (including
intermediate steps) and compute measure
 Savings: No EPC list intersection, remember that each
EPC list may contain millions of different tags, and
retrieving them is a significant IO cost

From RFID-Cuboids to RFID-Warehouse
 Materialize the lowest RFID-

cuboid at the minimum level
of abstraction interested to
a user
 Materialize frequently
requested RFID-cuboids
 Materialization is done from
the smallest materialized
RFID-Cuboid that is at a
lower level of abstraction

Outline

 Conclusions

RFID-Cube Compression (I)
Compression vs. 350

300
clean
nomap
Size (MBytes)
Cleansed data size 250
200
map
150
P=1000, B=(500,150,40,8,1), k = 5 100
50
Lossless compression, cuboid is at 0
the same level of abstraction as 0.1 0.5 1 5 10
cleansed RFID database Input Stay Records (m illions)
Compression vs. Data 35 clean

Bulkiness 30 nomap
Size (MBytes)
25 map
P=1000, N = 1,000,000, k= 5 20
15
Map gives significant benefits for bulky 10
data 5
0
For data where items move individually a b c d e
we are better off using tag lists Path Bulkiness

RFID-Cube Compression (II)
12
Size (MBytes) 10 nomap
8 map
6
4
2
0
1 2 3 4
Abstraction Level
Compression vs. Abstraction Level

P=1000, B=(500,150,40,8,1), k = 5, N=1,000,000
The map provides significant savings over using tag lists
At very high levels of abstraction the stay table is very small, most of
the space is used in recording RFID tags

RFID-Cube Construction Time
1,800
100K tags
1,600
200K tags
1,400
Time (seconds)
400K tags
1,200
1,000 800K tags
800 1600K tags
600
400
200
0
level 0 level 1 level 2 level 3
Aggregation Level
Construction Time
P=1000, B=(500,150,40,8,1), k = 5, N=1,000,000
Savings by constructing from lower level cuboid 50% to 80%

Query Processing
I/O Operations (log scale)

Time vs. DB Size 10,000
clean
1,000 nomap
P=1000, B=(500,150,40,8,1), k = 5 map
100
Speed up due to stay table 1 order of
10
magnitude
1
Speed up due to stay table and map 1 2 3 4 5
table 2 orders of magnitude Input Stay Records (m illions)
Time vs. Bulkiness I/O Operations (log scale)

100,000 clean
P=1000, k = 5 10,000 nomap
1,000 map
Speed up is most significant for bulky
100
paths
10
For non-bulky paths performance is 1
not worse than using the clean table 1 2 3 4 5
Path Bulkiness

Discussion
 Our RFID cube model works well for bulky object
movements
 But there are many applications where this assumption is
not true and other models are needed
 We have only focused on warehousing RFID data, a
variety of other problems remain open:
 Path classification and clustering
 Workflow analysis
 Trend analysis
 Sophisticated RFID data cleaning

Outline

 Conclusions

Linking RFID Data Analysis with HPC
 High performance computing will play an

important role in RFID data warehousing and
data analysis
 Most of data cleaning process can be done in
parallel and distributed manner
 Stay and map tables construction can be
constructed in parallel
 Parallel computation and consolidation of multi-
layer and multi-path data cubes
 Query and mining can be processed in parallel

Parallel RFID Data Mining: Promising
 Parallel computing has been successfully applied to rather

sophisticated data mining algorithms
 Parallelizing frequent pattern mining (based on FPgrowth)
 Shengnan Cong, Jiawei Han, Jay Hoeflinger, and David Padua, “A
Sampling-based Framework for Parallel Data Mining,” PPOPP’05
(ACM SIGPLAN Symp. on Principles & Practice of Parallel
Programming)
 Parallelizing sequential-pattern mining algorithm (based on
PrefixSpan)
 Shengnan Cong, Jiawei Han, and David Padua, “Parallel Mining of
Closed Sequential Patterns”, KDD'05
 Parallel FRID data analysis is highly promising

Mining Frequent Patterns
 Breadth-first search vs. depth-first search

null null
A B C D E A B C D E
AB AC AD AE BC BD BE CD CE DE AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE ABCD ABCE ABDE ACDE BCDE
ABCD ABCD
E E
 Depth-first mining algorithm is proved to be more efficient

 Depth-first mining algorithm is more convenient to be parallelized

Parallel Frequent-Pattern Mining
 Target platform ─ distributed memory system

 Framework for parallelization
 Step 1:
 Each processor scans local portion of the dataset and
accumulate the numbers of occurrence for each items
 Reduction to obtain the global numbers of occurrence
 Step 2:
 Partition the frequent items and assign a subset to each
processor
 Each processor makes projections for the assigned items
 Step 3:
 Each processor mines the local projections independently
Parallel Frequent-Pattern Mining (2)
 Load balancing problem
 Some projection mining time is too large relative to the
overall mining time

Dataset mushroom connect pumsb pumsb_star T30I0.2D1K T40I10D100K T50I5D500K
Maximal/Overall 7.66% 12.4% 14.7% 47.6% 42.1% 4.15% 3.07%
(a) datasets for frequent-itemset mining
Dataset C10N0.1T8S8I8 C50N10T8S20I2.5 C100N5T2.5S10I1.25 C200N10T2.5S10I1.25 C100N20T2.5S10I1.25

Maximal/Overall 21.4% 13.8% 14.2% 10.1% 11.6%
(b) datasets for sequential-pattern mining
Dataset C100S50N10 C100S100N5 C200S25N9 gazelle

Maximal/Overall 15.3% 5.02% 4.53% 25.9%
(c) datasets for closed-sequential-pattern mining
 Solution: The large projections must be partitioned

 Challenge: How to identify the large projections?
How to Identify the Large Projections?
 To identify the large projections, we need an estimation of

the relative mining time of the projections
 Static estimation
 Study the correlation with the dataset parameters
 Number of items, number of records, width of
records, …
 Study the correlation with the characteristics of the
projection
 Depth, bushiness, tree size, number of leaves, fan-
out/in, …
Result ─ No rule found with the above parameters for the
projection mining time

Dynamic Estimation
 Runtime sampling
 Use the relative mining time of a sample to estimate the relative
mining time of the whole dataset.

 Accuracy vs. overhead
 Random sampling: random select a subset of records.

 Not accurate with small
sample size.
e.g. Dataset —pumsb
1% random sampling
 Becomes accurate when
sample size > 30%, but

sampling overhead is
over 50% then

Selective Sampling
 Selective sampling: for each record, some items are

removed
 In frequent-itemset mining:
 Discard the infrequent items
 Discard a fraction t of the most frequent items
dataset Support threshold =2, t=20% selective sample

1 a, c, d, f, m 1 a, c, m
f :4
2 b, c, f, m b :3 2 b, c, m
3 b, f
c :3
3 b
m :3
4 b, c a: 2 4 b, c
5 a, f, m d: 1 5 a, m

Accuracy of Selective Sampling

Overhead of Selective Sampling
(a) datasets for frequent-itemset mining
(b) datasets for sequential-pattern mining
(c) datasets for closed-sequential-pattern mining

Experimental Setups
 Two Linux clusters using up to 64 processors
 Cluster A – 1GHz Pentium III processor, 1GB
memory
 Cluster B – 1.3GHz Intel Itanium2 processor, 2GB
memory
 Implement with C++ using MPI
 Dataset generator from IBM
 Datasets

Experimental Setups

Speedups with One-Level Task
Partitioning
 Parallel frequent-itemset mining
mushroom connect pumsb pumsb_star
100 100 100 100
optimal optimal optimal optimal

Par-FP Par-FP Par-FP
Par-FP
10 10 10 10
1
1 1 1
1 2 4 8 16 32 64
1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Processor# Processor# Processor# Processor#
T30I0.2D1K T40I10D100K T50I5D500K

100 100 100
optimal optimal
Par-FP Par-FP
optimal
Par-FP
10 10 10
1 1 1
1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Processor# Processor# Processor#

Effectiveness of Selective Sampling
 Multi-level task partitioning

connect pumsb pumsb_star
mushroom 100 100 100
100
optimal
optimal optimal optimal
one-level
one-level one-level
one-level
multi-level multi-level multi-level
multi-level
10 10 10 10
1
1 1 1
1 2 4 8 16 32 64
1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
T30I0.2D1K T40I10D100K T50I5D500K
100 100 100
optimal
optimal optimal
one-level
one-level
one-level
multi-level multi-level
multi-level
10 10 10
1 1 1
1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64

Speedups with One-Level Task
Partitioning (Sequential Patterns)
 Parallel sequential-pattern mining
C100N5T2.5S10I1.25 C50N10T8S20I2.5
100 C10N0.1T8S8I8
100 100
optimal
optimal optimal
Par-Span
Par-Span
Par-Span
10 10 10
1 1 1
1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Processor# Processor# Processor#
C100N20T2.5S10I1.25 C200N10T2.5S10I1.25
100 100
optimal
Par-Span optimal
Par-Span
10 10
1 1
1 2 4 8 16 32 64 1 2 4 8 16 32 64
Processor# Processor#

Speedups with One-Level Task Partitioning
(Closed Sequential Pattern)
 Parallel closed-sequential-pattern mining
C100S50N10 C100S100N5
100
100
optimal optimal
Par-CSP Par-CSP
10 10
1 1
1 2 4 8 16 32 64 1 2 4 8 16 32 64
C200S25N9 Gazelle
100 100
optimal optimal
Par-CSP Par-CSP
10 10
1 1
1 2 4 8 16 32 64 1 2 4 8 16 32 64

Effectiveness of Selective Sampling
 One-level task partitioning with 64 processors

 The speedups are improved by more than 50% on
40
average.
without sampling
35
with sampling
30
25
20
15
10

Conclusions
 A new RFID warehouse model
 allows efficient and flexible analysis of RFID data in
multidimensional space
 preserves the structure of the data
 compresses data by exploiting bulky movements,
concept hierarchies, and path collapsing
 High-performance computing will benefit RFID data
warehousing and data mining tremendously
 Efficient and highly parallel algorithms can be developed
for RFID data analysis


Concepts and Techniques: Data Mining

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Concepts and Techniques: Data Mining

Uploaded by

Copyright:

Available Formats

Data Mining:

Concepts and Techniques

Jiawei Han and Micheline Kamber

 Introduction to RFID Technology

12/08/21 Data Mining: Concepts and Techniques 3

12/08/21 Data Mining: Concepts and Techniques 4

12/08/21 Data Mining: Concepts and Techniques 5

12/08/21 Data Mining: Concepts and Techniques 6

 Introduction to RFID Technology

12/08/21 Data Mining: Concepts and Techniques 7

12/08/21 Data Mining: Concepts and Techniques 8

12/08/21 Data Mining: Concepts and Techniques 9

12/08/21 Data Mining: Concepts and Techniques 10

12/08/21 Data Mining: Concepts and Techniques 11

12/08/21 Data Mining: Concepts and Techniques 12

 Introduction to RFID Technology

12/08/21 Data Mining: Concepts and Techniques 13

 If 1000 packs of soda stay together at the distribution center,

register a single record

 GID is a generalized identifier that represents the 1000 packs that

stayed together at the distribution center

12/08/21 Data Mining: Concepts and Techniques 15

be uninteresting to a regional manager and thus can

12/08/21 Data Mining: Concepts and Techniques 16

Category level Clothing

Type level Interesting

SKU level Shirt Jacket …

EPC level Shirt 1 … Shirt n Cleansed RFID

12/08/21 Data Mining: Concepts and Techniques 17

Transportation backroom shelf checkout

dist. center truck backroom shelf checkout

dist. center truck Store

12/08/21 Data Mining: Concepts and Techniques 18

 Fact Table: (EPC, location, time_in, time_out)

refrigerator in Illinois stores?

refrigerator in Champaign when coming from farm A,

12/08/21 Data Mining: Concepts and Techniques 19

12/08/21 Data Mining: Concepts and Techniques 20

 If saving complete EPC Lists: high costs of IO to retrieve long

lists, costly query processing

12/08/21 Data Mining: Concepts and Techniques 21

r1 l2 t20 t30 g1.1

r3 l4 t15 t20 g1 g1.1,g1.2

12/08/21 Data Mining: Concepts and Techniques 22

Query: What is the average time that items stay

Transition Grouping l1 ln+1

12/08/21 Data Mining: Concepts and Techniques 23

With Cleansed Database With Stay Table

12/08/21 Data Mining: Concepts and Techniques 24

{r1,..,ri} {ri+1,..,rj} {rj+1,..,rk} {rk+1,..,rl} {rl+1,..,rm} {rm+1,..,rn} 3n 10+n

12/08/21 Data Mining: Concepts and Techniques 25

12/08/21 Data Mining: Concepts and Techniques 26

1. Build a prefix tree for the paths in the cleansed

12/08/21 Data Mining: Concepts and Techniques 27

GIDs loc t_in t_out count

0.0.0 l3 t20 t30 6

0.0.0.0 l5 t40 t60 5

12/08/21 Data Mining: Concepts and Techniques 28

 The RFID-cuboid can be constructed on a single scan of

12/08/21 Data Mining: Concepts and Techniques 29

 Can be implemented efficiently with traditional

optimization techniques, e.g., what is the average time

 Path selection (New operation)