You are on page 1of 59

Concepts and

Techniques
Chapter 11
Additional Theme: RFID Data Warehousing
and Mining and High-Performance Computing

Jiawei Han and Micheline Kamber


Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
2006 Jiawei Han and Micheline Kamber. All rights reserved.
Acknowledgements: Hector Gonzalez and Shengnan Cong

03/16/17 Data Mining: Concept 1


03/16/17 Data Mining: Concept 2
Outline
Introduction to RFID Technology
Motivation: Why RFID-Warehousing?
RFID-Warehouse Architecture
Performance Study
Linking RFID Data Analysis with HPC
Conclusions

03/16/17 Data Mining: Concept 3


What is RFID?
Radio Frequency Identification (RFID)
Technology that allows a sensor (reader)
to read, from a distance, and without line
of sight, a unique electronic product code
(EPC) associated with a tag

Tag Reader

03/16/17 Data Mining: Concept 4


RFID System

Source: www.belgravium.com

03/16/17 Data Mining: Concept 5


Applications
Supply Chain Management:
real-time inventory tracking
Retail: Active shelves monitor
product availability
Access control: toll collection,
credit cards, building access
Airline luggage management:
(British airways) Implemented to
reduce lost/misplaced luggage
(20 million bags a year)
Medical: Implant patients with a
tag that contains their medical
history
Pet identification: Implant
RFID tag with pet owner
information (www.pet-id.net)

03/16/17 Data Mining: Concept 6


Outline
Introduction to RFID Technology
Motivation: Why RFID-Warehousing?
RFID-Warehouse Architecture
Performance Study
Linking RFID Data Analysis with HPC
Conclusions

03/16/17 Data Mining: Concept 7


RFID Warehouse Architecture

03/16/17 Data Mining: Concept 8


Challenges of RFID Data Sets
Data generated by RFID systems is enormous due to redundancy
and low level of abstraction
Walmart is expected to generate 7 terabytes of RFID
data per day
Solution Requirements
Highly compact summary of the data
OLAP operations on multi-dimensional view of the data
Summary should preserve the path structure of RFID
data
It should be possible to efficiently drill down to
individual tags when an interesting pattern is
discovered

03/16/17 Data Mining: Concept 9


Why RFID-Warehousing? (1)
Lossless compression
Significantly reduce the size of the RFID data set by
redundancy removal and grouping objects
that move and stay together
Data cleaning: reasoning based on more complete info
Multi-reading, miss-reading, error-reading, bulky
movement,
Multi-dimensional summary: product, location, time,
Store manager: Check item movements from the
backroom to different shelves in his store
Region manager: Collapse intra-store movements
and look at distribution centers, warehouses, and
stores

03/16/17 Data Mining: Concept 10


Why RFID-Warehousing? (2)
Query Processing
Support for OLAP: roll-up, drill-down, slice, and
dice
Path query: New to RFID-Warehouses, about
the structure of paths
What products that go through quality control have

shorter paths?

What locations are common to the paths of a set of
defective auto-parts?

Identify containers at a port that have deviated
from their historic paths
Data mining
Find trends, outliers, frequent, sequential, flow
patterns,
03/16/17 Data Mining: Concept 11
Example: A Supply Chain Store
A retailer with 3,000 stores, selling 10,000 items a day
per store
Each item moves 10 times on average before being sold
Movement recorded as (EPC, location, second)
Data volume: 300 million tuples per day (after
redundancy removal)
OLAP Query
Avg time for outwear items to move from
warehouse to checkout counter in March 2006?
Costly to answer if scanning 1 billion tuples for
March

03/16/17 Data Mining: Concept 12


Outline
Introduction to RFID Technology
Motivation: Why RFID-Warehousing?
RFID-Warehouse Architecture
Performance Study
Linking RFID Data Analysis with HPC
Conclusions

03/16/17 Data Mining: Concept 13


Cleaning of RFID Data Records
Raw Data
(EPC, location, time)
Duplicate records due to multiple readings
of a product at the same location
(r1,l1,t1) (r1,l1,t2) ... (r1,l1,t10)
Cleansed Data: Minimal information to store, raw
data will be then removed
(EPC, Location, time_in, time_out)
(r1,l1,t1,t10)
Warehousing can help fill-up missing records and
correct wrongly-registered information

03/16/17 Data Mining: Concept 14


Key Compression Ideas (I)
Bulky object movements
Objects often move and stay together through the supply chain

If 1000 packs of soda stay together at the distribution center,

register a single record



(GID, distribution center, time_in, time_out)
GID is a generalized identifier that represents the 1000 packs

that stayed together at the distribution center


shelf 1

10 pallets store 1
(1000 cases)
Dist. Center 1 shelf 2
store 2

Dist. Center2
Factory

10 packs
(12 sodas)
20 cases
(1000 packs)

03/16/17 Data Mining: Concept 15


Key Compression Ideas (II)
Data generalization
Analysis usually takes place at a much higher level of
abstraction than the one present in raw RFID data
Aggregate object movements into fewer records

If interested in time at the day level, merge records
at the minute level into records at the hour level
Merge and/or collapse of path segments
Uninteresting path segments can be ignored or
merged
Multiple item movements within the same store

may be uninteresting to a regional manager and


thus can be merged

03/16/17 Data Mining: Concept 16


Path-Independent Generalization

Category level Clothing

Type level Interesting


Outerwear Shoes
Level

SKU level Shirt Jacket

EPC level Shirt 1 Shirt n Cleansed RFID


Database Level

03/16/17 Data Mining: Concept 17


Path Generalization

Store View:

Transportation backroom shelf checkout

dist. center truck backroom shelf checkout

Transportation View:

dist. center truck Store

03/16/17 Data Mining: Concept 18


Why Not Using Traditional Data Cube?

Fact Table: (EPC, location, time_in, time_out)


Aggregate: A measure at a single location
e.g., what is the average time that milk stays in

the refrigerator in Illinois stores?


What is missing?
Measures computed on items that travel through

a series of locations
e.g., what is the average time that milk stays at

the refrigerator in Champaign when coming from


farm A, and Warehouse B?
Traditional cubes miss the path structure of the
data

03/16/17 Data Mining: Concept 19


RFID-Cube Architecture

03/16/17 Data Mining: Concept 20


RFID-Cuboid Architecture (II)
Stay Table: (GIDs, location, time_in, time_out: measures)
Records information on items that stay together at a given

location
If using record transitions: difficult to answer queries, lots of

intersections needed
Map Table: (GID, <GID1,..,GIDn>)
Links together stages that belong to the same path. Provides
additional: compression and query processing efficiency
High level GID points to lower level GIDs

If saving complete EPC Lists: high costs of IO to retrieve long

lists, costly query processing


Information Table: (EPC list, attribute 1,...,attribute n)
Records path-independent attributes of the items, e.g., color,

manufacturer, price

03/16/17 Data Mining: Concept 21


RFID-Cuboid Example
Cleansed RFID Database Stay Table
epc loc t_in t_out epcs
gids loc t_in t_out

r1 l1 t1 t10 r1,r2,r3
g1 l1 t1 t10

r1 l2 t20 t30 g1.1


r1,r2 l2 t20 t30

r2 l1 t1 t10 r3
g1.2 l4 t15 t20

r2 l3 t20 t30
Map Table
r3 l1 t1 10
gid gids

r3 l4 t15 t20 g1 g1.1,g1.2

g1.1 r1,r2

g1.2 r3

03/16/17 Data Mining: Concept 22


Benefits of the Stay Table (I)
Query: What is the average time that items stay
at location l ?

Transition Grouping l1 ln+1


Retrieve all transitions with destination = l
Retrieve all transitions with origin = l
l2 l ln+2
Intersect results and compute average time
IO Cost: n + m retrievals

Prefix Tree
Retrieve n records ln ln+m
Stay Grouping
Retrieve stay record with location = l
IO Cost: 1

03/16/17 Data Mining: Concept 23


Benefits of the Stay Table (II)
Query: How many boxes of milk traveled through the locations l1, l7, l13?

With Cleansed Database With Stay Table


Strategy:
(r1,l1,t1,t2) Strategy:
Retrieve itemsets for Retrieve the gids (g1,l1,t1,t2)
(r1,l2,t3,t4) (g2,l2,t3,t4)
locations l1, l7, l13 for l1, l7, l13
Intersect itemsets (r2,l1,t1,t2) Intersect the gids

IO Cost:
(r2,l2,t3,t4) IO Cost:
One IO per GID in
One IO per item in (rk,l1,t1,t2) locations l1, l7, and
locations l1 or l7 or l13 (rk,l2,t3,t4) l13
Observation: Observation:
Very costly, we retrieve Retrieve records

records at the individual at the group level


and thus greatly
item level
reduce IO costs

03/16/17 Data Mining: Concept 24


Benefits of the Map Table
#EPCs #GIDs

l1 n 1

n 3
l2 l3 l4

l5 l6 l7 l8 l9 l10 n 6

{r1,..,ri} {ri+1,..,rj} {rj+1,..,rk} {rk+1,..,rl} {rl+1,..,rm} {rm+1,..,rn} 3n 10+n

03/16/17 Data Mining: Concept 25


Path-Dependent Naming of GIDs

Assign to each GID a 0.0 0.1

l2
unique identifier that l1

encodes the path traversed 0.1.0


0.1.1
0.0.0
by the items that it points
l3 l4
to
0.0.0.0
Path-dependent name: 0.1.0.1

Makes it easy to detect if l5 l6


locations form a path

03/16/17 Data Mining: Concept 26


RFID-Cuboid Construction Algorithm

1. Build a prefix tree for the paths in the cleansed


database
2. For each node, record a separate measure for each
group of items that share the same leaf and
information record
3. Assign GIDs to each node:
GID = parent GID + unique id
4. Each node generates a stay record for each distinct
measure
5. If multiple nodes share the same location, time,
and measure, generate a single record with
multiple GIDs

03/16/17 Data Mining: Concept 27


RFID-Cube Construction
Path Tree Stay Table
GIDs loc t_in t_out count
0.0 l1 t1 t10 3

0.0 0.1 l2
0.0.0
0.0.0 l3 t20 t30 6
3
t1,t10: 3 l1 t1,t8: 3 0.1.0

0.0.0.0 l5 t40 t60 5


3
0.1.0.0

0.0.0
l3 0.1.0 l3 0.1.1 l4 0.1 l2 t1 t8 3
t20,t30: 3 t20,t30: 3 t10,t20: 2
{r8,r9} 0.1.0.1 l6 t35 t50 1
0.1.0.0 0.1.0.1
t40,t60: 2 t35,t50: 1
0.1.1 l4 t10 t20 2
0.0.0.0
t40,t60: 3 l5 l5 l6
{r1,r2,r3} {r5,r6} {r7}

03/16/17 Data Mining: Concept 28


RFID-Cube Properties
The RFID-cuboid can be constructed on a single
scan of the cleansed RFID database
The RFID-cuboid provides lossless compression at
its level of abstraction
The size of the RFID-cuboid is smaller than the
cleansed data
In our experiments we get 80% lossless
compression at the level of abstraction of the
raw data

03/16/17 Data Mining: Concept 29


Query Processing
Traditional OLAP operations
Roll up, drill down, slice, and dice

Can be implemented efficiently with traditional

optimization techniques, e.g., what is the


average time spent by milk at the shelf
stay.location = 'shelf', info.product = 'milk' (stay gid info)

Path selection (New operation)


Compute an aggregate measure on the tags that

travel through a set of locations and that match a


selection criteria on path independent dimensions

q < c info,(c1 stage1, ..., ck stagek) >

03/16/17 Data Mining: Concept 30


03/16/17 Data Mining: Concept 31
Query Processing (II)

Query: What is the average time spent from l3 to l5?


GIDs for l3 <0.0.0>, <0.1.0>
GIDs for l5 <0.0.0.0>, <0.1.0.1>
Prefix pairs: p1: (<0.0.0>,<0.0.0.0>)
p2: (<0.1.0>,<0.1.0.1>)
Retrieve stay records for each pair (including
intermediate steps) and compute measure
Savings: No EPC list intersection, remember that
each EPC list may contain millions of different tags,
and retrieving them is a significant IO cost

03/16/17 Data Mining: Concept 32


From RFID-Cuboids to RFID-Warehouse

Materialize the lowest


RFID-cuboid at the
minimum level of
abstraction interested to a
user
Materialize frequently
requested RFID-cuboids
Materialization is done from
the smallest materialized
RFID-Cuboid that is at a
lower level of abstraction

03/16/17 Data Mining: Concept 33


Outline
Introduction to RFID Technology
Motivation: Why RFID-Warehousing?
RFID-Warehouse Architecture
Performance Study
Linking RFID Data Analysis with HPC
Conclusions

03/16/17 Data Mining: Concept 34


RFID-Cube Compression (I)

Compression vs.
Cleansed data size
P=1000, B=(500,150,40,8,1), k = 5
Lossless compression, cuboid is at
the same level of abstraction as
cleansed RFID database

Compression vs. Data


Bulkiness
P=1000, N = 1,000,000, k= 5
Map gives significant benefits for bulky
data
For data where items move individually
we are better off using tag lists

03/16/17 Data Mining: Concept 35


RFID-Cube Compression (II)

Compression vs. Abstraction Level


P=1000, B=(500,150,40,8,1), k = 5, N=1,000,000
The map provides significant savings over using tag lists
At very high levels of abstraction the stay table is very small, most of
the space is used in recording RFID tags

03/16/17 Data Mining: Concept 36


RFID-Cube Construction Time

Construction Time
P=1000, B=(500,150,40,8,1), k = 5, N=1,000,000
Savings by constructing from lower level cuboid 50% to 80%

03/16/17 Data Mining: Concept 37


Query Processing

Time vs. DB Size


P=1000, B=(500,150,40,8,1), k = 5
Speed up due to stay table 1 order of
magnitude
Speed up due to stay table and map
table 2 orders of magnitude

Time vs. Bulkiness


P=1000, k = 5
Speed up is most significant for bulky
paths
For non-bulky paths performance is
not worse than using the clean table

03/16/17 Data Mining: Concept 38


Discussion
Our RFID cube model works well for bulky object
movements
But there are many applications where this
assumption is not true and other models are needed
We have only focused on warehousing RFID data, a
variety of other problems remain open:
Path classification and clustering
Workflow analysis
Trend analysis
Sophisticated RFID data cleaning

03/16/17 Data Mining: Concept 39


Outline
Introduction to RFID Technology
Motivation: Why RFID-Warehousing?
RFID-Warehouse Architecture
Performance Study
Linking RFID Data Analysis with HPC
Conclusions

03/16/17 Data Mining: Concept 40


Linking RFID Data Analysis with HPC

High performance computing will play an


important role in RFID data warehousing
and data analysis
Most of data cleaning process can be done
in parallel and distributed manner
Stay and map tables construction can be
constructed in parallel
Parallel computation and consolidation of
multi-layer and multi-path data cubes
Query and mining can be processed in
parallel
03/16/17 Data Mining: Concept 41
Parallel RFID Data Mining: Promising

Parallel computing has been successfully applied to rather


sophisticated data mining algorithms
Parallelizing frequent pattern mining (based on FPgrowth)
Shengnan Cong, Jiawei Han, Jay Hoeflinger, and David
Padua, A Sampling-based Framework for Parallel Data
Mining, PPOPP05 (ACM SIGPLAN Symp. on Principles &
Practice of Parallel Programming)
Parallelizing sequential-pattern mining algorithm (based on
PrefixSpan)
Shengnan Cong, Jiawei Han, and David Padua, Parallel
Mining of Closed Sequential Patterns, KDD'05
Parallel FRID data analysis is highly promising

03/16/17 Data Mining: Concept 42


Mining Frequent Patterns

Breadth-first search vs. depth-first search


null null

A B C D E A B C D E

AB AC AD AE BC BD BE CD CE DE AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE ABCD ABCE ABDE ACDE BCDE

ABCD ABCD
E E

Depth-first mining algorithm is proved to be more


efficient
Depth-first mining algorithm is more convenient to be
parallelized
03/16/17 Data Mining: Concept 43
Parallel Frequent-Pattern Mining
Target platform distributed memory system
Framework for parallelization
Step 1:

Each processor scans local portion of the dataset and
accumulate the numbers of occurrence for each items

Reduction to obtain the global numbers of occurrence
Step 2:

Partition the frequent items and assign a subset to each
processor

Each processor makes projections for the assigned items
Step 3:

Each processor mines the local projections independently

03/16/17 Data Mining: Concept 44


Parallel Frequent-Pattern Mining (2)
Load balancing problem
Some projection mining time is too large relative

to the overall mining time


Dataset mushroom connect pumsb pumsb_star T30I0.2D1K T40I10D100K T50I5D500K
Maximal/Overall 7.66% 12.4% 14.7% 47.6% 42.1% 4.15% 3.07%
(a) datasets for frequent-itemset mining

Dataset C10N0.1T8S8I8 C50N10T8S20I2.5 C100N5T2.5S10I1.25 C200N10T2.5S10I1.25 C100N20T2.5S10I1.25

Maximal/Overall 21.4% 13.8% 14.2% 10.1% 11.6%


(b) datasets for sequential-pattern mining

Dataset C100S50N10 C100S100N5 C200S25N9 gazelle


Maximal/Overall 15.3% 5.02% 4.53% 25.9%
(c) datasets for closed-sequential-pattern mining

Solution: The large projections must be partitioned


Challenge: How to identify the large projections?
03/16/17 Data Mining: Concept 45
How to Identify the Large Projections?

To identify the large projections, we need an


estimation of the relative mining time of the
projections
Static estimation
Study the correlation with the dataset parameters


Number of items, number of records, width of
records,
Study the correlation with the characteristics of the

projection

Depth, bushiness, tree size, number of leaves,
fan-out/in,
Result No rule found with the above parameters for
the projection mining time
03/16/17 Data Mining: Concept 46
Dynamic Estimation
Runtime sampling
Use the relative mining time of a sample to estimate the

relative mining time of the whole dataset.


Accuracy vs. overhead

Random sampling: random select a subset of records.


Not accurate with small

sample size.
e.g. Dataset pumsb
1% random sampling
Becomes accurate when

sample size > 30%, but


sampling overhead is
over 50% then

03/16/17 Data Mining: Concept 47


Selective Sampling

Selective sampling: for each record, some items


are removed
In frequent-itemset mining:
Discard the infrequent items

Discard a fraction t of the most frequent items

dataset Support threshold =2, t=20% selective sample


1 a, c, d, f, m 1 a, c, m
f :4
2 b, c, f, m b :3 2 b, c, m
3 b, f c :3 3 b
m :3
4 b, c
a: 2 4 b, c
5 a, f, m d: 1 5 a, m

03/16/17 Data Mining: Concept 48


Accuracy of Selective Sampling

03/16/17 Data Mining: Concept 49


Overhead of Selective Sampling

(a) datasets for frequent-itemset mining

(b) datasets for sequential-pattern mining

(c) datasets for closed-sequential-pattern mining

03/16/17 Data Mining: Concept 50


Experimental Setups
Two Linux clusters using up to 64 processors
Cluster A 1GHz Pentium III processor, 1GB
memory
Cluster B 1.3GHz Intel Itanium2 processor,
2GB memory
Implement with C++ using MPI
Dataset generator from IBM
Datasets

03/16/17 Data Mining: Concept 51


Experimental Setups

03/16/17 Data Mining: Concept 52


Speedups with One-Level Task
Partitioning
Parallel frequent-itemset mining
mushroom connect pumsb pumsb_star
100 100 100 100

optimal optimal optimal optimal


Par-FP Par-FP Par-FP
Par-FP

10 10 10 10

1
1 1 1
1 2 4 8 16 32 64
1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Processor# Processor# Processor# Processor#

T30I0.2D1K T40I10D100K T50I5D500K


100 100 100

optimal optimal
Par-FP Par-FP
optimal
Par-FP

10 10 10

1 1 1
1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Processor# Processor# Processor#

03/16/17 Data Mining: Concept 53


Effectiveness of Selective Sampling

Multi-level task partitioning


connect pumsb pumsb_star
mushroom 100 100 100
100
optimal
optimal optimal optimal
one-level
one-level one-level
one-level
multi-level multi-level multi-level
multi-level
10 10 10 10

1
1 1 1
1 2 4 8 16 32 64
1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
T30I0.2D1K T40I10D100K T50I5D500K
100 100 100
optimal
optimal optimal
one-level
one-level
one-level
multi-level multi-level
multi-level
10 10 10

1 1 1
1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64

03/16/17 Data Mining: Concept 54


Speedups with One-Level Task
Partitioning (Sequential Patterns)
Parallel sequential-pattern mining
C100N5T2.5S10I1.25 C50N10T8S20I2.5
100 C10N0.1T8S8I8
100 100

optimal
optimal optimal
Par-Span
Par-Span
Par-Span

10 10 10

1 1 1
1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Processor# Processor# Processor#
C100N20T2.5S10I1.25 C200N10T2.5S10I1.25
100 100

optimal
Par-Span optimal
Par-Span

10 10

1 1
1 2 4 8 16 32 64 1 2 4 8 16 32 64
Processor# Processor#

03/16/17 Data Mining: Concept 55


Speedups with One-Level Task
Partitioning (Closed Sequential Pattern)
Parallel closed-sequential-pattern mining
C100S50N10 C100S100N5
100
100

optimal optimal
Par-CSP Par-CSP

10 10

1 1
1 2 4 8 16 32 64 1 2 4 8 16 32 64

Processor# Processor#

C200S25N9 Gazelle
100 100
optimal optimal
Par-CSP Par-CSP

10 10

1 1
1 2 4 8 16 32 64 1 2 4 8 16 32 64

Processor# Processor#

03/16/17 Data Mining: Concept 56


Effectiveness of Selective Sampling

One-level task partitioning with 64


processors
40
Thewithout
speedups
sampling
are improved by more than
35
50%withon average.
sampling
30

25

20

15

10

03/16/17 Data Mining: Concept 57


Conclusions
A new RFID warehouse model
allows efficient and flexible analysis of RFID data
in multidimensional space
preserves the structure of the data
compresses data by exploiting bulky movements,
concept hierarchies, and path collapsing
High-performance computing will benefit RFID data
warehousing and data mining tremendously
Efficient and highly parallel algorithms can be
developed for RFID data analysis

03/16/17 Data Mining: Concept 58


03/16/17 Data Mining: Concept 59

You might also like