Chapter 11 - : - Additional Theme: RFID Data Warehousing and Mining and High-Performance Computing

Concepts and
Techniques
Chapter 11
Additional Theme: RFID Data Warehousing
and Mining and High-Performance Computing

Jiawei Han and Micheline Kamber

Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
2006 Jiawei Han and Micheline Kamber. All rights reserved.
Acknowledgements: Hector Gonzalez and Shengnan Cong
03/16/17 Data Mining: Concept 1

Outline
Introduction to RFID Technology
Motivation: Why RFID-Warehousing?
RFID-Warehouse Architecture
Performance Study
Linking RFID Data Analysis with HPC
Conclusions

What is RFID?
Radio Frequency Identification (RFID)
Technology that allows a sensor (reader)
to read, from a distance, and without line
of sight, a unique electronic product code
(EPC) associated with a tag
Tag Reader

RFID System
Source: www.belgravium.com

Applications
Supply Chain Management:
real-time inventory tracking
Retail: Active shelves monitor
product availability
Access control: toll collection,
credit cards, building access
Airline luggage management:
(British airways) Implemented to
reduce lost/misplaced luggage
(20 million bags a year)
Medical: Implant patients with a
tag that contains their medical
history
Pet identification: Implant
RFID tag with pet owner
information (www.pet-id.net)

Outline
Performance Study
Conclusions

RFID Warehouse Architecture

Challenges of RFID Data Sets
Data generated by RFID systems is enormous due to redundancy
and low level of abstraction
Walmart is expected to generate 7 terabytes of RFID
data per day
Solution Requirements
Highly compact summary of the data
OLAP operations on multi-dimensional view of the data
Summary should preserve the path structure of RFID
data
It should be possible to efficiently drill down to
individual tags when an interesting pattern is
discovered

Why RFID-Warehousing? (1)
Lossless compression
Significantly reduce the size of the RFID data set by
redundancy removal and grouping objects
that move and stay together
Data cleaning: reasoning based on more complete info
Multi-reading, miss-reading, error-reading, bulky
movement,
Multi-dimensional summary: product, location, time,
Store manager: Check item movements from the
backroom to different shelves in his store
Region manager: Collapse intra-store movements
and look at distribution centers, warehouses, and
stores

Why RFID-Warehousing? (2)
Query Processing
Support for OLAP: roll-up, drill-down, slice, and
dice
Path query: New to RFID-Warehouses, about
the structure of paths
What products that go through quality control have

shorter paths?

What locations are common to the paths of a set of
defective auto-parts?

Identify containers at a port that have deviated
from their historic paths
Data mining
Find trends, outliers, frequent, sequential, flow
patterns,
Example: A Supply Chain Store
A retailer with 3,000 stores, selling 10,000 items a day
per store
Each item moves 10 times on average before being sold
Movement recorded as (EPC, location, second)
Data volume: 300 million tuples per day (after
redundancy removal)
OLAP Query
Avg time for outwear items to move from
warehouse to checkout counter in March 2006?
Costly to answer if scanning 1 billion tuples for
March

Outline
Performance Study
Conclusions

Cleaning of RFID Data Records
Raw Data
(EPC, location, time)
Duplicate records due to multiple readings
of a product at the same location
(r1,l1,t1) (r1,l1,t2) ... (r1,l1,t10)
Cleansed Data: Minimal information to store, raw
data will be then removed
(EPC, Location, time_in, time_out)
(r1,l1,t1,t10)
Warehousing can help fill-up missing records and
correct wrongly-registered information

Key Compression Ideas (I)
Bulky object movements
Objects often move and stay together through the supply chain
If 1000 packs of soda stay together at the distribution center,
register a single record

(GID, distribution center, time_in, time_out)
GID is a generalized identifier that represents the 1000 packs
that stayed together at the distribution center

shelf 1
10 pallets store 1
(1000 cases)
Dist. Center 1 shelf 2
store 2
Dist. Center2
Factory

10 packs
(12 sodas)
20 cases
(1000 packs)

Key Compression Ideas (II)
Data generalization
Analysis usually takes place at a much higher level of
abstraction than the one present in raw RFID data
Aggregate object movements into fewer records

If interested in time at the day level, merge records
at the minute level into records at the hour level
Merge and/or collapse of path segments
Uninteresting path segments can be ignored or
merged
Multiple item movements within the same store
may be uninteresting to a regional manager and

thus can be merged

Path-Independent Generalization
Category level Clothing
Type level Interesting

Outerwear Shoes
Level
SKU level Shirt Jacket
EPC level Shirt 1 Shirt n Cleansed RFID

Database Level

Path Generalization
Store View:
Transportation backroom shelf checkout
dist. center truck backroom shelf checkout
Transportation View:
dist. center truck Store

Why Not Using Traditional Data Cube?
Fact Table: (EPC, location, time_in, time_out)

Aggregate: A measure at a single location
e.g., what is the average time that milk stays in
the refrigerator in Illinois stores?

What is missing?
Measures computed on items that travel through
a series of locations
e.g., what is the average time that milk stays at
the refrigerator in Champaign when coming from

farm A, and Warehouse B?
Traditional cubes miss the path structure of the
data

RFID-Cube Architecture

RFID-Cuboid Architecture (II)
Stay Table: (GIDs, location, time_in, time_out: measures)
Records information on items that stay together at a given
location
If using record transitions: difficult to answer queries, lots of
intersections needed
Map Table: (GID, <GID1,..,GIDn>)
Links together stages that belong to the same path. Provides
additional: compression and query processing efficiency
High level GID points to lower level GIDs
If saving complete EPC Lists: high costs of IO to retrieve long
lists, costly query processing

Information Table: (EPC list, attribute 1,...,attribute n)
Records path-independent attributes of the items, e.g., color,
manufacturer, price

RFID-Cuboid Example
Cleansed RFID Database Stay Table
epc loc t_in t_out epcs
gids loc t_in t_out
r1 l1 t1 t10 r1,r2,r3
g1 l1 t1 t10
r1 l2 t20 t30 g1.1

r1,r2 l2 t20 t30
r2 l1 t1 t10 r3
g1.2 l4 t15 t20
r2 l3 t20 t30
Map Table
r3 l1 t1 10
gid gids
r3 l4 t15 t20 g1 g1.1,g1.2
g1.1 r1,r2
g1.2 r3

Benefits of the Stay Table (I)
Query: What is the average time that items stay
at location l ?
Transition Grouping l1 ln+1

Retrieve all transitions with destination = l
Retrieve all transitions with origin = l
l2 l ln+2
Intersect results and compute average time
IO Cost: n + m retrievals

Prefix Tree
Retrieve n records ln ln+m
Stay Grouping
Retrieve stay record with location = l
IO Cost: 1

Benefits of the Stay Table (II)
Query: How many boxes of milk traveled through the locations l1, l7, l13?
With Cleansed Database With Stay Table

Strategy:
(r1,l1,t1,t2) Strategy:
Retrieve itemsets for Retrieve the gids (g1,l1,t1,t2)
(r1,l2,t3,t4) (g2,l2,t3,t4)
locations l1, l7, l13 for l1, l7, l13
Intersect itemsets (r2,l1,t1,t2) Intersect the gids
IO Cost:
(r2,l2,t3,t4) IO Cost:
One IO per GID in
One IO per item in (rk,l1,t1,t2) locations l1, l7, and
locations l1 or l7 or l13 (rk,l2,t3,t4) l13
Observation: Observation:
Very costly, we retrieve Retrieve records
records at the individual at the group level

and thus greatly
item level
reduce IO costs

Benefits of the Map Table
#EPCs #GIDs
l1 n 1
n 3
l2 l3 l4
l5 l6 l7 l8 l9 l10 n 6
{r1,..,ri} {ri+1,..,rj} {rj+1,..,rk} {rk+1,..,rl} {rl+1,..,rm} {rm+1,..,rn} 3n 10+n

Path-Dependent Naming of GIDs
Assign to each GID a 0.0 0.1
l2
unique identifier that l1
encodes the path traversed 0.1.0

0.1.1
0.0.0
by the items that it points
l3 l4
to
0.0.0.0
Path-dependent name: 0.1.0.1
Makes it easy to detect if l5 l6

locations form a path

RFID-Cuboid Construction Algorithm
1. Build a prefix tree for the paths in the cleansed

database
2. For each node, record a separate measure for each
group of items that share the same leaf and
information record
3. Assign GIDs to each node:
GID = parent GID + unique id
4. Each node generates a stay record for each distinct
measure
5. If multiple nodes share the same location, time,
and measure, generate a single record with
multiple GIDs

RFID-Cube Construction
Path Tree Stay Table
GIDs loc t_in t_out count
0.0 l1 t1 t10 3
0.0 0.1 l2
0.0.0
0.0.0 l3 t20 t30 6
3
t1,t10: 3 l1 t1,t8: 3 0.1.0
0.0.0.0 l5 t40 t60 5

3
0.1.0.0
0.0.0
l3 0.1.0 l3 0.1.1 l4 0.1 l2 t1 t8 3
t20,t30: 3 t20,t30: 3 t10,t20: 2
{r8,r9} 0.1.0.1 l6 t35 t50 1
0.1.0.0 0.1.0.1
t40,t60: 2 t35,t50: 1
0.1.1 l4 t10 t20 2
0.0.0.0
t40,t60: 3 l5 l5 l6
{r1,r2,r3} {r5,r6} {r7}

RFID-Cube Properties
The RFID-cuboid can be constructed on a single
scan of the cleansed RFID database
The RFID-cuboid provides lossless compression at
its level of abstraction
The size of the RFID-cuboid is smaller than the
cleansed data
In our experiments we get 80% lossless
compression at the level of abstraction of the
raw data

Query Processing
Traditional OLAP operations
Roll up, drill down, slice, and dice
Can be implemented efficiently with traditional
optimization techniques, e.g., what is the

average time spent by milk at the shelf
stay.location = 'shelf', info.product = 'milk' (stay gid info)
Path selection (New operation)

Compute an aggregate measure on the tags that
travel through a set of locations and that match a

selection criteria on path independent dimensions
q < c info,(c1 stage1, ..., ck stagek) >

Query Processing (II)
Query: What is the average time spent from l3 to l5?

GIDs for l3 <0.0.0>, <0.1.0>
GIDs for l5 <0.0.0.0>, <0.1.0.1>
Prefix pairs: p1: (<0.0.0>,<0.0.0.0>)
p2: (<0.1.0>,<0.1.0.1>)
Retrieve stay records for each pair (including
intermediate steps) and compute measure
Savings: No EPC list intersection, remember that
each EPC list may contain millions of different tags,
and retrieving them is a significant IO cost

From RFID-Cuboids to RFID-Warehouse
Materialize the lowest

RFID-cuboid at the
minimum level of
abstraction interested to a
user
Materialize frequently
requested RFID-cuboids
Materialization is done from
the smallest materialized
RFID-Cuboid that is at a
lower level of abstraction

Outline
Performance Study
Conclusions

RFID-Cube Compression (I)
Compression vs.
Cleansed data size
P=1000, B=(500,150,40,8,1), k = 5
Lossless compression, cuboid is at
the same level of abstraction as
cleansed RFID database
Compression vs. Data

Bulkiness
P=1000, N = 1,000,000, k= 5
Map gives significant benefits for bulky
data
For data where items move individually
we are better off using tag lists

RFID-Cube Compression (II)
Compression vs. Abstraction Level

P=1000, B=(500,150,40,8,1), k = 5, N=1,000,000
The map provides significant savings over using tag lists
At very high levels of abstraction the stay table is very small, most of
the space is used in recording RFID tags

RFID-Cube Construction Time
Construction Time
P=1000, B=(500,150,40,8,1), k = 5, N=1,000,000
Savings by constructing from lower level cuboid 50% to 80%

Query Processing
Time vs. DB Size

P=1000, B=(500,150,40,8,1), k = 5
Speed up due to stay table 1 order of
magnitude
Speed up due to stay table and map
table 2 orders of magnitude
Time vs. Bulkiness

P=1000, k = 5
Speed up is most significant for bulky
paths
For non-bulky paths performance is
not worse than using the clean table

Discussion
Our RFID cube model works well for bulky object
movements
But there are many applications where this
assumption is not true and other models are needed
We have only focused on warehousing RFID data, a
variety of other problems remain open:
Path classification and clustering
Workflow analysis
Trend analysis
Sophisticated RFID data cleaning

Outline
Performance Study
Conclusions

High performance computing will play an

important role in RFID data warehousing
and data analysis
Most of data cleaning process can be done
in parallel and distributed manner
Stay and map tables construction can be
constructed in parallel
Parallel computation and consolidation of
multi-layer and multi-path data cubes
Query and mining can be processed in
parallel
Parallel RFID Data Mining: Promising
Parallel computing has been successfully applied to rather

sophisticated data mining algorithms
Parallelizing frequent pattern mining (based on FPgrowth)
Shengnan Cong, Jiawei Han, Jay Hoeflinger, and David
Padua, A Sampling-based Framework for Parallel Data
Mining, PPOPP05 (ACM SIGPLAN Symp. on Principles &
Practice of Parallel Programming)
Parallelizing sequential-pattern mining algorithm (based on
PrefixSpan)
Shengnan Cong, Jiawei Han, and David Padua, Parallel
Mining of Closed Sequential Patterns, KDD'05
Parallel FRID data analysis is highly promising

Mining Frequent Patterns
Breadth-first search vs. depth-first search

null null
A B C D E A B C D E
AB AC AD AE BC BD BE CD CE DE AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE ABCD ABCE ABDE ACDE BCDE
ABCD ABCD
E E
Depth-first mining algorithm is proved to be more

efficient
Depth-first mining algorithm is more convenient to be
parallelized
Parallel Frequent-Pattern Mining
Target platform distributed memory system
Framework for parallelization
Step 1:

Each processor scans local portion of the dataset and
accumulate the numbers of occurrence for each items

Reduction to obtain the global numbers of occurrence
Step 2:

Partition the frequent items and assign a subset to each
processor

Each processor makes projections for the assigned items
Step 3:

Each processor mines the local projections independently

Parallel Frequent-Pattern Mining (2)
Load balancing problem
Some projection mining time is too large relative
to the overall mining time

Dataset mushroom connect pumsb pumsb_star T30I0.2D1K T40I10D100K T50I5D500K
Maximal/Overall 7.66% 12.4% 14.7% 47.6% 42.1% 4.15% 3.07%
(a) datasets for frequent-itemset mining
Dataset C10N0.1T8S8I8 C50N10T8S20I2.5 C100N5T2.5S10I1.25 C200N10T2.5S10I1.25 C100N20T2.5S10I1.25
Maximal/Overall 21.4% 13.8% 14.2% 10.1% 11.6%

(b) datasets for sequential-pattern mining
Dataset C100S50N10 C100S100N5 C200S25N9 gazelle

Maximal/Overall 15.3% 5.02% 4.53% 25.9%
(c) datasets for closed-sequential-pattern mining
Solution: The large projections must be partitioned

Challenge: How to identify the large projections?
How to Identify the Large Projections?
To identify the large projections, we need an

estimation of the relative mining time of the
projections
Static estimation
Study the correlation with the dataset parameters

Number of items, number of records, width of
records,
Study the correlation with the characteristics of the
projection

Depth, bushiness, tree size, number of leaves,
fan-out/in,
Result No rule found with the above parameters for
the projection mining time
Dynamic Estimation
Runtime sampling
Use the relative mining time of a sample to estimate the
relative mining time of the whole dataset.

Accuracy vs. overhead
Random sampling: random select a subset of records.

Not accurate with small
sample size.
e.g. Dataset pumsb
1% random sampling
Becomes accurate when
sample size > 30%, but

sampling overhead is
over 50% then

Selective Sampling
Selective sampling: for each record, some items

are removed
In frequent-itemset mining:
Discard the infrequent items
Discard a fraction t of the most frequent items
dataset Support threshold =2, t=20% selective sample

1 a, c, d, f, m 1 a, c, m
f :4
2 b, c, f, m b :3 2 b, c, m
3 b, f c :3 3 b
m :3
4 b, c
a: 2 4 b, c
5 a, f, m d: 1 5 a, m

Accuracy of Selective Sampling

Overhead of Selective Sampling
(a) datasets for frequent-itemset mining
(b) datasets for sequential-pattern mining
(c) datasets for closed-sequential-pattern mining

Experimental Setups
Two Linux clusters using up to 64 processors
Cluster A 1GHz Pentium III processor, 1GB
memory
Cluster B 1.3GHz Intel Itanium2 processor,
2GB memory
Implement with C++ using MPI
Dataset generator from IBM
Datasets

Experimental Setups

Speedups with One-Level Task
Partitioning
Parallel frequent-itemset mining
mushroom connect pumsb pumsb_star
100 100 100 100
optimal optimal optimal optimal

Par-FP Par-FP Par-FP
Par-FP
10 10 10 10
1
1 1 1
1 2 4 8 16 32 64
1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Processor# Processor# Processor# Processor#
T30I0.2D1K T40I10D100K T50I5D500K

100 100 100
optimal optimal
Par-FP Par-FP
optimal
Par-FP
10 10 10
1 1 1
1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Processor# Processor# Processor#

Effectiveness of Selective Sampling
Multi-level task partitioning

connect pumsb pumsb_star
mushroom 100 100 100
100
optimal
optimal optimal optimal
one-level
one-level one-level
one-level
multi-level multi-level multi-level
multi-level
10 10 10 10
1
1 1 1
1 2 4 8 16 32 64
1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
T30I0.2D1K T40I10D100K T50I5D500K
100 100 100
optimal
optimal optimal
one-level
one-level
one-level
multi-level multi-level
multi-level
10 10 10
1 1 1
1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64

Partitioning (Sequential Patterns)
Parallel sequential-pattern mining
C100N5T2.5S10I1.25 C50N10T8S20I2.5
100 C10N0.1T8S8I8
100 100
optimal
optimal optimal
Par-Span
Par-Span
Par-Span
10 10 10
1 1 1
1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Processor# Processor# Processor#
C100N20T2.5S10I1.25 C200N10T2.5S10I1.25
100 100
optimal
Par-Span optimal
Par-Span
10 10
1 1
1 2 4 8 16 32 64 1 2 4 8 16 32 64
Processor# Processor#

Partitioning (Closed Sequential Pattern)
Parallel closed-sequential-pattern mining
C100S50N10 C100S100N5
100
100
optimal optimal
Par-CSP Par-CSP
10 10
1 1
1 2 4 8 16 32 64 1 2 4 8 16 32 64
C200S25N9 Gazelle
100 100
optimal optimal
Par-CSP Par-CSP
10 10
1 1
1 2 4 8 16 32 64 1 2 4 8 16 32 64

Effectiveness of Selective Sampling
One-level task partitioning with 64

processors
40
Thewithout
speedups
sampling
are improved by more than
35
50%withon average.
sampling
30
25
20
15
10

Conclusions
A new RFID warehouse model
allows efficient and flexible analysis of RFID data
in multidimensional space
preserves the structure of the data
compresses data by exploiting bulky movements,
concept hierarchies, and path collapsing
High-performance computing will benefit RFID data
warehousing and data mining tremendously
Efficient and highly parallel algorithms can be
developed for RFID data analysis


Chapter 11 - : - Additional Theme: RFID Data Warehousing and Mining and High-Performance Computing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 11 - : - Additional Theme: RFID Data Warehousing and Mining and High-Performance Computing

Uploaded by

Copyright:

Available Formats

Concepts and

Jiawei Han and Micheline Kamber

03/16/17 Data Mining: Concept 1

03/16/17 Data Mining: Concept 3

03/16/17 Data Mining: Concept 4

03/16/17 Data Mining: Concept 5

03/16/17 Data Mining: Concept 6

03/16/17 Data Mining: Concept 7

03/16/17 Data Mining: Concept 8

03/16/17 Data Mining: Concept 9

03/16/17 Data Mining: Concept 10

03/16/17 Data Mining: Concept 12

03/16/17 Data Mining: Concept 13

03/16/17 Data Mining: Concept 14

If 1000 packs of soda stay together at the distribution center,

register a single record

that stayed together at the distribution center

03/16/17 Data Mining: Concept 15

may be uninteresting to a regional manager and

03/16/17 Data Mining: Concept 16

Category level Clothing

Type level Interesting

SKU level Shirt Jacket

EPC level Shirt 1 Shirt n Cleansed RFID

03/16/17 Data Mining: Concept 17

Transportation backroom shelf checkout

dist. center truck backroom shelf checkout

dist. center truck Store

03/16/17 Data Mining: Concept 18

Fact Table: (EPC, location, time_in, time_out)

the refrigerator in Illinois stores?

the refrigerator in Champaign when coming from

03/16/17 Data Mining: Concept 19

03/16/17 Data Mining: Concept 20

If saving complete EPC Lists: high costs of IO to retrieve long

lists, costly query processing

03/16/17 Data Mining: Concept 21

r1 l2 t20 t30 g1.1

r3 l4 t15 t20 g1 g1.1,g1.2

03/16/17 Data Mining: Concept 22

Transition Grouping l1 ln+1

03/16/17 Data Mining: Concept 23

With Cleansed Database With Stay Table

records at the individual at the group level

03/16/17 Data Mining: Concept 24

{r1,..,ri} {ri+1,..,rj} {rj+1,..,rk} {rk+1,..,rl} {rl+1,..,rm} {rm+1,..,rn} 3n 10+n

03/16/17 Data Mining: Concept 25

Assign to each GID a 0.0 0.1

encodes the path traversed 0.1.0

Makes it easy to detect if l5 l6

03/16/17 Data Mining: Concept 26

1. Build a prefix tree for the paths in the cleansed

03/16/17 Data Mining: Concept 27

0.0.0.0 l5 t40 t60 5

03/16/17 Data Mining: Concept 28

03/16/17 Data Mining: Concept 29

Can be implemented efficiently with traditional

optimization techniques, e.g., what is the

Path selection (New operation)

travel through a set of locations and that match a

q < c info,(c1 stage1, ..., ck stagek) >

03/16/17 Data Mining: Concept 30

Query: What is the average time spent from l3 to l5?