DW Stanford

Data Warehousing Overview
CS245 Notes 11
Hector Garcia-Molina
Stanford University
CS 245
Notes11
Outline
What is a data warehouse?
Why a warehouse?
Models & operations
Implementing a warehouse
CS 245
Notes11
What is a Warehouse?
Collection of diverse data

subject
oriented
aimed at executive, decision maker
often a copy of operational data
with value-added data (e.g., summaries, history)
integrated
time-varying
non-volatile
more
CS 245
Notes11
What is a Warehouse?
Collection of tools
gathering
data
cleansing, integrating, ...
querying, reporting, analysis
data mining
monitoring, administering warehouse
CS 245
Notes11
Warehouse Architecture
Client
Client
Query & Analysis
Metadata
Warehouse
Integration
Source
CS 245
Source
Notes11
Source
5
Motivating Examples
Forecasting
Comparing performance of units
Monitoring, detecting fraud
Visualization
CS 245
Notes11
Alternative to Warehousing
Two Approaches:
Query-Driven
(Lazy)
Warehouse (Eager)
?
Source
CS 245
Notes11
Source
Query-Driven Approach
Client
Client
Mediator
Wrapper
Source
CS 245
Wrapper
Wrapper
Source
Source
Notes11
Advantages of Warehousing
High query performance
Queries not visible outside warehouse
Local processing at sources unaffected
Can operate when sources unavailable
Can query data not stored in a DBMS
Extra information at warehouse
Modify,
summarize (store aggregates)

Add historical information
CS 245
Notes11
Advantages of Query-Driven
No need to copy data

less
storage
no need to purchase data
More up-to-date data

Query needs can be unknown
Only query interface needed at sources
May be less draining on sources
CS 245
Notes11
10
Warehouse Models & Operators
Data Models
relational
cubes
Operators
CS 245
Notes11
11
Star
product
prodId
p1
p2
name price
bolt
10
nut
5
sale oderId date

o100 1/7/97
o102 2/7/97
105 3/8/97
customer
CS 245
custId
53
81
111
store
custId
53
53
111
prodId
p1
p2
p1
name
joe
fred
sally
Notes11
storeId
c1
c1
c3
address
10 main
12 main
80 willow
qty
1
2
5
storeId
c1
c2
c3
city
nyc
sfo
la
amt
12
11
50
city
sfo
sfo
la
12
Star Schema
product
prodId
name
price
sale
orderId
date
custId
prodId
storeId
qty
amt
customer
custId
name
address
city
store
storeId
city
CS 245
Notes11
13
Terms
Fact table
Dimension tables
Measures
product
prodId
name
price
sale
orderId
date
custId
prodId
storeId
qty
amt
customer
custId
name
address
city
store
storeId
city
CS 245
Notes11
14
Dimension Hierarchies
sType
store
store storeId
s5
s7
s9
city
cityId
sfo
sfo
la
tId
t1
t2
t1
mgr
joe
fred
nancy
region
sType tId
t1
t2
city
snowflake schema
constellations
CS 245
size
small
large
cityId pop
sfo
1M
la
5M
location
downtown
suburbs
regId
north
south
region regId
name
north cold region
south warm region
Notes11
15
Cube
Fact table view:
sale
prodId
p1
p2
p1
p2
storeId
c1
c1
c3
c2
Multi-dimensional cube:
amt
12
11
50
8
p1
p2
c1
12
11
c2
c3
50
dimensions = 2
CS 245
Notes11
16
3-D Cube
Fact table view:
sale
prodId
p1
p2
p1
p2
p1
p1
storeId
c1
c1
c3
c2
c1
c2
Multi-dimensional cube:
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
day 2
day 1
p1
p2 c1
p1
12
p2
11
c1
44
c2
4
c2
c3
c3
50
dimensions = 3
CS 245
Notes11
17
Operators
Traditional
Relational
Cube
selection
aggregation
...
Analysis
clean
data
find trends
...
CS 245
Notes11
18
Aggregates
Add up amounts for day 1
In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
sale
CS 245
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
Notes11
81
19
Aggregates
Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
CS 245
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
Notes11
ans
date
1
2
sum
81
48
20
Another Example
Add up amounts by day, product
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
sale
prodId
p1
p2
p1
date
1
1
2
amt
62
19
48
rollup
drill-down
CS 245
Notes11
21
Aggregates
Operators: sum, count, max, min,
median, ave
Having clause
Using dimension hierarchy
average
by region (within store)

maximum by month (within date)
CS 245
Notes11
22
Cube Aggregation
day 2
day 1
p1
p2 c1
p1
12
p2
11
p1
p2
c1
56
11
c1
44
c2
4
c2
c3
c3
50
c2
4
8
rollup
drill-down
CS 245
Example: computing sums

...
c3
50
sum
c1
67
c2
12
c3
50
129
p1
p2
Notes11
sum
110
19
23
Cube Operators
day 2
day 1
p1
p2 c1
p1
12
p2
11
c1
56
11
p1
p2
c1
44
c2
4
c2
...
c3
50
sale(c1,*,*)
c2
4
8
c3
50
sale(c2,p2,*)
CS 245
c3
sum
c1
67
c2
12
c3
50
129
p1
p2
Notes11
sum
110
19
sale(*,*,*)
24
Extended Cube
c2
4
8
c312
p1
p2
c1
*
12
p1
p2
c1*
44
c1
56
11
c267
4
c2
44
c3
4
50
11
23
8
8
50
*
62
19
81
day 2
day 1
CS 245
p1
p2
*
Notes11
c3
50
* 50
48
48
*
110
19
129
sale(*,p2,*)
25
Aggregation Using Hierarchies
day 2
day 1
p1
p2 c1
p1
12
p2
11
c1
44
c2
4
c2
c3
c3
50
customer
region
country
p1
p2
CS 245
region A region B
56
54
11
8
(customer c1 in Region A;
customers c2, c3 in Region B)
Notes11
26
Data Analysis
Decision Trees
Clustering
Association Rules
CS 245
Notes11
27
Decision Trees
Example:
Conducted survey to see what customers were
interested in new model car
Want to select customers for advertising campaign
sale
CS 245
custId
c1
c2
c3
c4
c5
c6
car
taurus
van
van
taurus
merc
taurus
age
27
35
40
22
50
25
Notes11
city newCar
sf
yes
la
yes
sf
yes
sf
yes
la
no
la
no
training
set
28
One Possibility
sale
custId
c1
c2
c3
c4
c5
c6
age<30
Y
city=sf
Y
CS 245
age
27
35
40
22
50
25
city newCar
sf
yes
la
yes
sf
yes
sf
yes
la
no
la
no
car=van
N
likely
car
taurus
van
van
taurus
merc
taurus
unlikely
likely
unlikely
Notes11
29
Another Possibility
sale
custId
c1
c2
c3
c4
c5
c6
car=taurus
Y
city=sf
Y
CS 245
age
27
35
40
22
50
25
city newCar
sf
yes
la
yes
sf
yes
sf
yes
la
no
la
no
age<45
N
likely
car
taurus
van
van
taurus
merc
taurus
unlikely
likely
unlikely
Notes11
30
Issues
Decision tree cannot be too deep
would not have statistically significant amounts of

data for lower decisions
Need to select tree that most reliably

predicts outcomes
CS 245
Notes11
31
Clustering
income
education
age
CS 245
Notes11
32
Clustering
income
education
age
CS 245
Notes11
33
Another Example: Text
Each document is a vector

e.g.,
<100110...> contains words 1,4,5,...
Clusters contain similar documents

Useful for understanding, searching
documents
sports
international
news
business
CS 245
Notes11
34
Issues
Given desired number of clusters?
Finding best clusters
Are clusters semantically meaningful?
e.g.,
yuppies cluster?
Using clusters for disk storage
CS 245
Notes11
35
Association Rule Mining

on
i
t
er
c
a
m
s
to
n
s
a
d
r
i
u
t
c
id
sales
records:
tran1
tran2
tran3
tran4
tran5
tran6
cust33
cust45
cust12
cust40
cust12
cust12
p2,
p5,
p1,
p5,
p2,
p9
ts
c
du ht
o
pr oug
b
p5, p8
p8, p11
p9
p8, p11
p9
market-basket
data
Trend: Products p5, p8 often bough together

Trend: Customer 12 likes product p9
CS 245
Notes11
36
Association Rule
Rule: {p1, p3, p8}
Support: number of baskets where these

products appear
High-support set: support threshold s
Problem: find all high support sets
CS 245
Notes11
37
Implementation Issues
ETL (Extraction, transformation, loading)

Getting
data to the warehouse

Entity Resolution
What to materialize?
Efficient Analysis
Association
rule mining
...
CS 245
Notes11
38
Periodic snapshots
Database triggers
Log shipping
Data shipping (replication service)
Transaction shipping
Polling (queries to source)
Screen scraping
Application level monitoring
CS 245
Notes11
Advantages & Disadvantages!!
ETL: Monitoring Techniques
39
ETL: Data Cleaning
Migration (e.g., yen dollars)

Scrubbing: use domain-specific knowledge (e.g.,
social security numbers)
Fusion (e.g., mail list, customer merging)
billing DB
service DB
customer1(Joe)
merged_customer(Joe)
customer2(Joe)
Auditing: discover rules & relationships

(like data mining)
CS 245
Notes11
40
More details: Entity Resolution

e2
e1
N: a
41
A: b
CC#: c
Ph: e
N: a
Exp: d
Ph: e
Applications
comparison shopping
mailing lists
classified ads
N: a
customer files
counter-terrorism
e1
A: b
Ph: e
e2
N: a
42
CC#: c
Exp: d
Ph: e
Why is ER Challenging?
Huge data sets
No unique identifiers
Lots of uncertainty
Many ways to skin the cat
43
Taxonomy: Pairwise vs Global

Decide if r, s match only by looking at r, s?
Or need to consider more (all) records?
Nm: Pat Smith

Ad: 123 Main St
Ph: (650) 555-1212
Nm: Patrick Smith

Ad: 132 Main St
Ph: (650) 555-1212
or
Nm: Patricia Smith
Ad: 123 Main St
Ph: (650) 777-1111
44
Taxonomy: Pairwise vs Global
Global matching complicates things a lot!

e.g.,
change decision as new records arrive
Nm: Pat Smith

Ad: 123 Main St
Ph: (650) 555-1212
Nm: Patrick Smith

Ad: 132 Main St
Ph: (650) 555-1212
or
Nm: Patricia Smith
Ad: 123 Main St
Ph: (650) 777-1111
45
Taxonomy: Outcome
Partition of records
e.g.,
comparison shopping
Merged records
Nm: Pat Smith
Ad: 123 Main St
Ph: (650) 555-1212
46
Nm: Patricia Smith

Ad: 132 Main St
Ph: (650) 777-1111
Hair: Black
Nm: Patricia Smith

Ad: 123 Main St
Ph: (650) 555-1212
(650) 777-1111
Hair: Black
Taxonomy: Outcome
Iterate
after merging
Nm: Tom
Ad: 123 Main
BD: Jan 1, 85
Wk: IBM
Nm: Tom
Ad: 123 Main
BD: Jan 1, 85
Wk: IBM
Oc: lawyer
47
Nm: Thomas
Ad: 123 Maim
Oc: lawyer
Nm: Tom
Wk: IBM
Oc: laywer
Sal: 500K
Nm: Tom
Ad: 123 Main
BD: Jan 1, 85
Wk: IBM
Oc: lawyer
Sal: 500K
Taxonomy: Record Reuse
One record related to multiple entities?

Nm: Pat Smith Sr.
Ph: (650) 555-1212
Ph: (650) 555-1212
Ad: 123 Main St
Nm: Pat Smith Jr.

Ph: (650) 555-1212
48
Nm: Pat Smith Sr.

Ph: (650) 555-1212
Ad: 123 Main St
Nm: Pat Smith Jr.
Ph: (650) 555-1212
Ad: 123 Main St
Partitions
Merges
t
rs
s
st
t
49
Partitions
Merges
t
rs
s
st
t
Record reuse
50
complex and expensive!
Taxonomy: Multiple Entity Types

person 1
person 2
brother
Organization A
member
member
business
Organization B
51
Taxonomy: Multiple Entity Types

authors
same??
a1
p1
a2
p2
a3
p5
a4
p7
a5
52
papers
Taxonomy: Exact vs Approximate

cameras
CDs
ER
resolved
cameras
ER
resolved
CDs
ER
resolved
books
products
books
...
53
...
Taxonomy: Exact vs Approximate
terrorists
sort
by age terrorists
B Cooper 30
54
match against
ages 25-35

Getting

Entity Resolution
Efficient Analysis
Association
rule mining
...
CS 245
Notes11
55
What to Materialize?
Store in warehouse results useful for
common queries
Example:
total sales
day 2
day 1
c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8
p1
p2
c1
56
11
c2
4
8
c3
50
materialize
CS 245
...
p1
c2
12
c3
50
129
p1
p2
Notes11
c1
67
c1
110
19
56
Materialization Factors
Type/frequency of queries
Query response time
Storage cost
Update cost
CS 245
Notes11
57
Cube Aggregates Lattice

129
c1
67
p1
c2
12
c3
50
city
city, product
p1
p2
c1
56
11
c2
4
8
all
product
city, date
date
product, date
c3
50
day 2
day 1
CS 245
c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8
city, product, date

Notes11
use greedy
algorithm to
decide what
to materialize
58
all
cities
state
city
c1
c2
state
CA
NY
city
CS 245
Notes11
59
all
city
city, product
product
city, date
date
product, date
state
city, product, date
state, date
state, product
state, product, date
not all arcs shown...

CS 245
Notes11
60
Interesting Hierarchy
time
all
years
weeks
quarters
day
1
2
3
4
5
6
7
8
week
1
1
1
1
1
1
1
2
month
1
1
1
1
1
1
1
1
quarter
1
1
1
1
1
1
1
1
year
2000
2000
2000
2000
2000
2000
2000
2000
conceptual
dimension table
months
days
CS 245
Notes11
61

Getting

Entity Resolution
Efficient Analysis
Association
rule mining
...
CS 245
Notes11
62
Finding High-Support Pairs

Baskets(basket, item)
SELECT I.item, J.item, COUNT(I.basket)
FROM Baskets I, Baskets J
WHERE I.basket = J.basket AND
I.item < J.item
GROUP BY I.item, J.item
HAVING COUNT(I.basket) >= s;
CS 245
Notes11
63
Finding High-Support Pairs

Baskets(basket, item)
FROM Baskets I, Baskets J
I.item < J.item
WHY?
CS 245
Notes11
64
Example
basket item
t1
p2
t1
p5
t1
p8
t2
p5
t2
p8
t2
p11
...
...
CS 245
basket item1 item2

t1
p2
p5
t1
p2
p8
t1
p5
p8
t2
p5
p8
t2
p5
p11
t2
p8
p11
...
...
...
Notes11
65
Example
basket item
t1
p2
t1
p5
t1
p8
t2
p5
t2
p8
t2
p11
...
...
CS 245
basket item1 item2

t1
p2
p5
t1
p2
p8
t1
p5
p8
t2
p5
p8
t2
p5
p11
t2
p8
p11
...
...
...
Notes11
check if
count s
66
Issues
Performance for size 2 rules

big
basket
t1
t1
t1
t2
t2
t2
...
item
p2
p5
p8
p5
p8
p11
...
basket
t1
t1
t1
t2
t2
t2
...
item1
p2
p2
p5
p5
p5
p8
...
item2
p5
p8
p8
p8
p11
p11
...
even
bigger!
Performance for size k rules
CS 245
Notes11
67
Association Rules
How do we perform rule mining efficiently?
CS 245
Notes11
68
Association Rules
Observation: If set X has support t, then
each X subset must have at least support t
CS 245
Notes11
69
Association Rules
Observation: If set X has support t, then
each X subset must have at least support t
For 2-sets:
if
we need support s for {i, j}

then each i, j must appear in at least s
baskets
CS 245
Notes11
70
Algorithm for 2-Sets

(1) Find OK products
those
appearing in s or more baskets
(2) Find high-support pairs

using only OK products
CS 245
Notes11
71
INSERT INTO okBaskets(basket, item)

SELECT basket, item
FROM Baskets
GROUP BY item
HAVING COUNT(basket) >= s;
CS 245
Notes11
72
INSERT INTO okBaskets(basket, item)

SELECT basket, item
FROM Baskets
GROUP BY item
HAVING COUNT(basket) >= s;
Perform mining on okBaskets
FROM okBaskets I, okBaskets J
I.item < J.item
CS 245
Notes11
73
Counting Efficiently
One way:
threshold = 3
basket I.item J.item

t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
CS 245
Notes11
74
One way:

t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
CS 245
sort
threshold = 3
t3
p2
p3
t3
p2
p8
t1
p5
p8
t2
p5
p8
t3
p5
p8
t2
p8
p11
...
...
...
Notes11
75
One way:

t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
CS 245
sort
threshold = 3
t3
p2
p3
t3
p2
p8
t1
p5
p8
t2
p5
p8
t3
p5
p8
t2
p8
p11
...
...
...
Notes11
count &
remove
count I.item J.item

3
p5
p8
5
p12 p18
...
...
...
76
Another way:
threshold = 3

t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
CS 245
Notes11
77
Another way:

t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
scan &
count
threshold = 3
count I.item J.item
1
p2
p3
2
p2
p8
3
p5
p8
5
p12
p18
1
p21
p22
2
p21
p23
...
...
...
keep counter
array in memory
CS 245
Notes11
78
Another way:

t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
scan &
count
threshold = 3
count I.item J.item
1
p2
p3
2
p2
p8
3
p5
p8
5
p12
p18
1
p21
p22
2
p21
p23
...
...
...
remove
count I.item J.item

3
p5
p8
5
p12 p18
...
...
...
keep counter
array in memory
CS 245
Notes11
79
Another way:

t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
scan &
count
threshold = 3
count I.item J.item
1
p2
p3
2
p2
p8
3
p5
p8
5
p12
p18
1
p21
p22
2
p21
p23
...
...
...
remove
count I.item J.item

3
p5
p8
5
p12 p18
...
...
...
keep counter
array in memory
CS 245
Notes11
80
Yet Another Way

t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
CS 245
(1)
scan &
hash &
count
count
1
5
2
1
8
1
...
bucket
A
B
C
D
E
F
...
Notes11
in-memory
hash table
threshold = 3
81
Yet Another Way

t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
(2) scan &

remove
CS 245
(1)
scan &
hash &
count
count
1
5
2
1
8
1
...
bucket
A
B
C
D
E
F
...
in-memory
hash table
threshold = 3

t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p5
p8
t5
p12 p18
t8
p12 p18
...
...
...
Notes11
82
Yet Another Way

t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
(1)
scan &
hash &
count
(2) scan &

remove
false positive
CS 245
count
1
5
2
1
8
1
...
bucket
A
B
C
D
E
F
...
in-memory
hash table
threshold = 3

t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p5
p8
t5
p12 p18
t8
p12 p18
...
...
...
Notes11
83
Yet Another Way

t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
(1)
scan &
hash &
count
(2) scan &

remove
false positive
CS 245
count
1
5
2
1
8
1
...
bucket
A
B
C
D
E
F
...
in-memory
hash table
threshold = 3
in-memory
counters
t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p5
p8
t5
p12 p18
t8
p12 p18
...
...
...
Notes11
count I.item J.item

3
p5
p8
1
p8
p11
5
p12
p18
...
...
...
(3) scan& count

84
Yet Another Way

t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
(1)
scan &
hash &
count
(2) scan &

remove
false positive
CS 245
count
1
5
2
1
8
1
...
bucket
A
B
C
D
E
F
...
in-memory
hash table
in-memory
counters
t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p5
p8
t5
p12 p18
t8
p12 p18
...
...
...
Notes11
threshold = 3
count I.item J.item

3
p5
p8
5
p12 p18
...
...
...
(4) remove
count I.item J.item

3
p5
p8
1
p8
p11
5
p12
p18
...
...
...
(3) scan& count

85
Discussion
Hashing scheme: 2 (or 3) scans of data

Sorting scheme: requires a sort!
Hashing works well if few high-support pairs
and many low-support ones
CS 245
Notes11
86
Discussion
frequency
Hashing scheme: 2 (or 3) scans of data

Sorting scheme: requires a sort!
Hashing works well if few high-support pairs
and many low-support ones
iceberg queries
threshold
item-pairs ranked by frequency
CS 245
Notes11
87

Getting

Entity Resolution
Efficient Analysis
Association
rule mining
...
CS 245
Notes11
88
Extra: Data Mining in the InfoLab

Recommendations in CourseRank
quarters
user
u1
u2
u3
u4
u
CS 245
q1
a: 5
a: 1
g: 4
b: 2
a: 5
q2
b: 5
e: 2
h: 2
g: 4
g: 4
Notes11
q3
d: 5
d: 4
e: 3
h: 4
e: 4
q4
f: 3
f: 3
e: 4
89

quarters
user
u1
u2
u3
u4
u
q1
a: 5
a: 1
g: 4
b: 2
a: 5
q2
b: 5
e: 2
h: 2
g: 4
g: 4
u3 and u4 are similar to u
CS 245
q3
d: 5
d: 4
e: 3
h: 4
e: 4
q4
f: 3
f: 3
e: 4
Recommend h
Notes11
90

quarters
user
u1
u2
u3
u4
u
q1
a: 5
a: 1
g: 4
b: 2
a: 5
q2
b: 5
e: 2
h: 2
g: 4
g: 4
q3
d: 5
d: 4
e: 3
h: 4
e: 4
q4
f: 3
f: 3
e: 4
Recommend d (and f, h)
CS 245
Notes11
91
Sequence Mining
Given a set of transcripts, use Pr[x|a]
to predict if x is a good recommendation
given user has taken a.
Two issues...
CS 245
Notes11
92
Pr[x|a] Not Quite Right

transcript containing
1
2
a
3
x
4
a -> x
5
x -> a
target users transcript:

[ ... a .... || unknown ]
recommend x?
Pr[x|a] = 2/3
Pr[x|a~x] = 1/2
CS 245
Notes11
93
User Has Taken >= 1 Course

User has taken T= {a, b, c}
Need Pr[x|T~x]
Approximate as Pr[x|a~x b~x c~x ]
Expensive to compute, so...
CS 245
Notes11
94
percentage of ratings
CourseRank User Study
CS 245
good, expected
good, unexpected
Notes11
95

DW Stanford

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DW Stanford

Uploaded by

Copyright:

Available Formats

Data Warehousing Overview

Collection of diverse data

summarize (store aggregates)

No need to copy data

More up-to-date data

Warehouse Models & Operators

sale oderId date

by region (within store)

Example: computing sums

Aggregation Using Hierarchies

Decision tree cannot be too deep

would not have statistically significant amounts of

Need to select tree that most reliably

Another Example: Text

Each document is a vector

<100110...> contains words 1,4,5,...

Clusters contain similar documents

Using clusters for disk storage

Association Rule Mining

Trend: Products p5, p8 often bough together

Rule: {p1, p3, p8}

Support: number of baskets where these

ETL (Extraction, transformation, loading)

data to the warehouse

Advantages & Disadvantages!!

ETL: Monitoring Techniques

ETL: Data Cleaning

Migration (e.g., yen dollars)

Auditing: discover rules & relationships

More details: Entity Resolution

Taxonomy: Pairwise vs Global

Nm: Pat Smith

Nm: Patrick Smith

Taxonomy: Pairwise vs Global

Global matching complicates things a lot!

change decision as new records arrive

Nm: Pat Smith

Nm: Patrick Smith

Nm: Patricia Smith

Nm: Patricia Smith

Taxonomy: Record Reuse

One record related to multiple entities?

Nm: Pat Smith Jr.

Nm: Pat Smith Sr.

Taxonomy: Record Reuse

Taxonomy: Record Reuse

complex and expensive!

Taxonomy: Multiple Entity Types

Taxonomy: Multiple Entity Types

Taxonomy: Exact vs Approximate

Taxonomy: Exact vs Approximate

ETL (Extraction, transformation, loading)

data to the warehouse

Cube Aggregates Lattice

city, product, date

city, product, date

not all arcs shown...

ETL (Extraction, transformation, loading)

data to the warehouse

Finding High-Support Pairs

Finding High-Support Pairs

basket item1 item2

basket item1 item2

Performance for size 2 rules