You are on page 1of 13

Data Mining and Knowledge

Discovery
Assignment 2

Student: Angelos Ikonomakis s161216


Instructor: Jae-Gil Lee

Technical University of Denmark


Korea Advanced Institute of Science and Technology
April 14, 2017
KSE525 Assignment 1

Contents
1 Question 1 2
1.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Question 2 5
2.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Question 3 7
3.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Page 1 of 12
KSE525 Assignment 1

1 Question 1
1.1 Description
The Apriori algorithm uses a generate and count strategy for deriving frequent
itemsets. Candidate itemsets of size k + 1 are created by joining a pair of frequent
itemsets of size k (this is known as a candidate generation step). A candidate is
discarded if anyone of its subsets is found to be infrequent during the candidate pruning
step. Suppose the Apriori algorithm is applied to the dataset shown below with
minsup = 30%, i.e., any itemset occurring in less then 3 transactions is considered to
be infrequent.

Transaction ID Items Bought


1 {a,b,d,e}
2 {b,c,d}
3 {a,b,d,e}
4 {a,c,d,e}
5 {b,c,d,e}
6 {b,d,e}
7 {c,d}
8 {a,b,c}
9 {a,d,e}
10 {b,d}

Figure 1: Example of market basket transactions

1. Draw an itemset lattice representing the dataset given in the above table. Label
each node in the lattice with the following letter(s):

• N : If the itemset is not considered to be a candidate itemset by the Apriori


algorithm. There are two reasons for an itemset not to be considered as
a candidate itemset. (1) it is not generated at all during the candidate
generation step, or (2) it is generated during the candidate generation step
but is subsequently removed during the candidate pruning step because one
of its subsets is found to be infrequent.
• F : If the candidate itemset is found to be frequent by the Apriori algo-
rithm.
• I : If the candidate itemset is found to be infrequent after support counting.

2. What is the percentage of frequent itemsets (with respect to all itemsets in the
lattice)?

3. What is the pruning ratio of the Apriori algorithm on this data set? (Pruning
ratio is defined as the percentage of itemsets not considered to be a candidate
because (1) they are not generated during candidate generation or (2) they are
pruned during the candidate pruning step.)

Page 2 of 12
KSE525 Assignment 1

4. What is the false alarm rate (i.e, percentage of candidate itemsets that are found
to be infrequent after performing support counting)

1.2 Answer
1. Firstly before drawing the lattice, we should calculate the support of each item-
set.

σ(a,b,d,e) 2
s1 = |T |
= 10
= 0, 2
σ(b,c,d) 2
s2 = |T |
= 10
= 0, 2
σ(a,b,d,e) 2
s3 = |T |
= 10
= 0, 2
σ(a,c,d,e) 1
s4 = |T |
= 10
= 0, 1
σ(b,c,d,e) 1
s5 = |T |
= 10
= 0, 1
σ(b,d,e) 4
s6 = |T |
= 10
= 0, 4
σ(c,d) 4
s7 = |T |
= 10
= 0, 4
σ(a,b,c) 1
s8 = |T |
= 10
= 0, 1
σ(a,d,e) 4
s9 = |T |
= 10
= 0, 4
σ(b,d) 6
s10 = |T |
= 10
= 0, 6

Then we should create a frequency table for each itemset.

Page 3 of 12
KSE525 Assignment 1

Item Count
a,b 3
a,c 2
Item Count Item Count
a,d 4
a 5 a,b,d 2
a,e 4
b 7 a,b,e 2
c 5
→ b,c 3 → b,c,d 2
b,d 6
d 9 a,d,e 4
b,e 4
e 6 b,d,e 4
c,d 4
c,e 2
d,e 6

Figure 2: Lattice (1-itemset) – (2-itemsets) – (3-itemsets)

Figure 3: (green-I) – (red-N) – (white-F)

2. The percentage of frequent itemsets is calculated by the fraction of the frequent


items of the lattice divided by the total number of itemsets. Thus,

Σ(F ) 15
F req = |T |
= 31
= 49%

3. The pruning ratio of the algorithm is calculated by summing the number of in-
frequent and Apriori algorithm candidates and then dividing them by teh total
number of itemsets. Thus,

Page 4 of 12
KSE525 Assignment 1

Σ(I+N ) 16
P run = |T |
= 31
= 51%

4. The false alarm rate is calculated by dividing the sum of Infrequent items after
performing support counting by the total number of itemsets. Thus,

Σ(I) 5
Alarm = |T |
= 31
= 16%

2 Question 2
2.1 Description
The following contingency table summarizes supermarket transaction data, where hot
dogs refer to the transactions containing hot dogs, hot dogs refers to the transactions
that do not contain hot dogs, hamburgers refers to the transactions containing ham-
burgers, and hamburgers refers to the transactions that do not contain hamburgers.

hot dogs hot dogs Σrow


hamburgers 2000 500 2500
hamburgers 1000 1500 2500
Σcol 3000 2000 5000

Figure 4: Contingency table

1. Suppose that the association rule “hot dogs ⇒ hamburgers” is mined. Given a
minimum support threshold of 25% and a minimum confidence threshold of 50%
, is this association rule strong?

2. Based on the given data, is the purchase of hot dogs independent of the purchase
of hamburgers? If not, what kind of correlation relationship exists between the
two?

3. Compare the use of the all_confidence, max_confidence, Kulczynski, and cosine


measures with lift and correlation on the given data.

2.2 Answer
1. In order for the association rule to be strong the support should be greater then
the minimum support threshold and the confidence should be greater then the
minimum confidence threshold.

Page 5 of 12
KSE525 Assignment 1

In our case,

σ(hotdog,Hamburger) 2
sup = |T |
= 5
= 40%
σ(hotdog,Hamburger) 2
conf = σ(hotdog)
= 3
= 66.7%
Both are greater then their thresholds so we can say that the association rule is
strong.

2. In order to check the dependence and correlation between two associations, we


should calculate their lif t. Two itemsets are independent when the occurrence
of one(A) is independent of the occurrence of the other(B). That occurs when,

P (A ∪ B) = P (A) P (B)

And this means that the lif t equals to 1 calculated by the following equation.

P (AB)
lif t = P (A)P (B)
In case the lif t is greater then 1, then the itemsets are positively correlated and
if it is less then 1, the itemsets are negatively correlated. So,

P (hotdog∪hamburger) 2000/5000
lif t = P (hotdog)P (hamburger)
= (3000/5000)(2500/5000)
= 1.33
Then we can say that the lift is greater then 1 and the itemsets are positively
correlated.

3. In order to calculate all_confidence, max_confidence, Kulczynski, and cosine


measures with lift and correlation we will use the following equations.

sup(AB)
AllConf = max(sup(A),sup(B))
 
sup(AB) sup(AB)
M axConf = max sup(A)
, sup(B)
 
sup(AB) 1 1
Kulc = 2 sup(A)
+ sup(B)
sup(AB)
Cosine = √
sup(A)sup(B)

Lift will be calculated by using the equation from the previous sub-question.

Page 6 of 12
KSE525 Assignment 1

dh dh dh dh AllConf MaxConf Kulc Cos Lift

Dataset 2000 500 1000 1500 0.67 0.8 0.732 0.731 1.33

Figure 5: Interestingness table

3 Question 3
3.1 Description
Install R and then two packages arules and arulesViz. Answer the following ques-
tions using R. For each question, hand in your R code as well as your answer (result).

1. Load the “Groceries” data set. Please obtain the following information: (i) the
most frequent item, (ii) the length of the longest transaction, and (iii) the first
five transactions.

2. Mine all association rules with the minimum support 0.001 and the mini-
mum confidence 0.8.

3. Draw a scatter plot for all association rules. Here, the x − axis represents the
support, the y − axis represents the confidence, and the shading of a point
represents the lift. [Hint: use the “plot” function in the arulesViz package.]

4. Select the top-3 association rules according to the lift and print these rules.

5. Draw the top-3 rules as a graph such that a node becomes an item. [Hint: use
the “plot” function in the arulesViz package.]

Manuals for R packages:

• arules

• arulesVis

3.2 Answer
1. Before answering in the questions we should first install packages in Rstudio and
load the "Groceries" dataset.

1 # Install Dependencies
2 install . packages ( " arules " )
3
4 # Load Libraries
5 library ( " Matrix " )
6 library ( " arules " )
7

8 # Load Groceries dataset


9 data ( " Groceries " )

Page 7 of 12
KSE525 Assignment 1

Now we are able to run some statistics on the dataset. First we can check the
most frequent items and the length of the longest transaction just by checking
the summary of the dataset. Thus,

1 # Take a look at the data


2 summary ( Groceries )

Figure 6: Summary output

We can see that the most frequent item is "whole milk", and the most lengthy
transaction consists of 32 items. Below the code for the first five transactions
and then the output on the console.

1 # Filter out data by index


2 inspect ( Groceries [1:5])

Page 8 of 12
KSE525 Assignment 1

Figure 7: Inspect output

2. After running some simple statistics on the dataset and we know what it consist
of, we are able to mine association rules of the itemsets.

1 # Apply Apriori and extract rules


2 rules <- apriori ( Groceries , parameter = list ( support
=0.001 , confidence =0.8) )
3
4 # Check rules exist with those minimum thresholds
5 rules

Figure 8: Rules output

When we check the first 10 rules we see the below list.

1 # Inspect rules
2 inspect ( rules [1:10])

Page 9 of 12
KSE525 Assignment 1

Figure 9: First 10 rules output

And lastly the summary of the rules is the following.

1 # Check rules ’ summary


2 summary ( rules )

Figure 10: Summary rules output

3. In order to draw the scatter plot we should first install the arulesViz package
and then load the library.

Page 10 of 12
KSE525 Assignment 1

1 # Install Dependencies
2 install . packages ( " arulesViz " )
3

4 # Load Libraries
5 library ( " arulesViz " )
6
7 # Plot rules
8 plot ( rules )

Figure 11: Rules plot output

4. The top-3 association rules according to the lift are the following.

1 # Filter out rules ordered by lift column


2 inspect ( head ( sort ( rules , by = " lift " ) , n = 3) )

Figure 12: Rules plot output

Page 11 of 12
KSE525 Assignment 1

5. In order to draw the graph of those rules, first we need to save those rules in a
seperate variable to be fed by the plot.
1 # Create a subrules variable to draw graph
2 subrules <- subset ( rules , lift >=8.34)
3

4 # Draw graph
5 plot ( subrules , method = " graph " )

Figure 13: Subrules graph output

Page 12 of 12

You might also like