You are on page 1of 6

Frequent-Itemset Mining Using Locality-Sensitive Hashing

Frequent-Itemset Mining Using Locality-Sensitive


Hashing
Ashwini Vasudevan
The Oxford College of Engineering, Bangalore, India.
ashuvasuma@gmail.com

Abstract The Apriori algorithm is a classical algorithm for the Cohen et al. [5] was also motivated by the same idea but
fre-quent itemset mining problem. A significant bottleneck in worked on a different problem (see Sect. 1.2). LSH is
Apriori is the number of I/O operation involved, and the number explained in Sect. 2, but roughly, it is a random-ized hashing
of candidates it generates. We investigate the role of LSH technique which allows efficient retrieval of approximately
techniques to overcome these problems, without adding much "similar" elements (here, itemsets).
computational overhead. We propose randomized variations of
Apriori that are based on asymmetric LSH defined over
Hamming distance and Jaccard similarity.
A. Our Contribution
In this work, we propose LSH-Apriori a basket of
I. INTRODUCTION
three explicit variations of Apriori that uses LSH for
Mining frequent itemsets in a transactions database computing FI. LSH-Apriori handles both the above mentioned
appeared first in the context of analyzing supermarket drawbacks of the Apriori algorithm. First, LSH-Apriori
transaction data for discovering association rules [1, 2], significantly cuts down on the number of infrequent
however this problem has, since then, found applications in
diverse domains like finding correlations [13], finding episodes candidates that are generated, and further due to its
[9], clustering [14]. Mathematically, each transaction can be dimensionality reduction property saves on reading every
regarded as a subset of the items ( "itemset") those that present transaction; secondly, LSH-Apriori could efficiently filter our
in the transaction. Given a database D of such transactions and those infrequent itemset without looking every candidate. The
a support threshold 0 E (0, 1), the primary objective of frequent first two variations essentially reduce computing FI to the
itemset mining is to identify 0-frequent itemsets (denoted by approximate nearest neighbor (cNN) problem for Hamming
FI, these are subsets of items that appear in at least 0-fraction distance and Jaccard similarity. Both these approaches can
of transactions) Computing FI is a challenging problem of data drastically reduce the number of false candidates without
mining. The question of decid-ing if there exists any FI with k much overhead, but has a non-zero probability of error in the
items is known to be NP-complete [7] (by relating it to the
existence of bi-cliques of size k in a given bipartite graph) but sense that some frequent itemset could be missed by the
on a more practical note, simply checking support of any algorithm. Then we present a third variation which also maps
itemset requires reading the trans-action database something FI to elements in the Hamming space but avoids the problem
that is computationally expensive since they are usually of an of these false negatives incurring a little cost of time and space
extremely large size. The state-of-the-art approaches try to complexity. Our techniques are based on asymmetric LSH
reduce the number of candidates, or not generate candidates at [12] and LSH with one-sided error [10] which are proposed
all. The best known approach in the former line of work is the very recently.
celebrated Apriori algorithm[2]. Apriori is based on the anti-
m,onotonicity property of partially-ordered sets which says that
no superset of an infrequent itemset can be frequent. This B. Related Work
algorithm works in a bottom-up fashion by generating itemsets There are a few hash based heuristic to compute FI
of size 1 in level 1, starting at the first level. After finding
which outperform the Apriori algorithm and PCY [11] is one
frequent itemsets at level 1 they are joined pairwise to generate
1 + 1-sized candidate itemsets; FI are identified among the of the most notable among them. PCY focuses on using
candidates by computing their support explicitly from the data. hashing to efficiently utilize the main memory over each pass
The algorithm terminates when no more candidates are of the database. However, our objective and approach both are
generated. Broadly, there are two down-sides to this simple but fundamentally different from that of PCY.
effective algorithm. The first one is that the algorithm has to
compute support' of every itemset in the candidate, even the The work that comes closest to our work is by Cohen
ones that are highly infrequent. Secondly, if an itemset is et al. [5]. They devel-oped a family of algorithms for finding
infrequent, but all its subsets are fre-quent, Apriori doesn't have interesting associations in a transaction database, also using
any easy way of detecting this without reading every
LSH techniques. However, they specifically wanted to avoid
transaction of the candidates. A natural place to look for fast
algorithms over large data are randomized techniques; so we any kind of filtering of itemsets based on itemset support. On
investigated if LSH could be of any help. An earlier work by the other hand, our problem is the vanilla frequent itemset

1
Frequent-Itemset Mining Using Locality-Sensitive Hashing

mining which requires filtering itemsets satisfying a given - if Sim(x, y) < (1 E)So, then hPrH[h(x) = h(y)1 < P2.
minimum support threshold.
Not all similarity measures have a corresponding LSH.
However, the following well-known result gives a sufficient
C. Organisation of the Paper
condition for existence of LSH for any Sim.
In Sect. 2, we introduce the relevant concepts and
give an overview of the problem. In Sect. 3, we build up the Lemma 1. If 0 is a strict monotonic function and a family of
concept of LSH-Apriori which is required to develop our hash function l-t satisfies Prh E 7-t[h(x) = h(y)) = 0(Sim(x, y)]
algorithms. In Sect. 4, we present three specific variations of for some Sim : IV x ri > {0, 1}, then the conditions of
LSH-Apriori for computing FI. Algorithms of Subsects. 4.1 Definitionl are true for Sim for any E E (0, 1).
and 4.2 are based on Hamming LSH and Minhashing,
respectively. In Subsect. 4.3, we present another approach The similarity measures that are of our interest are Hamming
based on CoveringLSH which overcomes the problem of and Jaccard over binary vectors. Let Ix I denote the Hamming
producing false negatives. In Sect. 5, we summarize the whole weight of a binary vector x. Then, for vectors x and y of length
discussion. Proofs are omitted due to space constraint and can n, Hamming distance is defined as Ham(x, y) = lx GM, where
be found in the full-version [3]. x y denotes a vector that is element-wise Boolean XOR of x
and y. Jaccard similarity is defined as (x, y) Ilx V y 1 , where
II. BACKGROUND
(x, y) indi-cates inner product, and x V y indicates element-
The input to the classical frequent itemset mining problem is a wise Boolean OR of x and y. LSH for these similarity
database D of n transactions {T1, ... , Tn} over m items measures are simple and well-known [4, 6, 8]. We recall them
{i1, ... , irn} and a support threshold 0 E (0, 1). Each below; here I is some subset of {1, ... , n} (or, n-length
transaction, in turn, is a subset of those items. Support of transaction vector).
itemset I C {ii, ... , im} is the number of transactions that
contain I. The objective of the problem is to determine every Definition 2 (Hash Function for Hamming Distance). For
itemset with support at least On. We will often identify an any particular bit position i, we define the function hi (I) :=
itemset I with its transaction vector (/[1], I[2], . . . , I[n] where IN. We will use hash functions of the form g j(I) =
I[j] is 1 if I is contained in Ti and 0 otherwise. An equivalent (hil(I),hi2(I), ... , hi, (I)), where J = {ii, ... , jk} is a subset of
way to formulate the objective is to find itemsets with at least {1, ... , n} and the hash values are binary vectors of length k.
On l's in their transaction vectors. It will be useful to view D
as a set of m transaction vectors, one for every item. Definition 3 (Minwise Hash Function for Jaccard
Similarity). Let 7r be some permutations over {1, ... , n}.
TABLE 1 : Notations Treating I as a subset of indices, we will use hash functions of
the form h, (I) = arg mini 7r(i) for i e I.
Notations
D Database of n Number of transactions The probabilities that two itemsets hash to the same value for
transactions: {t1,.,tn} these hash func-tions are related to their Hamming distance
and Jaccard similarity, respectively.
D FI of level-l:{I1,Iml} Support threshold ,
B. Apriori Algorithm for Frequent Itemset Mining
l (0,1)
l Maximum support of m Number of items As explained earlier, Apriori works level-wise and in its l-th
any item in Dl level, it generates all 0-frequent itemsets with /-items each; for
example, in the first level, the algorithm simply computes
Error tolerance in ml Number of FI of size l support of individual items and retains the ones with support at
LSH, (0, 1) least On.
Probability of error in |v| Number of 1s in v
LSH, (0, 1)
A. Locality Sensitive Hashing
We first briefly explain the concept of locality
sensitive hashing (LSH).

Definition 1 (Locality Sensitive Hashing [8]). Let S be a set


of m vectors in RI', and U be the hashing universe. Then, a
family 71 of functions from S to U is called as (S0, (1
Figure 1: Apriori algorithm for frequent itemset mining
e)So,pi,p2)-sensitive (with e E (0, 1] and pi > p2) for the
similarity measure Sim(.,.) if for any x, y E S: Apriori processes each level, say level-(l + 1), by joining all
- if Sim(x, y) > So, then hPrw[h(x) = h(y)1> pi, pairs of 0-frequent compatible itemsets generated in level-/,

2
Frequent-Itemset Mining Using Locality-Sensitive Hashing

and further filtering out the ones which have support less than Lemma 2. Let Iq and Ia be two -frequent compatible
On. Here, two candidate itemsets (of size l each) are said to be itemsets of size (1 1) such that the itemset J = Iq U Ia is also
compatible if their union has size exactly l +1. A high-level 0-frequent. Then, with probability at least 1 , FI(Iq,,,)
pseudocode of Apriori is given in Algorithm 1. All our contains Ia (hence C contains J).
algorithms rely on a good implementation of set whose runtime
cost is not included in our analysis. In the next section we describe three LSH-based
randomized algorithms to compute FI(Iq,,,) for all 0-
frequent itemset Iq from the earlier level. The input to these
III. LSH-APRIORI subroutines will be Dl, the frequent itemsets from earlier level,
The focus of this paper is to reduce the computation of and parameters ,,. In the pre-processing stage at level 1, the
processing all pairs of itemsets at each level in line 6 (which respective LSH is initialized and itemsets of DI are hashed; we
specifically record the itemsets hashing to every bucket. LSH
includes computing support by going through D). Suppose
guarantees (w.h.p.) that pairs of similar items hash into the
that level l outputs m1 frequent itemsets. We will treat the same bucket, and those that are not hash into different buckets.
output of level l as a collection of ml transaction vectors Dl In the query stage we find all the itemsets that any Iq ought to
={I1,Iml} , each of length n and one for each frequent itemset be combined with by looking in the bucket in which Iq hashed,
of the l-th level. Our approach involves defining appropriate and then combining the compatible ones among them with Iq
notions of similarity between itemsets (represented by vectors) to form C. Rest of the processing happens a la Apriori.
in Dl similar to the approach followed by Cohen et al. [5]. Let The internal LSH subroutines may output false-positives
Ii ,Ij be two vectors each of length n. Then, we use |Ii,I j,| to itemsets that are not 0-frequent, but such itemsets are eventualy
denote the number of bit positions where both the vectors have filtered out in line 9 of Algorithm 1. Therefore, the output of
a 1. LSH-Apriori does not contain any false positives. How-ever,
some frequent itemsets may be missing from its output (false
Definition 4. Given a parameter 0 < < 1, we say that {Ii ,I j} negatives) with some probability depending on the parameter 5
is -frequent (or similar) if |Ii,I j| > n and {Ii,Ij } is (1-)- as stated below in Theorem 3 (proof follows from the union
infrequent if |Ii,Ij| < (1-)n. Furthermore, we say that Ij is bound).
similar to Ii if {Ii,Ij} is -frequent. Theorem 3 (Correctness). LSH-Apriori does not output
Let Iq be a frequent itemset at level l 1. Let FI(Iq, ) be the any itemset which is not 0-infrequent. If X is a 0-frequent
set of itemsets Ia such that {Iq, Ia} is -frequent at level 1. Our itemset of size 1, then the probability that LSH-Apriori does
main contributions are a few randomized algorithms for not output X is at most 21.
identifying itemsets in FI(Iq,) with high-probability. The tolerance parameter 6 can be used to balance the
overhead from using hash-ing in LSH-Apriori with respect to
Definition 5 (FI(Iq, ,, )). Given a -frequent itemset Iq, of its savings because of reading fewer transactions. Most LSH,
size l 1, toler-ance (0, 1) and error probability , FI(Iq, including those that we will be using, behave somewhat like
, ,) is a set F' of itemsets of size l, such that with dimension-ality reduction. As a result, the hashing operations
probability at least 1 , F' contains every Ia for which {Iq, do not operate on all bits of the vectors. Furthermore, the pre-
Ia} is -frequent. condition of similarity for joining ensure that (w.h.p.) most
It is clear that FI(Iq, ) FI(Iq,, , ) with high infrequent itemsets can be detected before verifying them from
D. To formalize this, consider any level 1 with in/ 0-frequent
probability. This motivated us to propose LSH-Apriori, a
itemsets DI. We will com-pare the computation done by LSH-
randomized version of Apriori, that takes and as additional Apriori at level 1+1 to what Apriori would have done at level 1
inputs and essentially replaces line 6 by LSH operations to + 1 given the same frequent itemsets D1. Let c11 denote the
com-bine every itemset Iq with only similar itemsets, unlike num-ber of candidates Apriori would have generated and mi+i
Apriori which combines all pairs of itemsets. This potentially the number of frequent itemsets at this level (LSH-Apriori may
creates a significantly smaller C without missing out too many generate fewer).
frequent itemsets. The modifications to Apriori are pre-sented Overhead: Let T(LSH) be the time required for hashing an
in Algorithm 2 and the following lemma, immediate from itemset for a par-ticular LSH and let Q(LSH) be the space
Definition 5, establishes correctness of LSH-Apriori. needed for storing respective hash values. The extra overhead
in terms of space will be simply mio-(LSH) in level 1 +1. With
respect to overhead in running time, LSH-Apriori requires
hashing each of the m1 itemsets twice, during pre-processing
and during querying. Thus total time overhead in this level is
(LSH, 1 + 1) = 2mlT(LSH).
Savings: Consider the itemsets in Dl that are compatible
with any Iq Dl. Among them are those whose combination
Figure 2: LSH-Apriori level 1+ 1 (only modifications with Iq do not generate a 0-frequent itemset for level 1 + 1; call
to Apriori line: 6) them as negative itemsets and denote their num-ber by r(Iq).
Apriori will have to read all n transactions of Iq, r(Iq) itemsets
in order to reject them. Some of these negative itemsets will be

3
Frequent-Itemset Mining Using Locality-Sensitive Hashing

added to FI by LSH-Apriori we will call them false The algorithm contains an optimization over the generic LSH-
positives and denote their count by FP(Iq); the rest those which Apriori pseudocode (Algorithm 2). There is no need to
correctly not added with Iq lets call them as true nega-tives separately execute lines:7-10 of Apriori; one can immediately
and denote their count by TN(Iq). Clearly, r(Iq) = TN(Iq) + set F < C since LSH-Apriori computes support before
FP(Iq) and Iq, r(Iq) = 2(cl+1 ml+1). Suppose (LSH) denotes populating FI.
the number of transactions a particular LSH-Apriori reads for
hashing any itemset; due to the dimensionality reduction
property of LSH, (LSH) is always o(n). Then, LSH-Apriori is
able to reject all itemsets in TN by reading only (1) transactions
for each of them; thus for itemset Iq in level 1+ 1, a particular
LSH-Apriori reads (n 0(LSH)) x TN(Iq) fewer transactions
compared to a similar situation for Apriori. Therefore, total
savings at level 1 + 1 is c(LSH, 1 + 1) = (n - (LSH)) x I q
TN(Iq). In Sect. 4, we discuss this in more detail along with the
respective LSH-Apriori algorithms.

IV. FI VIA LSH


Our similarity measure |Ia,Ib | can also be seen as the inner
product of the binary vectors I a, and Ib. However, it is not
possible to get any LSH for such similarity measure because Figure 3: LSH-Aprior (only lines 6a,6b) using
for example there can be three items Ia, Ib and Ic, such that |Ia , Hamming LSH
Ib| >= |Ic , Id| which implies that Pr(h(Ia) = h(Ib)) >= Pr(h(Ic) =
h(Ic)) = 1, which is not possible. Noting the exact same Lemma 5. Algorithm 3 correctly outputs FI(Iq, , , ) for all
problem, Shrivastava et al. introduced the concept of
asymmetric LSH [12] in the context of binary inner product Iq Dl. Additional space required is o(ml2), which is also the
similarity. The essential idea is to use two different hash total time overhead. The expected savings can be bounded by
functions (for pre-processing and for querying) and they E[c(l + 1)] >= (n o(ml))((cl1 2ml+1) + (cl1
specifically proposed extending MinHashing by padding input o(ml2)))
vectors before hashing. We use the same pair of padding Expected savings outweigh time overhead if n >> ml, c1+1 =
functions proposed by them for n-length binary vectors in a (ml2) and cl +1 > 2ml1 , i.e., in levels where the number of
level 1: P(n,l) for preprocessing and Q(n,l) for querying are frequent itemsets generated are fewer compared to the number
defined as follows. of transactions as well as to the number of candidates
In P(I) we append (l n - |I|) many l's followed by ( l n + generated. The additional optimisation (*) essentially
|I|)) many 0's. increases the savings when all 1 + 1-extensions of Iq are (1
)9-infrequent this behaviour will be predominant in the
In Q(I) we append ln many 0's, then (l n - |I|) many
last few levels. It is easy to show that in this case, FP(Iq) <
l's, then |I| 0's.
with probability at least 1 ; this in turn implies that |S| <
Here, ln (at LSH-Apriori level 1) will denote the L/ So, if we did not find any similar /a within first L/ tries,
maximum number of ones in any itemset in D1. Therefore, we then we can be sure, with reasonable probability, that there are
always have (ln |I|) >= 0 in the padding functions. no itemsets similar to Iq.
Furthermore, since the main loop of Apriori is not continued if
no frequent itemset is generated at any level, ( l ) > 0 is
also ensured at any level that Apriori is executing. B. Min-Hashing Based LSH
We use the above padding functions to reduce our problem Cohen et al. had given an LSH-based randomized
of finding similar itemsets to finding nearby vectors under algorithm for finding interest-ing itemsets without any
Hamming distance (using Hamming-based LSH in Subsect. 4.1 requirement for high support [5]. We observed that their
and Covering LSH in Subsect. 4.3) and under Jaccard Minhashing-based technique [4] cannot be directly applied to
similarity (using MinHashing in Subsect. 4.2). the high-support version that we are interested in. The reason
is roughly that Jaccard similarity and itemset similarity (w.r.t.
A. Hamming Based LSH 0-frequent itemsets) are not monotonic to each other.
In the following lemma, we relate Hamming distance Therefore, we used padding to monotonically relate Jaccard
of two itemsets Ix and Iy with their 1/x, /y l. similarity of two itemsets Ix and Iy with their | Ix, Iy |.

Lemma 4. For two itemsets Ix and Iy, Ham(P(/x), Q(4)) = Lemma 6. For two padded itemsets Ix and Iy, JS(P(Ix),Q(Iy))
2(ain 1-fx,41). = |Tx,Iy|/2 ln -|Ix,Iy|.
Therefore, it is possible to use an LSH for Hamming distance Once padded, we follow similar steps (as [5]) to
to find sim-ilar itemsets. We use this technique in the create a similarity preserving summary D l of Dl such that the
following algorithm to compute FI(4,0,E,(5) for all itemset /q. Jaccard similarity for any column pair in Dl is approximately

4
Frequent-Itemset Mining Using Locality-Sensitive Hashing

preserved in D1, and then explicitly compute FI(Iq,,,) from ment over the seminal LSH algorithm by Indyk and Motwani
D1. D1 is created by using A independent minwise hashing [8], also for Hamming distance. Pagh's algorithm has a small
functions (see Definition 3). A should be carefully chosen overhead over the latter; to be precise, the query time bound of
since a higher value increases the accuracy of estima-tion, but [10] differs by at most ln(4) in the exponent in comparison
at the cost of large summary vectors in DI. Let us define with the time bound of [8]. However, its big advantage is that
JS(Ii,Ij) as the fraction of rows in the summary matrix in it generates no false negatives. Therefore, this LSH-Apriori
which min-wise entries of columns h and Ii are identical. Then version also does not miss any frequent itemset.
by Theorem 1 of Cohen et al. [5], we can get a bound on the The LSH by Page is with respect to Hamming
number of required hash functions: distance, so we first reduce our FI problem into the Hamming
space by using the same padding given in Lemma 4. Then we
Theorem 7 (Theorem 1 of [5]). Let 0 < , <1 and >= 2/w 2 use this LSH in the same manner as in Subsect. 4.1. Pagh
log 1/. Then for all pairs of columns h and Ij following are coined his hashing scheme as coveringLSH which broadly
true with probability at least 1 - : mean that given a threshold r and a tolerance c > 1, the
- If JS(Ii,Ij)> s* > w, then JS(Ii,Ij) > (1 - Os*, hashing scheme guaranteed a collision for every pair of
- If JS(Ii,Ij) < w, then JS(Ii,Ii) < (1+ )w. vectors that are within radius r. We will now briefly
summarize covering LSH for our requirement; refer to the
paper [10] for full details.
Similar to Hamming LSH, we use a family of
Hamming projections as our hash functions: HA := {x 1> x
a | a A}, where A {0,1} (1-1-2l)n. Now, given a query item
Iq, the idea is to iterate through all hash functions h H A, and
check if there is a collision h(P(Ix)) = h(Q(Iq)) for Ix D l. We
say that this scheme doesn't produce false negative for the
threshold 2(l - )n, if at least one collision happens when
there is an Ix E 7)1 when Ham(P(Ix), Q(Iq)) < 2(l (1-))n,
and the scheme is efficient if the number of collision is not too
many when Ham(P(Ix), Q(Iq)) > 2(l )n (proved in Theorems
Figure 4: LSH-Apriori (only lines 6a,6b) using 3.1, 4.1 of [10]). To make sure that all pairs of vector within
Minhash LSH (This algorithm can be easily boosted to o(ml) distance 2(l - )n collide for some h, we need to make sure
time by applying banding tech-nique (see Section 4 of [5]) on that some h map their "mismatching" bit positions (between
the minhash table.) P(Ix) and Q(Iq)) to 0. We describe construction of hash
functions next.
Lemma 8. Algorithm4 correctly computes FI(Iq, 0, 6, 6) for
all Iq Dl. Addi-tional space required is O(ml), and the total
time overhead is O((n + )ml). The expected savings is given
by E[c(l + 1)] > 2(1 - )(n - )(cl1 - ml+1).
The proof is omitted due to space constraints. Note Covering LSH: The parameters relevant to LSH-Apriori are
that A depends on at but is independent of n. This method given above. Notice that after padding, dimension of each item
should be applied only when A < n. And in that case, for levels is n', threshold is 0' (i.e., min-support is 0'/n'), and tolerance is
with number of candidates much larger than the number of c. We start by choosing a random function co : {1, ... , n'} >
frequent itemsets discovered (i.e., c1+1 >> {m l ml+1}), time { 0, 1}01+1, which maps bit positions of the padded itemsets
overhead would not appear significant compared to expected to bit vectors of length to' +1. We define a family of bit
savings. vectors a(v) E {0, 1}7/ , where a(v)2 = (co(i), v), for i E
C. Covering LSH {1, ... , n'}, v E fo, iyo'i and (m(i), v) denotes the inner
Due to their probabilistic nature, the LSH-algorithms product over F2. We define our hash function family 7-1A
presented earlier have the limitation of producing false using all such vectors a(v) except a(0): A = {a(v)Iv E {0,
positives and more importantly, false negatives. Since the 1}te'1/{0}}.
latter cannot be detected unlike the former, these algorithms Pagh described how to construct A' C A [10,
may miss some frequent itemsets (see Theorem 3). In fact, Corollary 4.1] such that HA' has a very useful property of no
once we miss some FI at a particular level, then all the FI false negatives and also ensuring very few false positives. We
which are "supersets" of that FI (in the subsequent levels) will use 71A, for hashing using the same manner of Hamming
be missed. Here we present another algorithm for the same projec-tions as used in Subsect. 4.1. Let 1p be the expected
purpose which over-comes this drawback. The main tool is a number of collisions between any itemset Iq and items in DI
recent algorithm due to Pagh [10] which returns approximate that are (1 E)O-infrequent with /q. The fol-lowing Theorem
nearest neighbors in the Hamming space. It is an improve- captures the essential property of coveringLSH that is relevant

5
Frequent-Itemset Mining Using Locality-Sensitive Hashing

for LSH-Apriori, described in Algorithm 5. It also bounds the Conference on Management of Data, Washington, D.C., 26-28
number of hash functions which controls the space and time May 1993, pp. 207-216 (1993)
overhead of LSH-Apriori. Proof of this theorem follows from [2] Agrawal, R., Srikant, R.: Fast algorithms for mining
Theorem 4.1 and Corollary 4.1 of [10]. association rules in large databases. In: Proceedings of 20th
International Conference on Very Large Data Bases, 12-15
Theorem 9. For a randomly chosen co, a hash family September 1994, Santiago de Chile, Chile, pp. 487-499 (1994)
HA' described above and distinct Ix, Iq E {0, l}n: If [3] Bera, D., Pratap, R.: Frequent-itemset mining using
Ham(P(/x), Q(Iq)) < 0', then there exists h E HAS s.t. h(P(Ix)) locality-sensitive hashing. CoRR, abs/1603.01682 (2016)
= h(Q(Iq)), [4] Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher,
M.: Min-wise independent permutations. J. Comput. Syst. Sci.
60(3), 630-659 (2000)
[5] Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P.,
Motwani, R., Ullman, J.D., Yang, C.: Finding interesting
associations without support prun-ing. IEEE Trans. Knowl.
Data Eng. 13(1), 64-78 (2001)
[6] Gionis, A., Indyk, P., Motwani, R.: Similarity search in
high dimensions via hash-ing. In: VLDB 1999, Proceedings of
25th International Conference on Very Large Data Bases, 7-10
September 1999, Edinburgh, Scotland, UK, pp. 518-529
Figure 5: LSH-Apriori (only lines 6a,6b) using
(1999)
Covering LSH
[7] Gunopulos, D., Khardon, R., Mannila, H., Saluja, S.,
Toivonen, H., Sharma, R.S.: Discovering all most specific
Lemma 10. Algorithm 5 outputs all 0-frequent
sentences. ACM Trans. Database Syst. 28(2), 140-174 (2003)
itemsets and only 0-frequent itemsets. Additional space
[8] Indyk, P., Motwani, R.: Approximate nearest neighbors:
required is 0 (mil+v), which is also the total time overhead.
towards removing the curse of dimensionality. In: Proceedings
The expected savings is given by E[c(l + 1)] > 2 (n logri 1)
of the Thirtieth Annual Symposium on the Theory of
((c/1 mt-Ei) mil+1-
Computing, Dallas, Texas, USA, 23-26 May 1998, pp. 604-
The (*) line is an additional optimisation similar to
613 (1998)
what we did for HammingLSH Sect. 4.1; it efficiently
[9] Mannila, H., Toivonen, H., Verkamo, A.I.: Discovery of
recognizes those frequent itemsets of the earlier level none of
frequent episodes in event sequences. Data Min. Knowl.
whose extensions are frequent. The guarantee of not miss-ing
Discov. 1(3), 259-289 (1997)
any valid itemset comes with a heavy price. Unlike the
[10] Pagh, R.: Locality-sensitive hashing without false
previous algorithms, the conditions under which expected
negatives. In: Proceedings of the Twenty-Seventh Annual
savings beats overhead are quite strin-gent, namely, c1+1 E
ACM-SIAM Symposium on Discrete Algorithms, SODA
{w(m?), w(m41)}, c > m1 > 2n/2 and E < 0.25 (since 1 < c <
2016, Arlington, VA, USA, 10-12 January 2016, pp. 1-9
2, these bounds ensure that v < 1 for later levels when al P.--,
(2016)
0).
[11] Park, J.S., Chen, M., Yu, P.S.: An effective hash based
V. CONCLUSION algorithm for mining asso-ciation rules. In: Proceedings of the
ACM SIGMOD International Conference on Management of
In this work, we designed randomized algorithms using Data, San Jose, California, 22-25 May, pp. 175-186 (1995)
locality-sensitive hashing (LSH) techniques which efficiently [12] Shrivastava, A., Li, P.: Asymmetric minwise hashing for
outputs almost all the frequent itemsets with high probability indexing binary inner products and set containment. In:
at the cost of a little space which is required for creating hash Proceedings of the 24th International Conference on World
tables. We showed that time overhead is usually small Wide Web, 2015, Florence, Italy, 18-22 May 2015, pp. 981-
compared to the savings we get by using LSH. Our work 991 (2015)
opens the possibilities for addressing a wide range of problems [13] Silverstein, C., Brin, S., Motwani, R.: Beyond market
that employ on various versions of frequent itemset and baskets: generalizing associ-ation rules to dependence rules.
sequential pattern mining problems, which potentially can Data Min. Knowl. Discov. 2(1), 39-68 (1998)
efficiently be randomized using LSH techniques. 14. Wang, H., Wang, W., Yang, J., Yu, P.S.: Clustering by
pattern similarity in large data sets. In: Proceedings of the
VI. REFERENCES 2002 ACM SIGMOD International Conference on
Management of Data, Madison, Wisconsin, 3-6 June 2002, pp.
[1] Agrawal, R., Imielinski, T., Swami, A.N.: Mining 394-405 (2002)
association rules between sets of items in large databases. In:
Proceedings of the 1993 ACM SIGMOD International

You might also like