Professional Documents
Culture Documents
Lift,
Mutual-Information,
INTRODUCTION
Proceedings of the Third International Conference on E-Technologies and Business on the Web, Paris, France 2015
and uniformly drawn from the entire twitter stream.
MEASURING SURPRISE
si = SurprisalV alue(di , C)
A prerequisite for achieving this is to be able to extract
features from the documents and count feature occurrences. Xi = SummarizeF eatures(di ), where Xi
is a vector of pairs (wj , vj ) such that wj is the feature
name and vj the number of occurrences of that feature
in the document. For example, we might summarize a
document by counting word occurrences after removing stop words and allowing for stemming.
In text analytics the common means of determining the importance of a word in classifying a document is the product of Term Frequency (TF) and Inverse Document Frequency (IDF) [3]. TF defines how
frequently a word appears in a given document. IDF
is defined as:
P (wi ) =
P (w1 w2 ... wn )
(1)
P (w1 )P (w2 )..P (wn )
M
IDF (w) = 1 + log(
)
N umContaining(C, w)
where N umContaining(C, w) is the total
number of documents that contain the word w. Multiplying TF by IDF amplifies the TFIDF score for words
that are rather unusual in the corpus as a whole:
N umContaining(C, w)
M
1
,
M
and
1
Mn
so the maximum value of Lif t1 is M n1 while
the minimum is zero. For example, assume we found
a tweet containing only the words love and dia drawn
from the twitter corpus for June 2014 described earlier. Then:
P (w1 )P (w2 )..P (wn ) =
800 130.106
= 0.09
2.106 5.105
Brin et all in [5] use the term Interest rather than Lift.
Proceedings of the Third International Conference on E-Technologies and Business on the Web, Paris, France 2015
this issue we need to adjust the equation for Lif t1 .
We consider the co-occurrence of pairs of words wi
and wj and treat them as if they were independent,
i.e. P (wi wj ) is independent of P (wi wk ) and
P (wk wj ). This is a gross simplification but is similar to what is done when using naive Bayes for classification, for example for spam filters, in which the
naive independence assumption still permits accurate
classification, see Mirkin [6].
For a document with N words, we have N (N21)
pairs, for simplicity we will denoted this number as
Kpairs in what follows. Then we apply the definition
of Lif t1 to all words pairs and combine them as if
they were independent. We define multi-variate lift
as:
i6=j P (wi wj )
P (w1 )n1 , .., P (wn )n1
(2)
This rather intimidating equation is simplified
when we remember that all the probabilities are calculated empirically by dividing the number of documents that contain a word by the total number of
documents. Therefore the top line of the equation
is the product of co-occurrences divided by M Kpairs
while the bottom line is the product of occurrences divided by M 2Kpairs . In consequence the equation is
simply the product of the co-occurrences multiplied
by M Kpairs and then divided by product of occurrences times. When there are just two words in the
document then List2 is exactly the same as for Lif t1
e.g. Kpairs = 1.
For large M , M Kpairs is huge. This motivates
our next adaptation, which instead of taking the probability of word occurrence and co-occurrence we take
their logarithm. Although the resulting value is not as
meaningful as that defined for Lif t1 and Lif t2 it preserves the same order and therefore allows different
surprisal values to be compared.
X
i6=j
(n 1)|
(4)
p(x)log(p(x))
(5)
Proceedings of the Third International Conference on E-Technologies and Business on the Web, Paris, France 2015
UPDATING A CORPUS
(6)
(7)
10
Proceedings of the Third International Conference on E-Technologies and Business on the Web, Paris, France 2015
Day 1
SurprisalValue(d,Day1)
Summarize Features
Update Current Day
1e+06
Number of Occurences (log)
Day 2
Document Stream
Day 3
SurprisalValue(d,Day1.. 4)
Day 4
SurprisalValue(d,Day1.. 64)
...
Day 64
100000
10000
1000
100
10
1
-4000 -3000 -2000 -1000
0
Surprisial Values
Surprisial Value
RESULTS
2000
counter obtained from yesterday and from the previous 4, 16 and 64 days. Figure 1 shows the general
schema of this approach.
1000
CONCLUSION
11
Proceedings of the Third International Conference on E-Technologies and Business on the Web, Paris, France 2015
typical of a corpus through the introduction of a document surprisal value. We have argued that generalizing Lift offers a simple way of calculating surprisals
for documents arriving in continuous streams. This
allows an additional terms to be associated with documents and used to rank them when they are queried.
We have motivated the practicality of such a system
by describing how the counters required for the calculation of Lift can be implemented using well known
sketching techniques and shown results from applying
the generalized Lift algorithm to a corpus of rapidly
changing documents.
REFERENCES
[1] Z. Harris, Distributional structure, Word,
vol. 10, no. 23, pp. 146162, 1954.
[9] D. M. Magerman and M. P. Marcus, Parsing a natural language using mutual information
statistics, 1990, pp. 984989.
[10] K. W. Church and P. Hanks, Word association
norms, mutual information, and lexicography, Comput. Linguist., vol. 16, no. 1,
pp. 2229, Mar. 1990. [Online]. Available:
http://dl.acm.org/citation.cfm?id=89086.89095
[11] S. Rooney, Scheduling intense applications
most surprising first, Science of Computer Programming, vol. 97, Part 3, no. 0, pp. 309 319,
2015.
[12] G. Cormode and S. Muthukrishnan, An improved data stream summary: The count-min
sketch and its applications, J. Algorithms,
vol. 55, no. 1, pp. 5875, Apr. 2005.
[2] L. Chiticariu, Y. Li, and F. R. Reiss, Rulebased information extraction is dead! long live
rule-based information extraction systems! in
EMNLP. ACL, 2013, pp. 827832.
[3] G. Salton and C. Buckley, Term-weighting approaches in automatic text retrieval, Inf. Process. Manage., vol. 24, no. 5, pp. 513523, Aug.
1988.
[4] M. McCandless, E. Hatcher, and O. Gospodnetic, Lucene in Action, Second Edition: Covers Apache Lucene 3.0. Greenwich, CT, USA:
Manning Publications Co., 2010.
[5] S. Brin et al, Dynamic itemset counting and implication rules for market basket data, SIGMOD
Rec., vol. 26, no. 2, pp. 255264, June 1997.
[6] B. G. Mirkin, Core concepts in data analysis: summarization, correlation and visualization. Springer-Verlag, 2011.
[7] S. F. Chen and J. Goodman, An empirical study
of smoothing techniques for language modeling, in Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, ser. ACL 96. Stroudsburg, PA, USA: Association for Computational Linguistics, 1996,
pp. 310318.
[8] R. Fano, Transmission of Information: A Statistical Theory of Communications. Cambridge,
MA: The MIT Press, 1961.
12