You are on page 1of 6

Proceedings of the Third International Conference on E-Technologies and Business on the Web, Paris, France 2015

Using Lift as a Practical Measure of Surprise


in a Document Stream
Sean Rooney
IBM Research, Zurich Laboratory
8803 Ruschlikon, Switzerland
E-mail: sro@zurich.ibm.com
ABSTRACT
We describe how the concept of Lift can be generalized to order small documents in a corpus by
their degree of similarity. This surprisal norm can
be used in conjunction with other features to search
over the corpus. From an information theoretic
point of view surprisal is the combination of the
Mutual-information of all word pairs in a documents.
We show how the calculation of surprisals can be
performed efficiently on a document stream using
sketching techniques.
KEYWORDS
Information-Retrieval,
Sketching, Surprisal

Lift,

Mutual-Information,

INTRODUCTION

Often we wish to rank a list of documents within a


corpus by their degree of resemblance to each other,
resemblance being defined by the number of words
they share. We can think of the top ranking documents
as being surprising and the lowest ranking documents
as being expected. We term this objective measure of
the degree that we are surprised by a document within
a corpus as its surprisal. Note that a surprisal only
has meaning in relation to a corpus from which the
document is drawn.
Having associated a surprisal value with a document we can use it for searching the corpus. For example, we may request the documents with the highest
surprisal values when looking for outliers or with the
lowest when looking for representative documents.
This can be combined with other criteria in a general
search, e.g. the most surprising K documents that contain the word cat.
Text analytics comprises two general ap-

ISBN: 978-1-941968-08-6 2015 SDIWC

proaches. The first, bag-of-words [1] ignores the


structure of the document and considers it as a set of
words each of which occur with some frequency. This
method can be refined to a topic model in which the
frequency of word occurrence is used as an indicator of latent topics into which individual documents
are classified; typically through the application of statistical machine learning techniques. The second is
a syntactic model [2] in which knowledge about the
structure of natural languages is used to identify relationships beyond simply co-occurrence, e.g. rules are
applied to the datagram [Alice,Loves,Bob] and
Alice and Bob are identified as objects linked by
a relationship Loves. This model often involves the
use of lexicons, either manually or automatically generated, which contain the rules of interest for the particular corpus. Chitacariu et al in [2] discuss why
academia has concentrated on statistically machine
learning while commercial products use rule based
approaches. Interestingly they claim this is because
rules based approaches are easier to incrementally refine and change.
We address two problems. First how can a surprisal be calculated on a document within a corpus?
Second how can this be efficiently implemented in a
corpus that is constantly being updated? The second
problem effects the solution to the first as we must
be able to incrementally change the measure of surprise as new documents arrive. We use a simple bagof-words model as, at its core, it simply requires the
maintenance of sets of counters. As we shall show this
allows the use of sketching algorithms to trade precision with size allowing estimates of surprisals to be
quickly and incrementally calculated on a constantly
changing stream.
We use twitter as an illustrative example in
which tweets are small documents arriving at a rate of
thousands per second. Twitter, via a publicly available
API, transmits a sample of 1% of all tweets normally

Proceedings of the Third International Conference on E-Technologies and Business on the Web, Paris, France 2015
and uniformly drawn from the entire twitter stream.

MEASURING SURPRISE

We assume a corpus C containing M text documents.


Our goal is given a document di to determine a surprisal value si for the document within the corpus.

document should be surprising, but the simple use of


TFIDF will not allow this to be represented. Clearly
more context needs to be maintained.
The best estimate of the probability of a word
occuring in a corpus is given by the observed number
of documents that contain that word divided by the
total number of documents:

si = SurprisalV alue(di , C)
A prerequisite for achieving this is to be able to extract
features from the documents and count feature occurrences. Xi = SummarizeF eatures(di ), where Xi
is a vector of pairs (wj , vj ) such that wj is the feature
name and vj the number of occurrences of that feature
in the document. For example, we might summarize a
document by counting word occurrences after removing stop words and allowing for stemming.
In text analytics the common means of determining the importance of a word in classifying a document is the product of Term Frequency (TF) and Inverse Document Frequency (IDF) [3]. TF defines how
frequently a word appears in a given document. IDF
is defined as:

P (wi ) =

The concept of Lift [5] compares the likelihood


of the co-occurrence of words happening by chance to
that observed in the actual corpus1 . Although Lift is
normally used in reference to pairs of words we can
generalize it for all words in a document.

Lif t1 ({w1 , ...wn }) =

T F IDF (w, di ) = T F (w, di ) IDF (w)


TFIDF is widely used to rank text documents,
for example the open source search engine library
Lucene [4] uses TFIDF to order matching documents
when query and document contain text. A simple
measure of surprise is the average of the TFIDF of
its words normalized for document length,
Unfortunately concentrating on word frequencies alone loses the information about the association
between words. A document might contain a very
common set of words, but it would be highly unusual
to find them together. For example, out of 130 million tweets measured from the twitter sample stream
in June 2014, 2 million contained the word love and
half a million the word dia but only 800 had them
both. The appearance of love and dia in the same

ISBN: 978-1-941968-08-6 2015 SDIWC

P (w1 w2 ... wn )
(1)
P (w1 )P (w2 )..P (wn )

Assume a text corpus has M documents then the


maximum possible values of Lif t1 will be for a combination of words that occur only once, but always together, i.e.:
P (w1 w2 ... wn ) =

M
IDF (w) = 1 + log(
)
N umContaining(C, w)
where N umContaining(C, w) is the total
number of documents that contain the word w. Multiplying TF by IDF amplifies the TFIDF score for words
that are rather unusual in the corpus as a whole:

N umContaining(C, w)
M

1
,
M

and
1
Mn
so the maximum value of Lif t1 is M n1 while
the minimum is zero. For example, assume we found
a tweet containing only the words love and dia drawn
from the twitter corpus for June 2014 described earlier. Then:
P (w1 )P (w2 )..P (wn ) =

Lif t1 (dia, love) =

800 130.106
= 0.09
2.106 5.105

The combination of love and dia only occurs


with 9% of frequency with that we would expect if
words in tweets were randomly drawn from the total
set of words used, i.e. this combination is rare and we
can quantify by how much.
In order to use Lif t1 as a practical measure of
surprise we have to make some refinements and simplifications. Any given combination of words is likely
to be unique even within small documents, so the surprisal value will almost always be zero. To address
1

Brin et all in [5] use the term Interest rather than Lift.

Proceedings of the Third International Conference on E-Technologies and Business on the Web, Paris, France 2015
this issue we need to adjust the equation for Lif t1 .
We consider the co-occurrence of pairs of words wi
and wj and treat them as if they were independent,
i.e. P (wi wj ) is independent of P (wi wk ) and
P (wk wj ). This is a gross simplification but is similar to what is done when using naive Bayes for classification, for example for spam filters, in which the
naive independence assumption still permits accurate
classification, see Mirkin [6].
For a document with N words, we have N (N21)
pairs, for simplicity we will denoted this number as
Kpairs in what follows. Then we apply the definition
of Lif t1 to all words pairs and combine them as if
they were independent. We define multi-variate lift
as:

Lif t2 ({w1 , .., wn }) =

i6=j P (wi wj )
P (w1 )n1 , .., P (wn )n1

(2)
This rather intimidating equation is simplified
when we remember that all the probabilities are calculated empirically by dividing the number of documents that contain a word by the total number of
documents. Therefore the top line of the equation
is the product of co-occurrences divided by M Kpairs
while the bottom line is the product of occurrences divided by M 2Kpairs . In consequence the equation is
simply the product of the co-occurrences multiplied
by M Kpairs and then divided by product of occurrences times. When there are just two words in the
document then List2 is exactly the same as for Lif t1
e.g. Kpairs = 1.
For large M , M Kpairs is huge. This motivates
our next adaptation, which instead of taking the probability of word occurrence and co-occurrence we take
their logarithm. Although the resulting value is not as
meaningful as that defined for Lif t1 and Lif t2 it preserves the same order and therefore allows different
surprisal values to be compared.

Lif t3 ({w1 , .., wn }) = |

X
i6=j

(n 1)|

log(P (wi wj ))|


X

log(P (wi ))| (3)

Note that what is 1 for Lif t1 is zero for Lif t3 ,


while what is smaller/greater than 1 for Lif t1 is negative/positive for Lif t3 .
Some pairs of words may never have been observed meaning that the overall surprisal will be zero

ISBN: 978-1-941968-08-6 2015 SDIWC

although our intuition suggests it should be greater.


The simplest ways of handling this is using Laplace
additive smoothing; more complex smoothing algorithms are reviewed in [7]. When applying smoothing
we give an initial value of one to every word pair, before adding the actual counter from the set of observed
documents. In addition we increase the size of M by
the number of pairs. So, for example, imagine we observe a tweet containing a set of N words in which
no two pairs have been observed together in the corpus. The surprisal values calculated using Lif t3 of
that tweet is proportional to the inverse of the products of the occurrences of those words.
Lif t3 is similar to the idea of MutualInformation [8]. Mutual-information is a measure of
the interdependence of two signals, i.e. how much can
be inferred about one signal from knowledge about
the other. For example, if I know how much ice cream
is sold at a given location on every day in a year, what
can I infer about the temperature at that location and
visa-versa. The Mutual-information of two random
variables X,Y is often defined as:
I(X; Y ) = H(X) H(X|Y )

(4)

where H(X) is the Shannon entropy of X, i.e.:


H(X) =

p(x)log(p(x))

(5)

Clearly, if X is independent of Y , H(X) =


H(X|Y ) and I(X; Y ) = 0. Lif t3 in effect calculates the surprisal of a document as the combination of
the mutual information value of every pair of words in
that document using the corpus as a whole to calculate
the relevant probabilities.
Mutual-information has been applied to text analytics. For example [9] uses Mutual-information as
a means of parsing natural language without requiring a predefined grammar. While [10] uses Mutualinformation to generate a word association norm over
a corpus of data. The authors motivate this by showing how such a norm would allow enhanced optical
character recognition by factoring in the probability
of word association during the analysis of the text.
To the best of our knowledge, MutualInformation has not been used for identifying representative or unusual documents within a corpus.

Proceedings of the Third International Conference on E-Technologies and Business on the Web, Paris, France 2015

UPDATING A CORPUS

For every document we extract features, generate a


new feature, i.e. its surprisal value and then make the
document indexable based on the extracted and generated features using some standard indexing tool, e.g.
SOLR, Elastic Research etc.
While feature extraction is executed on each
document independently of the corpus to which it belongs, the generation of the surprisal requires knowledge of the frequency of all words and co-occurrences
in the corpus. If the corpus is static then this need
only be executed once. However, if the corpus is being added to then the calculation of a surprisal for the
document is more problematic. In particular, on arrival of each new document we need to calculate the
surprisal value of the document and update the word
and co-occurrence count of the entire corpus.
Assume there are 100 Million unique words in
a corpus. This roughly approximates to the unique
terms observed on the twitter sample treatment over
the course of a year. Assume furthermore that it takes
32 bytes to encode a word and its counter. Then it
would require more than 3 GB of RAM to hold the
individual word counters. However, there are many
more combinations of words possible, the exact number depends on the nature of the discourse.
As a simplifying model assume that each word
can only occur with exactly N 1 other words and
none of those words can occur with any other outside
of this set, i.e. all words are divided into equivalence
classes.
The number of word combinations within a class
is proportional to N 2 and there are: total number of
words / N such classes. Hence the total number of
word combination is: number of words * N
Let N = 1000, then to hold the necessary counters in memory would require 3 TB of RAM. We cannot easily hold such large data structures in memory, but storing them on disk would mean that the
rate at which new documents can be processed would
be bounded by disk I/O. In the twitter sample stream
we receive over 50 new documents every second, if
the full twitter stream were processed this would be
roughly 5000 documents per second.
Sketching algorithms produce a summary of information about a data set that is much smaller than
the data set itself, but contains enough information
for answering a known query. They are often applied
when processing data streams as the total amount of

ISBN: 978-1-941968-08-6 2015 SDIWC

data is either unknown or unbounded. Typically, we


seek algorithms that require poly-logarithmic space
in regard to the size of the data set. Sketching can
be contrasted with sampling, in as much as sampling
contains detailed information about some subset of the
data set, while sketching contains a small amount of
information about each element. For example [11] describes how a probabilistic measure of the entropy of
multiple continuous data streams can be calculated in
order to determine the priority in which they are processed.
The Count-Min data structure [12] is a means by
which counters can be efficiently stored by sacrificing exact values for probabilistic ones. Count-Min
assumes P pairwise independent hash functions over
a hash space with Q discrete buckets and a matrix
Count[P, Q] of integers with P rows and Q columns.
On arrival of a new entity, x we update the data structure such that i = 1..P :

Count[i, hi (x)] = Count[i, hi (x)] + 1

(6)

To recover the count of a particular entity x we


simply return the minimum of Count[i, hi (x)] i =
1..P . The size of the Count-Min structure is constant
and independent of the distribution of the entities, although the likelihood of getting a close to exact result
is not. Count-Min gives an (, )-guarantee of the estimate of the count being the total number of entities away from the true answer with probability . The
size of hash space and the number of hash functions is
then given by:
Q = e/ P = ln 1/

(7)

The surprisal value is dependent on the totality


of the corpus analyzed but the frequency of word usage can be expected to vary over time. We address the
time variability of word frequency by creating a new
Count-Min structure for every day and retaining a certain number of days into the past. Retaining a years
worth of such Count-Min under reasonable assumptions of word distribution and accuracy would require
less than 1 GB of RAM.
The counters required for calculating the surprisal values can then be obtained from each of CountMin data structures for the required time period and
summed. Multiple surprisals are retained and stored
with a given document each corresponding to distinct
time periods, e.g. we have a surprisal based on the

10

Proceedings of the Third International Conference on E-Technologies and Business on the Web, Paris, France 2015
Day 1

Distribution of Surprisial Values for Twitter Sample Stream

SurprisalValue(d,Day1)

Summarize Features
Update Current Day

1e+06
Number of Occurences (log)

Day 2
Document Stream
Day 3

SurprisalValue(d,Day1.. 4)

Day 4
SurprisalValue(d,Day1.. 64)
...
Day 64

100000
10000
1000
100
10
1
-4000 -3000 -2000 -1000
0
Surprisial Values

Figure 1. Calculating Surprisals Over Different Time


Periods

Surprisial Value

RESULTS

We measure how well Equation 3 for multi-variate


Lif t3 performs on classifying a real corpus of documents using an arbitrary day (1st February 2014). The
sample stream for this day contains more than 4 million documents. In order to better understand which
documents are given high and low surprisal values we
remove all tweets that are not labeled as English by
Twitter as well as all retweets. This leaves us with a
corpus of approximately 1.2 million documents. We
apply Equation 3 to the entire set and order the documents from lowest to highest surprisal value. Values
close to zero indicate documents that contain terms
that do not appear more frequently together than their
independent frequency would suggest. Values lower
than one correspond to documents whose word pairs
are unusual in the corpus and those higher than one
are those word pairs that appear frequently.
Figure 2 shows the distribution of the surprisal
values as a histogram each bin having a width of 100
values. The 1.2 million documents are spread over a
range from -3413 to 1969. Figure 2 shows that the
vast majority of the documents are in buckets close to
0. The biggest bucket is in the range [-100:0] having
930,000 documents. This skew towards negative values is due to the independence assumption leading to
the pairs having a higher probability than would be the
case otherwise.
Even with the log scale the histogram doesnt
show well the edges of the distribution so we also

ISBN: 978-1-941968-08-6 2015 SDIWC

2000

Figure 2. Distribution Surprisal Values of Corpus

counter obtained from yesterday and from the previous 4, 16 and 64 days. Figure 1 shows the general
schema of this approach.

1000

Distribution of Surprisial Values of Twitter Stream


2000
1500
1000
500
0
-500
-1000
-1500
-2000
-2500
-3000
-3500
100
105
105
105
105
106
106
Documents Ordered By Surprisial

Figure 3. Ordered Surprisal Values of Corpus


we give the surprisal values of the ordered documents. About 100 documents have surprisals less
than -1300 and 100 have values greater than 900. Inspection of the 1000 highest/lowest tweets show that
these are qualitatively different than the average tweet.
Documents with high negative surprisals are typically: sports results, weather forecasts, stock quotes or
tweets in foreign languages that twitter has misidentified as English. Those with high positive values are
documents which tend to repeat the same short message many times.
We stress that the use of surprisals by itself cannot determine whether a document is of interest to a
human but it does allow a simple numeric classification of documents which can then be used in conjunction with other search criteria.

CONCLUSION

Our work extends the idea of identifying which words


are typically of a document, to which documents are

11

Proceedings of the Third International Conference on E-Technologies and Business on the Web, Paris, France 2015
typical of a corpus through the introduction of a document surprisal value. We have argued that generalizing Lift offers a simple way of calculating surprisals
for documents arriving in continuous streams. This
allows an additional terms to be associated with documents and used to rank them when they are queried.
We have motivated the practicality of such a system
by describing how the counters required for the calculation of Lift can be implemented using well known
sketching techniques and shown results from applying
the generalized Lift algorithm to a corpus of rapidly
changing documents.

REFERENCES
[1] Z. Harris, Distributional structure, Word,
vol. 10, no. 23, pp. 146162, 1954.

[9] D. M. Magerman and M. P. Marcus, Parsing a natural language using mutual information
statistics, 1990, pp. 984989.
[10] K. W. Church and P. Hanks, Word association
norms, mutual information, and lexicography, Comput. Linguist., vol. 16, no. 1,
pp. 2229, Mar. 1990. [Online]. Available:
http://dl.acm.org/citation.cfm?id=89086.89095
[11] S. Rooney, Scheduling intense applications
most surprising first, Science of Computer Programming, vol. 97, Part 3, no. 0, pp. 309 319,
2015.
[12] G. Cormode and S. Muthukrishnan, An improved data stream summary: The count-min
sketch and its applications, J. Algorithms,
vol. 55, no. 1, pp. 5875, Apr. 2005.

[2] L. Chiticariu, Y. Li, and F. R. Reiss, Rulebased information extraction is dead! long live
rule-based information extraction systems! in
EMNLP. ACL, 2013, pp. 827832.
[3] G. Salton and C. Buckley, Term-weighting approaches in automatic text retrieval, Inf. Process. Manage., vol. 24, no. 5, pp. 513523, Aug.
1988.
[4] M. McCandless, E. Hatcher, and O. Gospodnetic, Lucene in Action, Second Edition: Covers Apache Lucene 3.0. Greenwich, CT, USA:
Manning Publications Co., 2010.
[5] S. Brin et al, Dynamic itemset counting and implication rules for market basket data, SIGMOD
Rec., vol. 26, no. 2, pp. 255264, June 1997.
[6] B. G. Mirkin, Core concepts in data analysis: summarization, correlation and visualization. Springer-Verlag, 2011.
[7] S. F. Chen and J. Goodman, An empirical study
of smoothing techniques for language modeling, in Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, ser. ACL 96. Stroudsburg, PA, USA: Association for Computational Linguistics, 1996,
pp. 310318.
[8] R. Fano, Transmission of Information: A Statistical Theory of Communications. Cambridge,
MA: The MIT Press, 1961.

ISBN: 978-1-941968-08-6 2015 SDIWC

12

You might also like