Professional Documents
Culture Documents
Text Compression
Using N-Grams
Dylan Freedman
Carmel High School
March 5, 2010
Introduction
With our society’s widespread proliferation of
information, it may come as a surprise that there is not
enough storage available on all the hard drives in the
world to fit the volume of digital data created by people,
and following current trends this situation will only
worsen. There are currently 4.6 billion cell phone
subscribers in the world and 1.7 billion internet users,
numbers which have been growing exponentially over the
past few years. The consequences of this excessive use of
information are costly and have a negative effect on the
environment as new data centers are being erected every
day.
But is the situation hopeless? No—almost every conflict is met by mankind with
innovation, and abundance of information is no exception. Data compression is the tool
commonly used to reduce the size of files, and computer users deal with these techniques
daily whether they know it or not. Images in websites are almost always compressed with
a lossy algorithm called, JPG, and uncompressed audio and video are usually compressed
to a lower quality filetype such as mp3 or AVI. But most research and attention in the
realm of compression is devoted to media files or large general purpose files, both of
which take up a good amount of data. Due to a natural interest in linguistics, I wondered
why there were not any specialized file formats designed for compressing text.
As I became more involved in this field, I figured that English text would be a
good candidate for compression experimentation for the following reasons:
1. There are many external corpora of English text available to examine more of the
language’s linguistic properties, and an abundance of research has already been
conducted.
2. English is prolific and relevant, yet challenging for computers to comprehend.
3. Text files are typically small and do not consume much space; however, Internet
services and large companies often handle thousands of these files and could save
space and time with an efficient compressor.
4. Furthermore, due to the relatively high compressibility of English through
conventional means, preexisting work on this subject is sparse compared to that of
generic file compression, leaving room for innovation.
My research began with an investigation of the commonly employed lossless
compression schemes, and over time I discovered that most fell into two categories: (1)
static coders - compression schemes that build a statistical model of the properties of the
file and then compress the file efficiently using this model, yet at the expense of having
to store these statistics (usually in the form of a frequency table) along with the
compressed contents; and (2) adaptive coders - those that feature an adaptive
compression scheme, starting with a limited, generic statistical model and dynamically
updating it as the file is read with the benefit of not having to store this model inside the
final file. Both these categories of compression generate statistical models, which allow
compressors and decompressors to represent more frequently occurring patterns with a
lesser amount of bits and less frequently occurring patters with more bits, so as to
produce an optimally short file.
However, while researching these algorithms, I asked myself why a statistical
model needed to be generated for each and every file to be compressed, even if multiple
files shared some statistical properties. Does this redundancy not challenge the very
purpose of compression, where the goal is to eliminate all redundancy?
This led me to question the existing compression methods. I soon found out that
both these compression methods were highly inefficient when dealing with small files.
For a given small file (<1 KB), static coding algorithms usually produced larger files than
the initial due to the requirement that the compressed archive include the file's statistical
model (which was often bigger than the file itself), and adaptive coders could not achieve
a high compression ratio either as the sample of data within a small file was typically
insufficient to produce an effective statistical model. But what if a comprehensive,
external database existed that could generalize the statistical properties of a type of file?
If this were the case, a file of this type may not require a compressor to generate these
properties.
To take a simple example, consider the average English-speaking adult, who can
naturally understand the syntax and semantics of his language and could loosely be
considered an external linguistics engine. Claude Shannon, the father of information
theory, proposed an experiment in a 1950 paper in which he attempted to calculate the
entropy of English by asking an average English-speaking person to guess a phrase he
was thinking of letter-by-letter (including spaces). When a letter was guessed correctly,
Shannon would write under that letter the number of guesses required, and the subject
would then attempt to guess the next letter. Below is the result of one of his experiments
on a human subject:
In this experiment, the subject correctly guessed the letter on his first try for 79 of
the letters out of the 102 total letters in the sequence. He was able to perform such a feat
by logically choosing letters that have a high likelihood of following given the preceding
context of the phrase. Through a series of calculations, Shannon placed the redundancy of
English text at approximately 75%, meaning that a piece of text could optimally be
represented in 1/4 of the length.
I wondered for the purposes of my experiment how well this technique of
predicting text given context could be emulated by a computer. Was this process deeply
rooted in linguistics that were beyond computer's current capacity or could it be feasibly
implemented?
I researched more on existing corpora and was ultimately driven towards
Google’s 2006 N-Gram corpus, titled, Web 1T 5-gram Version 1. This database was
comprised of n-grams up to 5-grams that appeared over 40 times in a one trillion word
corpus of text from the Internet. Used correctly, Google’s n-grams could be accessed to
determine a word’s conditional probability given a context of up to 4 words in length.
The knowledge of this powerful, comprehensive database finally led to my
experimental question: From Google’s n-gram data is it possible to compress files
efficiently, and if so, does this method offer any advantages over traditional compression
algorithms when dealing with small English text files?
Hypothesis
Using Google’s n-gram database, Web 1T 5-gram Version 1, it is possible to
compress English text with a much higher compression ratio for small files than could be
achieved by conventional algorithms.
Querying the N-Gram Database
In its current state, the Google n-gram database was in an unmanageable format to
extract useful data. After ordering the database from its distributor, the Linguistic Data
Consortium (LDC), and signing a license agreement, I received a six-DVD set in the mail
that contained the database in compressed gzip archives. The decompression process took
a couple of hours, and I was then required to move some files so that the database was
contained within five folders, “1GM,” “2GM,” … “5GM,” where the folder number
indicated the type of n-grams inside (i.e. “2GM” corresponds to 2-grams). Each folder
contained multiple text files labeled numerically, and an index file that listed the first n-
gram in each of the files. Inside each of the files were listings of n-grams in alphabetical
order, and each line also contained the observed count of the n-gram on that line within
the database delimited by a tab character. Here is an example excerpt from one of the 4-
grams files:
As shown above, each n-gram can be used to extract the conditional probability of a word
occurring. Using a line from the example above, infrastructure follows serve as the 500
times out of the total number of 4-grams that start with serve as the. I figured that for the
purposes of my project, I wanted to be able to extract the relative likelihood of a word
occurring in a given context. In other words, I desired a computationally feasible method
to determine the relative probability that a word would occur in comparison with the
other words that could have occurred following a given context. To accomplish this, I
devised a ranking system that relies on the fact that the arg max of the conditional
probabilities of the possible words following a given context is constant even when these
probabilities are scaled.
My first step involved sorting the database by the observed count within each set
of n-grams that share a base context. Following the example above (and assuming that
this deals with all 4-grams with the context, serve as the), the excerpt would be
rearranged to form the example below and to the left. From this sorted table, I assigned
each n-gram a rank equal to the number of n-grams in the database that shared the same
context and had a higher conditional probability than the given n-gram. This ranking
system starts at 1 (see example excerpt below and to the right). For instance, the n-gram
serve as the information has a ranking of 3, since information is the 3rd most common
word following serve as the, according to Google’s database.
Then, to produce a more usable format, I created two files, each optimized for a specific
data retrieving functionality.
− The first file, referred to as the alphabetical ranking index, was created by sorting the
ranking table produced in the previous step alphabetically (see example below and to
the left). This allows information to efficiently be extracted via a binary search to
determine the ranking of a given n-gram. With minimal effort, this file can be queried
in logarithmic time, and one could compute, say, that indicator is the 8th most
common word following serve as the (still assuming that the example excerpt being
used contains all the n-grams with a context of serve as the).
− The second file, referred to as the numerical ranking index, was created by
rearranging the final word in the n-gram with the rank for each n-gram in the ranking
table produced in the previous step (see example below and to the right). Once again,
this allows information to efficiently be extracted via binary search to determine the
word that is the n-th most likely to follow a given context. For example, a program
searching for serve as the 13, would return the word with the ranking 13 that follows
the context, serve as the. In this case, initiation would be returned.
Alphabetical Ranking Index Numerical Ranking Index
serve as the 1 initial serve as the indicator 8
serve as the 2 input serve as the indicators 16
serve as the 3 information serve as the indispensable 9
serve as the 4 industry serve as the indispensible 20
serve as the 5 infrastructure serve as the individual 6
serve as the 6 individual serve as the industrial 15
serve as the 7 initiating serve as the industry 4
serve as the 8 indicator serve as the info 17
serve as the 9 indispensable serve as the informal 10
serve as the 10 informal serve as the information 3
serve as the 11 inner serve as the informational 18
serve as the 12 initiator serve as the infrastructure 5
serve as the 13 initiation serve as the initial 1
serve as the 14 injector serve as the initiating 7
serve as the 15 industrial serve as the initiation 13
serve as the 16 indicators serve as the initiator 12
serve as the 17 info serve as the injector 14
serve as the 18 informational serve as the inlet 19
serve as the 19 inlet serve as the inner 11
serve as the 20 indispensible serve as the input 2
I implemented these techniques in Java, and after spending several days optimizing code
for virtual memory management, was able to produce 10 output files, 5 of which were
created with the alphabetical ranking index, and 5 of which were created with the
numerical ranking index. Each subset of 5 files contained one file of data for 1-grams,
one for 2-grams, and so on through 5-grams.
Finally, querying the database required the creation of a modified binary search
program. The determination of the rank of a word from a given context or of the word at
a specified rank from a given context required the use of principles from the Katz back-
off model. Essentially, this meant that if a word did not exist in the database given a
specific context, the program would successively back-up by chopping off the first word
of the context until the particular query was found. Mathematically, the conditional
probability of a word occurring given a specified context, P(wi|wi-n+1…wi-1), was equal to
the conditional probability of a word occurring given that context without its first word,
P(wi|wi-n+2…wi-1), if in the first case the probability was 0 (wi denotes the word at
position i, where i is the position of the word being tested; n is the length of the n-grams
being used). My program neglects the k, d and α constants present in the published
version of this equation in order to maximize the efficiency of the search. In the case that
my program had to back-up the current context, the rank being calculated was increased
by the maximum ranking of the context before backing-up.
With the created program, words could be queried from Google’s n-gram
database stored on an external hard drive at a rate of approximately 200 milliseconds per
word. In future experimentation, I would like to split the created n-gram files and index
them to attempt to increase the efficiency of this search.
Tokenization of the Text
To be able to compress text efficiently, the text had to first be tokenized in a
format uniform with the n-gram database’s tokenization system. As the database was not
well-documented, many of the rules that governed this tokenization system had to be
determined through trial and error. Several notable rules include the following:
Words (including letters and numbers) and punctuation marks receive their own
token. Multiple occurrences of the same punctuation mark in succession also
receive one token; however, differing punctuation marks in succession receive a
token for each run of similar punctuation marks.
Numbers with periods and commas inside them are counted as one token. Words
may also have a period inside them provided that the character following the
period is not whitespace or a punctuation mark.
Numerical dates separated by two or more slashes constitute one token; however
in every other case the slash receives its own token. i.e. 1/2/2003 is one token; the
fraction 1/2 is three tokens: ‘1’, ‘/’, and ‘2’.
The hyphen separating hyphenated words is a token. Multiple hyphens in
succession only counts as one token.
After end of sentence punctuation marks (. ? !), the database-specific tokens </S>
(end of sentence) and <S> (start of sentence) are required.
Any token that does not occur anywhere in Google’s n-gram database is to be
replaced with the database-specific token, <UNK> (unknown word).
I created a class in Java called Tokenizer, which read tokens from an input text file in the
format mentioned above. This class was efficient in that it was stream-based and offered
a command to read one token in the input file at a time rather than a common approach
involving storing the whole file in the virtual memory and then splitting it into tokens.
Compressing the Text
Compression relied on many factors that I altered during the course of testing;
however, there was an underlying design algorithm I created that proved to be effective
in terms of compression ratio. The basic steps are the following:
1. Read input text token by token, and for each token return the ranking in an n-gram
query with the preceding tokens as the context.
2. Encode these ranking numbers in a file. Initially, I implemented this step by
converting numbers into 2 bytes, which can be used to represent any range from
1-65536 (28×28). Rankings that were above this upper bound were converted to
the <UNK> token. However, I soon found out that variable byte encoding
produced more efficient results for my type of compression. In variable byte
encoding, numbers are encoded with multiples of 7 bits, but recorded as 8-bit
bytes. If a number requires more than 7 bits to convey its contents, then the 8th bit
is a 1, otherwise it is a 0. This way, variable-length numbers can be encoded
without needing to specify the number’s length in the file.
3. For any unknown tokens, take the initial word which produced this token and add
it to a string of unknown words delimited by a space.
4. Compress the string of unknown words using a preexisting generic file
compression algorithm (I used gzip because of its efficiency and simplicity). The
theory behind using a generic compression algorithm for this step is that, since
unknown tokens are so rare to begin with (the number of unique words in the n-
gram database is over 13 million) there is likely some redundancy if there is more
than one unknown token. For example, in a fictional story featuring a character
with a highly unusual name, every occurrence of his name within the database
will be replaced with an unknown token and added to the unknown string, which
will then compress effectively through generic means due to this redundancy.
5. Apply an arithmetic coding algorithm to compress the encoded rankings in the
file. This algorithm is similar to Huffman coding in that it is a form of variable-
length entropy encoding; however, it works much more effectively than Huffman
coding does for symbols that occur at probabilities far from a power of ½. It will
compress the encoded ranking data well because, for the typical text file, these
values will be skewed heavily towards lower values. Two variants of this
algorithm are tested: one which is adaptive and dynamically generates a table
representing the frequencies of certain symbols’ occurrences, and one which is
also adaptive yet initially begins with a generic static table previously generated
from the frequencies of symbols present in a short story.
6. The length of the compressed unknown string is calculated, and this number is
stored using variable byte encoding.
7. The final compressed archive is assembled in the following order: length of the
compressed unknown string expressed in variable byte encoding, compressed
unknown string, rankings compressed in arithmetic coding. This process is fully
reversible for decompression.
I constructed an implementation of this general method in Java and then proceeded to
analyze compression ratios for varying strings of English text. I had to be careful in
selecting samples of English text to use so as to not pick text that had been written from
before the year the Google N-Gram database was published, 2006, as it may influence the
data. For instance, the string, “To be or not to be – that is the question,” would compress
well using the above method; had Shakespeare not written Hamlet, this may not be the
case.
Ultimately, I selected 12 samples of English text of various length, divided into
four categories: Technical, News, Generic, and Text Message. Technical writing was
comprised of an article I read while researching for my project that had to do with
Google’s n-gram data along with my abstract and hypothesis for this project. News
included a recent article about the proliferation of data in The Economist, a New York
Times article detailing acceleration defects in some current Toyota cars, and a headline
from my local paper, The Carmel Pine Cone. Generic writing dealt with three relatively
small paragraphs of text about the character Alice in Alice and Wonderland, SMS texting,
and the anatomy of the International System of Units (SI Units). Lastly, a text message
was extracted from my mobile phone from three different friends, one with a formal
writing style, one with a semi-formal writing style, and one with an abbreviated informal
writing style.
For each sample of text, I compressed and decompressed the file’s contents using
my algorithm and recorded the following variables:
Input Size (Bytes) – the size of the original text file before compression takes
place
Time (ms) – the amount of time it took to compress the file’s contents
Token Count – the number of tokens within the uncompressed text file
Avg. Ranking – the average rank value assigned to the tokens
Unknown Tokens – the number of tokens which were not present in the n-gram
database
Unknown Compress. Ratio – The compression ratio achieved by applying the
gzip algorithm on the unknown data
Zip Compress. Ratio – The compression ratio achieved by compressing the
original file with zip compression (a commonly used algorithm for generic file
compression) on the highest compression setting
Rar Compress. Ratio – The compression ratio achieved by compressing the
original file with rar compression (a high power algorithm commonly used for
generic file compression) on the highest compression setting.
Output Size 1 – The size of the compressed file after compression with the
adaptive arithmetic coder that starts with an empty frequency table
Output Size 2 – The size of the compressed file after compression with the
adaptive arithmetic coder that starts with a previously generated frequency table
from a large sample of text
Compress. Ratio 1 – The compression ratio achieved by dividing output size 1 by
input size
Compress. Ratio 2 – The compression ratio achieved by dividing output size 2 by
input size
Following is a graph showing these variables’ experimentally determined values for each
of the 12 samples of English text being tested, sorted by decreasing initial size.
Analysis
There are some definite trends and inferences that can be extracted from the
previous chart. For example, zip and rar files—commonly used general compression
algorithms—feature increasing compression ratios as the list goes down and the filesize
decreases. rar compression ratios are also consistently less than zip compression ratios
for the files in my sample of English text. Following is a chart that demonstrates the
exponential increase in compression ratio as the text file size decreases for rar files.
RAR Compression Ratio with Small Text Files
350.00%
300.00%
250.00%
Compression Ratio
200.00%
150.00%
100.00%
50.00%
0.00%
16000 14000 12000 10000 8000 6000 4000 2000 0
Input File Size (Bytes)
For larger files, rar compression works well as it can build an effective statistical
model of the uncompressed file’s properties; however, short file compressibility is
minimal.
Perhaps most notably, Compress. Ratio 2 was less than or equal to Compress.
Ratio 1 for every file compressed. This is demonstrated in the following graph in which
the pink line representing Compression Ratio 2 is consistently below the blue line
representing Compression Ratio 1. Also notice that the gap between these two lines
grows wider as the text file gets smaller. That means that for large files, the ratio
achieved by loading the adaptive coder with an initial database of frequencies (Compress.
Ratio 2) is not significantly different from the ratio achieved when the adaptive coder
starts with an empty frequency table (Compress. Ratio 1).
Compress. Ratio 1 vs. Compress. Ratio 2
60.00%
50.00%
Compression Ratio
40.00%
30.00%
20.00%
10.00%
0.00%
16000 14000 12000 10000 8000 6000 4000 2000 0
Input File Size (Bytes)
350.00%
325.00%
300.00%
275.00%
250.00%
Compression Ratio
225.00%
200.00%
175.00%
150.00%
125.00%
100.00%
75.00%
50.00%
25.00%
0.00%
300 250 200 150 100 50 0
Input File Size (Bytes)
References:
“On Prediction Using Variable Order Markov Models.” Ron Begleiter, Ran El-Yaniv, et.
al. <http://www.jair.org/media/1491/live-1491-2335-jair.pdf>.
“An Enhanced Short Text Compression Scheme for Smart Devices.” Md. Rafiquel Islam,
S. A. Ahson Rajon.
<http://www.academypublisher.com/ojs/index.php/jcp/article/view/05014958/1345.>
“All our N-Gram Are Belong to You.” Alex Franz, Thorsten Brants.
<http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html>.