A Novel Approach To Text Compression Using N-Grams

A Novel Approach to
Text Compression
Using N-Grams
Dylan Freedman
Carmel High School
March 5, 2010
Advisors: Craig Martell, PhD. Naval Postgraduate School

George Dinolt, PhD. Naval Postgraduate School
Abstract
In our current age of high speed communications, the spread and growth of data
are taking off at higher rates than the world’s available storage can handle. With about
4.6 billion mobile phone subscribers and approximately 2 billion internet users, this
problem has extended throughout the globe and will continue at exponential rates. For
individuals and businesses, compressing files is becoming more and more relevant, as
cutting down on digital storage has the potential to save money and electricity, lessening
the user’s digital footprint and helping the environment through reduced consumption of
resources. Because of this demand, most attention and research for file compression has
been devoted towards larger files, typically media files, so as to reap the maximum
benefit of compression ratio. However, recent compressors are often not able to compress
a small file (less than 1KB) due to the necessity of including a statistical model of the
file’s properties in the compressed archive and due to an insufficiency of data from which
to construct this model.
In my project, I propose a method that is able to effectively compress files of
English text by using a standardized language model—Google’s N-Gram corpus.
Composed of almost 90GB of raw text files, this database contains n-grams up to 5-
grams that were observed more than 40 times in a one trillion word survey of the internet.
For the purposes of efficiency, I rearranged this database so as to optimize retrieval of
information through a binary search, and I wrote an undemanding program that uses a
simplified Katz back-off model to find the relative probability of a word occurring in an
n-gram given a preceding context. I then designed a compressor that first converts an
input file of text to a sequence of these probabilities encoded with variable-byte
encoding, and then applies arithmetic coding on the resultant file. With this process, I
was able to achieve an average 25% or less compression ratio with text files, even those
only a few hundred bytes long. With the added benefit of not having to include a
statistical model of the input file’s properties, this method is ideally suited for enterprises
that transmit thousands of small text files. The initial n-gram database could be hosted
remotely, and its large size would eventually be made up for in file size savings. One
practical application that comes to mind is SMS cell phone texting, which encompasses
over 4.1 billion text messages sent daily.
Introduction
With our society’s widespread proliferation of
information, it may come as a surprise that there is not
enough storage available on all the hard drives in the
world to fit the volume of digital data created by people,
and following current trends this situation will only
worsen. There are currently 4.6 billion cell phone
subscribers in the world and 1.7 billion internet users,
numbers which have been growing exponentially over the
past few years. The consequences of this excessive use of
information are costly and have a negative effect on the
environment as new data centers are being erected every
day.
But is the situation hopeless? No—almost every conflict is met by mankind with
innovation, and abundance of information is no exception. Data compression is the tool
commonly used to reduce the size of files, and computer users deal with these techniques
daily whether they know it or not. Images in websites are almost always compressed with
a lossy algorithm called, JPG, and uncompressed audio and video are usually compressed
to a lower quality filetype such as mp3 or AVI. But most research and attention in the
realm of compression is devoted to media files or large general purpose files, both of
which take up a good amount of data. Due to a natural interest in linguistics, I wondered
why there were not any specialized file formats designed for compressing text.
As I became more involved in this field, I figured that English text would be a
good candidate for compression experimentation for the following reasons:
1. There are many external corpora of English text available to examine more of the
language’s linguistic properties, and an abundance of research has already been
conducted.
2. English is prolific and relevant, yet challenging for computers to comprehend.
3. Text files are typically small and do not consume much space; however, Internet
services and large companies often handle thousands of these files and could save
space and time with an efficient compressor.
4. Furthermore, due to the relatively high compressibility of English through
conventional means, preexisting work on this subject is sparse compared to that of
generic file compression, leaving room for innovation.
My research began with an investigation of the commonly employed lossless
compression schemes, and over time I discovered that most fell into two categories: (1)
static coders - compression schemes that build a statistical model of the properties of the
file and then compress the file efficiently using this model, yet at the expense of having
to store these statistics (usually in the form of a frequency table) along with the
compressed contents; and (2) adaptive coders - those that feature an adaptive
compression scheme, starting with a limited, generic statistical model and dynamically
updating it as the file is read with the benefit of not having to store this model inside the
final file. Both these categories of compression generate statistical models, which allow
compressors and decompressors to represent more frequently occurring patterns with a
lesser amount of bits and less frequently occurring patters with more bits, so as to
produce an optimally short file.
However, while researching these algorithms, I asked myself why a statistical
model needed to be generated for each and every file to be compressed, even if multiple
files shared some statistical properties. Does this redundancy not challenge the very
purpose of compression, where the goal is to eliminate all redundancy?
This led me to question the existing compression methods. I soon found out that
both these compression methods were highly inefficient when dealing with small files.
For a given small file (<1 KB), static coding algorithms usually produced larger files than
the initial due to the requirement that the compressed archive include the file's statistical
model (which was often bigger than the file itself), and adaptive coders could not achieve
a high compression ratio either as the sample of data within a small file was typically
insufficient to produce an effective statistical model. But what if a comprehensive,
external database existed that could generalize the statistical properties of a type of file?
If this were the case, a file of this type may not require a compressor to generate these
properties.
To take a simple example, consider the average English-speaking adult, who can
naturally understand the syntax and semantics of his language and could loosely be
considered an external linguistics engine. Claude Shannon, the father of information
theory, proposed an experiment in a 1950 paper in which he attempted to calculate the
entropy of English by asking an average English-speaking person to guess a phrase he
was thinking of letter-by-letter (including spaces). When a letter was guessed correctly,
Shannon would write under that letter the number of guesses required, and the subject
would then attempt to guess the next letter. Below is the result of one of his experiments
on a human subject:
In this experiment, the subject correctly guessed the letter on his first try for 79 of
the letters out of the 102 total letters in the sequence. He was able to perform such a feat
by logically choosing letters that have a high likelihood of following given the preceding
context of the phrase. Through a series of calculations, Shannon placed the redundancy of
English text at approximately 75%, meaning that a piece of text could optimally be
represented in 1/4 of the length.
I wondered for the purposes of my experiment how well this technique of
predicting text given context could be emulated by a computer. Was this process deeply
rooted in linguistics that were beyond computer's current capacity or could it be feasibly
implemented?
I researched more on existing corpora and was ultimately driven towards
Google’s 2006 N-Gram corpus, titled, Web 1T 5-gram Version 1. This database was
comprised of n-grams up to 5-grams that appeared over 40 times in a one trillion word
corpus of text from the Internet. Used correctly, Google’s n-grams could be accessed to
determine a word’s conditional probability given a context of up to 4 words in length.
The knowledge of this powerful, comprehensive database finally led to my
experimental question: From Google’s n-gram data is it possible to compress files
efficiently, and if so, does this method offer any advantages over traditional compression
algorithms when dealing with small English text files?
Hypothesis
Using Google’s n-gram database, Web 1T 5-gram Version 1, it is possible to
compress English text with a much higher compression ratio for small files than could be
achieved by conventional algorithms.
Querying the N-Gram Database
In its current state, the Google n-gram database was in an unmanageable format to
extract useful data. After ordering the database from its distributor, the Linguistic Data
Consortium (LDC), and signing a license agreement, I received a six-DVD set in the mail
that contained the database in compressed gzip archives. The decompression process took
a couple of hours, and I was then required to move some files so that the database was
contained within five folders, “1GM,” “2GM,” … “5GM,” where the folder number
indicated the type of n-grams inside (i.e. “2GM” corresponds to 2-grams). Each folder
contained multiple text files labeled numerically, and an index file that listed the first n-
gram in each of the files. Inside each of the files were listings of n-grams in alphabetical
order, and each line also contained the observed count of the n-gram on that line within
the database delimited by a tab character. Here is an example excerpt from one of the 4-
grams files:
n-gram observed count

serve as the indicator 120
serve as the indicators 45
serve as the indispensable 111
serve as the indispensible 40
serve as the individual 234
serve as the industrial 52
serve as the industry 607
serve as the info 42
serve as the informal 102
serve as the information 838
serve as the informational 41
serve as the infrastructure 500
serve as the initial 5331
serve as the initiating 125
serve as the initiation 63
serve as the initiator 81
serve as the injector 56
serve as the inlet 41
serve as the inner 87
serve as the input 1323
As shown above, each n-gram can be used to extract the conditional probability of a word
occurring. Using a line from the example above, infrastructure follows serve as the 500
times out of the total number of 4-grams that start with serve as the. I figured that for the
purposes of my project, I wanted to be able to extract the relative likelihood of a word
occurring in a given context. In other words, I desired a computationally feasible method
to determine the relative probability that a word would occur in comparison with the
other words that could have occurred following a given context. To accomplish this, I
devised a ranking system that relies on the fact that the arg max of the conditional
probabilities of the possible words following a given context is constant even when these
probabilities are scaled.
My first step involved sorting the database by the observed count within each set
of n-grams that share a base context. Following the example above (and assuming that
this deals with all 4-grams with the context, serve as the), the excerpt would be
rearranged to form the example below and to the left. From this sorted table, I assigned
each n-gram a rank equal to the number of n-grams in the database that shared the same
context and had a higher conditional probability than the given n-gram. This ranking
system starts at 1 (see example excerpt below and to the right). For instance, the n-gram
serve as the information has a ranking of 3, since information is the 3rd most common
word following serve as the, according to Google’s database.
Rearrangement Assigning the Rank

serve as the initial 5331 serve as the initial 1
serve as the input 1323 serve as the input 2
serve as the information 838 serve as the information 3
serve as the industry 607 serve as the industry 4
serve as the infrastructure 500 serve as the infrastructure 5
serve as the individual 234 serve as the individual 6
serve as the initiating 125 serve as the initiating 7
serve as the indicator 120 serve as the indicator 8
serve as the indispensable 111 serve as the indispensable 9
serve as the informal 102 serve as the informal 10
serve as the inner 87 serve as the inner 11
serve as the initiator 81 serve as the initiator 12
serve as the initiation 63 serve as the initiation 13
serve as the injector 56 serve as the injector 14
serve as the industrial 52 serve as the industrial 15
serve as the indicators 45 serve as the indicators 16
serve as the info 42 serve as the info 17
serve as the informational 41 serve as the informational 18
serve as the inlet 41 serve as the inlet 19
serve as the indispensible 40 serve as the indispensible 20
Then, to produce a more usable format, I created two files, each optimized for a specific
data retrieving functionality.
− The first file, referred to as the alphabetical ranking index, was created by sorting the
ranking table produced in the previous step alphabetically (see example below and to
the left). This allows information to efficiently be extracted via a binary search to
determine the ranking of a given n-gram. With minimal effort, this file can be queried
in logarithmic time, and one could compute, say, that indicator is the 8th most
common word following serve as the (still assuming that the example excerpt being
used contains all the n-grams with a context of serve as the).
− The second file, referred to as the numerical ranking index, was created by
rearranging the final word in the n-gram with the rank for each n-gram in the ranking
table produced in the previous step (see example below and to the right). Once again,
this allows information to efficiently be extracted via binary search to determine the
word that is the n-th most likely to follow a given context. For example, a program
searching for serve as the 13, would return the word with the ranking 13 that follows
the context, serve as the. In this case, initiation would be returned.
Alphabetical Ranking Index Numerical Ranking Index
serve as the 1 initial serve as the indicator 8
serve as the 2 input serve as the indicators 16
serve as the 3 information serve as the indispensable 9
serve as the 4 industry serve as the indispensible 20
serve as the 5 infrastructure serve as the individual 6
serve as the 6 individual serve as the industrial 15
serve as the 7 initiating serve as the industry 4
serve as the 8 indicator serve as the info 17
serve as the 9 indispensable serve as the informal 10
serve as the 10 informal serve as the information 3
serve as the 11 inner serve as the informational 18
serve as the 12 initiator serve as the infrastructure 5
serve as the 13 initiation serve as the initial 1
serve as the 14 injector serve as the initiating 7
serve as the 15 industrial serve as the initiation 13
serve as the 16 indicators serve as the initiator 12
serve as the 17 info serve as the injector 14
serve as the 18 informational serve as the inlet 19
serve as the 19 inlet serve as the inner 11
serve as the 20 indispensible serve as the input 2
I implemented these techniques in Java, and after spending several days optimizing code
for virtual memory management, was able to produce 10 output files, 5 of which were
created with the alphabetical ranking index, and 5 of which were created with the
numerical ranking index. Each subset of 5 files contained one file of data for 1-grams,
one for 2-grams, and so on through 5-grams.
Finally, querying the database required the creation of a modified binary search
program. The determination of the rank of a word from a given context or of the word at
a specified rank from a given context required the use of principles from the Katz back-
off model. Essentially, this meant that if a word did not exist in the database given a
specific context, the program would successively back-up by chopping off the first word
of the context until the particular query was found. Mathematically, the conditional
probability of a word occurring given a specified context, P(wi|wi-n+1…wi-1), was equal to
the conditional probability of a word occurring given that context without its first word,
P(wi|wi-n+2…wi-1), if in the first case the probability was 0 (wi denotes the word at
position i, where i is the position of the word being tested; n is the length of the n-grams
being used). My program neglects the k, d and α constants present in the published
version of this equation in order to maximize the efficiency of the search. In the case that
my program had to back-up the current context, the rank being calculated was increased
by the maximum ranking of the context before backing-up.
With the created program, words could be queried from Google’s n-gram
database stored on an external hard drive at a rate of approximately 200 milliseconds per
word. In future experimentation, I would like to split the created n-gram files and index
them to attempt to increase the efficiency of this search.
Tokenization of the Text
To be able to compress text efficiently, the text had to first be tokenized in a
format uniform with the n-gram database’s tokenization system. As the database was not
well-documented, many of the rules that governed this tokenization system had to be
determined through trial and error. Several notable rules include the following:
Words (including letters and numbers) and punctuation marks receive their own
token. Multiple occurrences of the same punctuation mark in succession also
receive one token; however, differing punctuation marks in succession receive a
token for each run of similar punctuation marks.
Numbers with periods and commas inside them are counted as one token. Words
may also have a period inside them provided that the character following the
period is not whitespace or a punctuation mark.
Numerical dates separated by two or more slashes constitute one token; however
in every other case the slash receives its own token. i.e. 1/2/2003 is one token; the
fraction 1/2 is three tokens: ‘1’, ‘/’, and ‘2’.
The hyphen separating hyphenated words is a token. Multiple hyphens in
succession only counts as one token.
After end of sentence punctuation marks (. ? !), the database-specific tokens </S>
(end of sentence) and <S> (start of sentence) are required.
Any token that does not occur anywhere in Google’s n-gram database is to be
replaced with the database-specific token, <UNK> (unknown word).
I created a class in Java called Tokenizer, which read tokens from an input text file in the
format mentioned above. This class was efficient in that it was stream-based and offered
a command to read one token in the input file at a time rather than a common approach
involving storing the whole file in the virtual memory and then splitting it into tokens.
Compressing the Text
Compression relied on many factors that I altered during the course of testing;
however, there was an underlying design algorithm I created that proved to be effective
in terms of compression ratio. The basic steps are the following:
1. Read input text token by token, and for each token return the ranking in an n-gram
query with the preceding tokens as the context.
2. Encode these ranking numbers in a file. Initially, I implemented this step by
converting numbers into 2 bytes, which can be used to represent any range from
1-65536 (28×28). Rankings that were above this upper bound were converted to
the <UNK> token. However, I soon found out that variable byte encoding
produced more efficient results for my type of compression. In variable byte
encoding, numbers are encoded with multiples of 7 bits, but recorded as 8-bit
bytes. If a number requires more than 7 bits to convey its contents, then the 8th bit
is a 1, otherwise it is a 0. This way, variable-length numbers can be encoded
without needing to specify the number’s length in the file.
3. For any unknown tokens, take the initial word which produced this token and add
it to a string of unknown words delimited by a space.
4. Compress the string of unknown words using a preexisting generic file
compression algorithm (I used gzip because of its efficiency and simplicity). The
theory behind using a generic compression algorithm for this step is that, since
unknown tokens are so rare to begin with (the number of unique words in the n-
gram database is over 13 million) there is likely some redundancy if there is more
than one unknown token. For example, in a fictional story featuring a character
with a highly unusual name, every occurrence of his name within the database
will be replaced with an unknown token and added to the unknown string, which
will then compress effectively through generic means due to this redundancy.
5. Apply an arithmetic coding algorithm to compress the encoded rankings in the
file. This algorithm is similar to Huffman coding in that it is a form of variable-
length entropy encoding; however, it works much more effectively than Huffman
coding does for symbols that occur at probabilities far from a power of ½. It will
compress the encoded ranking data well because, for the typical text file, these
values will be skewed heavily towards lower values. Two variants of this
algorithm are tested: one which is adaptive and dynamically generates a table
representing the frequencies of certain symbols’ occurrences, and one which is
also adaptive yet initially begins with a generic static table previously generated
from the frequencies of symbols present in a short story.
6. The length of the compressed unknown string is calculated, and this number is
stored using variable byte encoding.
7. The final compressed archive is assembled in the following order: length of the
compressed unknown string expressed in variable byte encoding, compressed
unknown string, rankings compressed in arithmetic coding. This process is fully
reversible for decompression.
I constructed an implementation of this general method in Java and then proceeded to
analyze compression ratios for varying strings of English text. I had to be careful in
selecting samples of English text to use so as to not pick text that had been written from
before the year the Google N-Gram database was published, 2006, as it may influence the
data. For instance, the string, “To be or not to be – that is the question,” would compress
well using the above method; had Shakespeare not written Hamlet, this may not be the
case.
Ultimately, I selected 12 samples of English text of various length, divided into
four categories: Technical, News, Generic, and Text Message. Technical writing was
comprised of an article I read while researching for my project that had to do with
Google’s n-gram data along with my abstract and hypothesis for this project. News
included a recent article about the proliferation of data in The Economist, a New York
Times article detailing acceleration defects in some current Toyota cars, and a headline
from my local paper, The Carmel Pine Cone. Generic writing dealt with three relatively
small paragraphs of text about the character Alice in Alice and Wonderland, SMS texting,
and the anatomy of the International System of Units (SI Units). Lastly, a text message
was extracted from my mobile phone from three different friends, one with a formal
writing style, one with a semi-formal writing style, and one with an abbreviated informal
writing style.
For each sample of text, I compressed and decompressed the file’s contents using
my algorithm and recorded the following variables:
Input Size (Bytes) – the size of the original text file before compression takes
place
Time (ms) – the amount of time it took to compress the file’s contents
Token Count – the number of tokens within the uncompressed text file
Avg. Ranking – the average rank value assigned to the tokens
Unknown Tokens – the number of tokens which were not present in the n-gram
database
Unknown Compress. Ratio – The compression ratio achieved by applying the
gzip algorithm on the unknown data
Zip Compress. Ratio – The compression ratio achieved by compressing the
original file with zip compression (a commonly used algorithm for generic file
compression) on the highest compression setting
Rar Compress. Ratio – The compression ratio achieved by compressing the
original file with rar compression (a high power algorithm commonly used for
generic file compression) on the highest compression setting.
Output Size 1 – The size of the compressed file after compression with the
adaptive arithmetic coder that starts with an empty frequency table
Output Size 2 – The size of the compressed file after compression with the
adaptive arithmetic coder that starts with a previously generated frequency table
from a large sample of text
Compress. Ratio 1 – The compression ratio achieved by dividing output size 1 by
input size
Compress. Ratio 2 – The compression ratio achieved by dividing output size 2 by
input size
Following is a graph showing these variables’ experimentally determined values for each
of the 12 samples of English text being tested, sorted by decreasing initial size.
Analysis
There are some definite trends and inferences that can be extracted from the
previous chart. For example, zip and rar files—commonly used general compression
algorithms—feature increasing compression ratios as the list goes down and the filesize
decreases. rar compression ratios are also consistently less than zip compression ratios
for the files in my sample of English text. Following is a chart that demonstrates the
exponential increase in compression ratio as the text file size decreases for rar files.
RAR Compression Ratio with Small Text Files
350.00%
300.00%
250.00%
Compression Ratio
200.00%
150.00%
100.00%
50.00%
0.00%
16000 14000 12000 10000 8000 6000 4000 2000 0
Input File Size (Bytes)
For larger files, rar compression works well as it can build an effective statistical
model of the uncompressed file’s properties; however, short file compressibility is
minimal.
Perhaps most notably, Compress. Ratio 2 was less than or equal to Compress.
Ratio 1 for every file compressed. This is demonstrated in the following graph in which
the pink line representing Compression Ratio 2 is consistently below the blue line
representing Compression Ratio 1. Also notice that the gap between these two lines
grows wider as the text file gets smaller. That means that for large files, the ratio
achieved by loading the adaptive coder with an initial database of frequencies (Compress.
Ratio 2) is not significantly different from the ratio achieved when the adaptive coder
starts with an empty frequency table (Compress. Ratio 1).
Compress. Ratio 1 vs. Compress. Ratio 2
60.00%
50.00%
Compression Ratio
40.00%
30.00%
20.00%
10.00%
0.00%
16000 14000 12000 10000 8000 6000 4000 2000 0
Compress. Ratio 1 Compress. Ratio 2
Here also arises the significant application of my method. While compression

ratio actually increases above 100% for rar and zip when dealing with files about 200
bytes and less, my method is able to keep compression ratios below 50%, even when the
input text is only 31 bytes long. Following is a chart that shows Compress. Ratio 2
graphed on the same chart as RAR Compress. Ratio for input file sizes less than 300
bytes. Notice the significant difference between these ratios and how low Compress.
Ratio 2 is for files above 100 bytes. Achieving compression ratios this low for such a
small sample of text is rarely ever seen in data compression.
Compression Ratios for Small Text Files
350.00%
325.00%
300.00%
275.00%
250.00%
Compression Ratio
225.00%
200.00%
175.00%
150.00%
125.00%
100.00%
75.00%
50.00%
25.00%
0.00%
300 250 200 150 100 50 0
RAR Compress. Ratio Compress. Ratio 2

Conclusion
Overall, my project wielded surprisingly good results when dealing with
compression of small files. Typical compression algorithms have high compression ratios
for these types of files due to the limited amount of redundancy present in their contents;
however, my program instead analyzes and exploits redundancy between the relative
probabilities (rankings) of the words of the initial file given the context of the preceding
words. While compressing files that are already small seems to be a needless operation,
companies and enterprises dealing with millions or more of these types of files may
benefit. For example, potential markets for this compression scheme include cell phone
texting, Twitter posts, Facebook updates—anything dealing with a small amount of text
that needs to be transmitted to a user real-time, yet is stored for later access. Since the
Google n-gram database is so huge, a company implementing a scheme such as this
would store the database on their server’s hard drives and simply compress and
decompress messages that are sent to the server. I would imagine that a corporation such
as a cell phone company would want to produce their own n-gram corpus specialized for
the types of messages it would receive. This compression scheme is novel, and since not
much work has been conducted in the field of short text compression, there is room for
improvement.
For example, a question that arises in this scheme’s application is whether the
required n-gram querying scheme takes too long to calculate each rank, and whether this
computational difficulty would increase the bandwidth of the server too much to make
such an algorithm practical and useful. With many optimizations, my program still took
approximately 1/5 of a second to calculate each rank; however, I believe that there are
still further optimizations to be made. In my project, I was accessing the n-gram database
from an external hard drive with a slow read and write speed, but I would imagine that a
company would implement more advanced hardware such as a high speed internal hard
drive, or possibly even a solid state drive (which is fast for random access). If RAM
continues to grow, the database could potentially be stored in virtual memory. Despite the
impressive compression ratio, I believe that there is still much room for optimizing this as
well. For instance, since my program is word-based, single letter words and punctuation
are typically hard to compress well. Perhaps this could be dealt with using generic
compression algorithms on these small tokens where there would inevitably be a high
amount of redundancy.
There are many areas to expand and improve upon in this experiment. In fact, as
the project progressed, I ended up having more questions than answers, and more ideas
than could be contained in one science fair display. The n-grams are particularly
applicable and interesting as they are used in a wide variety of applications such as
speech recognition, plagiarism detection, machine learning, and translation. Based on n-
grams, my project could be expanded to compress other types of data, such as DNA
strands, with an applicable corpus. I also considered topics that related to the concepts in
my project yet did not deal with compression. For example, I believe one possible
application is implementing authorship detection by creating a digital writing signature
based on how often certain ranks appeared in the compressed versions of one’s works. I
even thought of my project as a potential metric of a writing originality. In school, this
could mean that the higher a student’s essay’s compression ratio, the more that that essay
followed predictable trends. One amusing thought I had was that this would lead our
society to be graded for its work on compression ratio. If this were the case, then my
abstract for this project would likely not be accepted by society, as it compressed to
18.63% of its original size.
Overall, I believe that implemented correctly, the compression scheme introduced
in my project has the potential for widespread application. For companies managing
significant quantities of data, this novel form of compression could be used effectively to
save energy and resources, and for massive markets such as text messaging or Twittering,
a small improvement like this could spare the construction of new data centers or server
farms, which are difficult to maintain and expensive. And with approximately 4.1 billion
text messages sent daily—an estimate from last year—the time and opportunity are
appropriate to implement such a scheme. A small change in compression can make a big
difference for the world and the consequences of an overabundance of information.
References:
“On Prediction Using Variable Order Markov Models.” Ron Begleiter, Ran El-Yaniv, et.
al. <http://www.jair.org/media/1491/live-1491-2335-jair.pdf>.
“An Enhanced Short Text Compression Scheme for Smart Devices.” Md. Rafiquel Islam,
S. A. Ahson Rajon.
<http://www.academypublisher.com/ojs/index.php/jcp/article/view/05014958/1345.>
“Prediction and Entropy of Printed English.” C. E. Shannon.

<http://languagelog.ldc.upenn.edu/myl/Shannon1950.pdf>.
“The Data Deluge.” The Economist.

<http://www.economist.com/surveys/displaystory.cfm?story_id=15557443>.
“All our N-Gram Are Belong to You.” Alex Franz, Thorsten Brants.
<http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html>.

A Novel Approach To Text Compression Using N-Grams

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Novel Approach To Text Compression Using N-Grams

Uploaded by

Copyright:

Available Formats

A Novel Approach to

Advisors: Craig Martell, PhD. Naval Postgraduate School

n-gram observed count

Rearrangement Assigning the Rank

Compress. Ratio 1 Compress. Ratio 2

Here also arises the significant application of my method. While compression

RAR Compress. Ratio Compress. Ratio 2

“Prediction and Entropy of Printed English.” C. E. Shannon.

“The Data Deluge.” The Economist.

You might also like