Professional Documents
Culture Documents
449
The cosine measure corresponds to taking the scalar which can then, if needed, be assembled into a co-
product of the vectors and then dividing by their norms. occurrence matrix.
The cosine measure is the most frequently utilized
similarity metric in word-space research. The advantage 4.1. RI Algorithm
of using cosine metric over other metrics to calculate
similarity is that it provides a fixed measure of similarity, Random Indexing accumulates context vectors in a
which ranges from 1 (for identical vectors), to 0 (for two step process:
orthogonal vectors) and -1 (for vectors pointing in the
opposite directions). Moreover, it is also comparatively 1. Each word in the text is assigned a unique and
efficient to compute. randomly generated vector called the index vector. The
index vectors are sparse and high dimensional and ternary
3.4. Problems Associated with Implementing (i.e. 1, -1, 0). Each word is also assigned an initially
Word Spaces empty context vector which has the same dimensionality
(r) as the index vector.
The dimension n used to define the word space 2. The context vectors are then accumulated by advancing
corresponding to a text document is equal to the number through the text one word taken at a time, and adding the
of unique words in the document. The number of context's index vector to the focus word's context vector.
dimensions increases as the size of text increases. Thus a When the entire data has been processed, the r-
text document containing a few thousands of words will dimensional context vectors are effectively the sum of the
have a word space of few thousands of dimensions. Thus words' contexts.
computational overhead increases rapidly with the size of For illustration we can again take the example of the
the text. The other problem is of data sparseness. The sentence
majority of cells in co-occurrence matrix constructed ‘A friend in need is a friend indeed’
corresponding to the document will be zero. The reason is
that the most of the words in any language appear in Let the dimension r of the index vector be 10 for
limited context, i.e. the words they co-occur with are very illustration purposes. The context is defined as one
limited. preceding and one succeeding word.
The solution to this predicament is to reduce the high Let ‘friend’ be assigned a random index vector:
dimensionality of the vectors. A few algorithms attempt to [0 0 0 1 0 0 0 0 -1 0 ]
solve this problem by dimensionality reduction. One of and ‘need’ be assigned a random index vector:
the simplest ways is to remove words belonging to certain [0 1 0 0 -1 0 0 0 0 0]
grammatical classes. Other way could be employing Then to compute the context vector of ‘in’ we need to
Latent Semantic Analysis [17]. We have used Random sum up the index vector of its context. Since the context is
Indexing [18] to address the problem of high defined as one preceding and one succeeding word, the
dimensionality. context of ‘in’ is ‘friend’ and ‘need’. We sum up their
index vectors to get the context vector of ‘in’.
4. Random Indexing [0 1 0 1 -1 0 0 0 -1 0]
If a co-occurrence matrix has to be constructed, r-
Random Indexing (RI) [18] is based on Pentti dimensional context vectors can be collected into a matrix
Kanerva's [19] work on sparse distributed memory. of order w x r, where w is the number of unique word
Random Indexing was developed to tackle the problem of types, and r is the chosen dimensionality of for each word.
high dimensionality in word space model. While Note that this is similar to constructing an n-
dimensionality reduction does make the resulting lower- dimensional unary context vector which has a single 1 in
dimensional context vectors easier to compute with, it different positions for different words and n is the number
does not solve the problem of initially having to collect a of distinct words. Mathematically, these n dimensional
potentially huge co-occurrence matrix. Even unary vectors are orthogonal, whereas the r-dimensional
implementations that use powerful dimensionality random index vectors are nearly orthogonal. There are
reduction, such as SVD [17], need to initially collect the many more nearly orthogonal than truly orthogonal
words-by-documents or words-by-words co-occurrence directions in a high-dimensional space [18]. Choosing
matrix. RI removes the need for the huge co-occurrence random indexing is an advantageous tradeoff between the
matrix. Instead of first collecting co-occurrences in a co- number of dimensions and orthogonality, as the r-
occurrence matrix and then extracting context vectors dimensional random index vectors can be seen as
from it, RI incrementally accumulates context vectors, approximations of the n-dimensional unary vectors.
450
Observe that both the unary vectors and the random any definitive pattern about the words they co-occur with.
index vectors assigned to the words construct the word Further, the terms whose distribution is most distinctive
space. The context vectors computed on the language data will be given the most weight.
are used in mapping the words onto the word space. In
our work we used Random Indexing because of the 5. The Experimental Setup
advantages discussed below.
Our experimental data set consists of fifteen
4.2. Advantages of Random Indexing
documents containing 200 to 300 words each. The
processing of each document to generate a summary has
Compared to other word space methodologies Random been carried out as follows:
Indexing approach is unique in the following three ways:
First, it is an incremental method, which means that
5.1. Mapping of Words onto the Word Space
the context vectors can be used for similarity
computations even after just a few examples have been
encountered. By contrast, most other word space methods Each word in the document was initially assigned a
require the entire data to be sampled before similarity unique randomly generated index vector of the dimension
computations can be performed. 100 with ternary values (1, -1, 0). This provided an
Second, it uses fixed dimensionality, which means that implicit dimensionality reduction of around 50%. The
new data do not increase the dimensionality of the index vectors were so constructed that each vector of 100
vectors. Increasing dimensionality can lead to significant units contained two randomly placed 1 and two randomly
scalability problems in other word space methods. placed -1s, rest of the units were assigned 0 value. Each
Third, it uses implicit dimension reduction, since the word was also assigned an initially empty context vector
fixed dimensionality is much lower than the number of of dimension 100. The dimensions r assigned to the words
words in the data. This leads to a significant gain in depend upon the number of unique words in the text.
processing time and memory consumption as compared to Since our test data consisted of small paragraphs of 200-
word space methods that employ computationally 300 words each the vector of dimensions 100 sufficed. If
expensive dimension reduction algorithms. larger texts containing thousands of word are to be
summarized larger dimensional vectors have to be
employed.
4.3. Assigning Semantic Vectors to Documents
We defined the context of a word as two words on
either side. Thus a 2x2 sliding window was used to
The average term vector can be considered as accumulate the context vector of the focus word. The
the central theme of the document and is computed as: context of a given word was also restricted in one
sentence, i.e. across sentence windows were not
considered. In case where the window extended in the
where n is the number of distinct words in the document. preceding or the succeeding sentence, a unidirectional
While we compute the semantic vectors for the window was used. There is fair evidence supporting the
sentences we subtract from the context vectors use of small context window. Kaplan [22] conducted
of the words of the sentence to remove the bias from the various experiments with people in which they
system [21]. The semantic vector of a sentence is thus successfully guessed the meaning of a word if two words
computed as: on either side of it were also provided. Experiments
conducted at SICS, Sweden [23] also indicate that a
narrow context window is preferable for acquiring
where, n is the number of words in the focus sentence and semantic information. The above observation prompted us
th to use a 2x2 window. The window can be weighted as
i refers to the i word of the sentence and is the
corresponding context vector. well to give greater importance to the words lying closer
Note that subtracting the mean vector reduces the to the focus word. For example, the weight vector [0.5 1 0
magnitude of those term vectors which are close in 1 0.5] suggests that the words adjacent to the focus word
direction to the mean vector, and increases the magnitude are given the weight 1 and the words at distance 2 are
of term vectors which are most nearly opposite in assigned a weight of 0.5. In our experiments we have used
direction from the mean vector. Thus the words which the above mentioned weights for computing the context
occur very commonly in a text, such as the auxiliary verbs vectors.
and articles, will have little influence on the sentence
vector so produced. Typically, these words do not have
451
5.2. Mapping of Sentences onto the Word Space 5.4 Calculating Weights of the Sentences and
Generating Summary
Once all the context vectors have been accumulated,
semantic vectors for the sentences were computed. A Once the graph is constructed, our aim is to get rid of
mean vector was calculated from the context vectors of all the redundant information in the text by removing the
the words in the text. This vector was subtracted from the sentences of less importance. To achieve this, the
context vectors of the word appearing in the sentence, the sentences are ranked by applying some graph-based
resultants were summed up and averaged to compute the ranking algorithms. Various graph-based ranking
semantic vector of the sentence. algorithms are available in literature. The one that we
have used for this work is weighted PageRank algorithm
5.3. Construction of Completely Connected [24].
Undirected Graph
Weighted PageRank Algorithm
We constructed a weighted, completely connected, Let G = (V, E) be a directed graph with the set of
undirected graph from the text, wherein each sentence is vertices V and set of edges E, where E is a subset of VxV.
represented by a node in the graph. The edge joining node For a given vertex Vi, let In(Vi) be the set of vertices that
i and node j is associated with a weight wij signifying the point to it (predecessors), and let Out(Vi) be the set of
similarity between the sentence i and sentence j. vertices that vertex Vi points to (successors). Then the
new node weight assigned by PageRank ranking algorithm
5.3.1. Assigning the Node Weights after one iteration is:
452
A solar eclipse occurs when the Moon passes between Earth and the The summary generated by Copernic:
Sun, thereby totally or partially obscuring Earth's view of the Sun. 10% summary
This configuration can only occur during a new moon, when the Sun A solar eclipse occurs when the Moon passes between Earth and the
and Moon are in conjunction as seen from the Earth. Sun, thereby totally or partially obscuring Earth's view of the Sun.
In ancient times, and in some cultures today, solar eclipses are 25% summary
attributed to mythical properties. A solar eclipse occurs when the Moon passes between Earth and the
Total solar eclipses can be frightening events for people unaware of Sun, thereby totally or partially obscuring Earth's view of the Sun.
their astronomical nature, as the Sun suddenly disappears in the middle A total solar eclipse is a spectacular natural phenomenon and many
of the day and the sky darkens in a matter of minutes. people consider travel to remote locations in order to observe one.
However, the spiritual attribution of solar eclipses is now largely 50% summary
disregarded. A solar eclipse occurs when the Moon passes between Earth and the
Total solar eclipses are very rare events for any given place on Earth Sun, thereby totally or partially obscuring Earth's view of the Sun.
because totality is only seen where the Moon's umbra touches the total solar eclipses can be frightening events for people unaware of their
Earth's surface. astronomical nature, as the Sun suddenly disappears in the middle of the
A total solar eclipse is a spectacular natural phenomenon and many day and the sky darkens in a matter of minutes.
people consider travel to remote locations in order to observe one. A total solar eclipse is a spectacular natural phenomenon and many
The 1999 total eclipse in Europe, said by some to be the most-watched people consider travel to remote locations in order to observe one.
eclipse in human history, helped to increase public awareness of the The 1999 total eclipse in Europe, said by some to be the most-watched
phenomenon. eclipse in human history, helped to increase public awareness of the
This was illustrated by the number of people willing to make the trip to phenomenon.
witness the 2005 annular eclipse and the 2006 total eclipse.
The next solar eclipse takes place on September 11, 2007, while the
next total solar eclipse will occur on August 1, 2008. The summary generated by Word:
10% summary
The next solar eclipse takes place on September 11, 2007, while the
The sentences selected by experts manually to create a next total solar eclipse will occur on August 1, 2008.
summary are: 25% summary
A solar eclipse occurs when the Moon passes between Earth and the
A solar eclipse occurs when the Moon passes between Earth and the Sun, thereby totally or partially obscuring Earth's view of the Sun.
Sun, thereby totally or partially obscuring Earth's view of the Sun. A total solar eclipse is a spectacular natural phenomenon and many
This configuration can only occur during a new moon, when the Sun people consider travel to remote locations in order to observe one.
and Moon are in conjunction as seen from the Earth. The next solar eclipse takes place on September 11, 2007, while the
Total solar eclipses are very rare events for any given place on Earth next total solar eclipse will occur on August 1, 2008.
because totality is only seen where the Moon's umbra touches the 50% summary
Earth's surface. A solar eclipse occurs when the Moon passes between Earth and the
A total solar eclipse is a spectacular natural phenomenon and many Sun, thereby totally or partially obscuring Earth's view of the Sun.
people consider travel to remote locations in order to observe one. A total solar eclipse is a spectacular natural phenomenon and many
The 1999 total eclipse in Europe, said by some to be the most-watched people consider travel to remote locations in order to observe one.
eclipse in human history, helped to increase public awareness of the The 1999 total eclipse in Europe, said by some to be the most-watched
phenomenon. eclipse in human history, helped to increase public awareness of the
phenomenon.
This was illustrated by the number of people willing to make the trip to
The summary generated by our summarizer:
witness the 2005 annular eclipse and the 2006 total eclipse.
10% summary
The next solar eclipse takes place on September 11, 2007, while the
A solar eclipse occurs when the Moon passes between Earth and the
next total solar eclipse will occur on August 1, 2008.
Sun, thereby totally or partially obscuring Earth's view of the Sun.
25% summary
A solar eclipse occurs when the Moon passes between Earth and the For larger texts, we used Precision, Recall and F, widely
Sun, thereby totally or partially obscuring Earth's view of the Sun. used in Information Retrieval [26] for evaluating our
This configuration can only occur during a new moon, when the Sun results. For each document an extract done manually by
and Moon are in conjunction as seen from the Earth.
experts has been considered as the reference summary
Total solar eclipses can be frightening events for people unaware of
their astronomical nature, as the Sun suddenly disappears in the middle (denoted by Sref). We then compare the candidate
of the day and the sky darkens in a matter of minutes. summary (denoted by Scand) with the reference summary
50% summary and compute the precision, recall and F values as follows:
A solar eclipse occurs when the Moon passes between Earth and the
Sun, thereby totally or partially obscuring Earth's view of the Sun.
This configuration can only occur during a new moon, when the Sun
and Moon are in conjunction as seen from the Earth.
Total solar eclipses can be frightening events for people unaware of We also compute the precision, recall and F values for the
their astronomical nature, as the Sun suddenly disappears in the middle summaries generated by Copernic [12] and Word
of the day and the sky darkens in a matter of minutes.
summarizer [14] by comparing them with Sref. Finally we
Total solar eclipses are very rare events for any given place on Earth
because totality is only seen where the Moon's umbra touches the compare the p, r, F values corresponding to our
Earth's surface. summarizer with these values. The values obtained for
The 1999 total eclipse in Europe, said by some to be the most-watched fifteen documents have been tabulated in Table 1.
eclipse in human history, helped to increase public awareness of the
phenomenon.
453
Text Our Approach Copernic Word
Number
p r F p r F p r F
1 10% 1.000 0.166 0.284 1.000 0.083 0.154 1.000 .250 0.400
25% 0.800 0.333 0.424 0.500 0.166 0.252 0.800 .333 0.424
2 10% 1.000 0.444 0.444 1.000 0.140 0.250 0.500 0.142 0.222
25% 1.000 0.429 0.601 1.000 0.429 0.545 0.333 0.142 0.200
3 10% 0.500 .125 0.200 0.500 0.125 0.200 0.500 0.125 0.200
25% 0.600 .375 0.462 0.750 0.375 0.500 0.600 0.375 0.462
4 10% 1.000 0.200 0.333 1.000 0.200 0.333 0 0 N.A.
25% 1.000 0.400 0.570 0.666 0.400 0.498 0.666 0.400 0.498
5 10% 1.000 0.143 0.249 1.000 0.143 0.249 0.500 0.143 0.222
25% 0.750 0.426 0.545 0.666 0.286 0.400 0.500 0.286 0.329
6 10% 1.00 0.300 0.451 1.000 0.200 0.333 0.500 0.100 0.167
25% 0.833 0.500 0.625 0.666 0.300 0.400 0.400 0.200 0.267
7 10% 1.000 0.222 0.364 1.000 0.222 0.364 0.500 0.111 0.181
25% 0.200 0.444 0.552 0.750 0.333 0.458 0.400 0.400 0.282
8 10% 1.000 0.250 0.400 1.000 0.250 0.400 0 0 N.A.
25% 1.000 0.500 0.666 1.000 0.500 0.666 0.500 0.250 0.400
9 10% 0.750 0.200 0.315 0.666 0.133 0.221 0.500 0.133 0.210
25% 0.875 0.466 0.608 0.857 0.400 0.545 0.625 0.333 0.434
10 10% 0.666 0.200 0.307 1.000 0.200 0.333 0.500 0.200 0.285
25% 0.833 0.500 0.625 0.800 0.400 0.533 0.714 0.500 0.588
11 10% 1.000 0.166 0.285 1.000 0.166 0.284 0.666 0.166 0.265
25% 0.857 0.500 0.632 0.666 0.333 0.444 0.875 0.583 0.699
12 10% 0.666 0.125 0.211 1.000 0.125 0.222 0.666 0.125 0.210
25% 0.875 0.438 0.583 0.750 0.375 0.500 0.750 0.375 0.500
13 10% 1.000 0.222 0.363 1.000 0.222 0.363 0.666 0.222 0.333
25% 0.800 0.444 0.571 0.750 0.333 0.461 0.600 0.333 0.428
14 10% 1.000 0.182 0.305 1.000 0.182 0.308 0.600 0.182 0.279
25% 0.833 0.454 0.593 0.800 0.364 0.499 0.600 0.364 0.453
15 10% 1.000 0.230 0.373 1.000 0.230 0.373 0.666 0.154 0.249
25% 0.857 0.461 0.599 0.666 0.307 0.419 0.714 0.384 0.499
The observations clearly indicate that the summaries memory consumption compared to other dimensionality
generated by our method are closer to the human reduction approaches. The approach gives better results
generated summaries that the summaries produced by than commercially available summarizers namely
Copernic and Word summarizers at 10% and 25% level Copernic and Word Summarizer.
in almost all the text cases. At 50% level too we In future we plan to include a training algorithm
obtained better results compared to Copernic and Word. using Random Indexing which will construct the Word
However, limitation of space precludes us to show the Space on a previously compiled text database and then
figures in this table. to use it for summarization purposes so as to resolve the
ambiguities, such as polysemy, more efficiently.
6. Conclusions and Future Scope We observed some abruptness in the summaries
generated by our method. We plan to smoothen out this
In this paper we have proposed a summarization abruptness by constructing Stiener trees of the graphs
technique which involves mapping of the words and constructed corresponding to the text.
sentences onto a semantic space and exploiting their In our present evaluation we have used measures like
similarities to remove the less important sentences precision, recall and F which are used primarily in the
containing redundant information. The problem of high context of information retrieval. In future we intend to
dimensionality of the semantic space corresponding to use more summarization-specific techniques, e.g.
the text has been tackled by employing Random ROUGE [27] to measure the efficacy of our scheme.
Indexing which is less expensive in computations and
454
Text summarization is an important challenge in [14] Word Sumamriser www.microsoft.com/education/
present day context for huge volumes of text are being autosummarize.mspx
produced every day. We expect that the proposed
[15] M. Sahlgren, “The Word-Space Model: Using
approach will paves the way for developing an efficient
distributional analysis to represent syntagmatic and
AI tool for text summarization. paradigmatic relations between words in high-dimensional
vector spaces”, Ph.D. dissertation, Department of Linguistics,
7. References Stockholm University, 2006
[1] Inderjeet Mani, “Advances in Automatic Text [16] Z. Harris, Mathematical structures of language,
Summarization”, MIT Press, Cambridge, MA, USA, 1999. Interscience Publishers, 1968.
[2] Jade Goldstein, Mark Kantrowitz, Vibhu Mittal, and Jaime [17] Thomas K Landauer, Peter W. Foltz, Darrell Laham, “An
Carbonell, “Summarizing text documents:sentence selection Introduction to Latent Semantic Analysis”, 45th Annual
and evaluation metrics”, ACM SIGIR, 1999, pp 121–128. Computer Personnel Research Conference – ACM, 2004.
[3] E.H. Hovy and C.Y. Lin, “Automated Text Summarization [18] M. Sahlgren, “An Introduction to Random Indexing.
in SUMMARIST”, Proceedings of the Workshop on Proceedings of the Methods and Applications of Semantic
Intelligent Text Summarization, ACL/EACL-97. Madrid, Indexing”, Workshop at the 7th International Conference on
Spain, 1997. Terminology and Knowledge Engineering, TKE, Copenhagen,
Denmark, 2005.
[4] J. Carbonell and J. Goldstein, “The use of MMR, diversity-
based reranking for reordering documents and producing [19] P. Kanerva, Sparse distributed memory, Cambridge, MA,
summaries,” ACM SIGIR, 1998, pp. 335–336. USA: MIT Press, 1988
[5] Zha Hongyuan, “Generic Summarization and Key phrase [20] S. Kaski, “Dimensionality reduction by random mapping:
Extraction Using Mutual Reinforcement Principle and Fast similarity computation for clustering”, Proceedings of the
Sentence Clustering”, ACM, 2002. International Joint Conference on Neural Networks,
. IJCNN'98 IEEE Service Center, 1999.
[6] John Conroy, Leary Dianne, “Text Summarization via
Hidden Markov Models and Pivoted QR Matrix [21] Derrick Higgins, Jill Burstein, “Sentence similarity
Decomposition”, ACM SGIR ,2001. measures for essay coherence”, Proceedings of the 2004
Human language Technology Conference of the North
[7] Daniel Marcu “From discourse structures to text American chapter of the Association for Computational
summaries” In ACL’97/EACL’97 Workshop on Intelligent Linguistics, Boston, Massachusetts, May 2004
Scalable Text Summarization, 1997, pp 82–88.
. [22] A. Kaplan “An experimental study of ambiguity and
[8] J. Pollock and A. Zamora “Automatic abstracting research context”, Mechanical Translation, 2(2), 1955.
at chemical abstracts service”, JCICS, 1975.
[23] J. Karlgren, & M. Sahlgren (2001), “From words to
[9] Hans P. Luhn, “The automatic creation of literature understanding”, Foundations of real-world intelligence ,CSLI
abstracts”, IBM J. of R. and D, 1958. Publications, 2001
[10] Nidhika Yadav, M.Tech Thesis , Indian Institute of [24] Rada Mihalcea, “Graph-based Ranking Algorithms for
Technology Delhi, 2007 Sentence Extraction, Applied to Text Summarization”,
Proceedings of the 42nd Annual Meeting of the Association
[11] Yihong Gong and Xin Liu, “Generic text summarization for Computational Linguistics, companion volume (ACL
using relevance measure and latent semantic analysis”, 2004), Barcelona, Spain, July 2004.
SIGIR,ACM, 2001, pp 19–25.
[25] S. Brin and L. Page, “The anatomy of a large-scale
[12] Copernic Summarizer Homepage: http://www.copernic hypertextual Web search engine”, Computer Networks and
.com /en/products/summarizer. ISDN Systems, 1998.
[13] Rada Mihalcea and Paul Tarau, “An Algorithm for [26] R.B. Yates, B.R. Neto, Modern Information Retrieval,
Language Independent Single and Multiple Document Pearson Education, 1999
Summarization”, Proceedings of the International Joint
Conference on Natural Language Processing (IJCNLP), [27] ROUGE: Recall Oriented Understudy for Gisting
Korea, October 2005. evaluation, http://www.isi.edu/~cyl/ROUGE/
455