You are on page 1of 8

miraQA: Experiments with Learning Answer Context

Patterns from the Web

César de Pablo-Sánchez1, José Luis Martínez-Fernández1, Paloma Martínez1,


and Julio Villena2
1
Advanced Databases Group, Computer Science Department,
Universidad Carlos III de Madrid,
Avda. Universidad 30, 28911 Leganés , Madrid, Spain
{cdepablo, jlmferna, pmf}@inf.uc3m.es
2
DAEDALUS – Data, Decisions and Language S.A.
Centro de Empresas “La Arboleda”, Ctra N-III km 7,300 Madrid 28031, Spain
jvillena@daedalus.es

Abstract. We present the miraQA system which is MIRACLE’s first


experience in Question Answering for monolingual Spanish. The general
architecture of the system developed for QA@CLEF 2004 is presented as well
as evaluation results. miraQA characterizes by learning the rules for answer
extraction from the Web using a Hidden Markov Model of the context in which
answers appear. We used a supervised approach that uses questions and
answers from last years evaluation set for training.

1 Introduction
Question Answering has received a lot of attention during the last years due to the
advances in IR and NLP. As in other applications in these areas, the bulk of the
research has been mainly in English while perhaps one of the most interesting
applications of QA systems could be in cross and multilingual scenarios. Access to
concrete quality information in a language that is not spoken or just poorly understood
could be advantageous to current IR systems in many situations. QA@CLEF [8] has
encouraged the development of QA systems in other languages than English and in
crosslingual scenarios.
QA systems are usually complex because of the number of different modules that
they use, and the need for a good integration among them. Even if questions are
expecting a simple fact or a short definition as an answer, the requirement of more
precise information has entailed the use of language and domain specific modules. On
the other hand, some other approaches relying on data-intensive [4], machine learning
and statistical techniques [10] have achieved wide spread and relative success.
Moreover, the interest of these approaches for multilingual QA systems lies on the
possibility of adapting them quickly to other target languages.
In this paper we present our first approach to the QA task. As we have not taken
part before in any of the QA evaluation forums, most of the work has been done
integrating different available resources. So far, the system we present is targeted only

C. Peters et al. (Eds.): CLEF 2004, LNCS 3491, pp. 494 – 501, 2005.
© Springer-Verlag Berlin Heidelberg 2005
miraQA: Experiments with Learning Answer Context Patterns from the Web 495

to the monolingual Spanish task. The system explores the use of Hidden Markov
1
Models [9] for Answer Extraction and uses Google to collect training data. The
results prove that further improvements and tuning are needed, both in the system and
the answer extraction method. We expect to continue working on this system to
enhance their results and inspect the suitability of the approach for different
languages.

2 Description
miraQA, the system that MIRACLE group has developed for QA@CLEF 2004,
represents our first attempt to face the Question Answering task. The system has been
developed for the monolingual Spanish subtask as we are familiar with available tools
for Spanish. Despite we only address this task, we believe that our approach for
Answer Extraction could be easily adapted to other target languages, as it uses
resources available for most of the languages like POS (Part-Of-Speech) taggers and
partial parsers.
The architecture of the system follows the usual structure of a QA with three
modules as shown in Figure 1: Question Analysis, Document Retrieval and Answer
Extraction.

Question Analysis Document


QA class
retrieval
Question POS Question
Term: SemTag
+Parsing classifier Term: SemTag
.... IR EFE94/95
engine

Answer Extraction
Sentence
Answer Answer Answer Anchor POS Sent. extractor
ranking Recog. searching +Parsing

QA class
model

Fig. 1. miraQA architecture

Besides these modules that we use in the question-answering phase, our approach
requires a system to train the models that we use for answer recognition. The system
uses pairs of questions and answers to query Google and select relevant snippets that
contain the answer and other questions terms. In order to build the model we have
used QA@CLEF 2003 [7] evaluation set with questions and the answers identified by
the judges.

1
Google: http://www.google.com
496 C. de Pablo-Sánchez et al.

2.1 Question Analysis

This module classifies questions and selects the terms that are relevant for later
processing stages. We have used a taxonomy of 17 different classes in our system that
is presented in Table 1. The criteria for the election of the classes has considered the
type of the answer, the question form and the relation of the question terms with the
answer. Therefore, we refer to classes in this taxonomy as question-answers (QA)
classes. General QA classes were split into more specific classes depending on the
number of examples in last year evaluation set. As we were planning to use a
statistical approach for answer extraction, we were also required to have enough
examples in every QA class which determines when to stop subdividing.

Table 1. Question answer (QA) classes used in miraQA

Name Time Location Cause

Person Year Country Manner


Group Month City_0 Definition
Count Day City_ 1 Quantity
Rest

In this module, questions are analyzed using ms-tools [1], a package for language
processing that contains a POS tagger (MACO) and a partial parser (TACAT) as well
as other tools like a Name Entity Recognition and Classification (NERC) module.
MACO is able to recognize basic proper names (np) but the NERC module is needed
to classify them. As this module is built using an statistical approach using a corpus of
a different genre, its accuracy was not good enough for questions and we decided not

P sn grup-verb ##REL#(sn) ##COUNTRY##(grup-sp) P states

espec grup-nom prep grup-nom

Fia pt vsip da nc sps np Fit symbols

¿ Cuál es la capital de Croacia ?

Fig. 2. Analysis of question #1 in QA@CLEF 2003 evaluation set


miraQA: Experiments with Learning Answer Context Patterns from the Web 497

to use it. We also modified TACAT to prevent prepositional attachment as it was


more appropriate to our interests. Once the questions are tagged and parsed, a set of
manually developed rules are used to classify questions. This set of rules is also used
to assign a semantic tag to some of the chunks according to the class they belong.
These tags are a crude attempt to represent the main relations between the answer and
the units appearing in the question. A simple example for the question: “¿Cual es la
capital de Croacia?” (“What is the capital city of Croatia?”) is shown in Figure 2
together with the rule that is applied.
An example of the rule that classifies question as city_1 QA class and assigns (M/)
the ##CAPITAL## and ##COUNTRY## semantic tags. (C/ means that the word is a
token, S/ means that the word is a lemma).
{13,city_1,S_[¿_Fia sn_[C/cuál] grup-verb_[S/ser]
sn_[ C/capital;M/##REL##] M/##COUNTRY## ?_Fit ]}

2.2 Document Retrieval

The IR module retrieves the top most relevant documents for a query and extracts
those sentences that contain any of the words that were used in the query. Words that
were assigned a semantic tag during question analysis are used to build the query. For
robustness reasons, the content is scanned again to remove stopwords. Our system
uses Xapian2 probabilistic engine to index and search for the most relevant
documents. The last step of the retrieval module tokenizes the document using
DAEDALUS Tokenizer3 and extracts the sentences that contain relevant terms. The
system assigns two scores to every sentence, the relevance measure provided by
Xapian to the document and another figure proportional to the number of terms that
were found in the sentence.

2.3 Answer Extraction

The answer extraction module uses a statistical approach to answer pinpointing that is
based on a syntactic-semantic context model of the answer built for any of the classes
that the system uses. The following operations are performed:
1. Parsing and Anchor Searching. Sentences selected in the previous step are
tagged and parsed using ms-tools. Chunks that contain any of the terms are
retagged with their semantic tags and will be used as anchors. Finally, the system
select pieces in a window of words around anchor terms that will go to the next
phase.
2. Answer Recognition. For every QA class we have previously trained a HMM that
models the context of answers found in Google snippets as explained later. A
variant of N-best recognition strategy is used to identify the most probable
sequence of states (syntactic and semantic tags) that originated the POS sequence.
A special semantic tag that identifies the answer (##ANSWER##) represents the

2
Xapian: http://www.xapian.org
3
DAEDALUS: http://www.daedalus.es
498 C. de Pablo-Sánchez et al.

state where words that form the answer are generated. The recognition algorithm is
guided to visit states marked as anchors in order to find a path that passes through
the answer state. The algorithm assigns a score to every computed path and
candidate answer based on the log probabilities of the HMM.
3. Ranking. Candidate answers are normalized (stopwords are removed) and ranked
attending to a weighted score that takes into account their length, the score of
original documents and sentences and the paths followed during recognition.

#ANSWER#(sn) g-v #REL#(sn) #COUNTRY#(g-sp) coord g-sp... states

g-n-fp esp-fs g-n-fs prep g-n-fp prep g-n-fp

np vs da nc sp np cc Fc sp np Fc vs ... symbols

Zagreb es la capital de Croacia y , junto a Belgrado, es....

Fig. 3. Answer extraction for “Zagreb is the capital city of Croatia and, together with
Belgrade, is….” The model suggest the most probable sequence of states for the sequence of
POS tags and assigns ##ANSWER## to the first np (proper noun), giving “Zagreb” as
candidate answer

Question Analysis QA class


Term: SemTag
Question POS Question Term: SemTag WebIR Google
+Parsing ....
classifier
POS
+Parsing
Answers
Anchor
Searching

Model QA class
training model

Fig. 4. Architecture for the training of models for extraction

2.4 Training for Answer Recognition

Models that are used in the answer extraction phase are trained before from examples.
For training the models we have used questions and answers from CLEF 2003.
Questions are analyzed as in the main QA system. Question terms and answers strings
miraQA: Experiments with Learning Answer Context Patterns from the Web 499

4
are combined and sent to Google using the Google API . Snippets for the top 100
results are retrieved and stored to build the model. They are split into sentences, then
they are analyzed and finally, terms that appeared in the question are tagged. The tag
is either the semantic class assigned to that term in the question or the answer tag
(##ANSWER##). Only sentences containing the answer and at least one of the other
semantic tags are selected to train the model.
In order to extract answers we train a HMM in which states are syntactic-semantic
tags assigned to the chunks and symbols are POS tags. To estimate the transition and
emission probabilities of the automata, we have counted the frequencies of the
bigrams for POS-POS and POS-CHUNKS. Besides, a simple add-one smoothing
technique is used. For every QA class we train a model that will be used to estimate
the score of a given sequence and to identify the answer as explained above.

3 Results
We submitted one run for the monolingual Spanish task (mira041eses) that provides
one exact answer to every question. Our system is unable to compute the confidence
measure so we have limited us to assign the default value of 0. There are two main
kinds of questions, factoid and definition and we have tried the same approach for
both of them. Besides, the question set contains some questions whose answer could
not be found in the document corpus and the valid answer in that case is the NIL
string.

Table 2. Results form mira041eses

Question type Right Wrong IneXact Unsupported


Factoid 18 154 4 1
Definition 0 17 3 0
Total 18 174 7 1

The results we have obtained are fairly low if we compare them with other
systems. We attribute these bad results to the fact that the system is in a very early
stage of development and tuning. We have obtained several conclusions from the
analysis of correct and wrong answers that will guide our future work. The extraction
algorithm is working better for factoid questions than definitional. Among factoid
questions results are also better for certain QA classes (DATE, NAME...) which are
found with higher frequency in our training set. For other QA classes (MANNER,
DEFINITION) there were not enough to efficiently build a model. Another
noteworthy fact is that our HMM algorithm is somewhat greedy when trying to
identify answer and in that case shows some preference for words appearing near
anchor terms. Finally, the algorithm is actually doing two jobs at once as it identifies

4
Google API: http://www.google.com/apis/
500 C. de Pablo-Sánchez et al.

answers and, in some way, recognises answer types or entities according to patterns
that were present in training answers of the same kind.
Another source of errors in our system is induced by the document retrieval
process and the way we posed questions and score documents. Terms that we select
from queries have the same relevance when it is clear that proper names would
benefit the retrieval of probably more precise documents. Besides, the simple scoring
schema that we used for sentences (one term-one point) contributes to mask some of
the useful fragments.
Finally, some errors are also generated during the question classification step as it
is unable to handle some of the new surface forms introduced in this year question set.
For that reason a catch all classification was also defined and used as a ragbag, but
results were not expected to be good for that class. Moreover, POS tagging with
MACO fails more frequently for questions and these errors are propagated to the
partial parsing. Our limited set of rules was not able to cope with some of these
inaccurate parses.
The evaluation also provides results for the percentage of NIL answers that we
have returned. In our case we returned 74 NIL answers and only 11 of them were
correct (14.86%). NIL values were returned when the process did not provided any
answer and their high value is due to the chaining of the other problems mentioned
above.

4 Future Work
Several lines for further research are open along with the deficiencies that we have
detected in the different modules of our system. One of the straightest improvements
is the recognition of Named Entities and other specific types that should entail
changes and improvements in the different modules. We believe that these
improvements could enhance precision in answer recognition and also retrieval.
With regard to the Question Analysis module we are planning to improve the QA
taxonomy as well as coverage and precision of the rules. We are considering manual
and automatic methods for the acquisition of classification rules.
Besides the use of NE in the Document Retrieval module, we need to improve the
interface with the other two main subsystems. We are planning to develop better
strategies for transforming questions into queries and effective scoring mechanism.
Results show that the answer extraction mechanism could work properly with
appropriate training. We are interested in determining the amount of training data that
would be needed in order to improve recognition results. We would likely need to
acquire or generate larger question-answer corpus. In the same line, we expect to
experiment with different finite state approaches and learning techniques.
In a cross-cutting line our interest lies in the development of multilingual and
crosslingual QA. Some attempts started already for this campaign in order to face
more target languages but revealed that the question classification needs a more
robust approach to accept the output of current machine translation systems, at least
for questions. Finally we would like to explore if our statistical approach for answer
recognition is practical for other languages.
miraQA: Experiments with Learning Answer Context Patterns from the Web 501

Acknowledgements
The work has been partially supported by the projects OmniPaper (European Union,
5th Framework Programme for Research and Technological Development, IST-2001-
32174) and MIRACLE (Regional Government of Madrid, Regional Plan for
Research, 07T/0055/2003)
Special mention to our colleagues at other members of the MIRACLE group
should be done: Ana García-Serrano, José Carlos González, José Miguel Goñi and
Javier Alonso.

References
1. S. Abney, M. Collins, and A. Singhal. Answer extraction. In Proceedings of Applied
Natural Language Processing (ANLP-2000), (2000).
2. Atserias J., J. Carmona, I. Castellón, S. Cervell, M. Civit, L. Màrquez, M.A. Martí, L.
Padró, R. Placer, H. Rodríguez, M. Taulé and J. Turmo Morphosyntactic Analysis and
Parsing of Unrestricted Spanish Text. Proceedings of the 1st International Conference on
Language Resources and Evaluation (LREC'98). Granada, Spain, 1998.
3. Baeza-Yates R. Ribeiro-Neto B. (Ed.) Modern Information Retrieval. Addison Wesley,
New York (1999).
4. Brill E. Lin J. Banco M, Dumais S, Ng A. Data-Intensive Question Answering. In
Proceedings of TREC 2001 (2001)
5. Jurafsky D. Martin J.H. Speech and Language Processing. Prentice Hall, Upper Saddle
River, New Jersey. (2000)
6. Manning C, Schütze H. Foundations of Statistical Natural Language Processing.. MIT
Press (1999)
7. Magnini B., Romagnoli S., Vallin A., Herrera J.,Peñas A., Peinado V, Verdejo F and de
Rijke M. The Multiple Language Question Answering Track at CLEF 2003. (2003)
Available at http://clef.isti.cnr.it/2003/WN_web/36.pdf
8. Magnini, B., Vallin, A., Ayache, C., Erbach, G., Peñas A., de Rijke, M., Rocha, P.,
Simov, K. and Sutcliffe R.: Overview of the CLEF 2004 Multilingual Question Answering
Track. In : Peters, C., and Clough, P., and Gonzalo, J., and Jones, G., and Kluck, M. and
Magnini, B.: Fifth Workshop of the Cross--Language Evaluation Forum (CLEF
2004),Lecture Notes in Computer Science (LNCS), Springer, Heidelberg, Germany (2005)
9. Mérialdo, B.: Tagging English Text with a Probabilistic Model. In Computational
Linguistics, Vol 20 (1994) 155-171.
10. Ravichandran, D. and E.H. Hovy. : Learning Surface Text Patterns for a Question
Answering System. In Proceedings of the 40th ACL conference. Philadelphia, PA (2002)
11. Vicedo J.L. Recuperando información de alta precisión. Los sistemas de Búsqueda de
Respuestas. Phd Thesis. Universidad de Alicante. (2003).

You might also like