You are on page 1of 6

Geographic Information Extraction using Natural Language

Processing in Wikipedia Texts


Edson B. de Lima1 , Clodoveu Augusto Davis Jr.1
1
Departamento de Ciencia da Computacao Universidade Federal do Minas Gerais (UFMG)
Belo Horizonte MG Brazil
edson@dcc.ufmg.br, clodoveu@dcc.ufmg.br

Abstract. Geographic Information Extraction from texts is a rich source for ob-
tain diversified data from documents that can improve Geographic Information
System semantic representation. Linked Data have provide a large amount of
data mainly focused in diversification get from integrated knowledge databases.
The Natural Language Processing tasks have been increased the name entity
recognition in the last years as powerful tool to help in the analysis of place
names and the relation extraction between location names and other entities.
In the area of GIS and Linked Data Source, software like Geo-NER and Linke-
dOntoGazetteer are available for the end users. These software provides a way
for extract information from diversified source and enrich geographic context
over the places. When dealing with data sources and text processing a huge
geographic information has been lacking. Thus, this work shows the results
of an approach to analyze text processing and quantify geographic information
related to Wikipedia articles accessing GIS attributes in LinkedOntoGazetteer.

1. Introduction
In the present, a relevant amount of information can be found in text or documents free of
structure such as Wikipedia1 articles or other forums and social networks. However, its
hard to get specialized information given a lot of documents that discourse about many
other specific topics. A type of information can be referent to a geographic domain,
where locations are obtained by name entities recognition and relation mentions. Then, a
problem that raise is how to analyze and quantify location information according to fea-
tures that classifying a place by geographic domain. Some applications have diversified
data from sources that help to work with spatial information. For example, the Linke-
dOntoGazzetter2 is a graph that contains relationship between place names and non-place
names. The main goal of the system is to provide a way to obtain multiple informations
from places given a name place. On the other hand, Geonames3 refers to a geographic
database accountable for show places features such as latitude and longitude covering
information of all countries and contains even million of place names.
In turn, through information retrieve a text can be treat for some proposes. For ex-
ample, collecting patterns that help to get location names entities and relation extraction
over entities types figure as a key point of Natural Language Processing (NLP) [Chowd-
hury, 2003] tasks. Then, in NLP context, a big set of features is extracted from sentences
1
https://www.wikipedia.org/
2
http://aqui.io/log/
3
http://www.geonames.org/
in the text. This features, posteriorly, must feed a SVM classifier [Hearst et al., 1998]
that operate on labeling process of each word [Collobert et al., 2011]. However, the fea-
ture choice is a hard process and consists of a empiric tasks sequence, based on linguistic
intuition.
Perea Ortega et al. [2009] discuss in your work about Geo-NER, a Geographic In-
formation System (GIS) [Chang, 2006] for detection and geographic name entity recog-
nition, based on a generic entity tagger and other geographic resources that have been
generated using the Wikipedia. The Geo-NER also have as work source a gazetteer such
GeoNames, together some developed heuristics. However, the Geo-NER application is
restricted for access only GeoName queries and heuristics, lacking diversified data from
other data sources. This works analyze geographic information from LOG that integrate
data from sources such as DBPedia, Freebase and Geonames. Also, this paper aims to
evaluate which class of documents contains more geographic information according to
candidate place names get from location name entities.

2. Geographic feature selection


Considering a text structure, most of location feature has been selected by name entities
and relation extraction between this entities. Then, the Natural Language Processing
appears as a key point for analyze a document and extract patterns that matches with
locations properties. In this work, we propose a approach that use two NLP specific
techniques for collect location attributes based on features present in relationship and
sentence structure. The first technique refers to Name Entity Recognition (NER) [Finkel
et al., 2005] given a text or corpus, in this case a Wikipedia article. The NER module
receives a text as input and break it in sentence through a subroutine called sentence
tokenization that get sentences by some features in the text such as final point or other
phrase termination features. Each sentence pass for another subroutine that do similar
procedure for get tokens through words, process called word tokenization.
A set of other modules must analyze the tokens to find pattern that confirm a
characteristic in this words. The Part of Speech (POS) procedure [Toutanova et al., 2003],
for example, requires as input tokens for label them based on semantic meaning of the
words. A group of this POS together forms a chunk that is used to establish a name
pattern such as name entities or other name expression. Then, NER make use of this
chunks to get names entities according with your types. The NER model of this work use
only three types of name entities: PERSON, ORGANIZATION and LOCATION. With
the last one representing any place name as an location, extract the locations names by
NER figure on the main topics that we analyze for geographic information in documents.
The figure 1 shows the NER applied to a text sentence for obtain name entities that refers
to a location name. In the example, New York City and United States names has been
recognizes as entity type LOCATION.
The second technique of this work address entity relation extraction, a task that
occur after NER subroutine. Each relationship between a location name and a name en-
tity related to the indicated types is extracted from text in the form of triples contained a
subject and an object that represent a name entity and a predicate that refers to the rela-
tionship get from sentence. The figure 2 exhibit all existing relationship in the sentence
given two name entities identified by NER process. However, in this case the name entity
type is not defined and have to be processed with NER module to obtain this types.

3. Experimental Results
To evaluate the spatial semantic score through proposed approach, we tested two classes
of Wikipedia articles containing 399 documents in the first class and 267 documents for
the second class. The first class correspond to a list of cities and states most populous in
USA, with each article explaining the city or state properties. And the second class refers
to types of main social networks present in the world. Its important to keep in mind that
the algorithm does not obtain all name entities. The aim of this experiment is to verify
which class of articles contains more geographic information according every candidate
place names.
The name entities and relationships were obtained using Open Information Ex-
traction (OpenIE) and NER annotations of Stanford CoreNLP [Manning et al., 2014] with
provide a NLP toolkit for text processing and parallel pipeline annotations. For obtain en-
tity relationship information, the CoreNLP OpenIE extract open-domain relation triple,
representing a subject, a relation, and the object of the relation [Angeli et al., 2015]. Ac-
cording to the authors, OpenIE extractor is useful for relation extraction tasks where there
is limited or no training data, and when speed is essential. As Wikipedia articles have no
training data and contains large document size, both tasks are extremely important to get
place entities that are part of the relationship.
After relation extraction with place name entities, to the next steps we analyze
location features using with Geonames feature codes and feature classes from Linke-
dOntoGazetteer (LOG) services responses [Moura et al., 2016] [Wick and Vatant, 2012].
According to Perea Ortega et al. [2009], GeoName gazetteer categorize geographic char-
acteristics between one of the nine classes and have more than 645 subcategories repre-
sented by feature codes. The tables 1 and 2 shows some of the main Geonames feature
classes and codes that has used in this work to classify each group of articles by ge-
ographic features. For the experiments we choose only four feature classes and codes
because the most place names are classifying according with this types. Then, if a place
name refers to a feature class A (Administrative Boundary), your feature code must be
some of the ADM codes. On the other hand, Populated Place Code is usually related to
the feature class P (Populated Place).
Considering the three experiments, the results we analyze concluded that docu-
ments with geographic context contains more Geoname feature classes mainly over Pop-
ulation Place feature. According to tables 3 and 4, the USA cities documents have more
Geoname feature class ratio followed by social networks and chat website documents.
Similarly, the tables 5 and 6 show that Geoname feature code follow the same order of

Figure 1. Name Entity Recognition applied to a text sentence and locations


names entities recognized. [Manning et al., 2014]
Figure 2. Relation Extraction applied to a text sentence using CoreNLP
toolkit.[Manning et al., 2014]

API Feature Class Description API Feature Code Description


1 A Administrative Boundary 1 ADM1 First Adm. Division
2 P Populated Place 2 ADM2 Second Adm. Division
3 L Area 3 PPL Populated Place Code
4 H Hydrographic 4 PPLA Seat of First Adm. Div

Table 1. Feature Class Table 2. Feature Code

Table 3. Geoname feature class of place names. The results contains the candi-
date place names by location name entity
Administrative Boundary Populated Places Area Hidrografic TOTAL
21148 235600 26436 13500 296684
672 7430 469 428 8999
24 272 16 3 315

Table 4. Feature class ratio by number of location triples


LOCATION LOCATION
A Ratio P Ratio L Ratio H Ratio Documents Triples
Triples Triples (%)
5,11 56,95 6,39 3,26 USA Cities 15861 11375 71,72
4,61 50,98 3,22 2,94 Social networks 1009 623 61,74
1,96 22,17 1,3 0,24 Chat websites 74 19 25,68

Table 5. Geoname feature codes of place names extracted from Wikipedia doc-
uments. This results also contains candidate place names get from Linke-
dOntoGazetteer.
ADM1 ADM2 PPL PPLA Total
2431 3193 211969 1340 218933
93 98 6830 26 7047
3 7 249 4 263

feature class in comparison to the documents. However, the most difference is present in
PPL and PPLA feature codes which indicates Populated Place properties. This show that
both ADM1 and ADM2 appears into all documents in the similar proportion, while PPL
and PPLA have more disparity values.
Table 6. Geoname feature codes ratio
LOCATION LOCATION
ADM1 ADM2 PPL PPLA Triples Context
Triples (%) Triples
0,8 1,05 69,44 0,44 15861 71,72 11375 US cities
0,81 0,86 59,84 0,23 1009 61,74 623 Social networks
0,29 0,68 24,31 0,39 74 25,68 19 Chat websites

Therefore, in this experiments texts with geographic context or other related issue
are most likely appears with the increase of place names number or location properties.
Its important to keep in mind that the feature class ratios are calculated with LOCATION
triples percentage, in such way that only location events have been analyzed.

4. Conclusion and future work


This work has show that extract information over geographic context documents and sim-
ilar issue text can help to find more place name and location properties, considering can-
didates location given a place name. The LinkedOntoGazetter has provide support to an-
alyze geographic properties over the place names with access to Geoname feature classes
and linked data that are related to location entity names obtained by NLP tasks.
As future contributions, we must evaluate location relation extraction with disam-
biguation process by analyze only place names without get candidate place names related
to a location name. Then, a set of triples obtained from documents is essential to estab-
lish relationship between other name entities for disambiguate name and provide more
semantic meaning over geographic information system.

References
Angeli, G., Premkumar, M. J., and Manning, C. D. (2015). Leveraging linguistic structure
for open domain information extraction. Association for Computational Linguistics.
Chang, K.-T. (2006). Geographic information system. The International Encyclopedia of
Geography.
Chowdhury, G. G. (2003). Natural language processing. Annual review of information
science and technology, 37(1):5189.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011).
Natural language processing (almost) from scratch. Journal of Machine Learning Re-
search, 12(Aug):24932537.
Finkel, J. R., Grenager, T., and Manning, C. (2005). Incorporating non-local information
into information extraction systems by gibbs sampling. In Proceedings of the 43rd an-
nual meeting on association for computational linguistics, pages 363370. Association
for Computational Linguistics.
Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., and Scholkopf, B. (1998). Support
vector machines. IEEE Intelligent Systems and their applications, 13(4):1828.
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., and McClosky, D.
(2014). The stanford corenlp natural language processing toolkit. In ACL (System
Demonstrations), pages 5560.
Moura, T. H. V. M., Davis, C. A., and Fonseca, F. T. (2016). Reference data enhancement
for geographic information retrieval using linked data. Transactions in GIS, pages n/a
n/a.
Perea Ortega, J. M., Martnez Santiago, F., Montejo Raez, A., and Urena Lopez, L. A.
(2009). Geo-ner: un reconocedor de entidades geograficas para ingles basado en geon-
ames y wikipedia.
Toutanova, K., Klein, D., Manning, C. D., and Singer, Y. (2003). Feature-rich part-of-
speech tagging with a cyclic dependency network. In Proceedings of the 2003 Confer-
ence of the North American Chapter of the Association for Computational Linguistics
on Human Language Technology-Volume 1, pages 173180. Association for Computa-
tional Linguistics.
Wick, M. and Vatant, B. (2012). The geonames geographical database. Available from
World Wide Web: http://geonames.org.

You might also like