Professional Documents
Culture Documents
Abstract. Geographic Information Extraction from texts is a rich source for ob-
tain diversified data from documents that can improve Geographic Information
System semantic representation. Linked Data have provide a large amount of
data mainly focused in diversification get from integrated knowledge databases.
The Natural Language Processing tasks have been increased the name entity
recognition in the last years as powerful tool to help in the analysis of place
names and the relation extraction between location names and other entities.
In the area of GIS and Linked Data Source, software like Geo-NER and Linke-
dOntoGazetteer are available for the end users. These software provides a way
for extract information from diversified source and enrich geographic context
over the places. When dealing with data sources and text processing a huge
geographic information has been lacking. Thus, this work shows the results
of an approach to analyze text processing and quantify geographic information
related to Wikipedia articles accessing GIS attributes in LinkedOntoGazetteer.
1. Introduction
In the present, a relevant amount of information can be found in text or documents free of
structure such as Wikipedia1 articles or other forums and social networks. However, its
hard to get specialized information given a lot of documents that discourse about many
other specific topics. A type of information can be referent to a geographic domain,
where locations are obtained by name entities recognition and relation mentions. Then, a
problem that raise is how to analyze and quantify location information according to fea-
tures that classifying a place by geographic domain. Some applications have diversified
data from sources that help to work with spatial information. For example, the Linke-
dOntoGazzetter2 is a graph that contains relationship between place names and non-place
names. The main goal of the system is to provide a way to obtain multiple informations
from places given a name place. On the other hand, Geonames3 refers to a geographic
database accountable for show places features such as latitude and longitude covering
information of all countries and contains even million of place names.
In turn, through information retrieve a text can be treat for some proposes. For ex-
ample, collecting patterns that help to get location names entities and relation extraction
over entities types figure as a key point of Natural Language Processing (NLP) [Chowd-
hury, 2003] tasks. Then, in NLP context, a big set of features is extracted from sentences
1
https://www.wikipedia.org/
2
http://aqui.io/log/
3
http://www.geonames.org/
in the text. This features, posteriorly, must feed a SVM classifier [Hearst et al., 1998]
that operate on labeling process of each word [Collobert et al., 2011]. However, the fea-
ture choice is a hard process and consists of a empiric tasks sequence, based on linguistic
intuition.
Perea Ortega et al. [2009] discuss in your work about Geo-NER, a Geographic In-
formation System (GIS) [Chang, 2006] for detection and geographic name entity recog-
nition, based on a generic entity tagger and other geographic resources that have been
generated using the Wikipedia. The Geo-NER also have as work source a gazetteer such
GeoNames, together some developed heuristics. However, the Geo-NER application is
restricted for access only GeoName queries and heuristics, lacking diversified data from
other data sources. This works analyze geographic information from LOG that integrate
data from sources such as DBPedia, Freebase and Geonames. Also, this paper aims to
evaluate which class of documents contains more geographic information according to
candidate place names get from location name entities.
3. Experimental Results
To evaluate the spatial semantic score through proposed approach, we tested two classes
of Wikipedia articles containing 399 documents in the first class and 267 documents for
the second class. The first class correspond to a list of cities and states most populous in
USA, with each article explaining the city or state properties. And the second class refers
to types of main social networks present in the world. Its important to keep in mind that
the algorithm does not obtain all name entities. The aim of this experiment is to verify
which class of articles contains more geographic information according every candidate
place names.
The name entities and relationships were obtained using Open Information Ex-
traction (OpenIE) and NER annotations of Stanford CoreNLP [Manning et al., 2014] with
provide a NLP toolkit for text processing and parallel pipeline annotations. For obtain en-
tity relationship information, the CoreNLP OpenIE extract open-domain relation triple,
representing a subject, a relation, and the object of the relation [Angeli et al., 2015]. Ac-
cording to the authors, OpenIE extractor is useful for relation extraction tasks where there
is limited or no training data, and when speed is essential. As Wikipedia articles have no
training data and contains large document size, both tasks are extremely important to get
place entities that are part of the relationship.
After relation extraction with place name entities, to the next steps we analyze
location features using with Geonames feature codes and feature classes from Linke-
dOntoGazetteer (LOG) services responses [Moura et al., 2016] [Wick and Vatant, 2012].
According to Perea Ortega et al. [2009], GeoName gazetteer categorize geographic char-
acteristics between one of the nine classes and have more than 645 subcategories repre-
sented by feature codes. The tables 1 and 2 shows some of the main Geonames feature
classes and codes that has used in this work to classify each group of articles by ge-
ographic features. For the experiments we choose only four feature classes and codes
because the most place names are classifying according with this types. Then, if a place
name refers to a feature class A (Administrative Boundary), your feature code must be
some of the ADM codes. On the other hand, Populated Place Code is usually related to
the feature class P (Populated Place).
Considering the three experiments, the results we analyze concluded that docu-
ments with geographic context contains more Geoname feature classes mainly over Pop-
ulation Place feature. According to tables 3 and 4, the USA cities documents have more
Geoname feature class ratio followed by social networks and chat website documents.
Similarly, the tables 5 and 6 show that Geoname feature code follow the same order of
Table 3. Geoname feature class of place names. The results contains the candi-
date place names by location name entity
Administrative Boundary Populated Places Area Hidrografic TOTAL
21148 235600 26436 13500 296684
672 7430 469 428 8999
24 272 16 3 315
Table 5. Geoname feature codes of place names extracted from Wikipedia doc-
uments. This results also contains candidate place names get from Linke-
dOntoGazetteer.
ADM1 ADM2 PPL PPLA Total
2431 3193 211969 1340 218933
93 98 6830 26 7047
3 7 249 4 263
feature class in comparison to the documents. However, the most difference is present in
PPL and PPLA feature codes which indicates Populated Place properties. This show that
both ADM1 and ADM2 appears into all documents in the similar proportion, while PPL
and PPLA have more disparity values.
Table 6. Geoname feature codes ratio
LOCATION LOCATION
ADM1 ADM2 PPL PPLA Triples Context
Triples (%) Triples
0,8 1,05 69,44 0,44 15861 71,72 11375 US cities
0,81 0,86 59,84 0,23 1009 61,74 623 Social networks
0,29 0,68 24,31 0,39 74 25,68 19 Chat websites
Therefore, in this experiments texts with geographic context or other related issue
are most likely appears with the increase of place names number or location properties.
Its important to keep in mind that the feature class ratios are calculated with LOCATION
triples percentage, in such way that only location events have been analyzed.
References
Angeli, G., Premkumar, M. J., and Manning, C. D. (2015). Leveraging linguistic structure
for open domain information extraction. Association for Computational Linguistics.
Chang, K.-T. (2006). Geographic information system. The International Encyclopedia of
Geography.
Chowdhury, G. G. (2003). Natural language processing. Annual review of information
science and technology, 37(1):5189.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011).
Natural language processing (almost) from scratch. Journal of Machine Learning Re-
search, 12(Aug):24932537.
Finkel, J. R., Grenager, T., and Manning, C. (2005). Incorporating non-local information
into information extraction systems by gibbs sampling. In Proceedings of the 43rd an-
nual meeting on association for computational linguistics, pages 363370. Association
for Computational Linguistics.
Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., and Scholkopf, B. (1998). Support
vector machines. IEEE Intelligent Systems and their applications, 13(4):1828.
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., and McClosky, D.
(2014). The stanford corenlp natural language processing toolkit. In ACL (System
Demonstrations), pages 5560.
Moura, T. H. V. M., Davis, C. A., and Fonseca, F. T. (2016). Reference data enhancement
for geographic information retrieval using linked data. Transactions in GIS, pages n/a
n/a.
Perea Ortega, J. M., Martnez Santiago, F., Montejo Raez, A., and Urena Lopez, L. A.
(2009). Geo-ner: un reconocedor de entidades geograficas para ingles basado en geon-
ames y wikipedia.
Toutanova, K., Klein, D., Manning, C. D., and Singer, Y. (2003). Feature-rich part-of-
speech tagging with a cyclic dependency network. In Proceedings of the 2003 Confer-
ence of the North American Chapter of the Association for Computational Linguistics
on Human Language Technology-Volume 1, pages 173180. Association for Computa-
tional Linguistics.
Wick, M. and Vatant, B. (2012). The geonames geographical database. Available from
World Wide Web: http://geonames.org.