Geographic Information Retrieval Systems

Geographic Information Retrieval

An Overview

Problem Statement:

Introduction

● Geographic Information Retrieval can be seen as a specialized branch of traditional Information Retrieval.

● Information that has relationships to geographic space is called georeferenced information and frequently used term in Georeferenced Information Retrieval.

● Georeferenced information is used in all kinds of media, Eg :- Structured data like maps, land surveys, airborne and satellite images and tabulated observations.

● Can also be used by researchers looking for certain area, or requiring particular area inhabited by certain animals or is affected by an epidemic.

Properties of Georeferenced Information:

● Information available in digital libraries and on the Internet is georeferenced, although mostly it is not denoted in terms of geographic coordinates.

● The geographical location and extension of a place name is often called geographic footprint and it is given by coordinates ( longitude, latitude ).

● Geographic Information Retrieval requires that place names and phrases that include direct or indirect references to place names be resolved and translated into footprints that can be indexed.

General Problems in GIR:Ambiguity/Lack of precision in Place Names:

● Firstly, several places can share the same name, making the place names unique only within a limited geographic area.

● Secondly, some place names occurring in texts are temporal or cultural conventions rather than official names, requiring the user to have an understanding of the time, context or cultural environment the place names are used in to be able to link it to some geographic location.

● Thirdly, some place names change over time. eg. Banglore to Bengaluru, Calcutta to Kolkata etc..

● Fourthly, the geographic extension that the place name denotes can be extended, reduced or changed over time.

General Problems in GIR: (contd.)

○ Fifthly, the borders of a location can be fuzzy. (Kashmir?)

○ The same place name can be written differently in different text, either because the author has misspelled the name or because there are different legal spellings of the same place name.

Information being fuzzy :

○ About 200 kilometers south of the capital of Russia” . Direction may vary, distance may vary. In case of South Africa there are 3 capitals which may lead to ambiguity.

○ Often, people are imprecise in giving geographic direction, using one of the four general directions north, south, east or west, when the actual direction might be somewhere in between.

Impact of cognitive model on Geographic Information Retrieval

● Human understanding of the geographic loaction: Procedural and Survey based.

● Survey: Involves looking at maps and geographic location finding.

● Procedural: Involves exploring and navigating through the place so as to get the 'feel' of it.

● Using procedural method to locate or gain information is particularly difficult as it contains many phrases involving human ambiguity.

Cognitive model (continued)

● 'People link geographic distance with time.': People when talking about going from say 'a' to 'b' have a tendency of using time as a method of asserting distance.eg: It takes two hours to reach from 'A' to 'B' by car.

● 'Topology and metric distances': People are very good at mentioning topological aspects pertaining to a place. Like inclusion (eg: names of the topologies in an area.) or coincidences (eg: this place is at the same place as..)

● 'People have biases towards east-west or north-south direction': People have a very biased view of the geographical area. And while giving specifics in direction, they seem to have a vague sense of direction. eg: When asked where is south america w.r.t to north america. The answer generally is south. While the really it is in the south-east.

Geo referencing using the Gazetteers Gazetteers: A form of index that relates place names to co-ordinates of locations and extents.

Here we are going to focus on automatic geo-referencing based on the contents of the documents text alone

In an automated approach most projects have based their approaches to georeferencing on a combination of place name identification and natural language processing to identify phrases that modifies the location pointed to by occurrences of place names (“200 km south of the Moskow”) or that provides georeferences that indicates a geo-reference without actually mentioning a specific place name (“Rosenborgs homefield”).

Geo- referencing (continued)

Gazetteers have three basic components:

The name is the textual designator of a geographic location, the location is the coordinates of a point, line or area on the earth’s surface pointed to by a name, and the feature type is the type of location that a name points to (Forrest, agricultural area, river, inhabited location etc).The location that a place name refers to (the place names footprint) can be given as a point, a bounding box or a polygon, all represented by coordinates.

Geo-referencing (continued)Centroid point:

Vague in terms of geometry and size of the area.Little data storage.

Geo- referencing (continued)Bounding Box:

Gives a better idea of the entire referenced area.Does not require a lot of data storage.However it overlaps other areas around it and is inaccurate.

Geo-referencing (continued)Approximated Polygon approach:

Most accurate in terms of referencing.However takes a lot of data storage space.

The best approach would be to have something in the middle of the polygon and bounded box approach like a fixed points polygon approach.

Searching for Georeferenced Information

Letting the user specify one or more place names in as keywords in a traditional keyword based query. When parsing the query, the GIR/IR treats the found place names as special keywords by the GIR/IR system, indicating the geographical scope of the information need of the user.e.g: Googling for Restaurants around you?

Letting users specify the geographic constraint to a query by drawing on one or more maps.e.g: Google Maps

and what about GPS Apps like "Here and Now", "Google Latitude"?

Searching for Georeferenced Information

Typical Queries:

○ Point in Polygon - asking for georeferenced information that contains, surrounds or refers to a particular geographic point location

○ Region Queries - asking for anything contained in, adjacent to, or overlaps the region.

○ Distance and Buffer Zone Queries - asking for information within some fixed distance of a geographic object (point, line, polygon)

○ Path Queries - asking for the presence of a network structure that can be queried for network traversal information

○ Multimedia Queries - combining multiple geo-referenced information sources in resolving a query.

Related Projects:SPIRIT:(Spatially-aware information retrieval on the internet) - funded by the EC Fifth Framework Programme. To improve the search capabilities on the internet by using geographical and conceptual ontologies to model both vocabulary and the spatial structure of places for purposes of IR.This ontology, which is envisioned as an extension to traditional gazetteers and related locations as well as help ranging hits based on geographic properties.

∙ ontologies that model geographical terminology;∙ query expansion and relevance ranking procedures based onthe geographical ontologies;∙ machine learning techniques for the extraction ofgeographical context from web documents and for generatingmetadata providing spatial context;∙ a multi-modal user interface providing textual input andinteractive map feedback of the context of retrieveddocuments;∙ spatial indices for web collections

Geo-Ontologies

Ontologies relating Geographical Terminology and Spatial Relationships

● Reference to a geographic place: <PL-Name,PL-Type,{(x,y)}>○ eg: <Charminar, Monument,{(x,y)}>

● Relative Place Reference : <Spatial Relationship,PL-Name, Type,PL-FP>○ eg: <In, Hyderabad, City, {(x,y)}>

A Query to SPIRIT will contain one or more references to a PL-REF

Geographic content is a set of <Place reference> expressions and the Geometric Footprint is a function of this set.

Basically Geo Ontologies can be applied in :1) User's query interpretation: (+ domain specific ontologies) for disambiguation of place name2) System query formulation: to generate alternate names and spatially associated names3) Metadata extraction: to extract info from free text documents to generate foot print(s)4) Relevance Ranking: potential for geographical relevance ranking (Dominos Pizza? :) )

Geo-OntologiesOntology"formal, explicit specification of a shared conceptualisation"

Geo-Ontologies

● Types of Atomic Queries:○ A place name○ An aspatial entity with relation to a place name○ An aspatial entity with a spatial relation to a place name○ An aspatial entity with a spatial relation to a place name○ A place name with spatial relation to a place name○ A place type with spatial relation to a place name○ A place type with spatial relation to a place type

● Geo Ontology = Geographic Feature Ontology + Geographic Type Ontology + Spatial Relation Ontology

User evaluation of the spirit prototype gave consistent results with SPIRIT priorities on innovative features. Yet, users explain a feeling of frustration which highlights that their requirements are beyond SPIRIT achievements and that there is still more work to be done in this area.

The last publication on the website dates back to 2005.

Relevance

In Information Retrieval, relevance denotes how well a retrieved document or set of documents meets the information need of the user.

Geographic Information Retrieval is concerned with retrieving documents in response to a spatially related query. Thus, the ranking of documents by both textual and spatial relevance have to be considered.

The most common way to return a set of documents obtained from a Web query is by a ranked list. The search engine attempts to determine which document seems to be the most relevant to the user and will put it first in the list. In short, every document receives a score, or distance to the query, and the returned documents are sorted by this score or distance.

There are situations where the sorting by score may not be the most useful one. When a more complex query is done, composed of more than one query term or aspect, documents can also be returned with two or more scores instead of one.

For example, the Web search could be for campings in the neighborhood of Neuschwanstein, and the documents returned ideally have a score for the queryterm “camping” and a score for the proximity to Neuschwanstein. This implies that a Webdocument resulting from this query can be mapped to a point in the 2-dimensional plane, where both axes represent a score. The map indicates campings near the castle Neuschwanstein, which is situated close to Schwangau, with the distance to the castle on the x-axis and the rating on the y-axis.

Another weakness of our methods lies in the way we treat multiple-footprint documents. While we assume that a query can have only one footprint (a user is interested in only one location), documents may have multiple footprints (refer to more than one location).

The method we followed so far in order to calculate the spatial score considers only the best-matching document footprint. For example, if a user is looking for “airports near London”, a document that refers to both “Gatwick” and “Stansted” is scored as referring only to “Gatwick” since it’s the nearest airport of the two. Such a document, however, should be scored higher than another that refers only to “Gatwick” since it provides more relevant information. Another thing is , the number of footprints occurring: Gatwick’s official web-pages should be more important than a web-list of all airports in UK.

For high-quality ranking two things are required. Firstly, we need a good spatial score between query and document footprints. Secondly, we need a good combination of the spatial and textual (BM25) scores. For finding spatial scores, the spatial relationships (distance, containment, and direction) were converted into numeric values that indicate how close, how much inside, or how much North-of the relationship between two objects is. Those numeric values were first attempts at obtaining a score to quantify spatial relationships.

However, certain issues do come up in this method. For example, let us assume three cities, A, B, and C, where A lies in equal distance (in a Euclidean sense) from B and C. If C is bigger than B, then the score of B being close to A should be lower than that of C being close to C. In other words, the distance scores of cities around A may depend on the context, i.e. which other cities are around A. Also, natural barriers can influence the concept of proximity. It matters a lot whether a distance of 10 km (as the crow flies) can be covered by a direct road, or requires a large detour around a mountain range (or a small road over a mountain pass)

In traditional information retrieval, the separate scores of each document would be combined into a single score (e.g., by a weighted sum or product) which produces the ranked list by sorting. Now, we are going to incorporate two pieces of information into the way that a spatial document score is calculated:• The number n of unique footprints in a document. • The frequencies f_1,…, f_n, of occurrence of the footprints in the document. Moreover, the total spatial score of a document will be derived from fractional score contributions of all occurring document footprints.

A simple way of taking into account all document footprints is to define the total spatial score as a linear combination (e.g. the simple average) of the individual scores of the footprints: S = 1/n * (s_1+…+s_n)

where s_i is the score of the ith document footprint in respect to the query footprint. Incorporating also the frequencies of occurrence f_i, let us define the weight of a footprint:

tf_i = 1 + log (f_i). A footprint that occurs in the document only once will get a weight of one, where any extra occurrences will increase the weight in a log fashion. The total score may be calculated as

S = 1/(tf_1+…+tf_n) * (tf_1*s_1+…+tf_n*s_n),

that is the weighted average of the individual scores.

Considering again the example about “airports near London”, such a scoring function like the last one would score higher Gatwick’s official web-page than a web-list of all UK airports. Moreover, it takes into account more than the best-matched document footprint. The last formula may serve as a starting point for improving the spatial scoring function.

Evaluation:

2 Indicators:1) Recall = No. of Relevant Docs returned / Total No. of rel. Docs2) Precission = No. of relevant Docs returned / Total No. of Indexed Docs

Trec has been evaluated using the ISO 9241 standard: based on Effectiveness (can users find relevant docs?) , Efficiency (resourcs consumed per result) and Satisfaction (User feedback)

Gazetteer Server and Service for UK Academia - James Reid

Gazetteer :- Geographical dictionary or directory. Serves as reference for information about places.

● Geographic searching is powerful information retrieval tool, because the results obtained hereafter are more specific.

● Geographic searching is restricted because Geographic metadata creation is very resource intensive and the resources having geographic metadata exists only to names.

● There is no particular mentioning of the geographic footprint i.e. directly. There might be direct or indirect reference to the place.

Constant change in Geographic metadata:-

● Names of places may vary.● Names may have changed from time to time.● Boundaries can be fuzzy.● Spoken in some context.

GeoXwalk is a comprehensive Gazetteer linking vocabulary of current and historical geographical names to a standard spatial coding scheme ( longitude, latitude ). Technically GeoXwalk has basically three components :-

● Gazetteer database to support spatial searches.● Middleware components to issue spatial/aspatial queries.● Geo parser to parse non geographically indexed documents

for some place name as reference to it.

Gazetteer database Each geographical feature must include :-

● Feature name.● Feature type.● Geometry ( spatial footprints ).

Marking out the places can be done better by using Polygons as opposed to Points. Explicit relationships can be defined which is of particular use when Gazetteer hold significant amount of historical data for which geometries doesn't exist.

Middleware components:

Protocols supported by geoXwalk are:-● ADL Gazetteer protocol● OGC filter encoding implementation.

This is to translate XML queries to database specific SQL queries.

GeoParser

Most data and metadata existing have some sort of geo-reference that is not in format which will allow it to be easily spatially searched.

One task associated is how non spatially referenced documents could be spatially indexed.Could be done using a Gazetteer as reference.Prototype based geo-parser has been implemented that semi automatically identifies place name in a document and extract a suitable spatial footprint.The rule based approach takes in account the structure and context in which words occur. One issue that is faced by GeoXwalk are Map conflation i.e. detecting duplicate entries.Like a place spoken differently in different language but has a same geographic footprint.

Related Projects: GeoVSM

Geographic Vector Space Model: The project integrates coordinate based geographic indexing with the key-word based vector space model in are presenting information space. Relevance measures are based on both geographic measures and on thematic measures which can be combined into one single measure system.

Vector Space Model: One of the most popular models of document space developed in textual-based information retrieval research. It is an algebraic model for representing text or graphical documents (and any objects, in general) as vectors of identifiers.Using a vector space model, the content of each geographic document can be approximately described by a vector of (content-bearing) terms, which are a combination of thematicsubjects and place names.

● Documents and queries are represented as vectors. Each dimension corresponds to a separate term

An information retrieval system stores a representation of adocument collection using a document-by-term matrix, where the element at position (i, j) corresponds to the frequency of occurrence of term j in the ith document. In the vector space model, all the objects (terms, documents, queries, concepts, etc) can be similarly represented as vectors.

● Vector space model is well accepted as an effective approach in modelling thematic subspace, and it allows geographic information to be handled the same way as non-geographical information.

However, the vector space model has some serious problems when used for modeling the geographic subspace.

The geographic space is inherently continuous and cannot beadequately approximated using a set of place names (which are discrete in nature). if a document mentions four place names—Pittsburgh, Philadelphia, Harrisburg, and Hagerstown—the four place names will be treated as four independent dimensions in a vector space model, whereas in fact, they are points (or regions) in a two-dimensional geographic space.

Additional concerns of using locational terms as geographic indexes include: ambiguity in meaning, non-unique place names, place name might change over time, and spelling variations

Geographical Model

● Geographical model of document space is capable of processing arbitrarily complex spatial queries.

● The most common spatial are believed to be of three types:1.Point query: Return the geometric object that contains a given query point2.Region query :Given a region R, find all objects in the collection that intersect R3.Buffer zone :A buffer query involves two spatial data sets and a distance d. The answer to this query are pairs of objects, one from each input set, that are within distanced of each other. For e.g. “find house-power line pairs that are within 50 meters of each other.”

● Spatial indexing based on coordinates generates persistent indexes for documents, since it is well defined and is immune from any changes in place names, political boundaries, and linguistic variations

VSM / Geographical model (contd..)

● Disadvantages of using the Geographical model in retrieving geographical information

-There are considerable amount of geographical information existing in textual forms that are not easily integrated into geographical model for mapping and spatial analysis, due to the difficulties of natural language understanding for geo-referencing text.-

GeoVSM

● Model obtained by combining the advantages of both the geographical model and vector space model.

● Each document will be indexed both by footprint (in geographical coordinate space) and by a term vector (in vector space).

● Geographical indexes will only represent the geographical scope of the document, and term vectors will only represent thematic scope of documents

Assume that any document has a limited geographic scope, GSd, anda thematic scope, TSd. Similarly, a query on a document collection also has a geographicscope, GSq and a thematic scope, TSq. The degree of relevance of a documentto a query can be determined by the following measure:Rel(d, q) = ƒ(SimG(GSd, GSq), SimT(TSd, TSq) ) (1)where SimG(•) measures the similarity (i.e., the degree of overlapping) between thegeographic scopes of the document and the query; SimT(•) measures the degree ofoverlapping between the thematic scopes of the document and the query; and ƒ(*) is afunction for combining relevance measures of geographic dimensions and thematicdimensions.

References

* GeoVSM: An Integrated Retrieval Model for Geographic InformationGuoray CaiSchool of Information Sciences and TechnologyThe Pennsylvania State University002K Thomas Building, University Park, PA 16802

* http://www.geo-spirit.org/public_deliverables.html

* http://www.geo-spirit.org/publications/SPIRIT_WP5_D17_5201_final.pdf

* http://www.geo-spirit.org/publications/SPIRIT_DeliverableD18_5302_final.pdf

* http://www.geo-spirit.org/publications/GIR_distrib_ranking.pdf

* Distributed Ranking Methods for Geographic Information Retrieval by

Marc van Kreveld Iris Reinbacher Avi Arampatzis Roelof van Zwol

http://www.geo-spirit.org/public_deliverables.html

http://www.geo-spirit.org/publications/SPIRIT_WP5_D17_5201_final.pdf

http://www.geo-spirit.org/publications/SPIRIT_DeliverableD18_5302_final.pdf

http://www.geo-spirit.org/publications/GIR_distrib_ranking.pdf