38
Geographical Information Retrieval Behrooz Rasuli Iranian Research Inst. For Information Science & Technol. [email protected] GIR

Geographic Information Retrieval (GIR)

Embed Size (px)

DESCRIPTION

Geographic Information Retrieval (GIR) is a branch of Information Retrieval that nowadays is an important field in Library and Information Science, GIS, and Information Systems, as well as Computer Science. This Presentation includes different models and methods for GIR, highlighted issues in GIR systems user interface design, and so on.

Citation preview

Page 1: Geographic Information Retrieval (GIR)

Geographical Information Retrieval

Behrooz RasuliIranian Research Inst. For Information Science & [email protected]

GIR

Page 2: Geographic Information Retrieval (GIR)

Address information is essential for people's daily life. People often need to query addresses of unfamiliar location through Web and then use map services to mark down the location for direction purpose. Although both address information and map services are available online, they are not well combined.

Introduction . . .

Page 3: Geographic Information Retrieval (GIR)

general search engines are widely used to retrieve Web pages

Specialized search engines are dedicated to find either particular types of resources or Web pages based on different criteria e.g. language or geographic location

People use search engines to find Web pages of local services and events around them or in a particular area

Introduction . . .

Page 4: Geographic Information Retrieval (GIR)

is the data pertaining to the location of geographical entities together with their spatial dimensions

Location could be defined as “a place on the Internet where an Internet resource, such as a Web page, is stored”

Spatial data, geospatial data or geographic information

Page 5: Geographic Information Retrieval (GIR)

Source Geography◦ physical location of hosts◦ signal processing and network-based techniques

Target Geography◦ uses elements contained in the page to deduce

locations (place names, postal addresses, and phone numbers)

◦ Challenge: involves evidence extraction, semantic analysis, and interpretation, in order to link Web pages to geographic locations

Geographic aspects of the Web can be explored using two approaches

Page 6: Geographic Information Retrieval (GIR)

Geographic Information Retrieval (GIR) is an applied research field that involves indexing, searching, retrieving, and browsing georeferenced information sources, and designing systems to execute these tasks effectively and efficiently

Like IR, GIR includes indexing, storage and ranking

GIR

Page 7: Geographic Information Retrieval (GIR)

pattern extraction from raw text has already been done. For example, M. Hearst (1990s)

developed an approach for discovering lexico-syntactic patterns for hypernyms

GIR History

Page 8: Geographic Information Retrieval (GIR)

Pattern-Based Methods;◦ Named Entity Recognition (NER)◦ Gazetteer approach (Web-a-Where);◦ Pattern-based method

Ontology-Based Methods;◦ OnLocus

Machine Learning Methods;

GIR Methods

Page 9: Geographic Information Retrieval (GIR)

Pattern-Based Method

Page 10: Geographic Information Retrieval (GIR)

Few commercial geographic search engines have been commercially developed among them Google Map and Yahoo Local are notable

ambiguous dynamic nature of location names, various addressing styles, lack of geographic information, and multiple locations related to a Web resource

Page 11: Geographic Information Retrieval (GIR)

extract proper names from texts and documents

an algorithm that distinguishes five classes for name of locations: CITY, REGION, COUNTRY, ISLAND, RIVER, and MOUNTAIN

method is time-consuming and is not useful for real-time search

Named Entity Recognition (NER)

Page 12: Geographic Information Retrieval (GIR)

tagging individual place names (geotagger);◦finds and disambiguates geographic names

(assigning a canonical taxonomy node to each phrase in the text)

1. Spotting;2. Disambiguation;3. Focus determination;crawling the Web, storing the resulting pages and indexing their contents

Gazetteer approach

Page 13: Geographic Information Retrieval (GIR)

Basically, a geographic search engine must be able to find related addresses and location names and assign them to Web pages

Current address extraction techniques basically require large gazetteers which are expensive and unavailable for many countries

different markup styles e.g. HTML, XML and DOM

natural language processing models are not able to extract all addresses and location names from Web page contents

Page 14: Geographic Information Retrieval (GIR)

Different ways of mentioning an address in a Web page

Page 15: Geographic Information Retrieval (GIR)

large scale gazetteers

pattern-based model which uses HTML and visual segmentations to improve

address extraction on Web pages

divide an address to its semantic components

automatic

much human effort

new location names

Page 16: Geographic Information Retrieval (GIR)

The proposed address extraction system consists of five components:

HTML Pre- Processor, Parser, Knowledge Searcher, Decision Maker, and Knowledge Accumulator

Page 17: Geographic Information Retrieval (GIR)

analyze HTML tags and codes; convert HTML files to XML (by employing the

VIPS Demo software); in-depth analyzing and traversing the XML

to obtain content information; sorting them in a linear sequence together

with their node numbers; a node index is built

1. HTML Pre-processor

Page 18: Geographic Information Retrieval (GIR)

It tries to find all candidate phrases (potential addresses) in a node;

divides a potential address into its component;

Each segment obtained in this step, will be utilized as default searching unit of Database Searcher;

2. XML Parser

Page 19: Geographic Information Retrieval (GIR)

itemizes elements of a potential address; It finds all possibilities of a potential address

and forms them into a list of possible patterns in three steps:◦ Standardizing Word Formats (different spells, abbreviations,

synonyms)◦ Knowledge-Base Place Name Matching (separates

elements into more delicate level)◦ Ambiguity Eliminating (tries to match place name)

3. Knowledge Searcher

Page 20: Geographic Information Retrieval (GIR)

whether a candidate phrase is an address or not; by matching it with address patterns already stored

in a database;◦ Delimitating ambiguities and conflicts of place names (syntactic and

semantic: geo/non-geo and geo/geo);

◦ Itemizing each potential address to its elements;◦ Adding the lost parts to address based on a location tree

wherever it is possible

the address ”No. 10, William Street, Toowong, Queensland” will be modified as ”No. 10, William Street, Toowong, Brisbane, Queensland, Australia”

4. Decision Maker

Page 21: Geographic Information Retrieval (GIR)

the last component of the system; exhibits in two aspects:◦ Location Accumulation;◦ Address Pattern Accumulation

5. Knowledge Accumulator

Page 22: Geographic Information Retrieval (GIR)

there are 9 lemmas in KB; 3 lemmas have multiple identities (Victoria, Churchill, Howard

Avenue); Following algorithm indicates how place

names are detected in Phrases◦ PW - A candidate phrase◦ Wi - the ith word in PW◦ f - any syntactic format of W i

◦ KB - Knowledge-Base◦ Ci - Result Collection

Example

Inputs

Page 23: Geographic Information Retrieval (GIR)

1. PW(pre word, Wi) {2. if ((pre word + f) = a place name

found in KB)3. add (pre word + f) to Ci;4. if (pre word + f) = part of a name in

KB5. pre word = pre word + f;6. PW(pre word, Wi+1);//try next word in PW7. }

Page 24: Geographic Information Retrieval (GIR)

1. SyntacticAE(Potential) {2. current word = first word in Potential3. C = NULL; //initialize C4. While current word != EOF5. {6. C = SAE (C, current word); //add longestresult in C7. current word = next new word in Potential;8. }9. }

Syntactic Ambiguity Elimination

Page 25: Geographic Information Retrieval (GIR)

inconsistencies between accumulated knowledge in KB and extracted information from the Web:◦ misspelling and synonymy◦ incompleteness of KB

• Keeping the Conflict• Removing Meaningless Conflict Element• Finding Synonymous Sub-tree• Merging Synonymous Sub-Tree

Conflict Elimination

Page 26: Geographic Information Retrieval (GIR)

Ontology-Based Method

Page 27: Geographic Information Retrieval (GIR)

Direct references◦ place names, complete postal addresses

Indirect references◦ postal codes and telephone area codes, or from

expressions that indicate relationships to other places, which are directly referenced (for instance, “The hotel is two blocks from Times Square”)

References to places in Web pages

Page 28: Geographic Information Retrieval (GIR)

propose a three-phase process for recognizing geographic evidence in Web pages:◦ Extraction (selecting relevant Web content),◦ Recognition (corresponds to isolating references

to places embedded in text and includes dealing with ambiguity),

◦ Location (obtains locations from the place descriptions previously recognized, using positioning data from gazetteers or from spatial databases)

OnLocus

Page 29: Geographic Information Retrieval (GIR)

an extraction ontology is able to identify objects and relationships;

ontology must describe rules for identifying elements within its domain that are present in Web pages

extraction

Page 30: Geographic Information Retrieval (GIR)

recognition of terms and expressions as place names;◦ compared to a gazetteer: Alexandria and

GeoNames

recognition

Page 31: Geographic Information Retrieval (GIR)
Page 32: Geographic Information Retrieval (GIR)

try to determine an actual location from a gazetteer or performing a process

known as geocoding

Location of direct references◦ matching and locating

Location of indirect references◦ Formal

establish a correspondence between a code and the area it serves (supported by spatial databases)

◦ Informal natural language interpretation is required

Location

Page 33: Geographic Information Retrieval (GIR)
Page 34: Geographic Information Retrieval (GIR)

Machine Learning Methods

Page 35: Geographic Information Retrieval (GIR)

apply Text Mining procedures to the Internet in order to classify places into different location types (e.g., Maebashi is a CITY, Honshu is an ISLAND) and to determine for a given place name, where the place is (e.g. Maebashi is in Japan, Honshu is in the Pacific ocean);

acquire exhaustive fine-grained gazetteers automatically and thus avoid hand-coding;

distinguish 6 location types (CITY, REGION, COUNTRY, ISLAND, RIVER, MOUNTAIN)

Machine Learning Method

Page 36: Geographic Information Retrieval (GIR)

dataset consists of 1260 names of locations For each class constructed a set of patterns

◦ patterns have the form “KEYWORD+of+X” and “X+KEYWORD” (Alta Vista counts)

Each class has from 3 (ISLAND) up to 10 (MOUNTAIN) different keywords

Keywords and patterns were selected manually

Algorithm

Page 37: Geographic Information Retrieval (GIR)

For example, for the class CITY use 4 keywords (“city”, “town”, “mayor”, “streets”) and 7 corresponding patterns (“city+ of+X”, “X+city”, “town+of+X”, “mayor+of+X”, “X+ mayor”, “streets+of+X”, and “X+streets”

Example

Page 38: Geographic Information Retrieval (GIR)

Thank You!Presented in Information Retrieval Course, under supervision of

Dr. Saeid Asadi