Upload
christian-washington
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
going further together
Information Search & Retrieval:Problems, solutions, trends…
Tony Rose, PhD MBCS CEngVice-Chair, BCS IRSG
Contents
The BCS Information Retrieval SG
What is IR anyway?
How search engines work
Why search is hard
Where’s it all going?
Information Retrieval SG
Growing rapidly– 750+ members
Annual conference (ECIR)– FDIA
Various 1-day events– Search Solutions
Informer
Discounts for various events, e.g. SIGIR
… is free to join!
Information Retrieval SG
Traditional focus on search (text retrieval)– Knowledge management, Multimedia retrieval, User experience,
Information visualisation, extraction, summarisation, etc.
Latest issue of Informer:– “Searching for the Music You Like”– “Exploring Maps through Geo-referenced Images and RDF
Shared Metadata”– “Using Semantic Relations to improve Question Answering”– “Modeling & Annotation of Dance Media Semantics”
What is IR?
“Science of searching for:– information in documents– documents themselves– metadata which describe documents,– within databases
…whether relational stand-alone databases or hypertextually-networked databases such as the World Wide Web”
The Need for IR
In a word … Infoglut
800Mb of recorded information is produced per person per year [Computing magazine]
Up to 80% of corporate information is unstructured– Documents, emails, images, voicemail, etc.
So …can’t we just use Google?
How do Search Engines Work?
On the surface:
1. Understand what the user wants
2. Find documents about that topic
In reality:
1. Count words
2. Apply a simple equation
How do Search Engines Work?
1. Measure the conceptual distance between your query and each document in the DB
2. Return the best matches
[Source: Maristella Agosti, University of Padova]
The Central Problem in IR
Information Seeker Author
Concepts Concepts
Query Terms Document Terms
Do these represent the same concepts?
[Source: Jimmy Lin, University of Maryland]
The Central Problem in IR
How do you represent the concepts?– Documents and queries = “bag of words”
• Unordered set of terms + numeric weights
How do you calculate similarity?– Set theory (e.g. Boolean)– Algebraic (e.g. vector space)– Probabilistic
IR models
[Source: Wikipedia]
Assume that results are either relevant or non-relevant
Precision:– Proportion of retrieved documents that are relevantRecall:– Proportion of known-relevant documents that were
actually retrievedBut what about: indexing / retrieval speed, query language, user experience, etc?
How do we Evaluate Search?
relevant retrieved
Why Search is Hard
Document representation– Keywords are not enough
•Blind Venetian = Venetian Blind
– Terms are not independent• Structural & discourse dependencies, co-
references, etc.
Imperfect “stop lists”– the, and, of…
Why Search is Hard
Morphological relationships– Computer, computing, compute, computed…
Index documents using word stems– False positives:
– organization, organ organ– police, policy polic– arm, army arm
– False negatives:– cylinder, cylindrical– create, creation– Europe, European
– Prefixes are particularly difficult– Un*, dis*– Delegate = de-leg-ate– Ratify = rat-ify
Why Search is Hard
Named entity recognition– Companies in New York– New companies in YorkNEs are highly discriminatory– People– Places– OrganisationsMany vertical applications– e.g. bioscience
Why Search is Hard
Semantic relationships– Car = automobile– Buy = purchase– Sick = ill
Synonym rings– Car, automobile, truck, bus, taxi...– Appropriate level of abstraction depends on user & task
Development of subject-specific taxonomies– “concept matching”
Why Search is Hard
Word sense disambiguation– “Bank”
• Financial institution?• Part of a river?• An aerial manoeuvre?
Active research area– Categorisation & clustering of results
Google’s Insight
Exploit the link structure inherent in the web– calculate measure of document’s value
• Independent of any query
– “PageRank”
Overall relevance based on 100+ parameters– Constant battle with SEOs
Enterprise search is a different proposition…– As is desktop search
Where’s it all going?
Vertical search– Jobs, travel, health, people, etc.Rich media search– Audio, video, TV, imagesSpecialised content search– blogs, news, classifiedsSocial searchPersonalisation
Where’s it all going?
Mobile search
Answer engines– Active research community in Question Answering
Multi / cross-lingual search
Search agentsHuman UI
Further Information
www.irsg.bcs.org
Informer
ECIR (March 2008, Glasgow)
Search Solutions 2008 (Sept 2008, London)