21
going further togethe Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG

Going further together Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG

Embed Size (px)

Citation preview

Page 1: Going further together Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG

going further together

Information Search & Retrieval:Problems, solutions, trends…

Tony Rose, PhD MBCS CEngVice-Chair, BCS IRSG

Page 2: Going further together Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG

Contents

The BCS Information Retrieval SG

What is IR anyway?

How search engines work

Why search is hard

Where’s it all going?

Page 3: Going further together Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG

Information Retrieval SG

Growing rapidly– 750+ members

Annual conference (ECIR)– FDIA

Various 1-day events– Search Solutions

Informer

Discounts for various events, e.g. SIGIR

… is free to join!

Page 4: Going further together Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG

Information Retrieval SG

Traditional focus on search (text retrieval)– Knowledge management, Multimedia retrieval, User experience,

Information visualisation, extraction, summarisation, etc.

Latest issue of Informer:– “Searching for the Music You Like”– “Exploring Maps through Geo-referenced Images and RDF

Shared Metadata”– “Using Semantic Relations to improve Question Answering”– “Modeling & Annotation of Dance Media Semantics”

Page 5: Going further together Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG

What is IR?

“Science of searching for:– information in documents– documents themselves– metadata which describe documents,– within databases

…whether relational stand-alone databases or hypertextually-networked databases such as the World Wide Web”

Page 6: Going further together Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG

The Need for IR

In a word … Infoglut

800Mb of recorded information is produced per person per year [Computing magazine]

Up to 80% of corporate information is unstructured– Documents, emails, images, voicemail, etc.

So …can’t we just use Google?

Page 7: Going further together Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG

How do Search Engines Work?

On the surface:

1. Understand what the user wants

2. Find documents about that topic

In reality:

1. Count words

2. Apply a simple equation

Page 8: Going further together Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG

How do Search Engines Work?

1. Measure the conceptual distance between your query and each document in the DB

2. Return the best matches

[Source: Maristella Agosti, University of Padova]

Page 9: Going further together Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG

The Central Problem in IR

Information Seeker Author

Concepts Concepts

Query Terms Document Terms

Do these represent the same concepts?

[Source: Jimmy Lin, University of Maryland]

Page 10: Going further together Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG

The Central Problem in IR

How do you represent the concepts?– Documents and queries = “bag of words”

• Unordered set of terms + numeric weights

How do you calculate similarity?– Set theory (e.g. Boolean)– Algebraic (e.g. vector space)– Probabilistic

Page 11: Going further together Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG

IR models

[Source: Wikipedia]

Page 12: Going further together Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG

Assume that results are either relevant or non-relevant

Precision:– Proportion of retrieved documents that are relevantRecall:– Proportion of known-relevant documents that were

actually retrievedBut what about: indexing / retrieval speed, query language, user experience, etc?

How do we Evaluate Search?

relevant retrieved

Page 13: Going further together Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG

Why Search is Hard

Document representation– Keywords are not enough

•Blind Venetian = Venetian Blind

– Terms are not independent• Structural & discourse dependencies, co-

references, etc.

Imperfect “stop lists”– the, and, of…

Page 14: Going further together Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG

Why Search is Hard

Morphological relationships– Computer, computing, compute, computed…

Index documents using word stems– False positives:

– organization, organ organ– police, policy polic– arm, army arm

– False negatives:– cylinder, cylindrical– create, creation– Europe, European

– Prefixes are particularly difficult– Un*, dis*– Delegate = de-leg-ate– Ratify = rat-ify

Page 15: Going further together Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG

Why Search is Hard

Named entity recognition– Companies in New York– New companies in YorkNEs are highly discriminatory– People– Places– OrganisationsMany vertical applications– e.g. bioscience

Page 16: Going further together Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG

Why Search is Hard

Semantic relationships– Car = automobile– Buy = purchase– Sick = ill

Synonym rings– Car, automobile, truck, bus, taxi...– Appropriate level of abstraction depends on user & task

Development of subject-specific taxonomies– “concept matching”

Page 17: Going further together Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG

Why Search is Hard

Word sense disambiguation– “Bank”

• Financial institution?• Part of a river?• An aerial manoeuvre?

Active research area– Categorisation & clustering of results

Page 18: Going further together Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG

Google’s Insight

Exploit the link structure inherent in the web– calculate measure of document’s value

• Independent of any query

– “PageRank”

Overall relevance based on 100+ parameters– Constant battle with SEOs

Enterprise search is a different proposition…– As is desktop search

Page 19: Going further together Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG

Where’s it all going?

Vertical search– Jobs, travel, health, people, etc.Rich media search– Audio, video, TV, imagesSpecialised content search– blogs, news, classifiedsSocial searchPersonalisation

Page 20: Going further together Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG

Where’s it all going?

Mobile search

Answer engines– Active research community in Question Answering

Multi / cross-lingual search

Search agentsHuman UI

Page 21: Going further together Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG

Further Information

www.irsg.bcs.org

Informer

ECIR (March 2008, Glasgow)

Search Solutions 2008 (Sept 2008, London)