33
Content Based Search Rajesh Kumar Jain Roll No: 07405402 ([email protected])

Content Based Search Rajesh Kumar Jain Roll No: 07405402 ([email protected])

Embed Size (px)

Citation preview

Page 1: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

Content Based Search

Rajesh Kumar JainRoll No: 07405402

([email protected])

Page 2: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

Agenda-e-DayMotivation What Do People Want from

Search Engine? Types of Search EnginesExisting Search Engines (Google,

Yahoo, Ask AppliedSemantics) INIS – International Nuclear Information System

AgroExplorer Our approach – Functional Architecture with exa.

Conclusion and Future Work

Page 3: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

MotivationWeb major source of information. Need for search engines

Efficient and time saving.Language barrier.Most relevant documents.

Meaning Based Search Used to retrieve most relevant documents

Multilingual SearchUsed to eliminate language barrier.

Page 4: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

What Do People Want from Search Engine?

Integrated SolutionsDistributed SolutionsEfficient, Flexible Indexing and

Retrieval Interfaces and Browsing Effective Retrieval Multimedia Retrieval Information Extraction Relevance Feedback

Page 5: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

Types of Search Engines Individual Search engines

Compile their own databases. Further classified as

Keyword based search engines. Search on the keywords. e.g. Google.

Meaning based search engines. Search on the meaning or semantics. e.g. AgroExplorer

Meta Search engines Do not compile their own databases.Search databases of different search engines. e.g.

Dogpile.

Subject DirectoriesCreated and maintained by human editors. I.e. LIBRARIANS'

INDEX http://lii.org, INFOMINE http://infomine.ucr.edu, ACADEMIC INFO, http://www.academicinfo.us

Page 6: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

Existing Search Engines -Google

Keyword Based Search

Page RankRelative importance of the web

page.

Anchor Text

Page 7: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

Existing Search Engines – .

Yahoo! search http://search.yahoo.com

? Huge (15 or more billion web pages)

? Relevancy ranking (word proximity and placement) - not popularity ranking

? Capitalize OR, AND, or AND NOT. Put parentheses around words joined by OR.

? No search-size word limit (Google limits you to 32 terms)

Services and tools similar to Google's

Page 8: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

Existing Search Engines – .

Differences between searching Google and Yahoo! Search

Parentheses around ORed terms – sometimes works without parentheses

("global warming" OR "greenhouse effect") rise "sea level" (california OR "los angeles" OR "san diego" OR "san francisco")

Supports intitle: site: inurl: hostname:(for entire site name - hosthame:google.com

Shortcuts available at http://tools.search.yahoo.com/shortcuts

Page 9: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

Existing Search Engines – .

Ask.com http://ask.com

Subject-Specific Popularity ranking (links from pages on same subject as your search)

Search results analyzed to provide:

BROADER & NARROWER TERMS suggestions

Smaller database than Google or Yahoo! - about 2 billionNo differences between basic searching in Google and searching Ask.

.com

Page 10: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

Existing Search Engines – AppliedSemantics

•Internet’s first meaning based search engine.

•Used in Google Adsense (Advertising solutions).

•CIRCA technology used. (Conceputal Information Retrieval and Communication Architecture)

•CIRCA has

•a scalable, language independent ontology.

•Ontology has

•Millions of words with their meanings

•Conceptual relationships to other meanings.

Page 11: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

CIRCA•Identifies concepts related to specific words and phrases.

•Finds how close “phrase A” is to “concept B”.

•For a given query

•Finds the distance between the query and various concepts in the database.

•E.g. Query – “Colorado Bicycle trips”.

•Possible concepts– region, bicycling, travel, etc.

Page 12: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

Existing Search Engines – ..com

Page 13: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

INISThere are three major INIS products: The INIS Database, which today contains 2.9 million

bibliographic records; it is accessible by subscription only and has currently 1.3 million authorized users.

A unique collection of over 850 000 full-text documents (non-conventional "grey" literature – NCL) in 63 languages, including many documents that cannot easily be found anywhere else.

The INIS Multilingual Thesaurus – a major tool for describing nuclear information and knowledge in a structured form, which assists in multilingual and semantic searches.

Page 14: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

INIS-Features and Benefits

IAEA official design Direct access to NCL documents in pdf format Extended and configurable hyper-linking of external web

addresses and emails, facilitating easier access to NCL documents on external systems or contacting authors

Weekly email notifications Improved usability:

Allows users to see the query and its results at the same time Allows users to preserve previously run queries for comparison

purposes. Displays records in reverse chronological order, giving users quick

access to the latest records. Better documentation:

Tool-tips assist users in performing tasks Static help pages with "how-to" documents, manuals and glossary

of terms can be opened in separate window for consultation.

Page 15: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

INIS-Features and Benefits

Improved configurability: Allows users to fully customize the search mask and search results pages The interface can be used in English, German and Spanish, with

Portuguese to be added soon. More languages can be added upon demand

Anonymous users can register their own profiles and enjoy personalized features

Improved Index/Authority Navigator with search-composing assistant (CTRL-CLICK)

Increased data export capabilities: new formats (XML, Excel, formatted text, delimited text, HTML), sorting of exports

The type-ahead, search-ahead functionality "INIS Suggest" assists users when entering search terms and shows the hit count before the search is executed; this provides additional useful information when composing queries

Searches are much faster, now enabling queries that used to time out in the old system. Most queries are estimated to be between 5 and 20 times faster

Page 16: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

INIS-Features and Benefits

Support for concurrent users: a round-robin load balancer distributes the load among different databases

Improved maintenance: all update procedures are automated, require no human intervention and notify administrators in case of problems

Zero downtime per week: updates are transparent to users, who can use the system 24/7 without performance detriments.

Page 17: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

AgroExplorer A meaning based multilingual search engine. Agriculture domain. UNL is used as interlingua. Supports english, hindi and and marathi languages.

Methodology User phrases the query in native language. System translates it to Universal Networking

Language (UNL). UNL corpus is searched. Related documents in UNL are fetched. Fetched documents are converted to native

language.

Page 18: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

AgroExplorer

Page 19: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

Query Output Complete Expression Matching.

Retrieves completely relevant documents where query UNL graph is a subgraph of any sentence UNL graph.

Partial Expression Matching Retrieves relevant documents where query UNL

graph is a part of any sentence UNL graph. Universal Word Matching

Search on Universal words which are concepts, not just keywords.

Keyword Based Matching. Traditional search. Lucene search engine used.

Page 20: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

Multilingual Information Retrieval

Need Document collection contains

documents in many languages. User may not be fluent to express

query in document language.

Approaches Machine translation for text

translation Thesaurus/Dictionary Based Corpus Based (Sub word clusters)

Page 21: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

Our Aproach – Functional Architecture

Page 22: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

Example…

Commercial Description:1. Automobile Radio and Stereo Retail Store;

2. Automobile Engine Rebuilding, Repair,

and Exchange Workshop;

3. Car Repair and Retail Shop;

4. Jeep Repair and Retail Shop; and

5. Motor Mending and Replacement Workshop.

Page 23: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

Example… For our search, we shall compare these encoding

and retrieval techniques:

a flat list of words,

a structured list of words,

a flat list of word senses plus the linguistic Ontology

a structured list of word senses, using WordNet’s ontology.

Page 24: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

Method – Flat list of Words

 

Both recall and precision of this method is very bad!!!

NO. QUERY DESCRIPTIONS FOUND

1 Automobile 1, 2

2 Automobile Retail

1

3 Car Repair 3

4 Motor Repair -

5 Engine Repair 2

6. Motor Exchange

-

Page 25: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

Method – Structured list of Words

NO. BUSINESS TYPE

ACTIVITY OBJECT MARKET AREA

1 Store Retail Radio Automobile

  Store Retail Stereo Automobile

2 Workshop Rebuilding

Engine Automobile

  Workshop Repair Engine Automobile

  Workshop Exchange Engine Automobile

3 Shop Retail Car  

  Shop Repair Car  

4 Shop Retail Jeep  

  Shop Repair Jeep  

5 Workshop Replacement

Motor  

  Workshop Mending Motor  

Page 26: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

Method – Structured list of Words

 

Recall remains the same because we have not eliminated the semantic-match problems.

Page 27: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

Method –WordNet Synset and Linguistic ontology

NO.

DISAMBIGUATED DESCRIPTION

1 [car, auto, automobile, machine, motorcar], [radio receiver, receiving set, radio set, radio, tuner, wireless], [stereo, stereo system, stereophonic system], [retail, sell retail], [shop, store]2 [car, auto, automobile, machine, motorcar], [engine], [rebuilding], [repair, fix, fixing, mending, reparation], [substitution, exchange], [workshop, shop]

3 [car, auto, automobile, machine, motorcar], [repair, fix, fixing, mending, reparation], [retail, sell retail], [shop, store]

4 [jeep, landrover], [repair, fix, fixing, mending, reparation], [retail, sell retail], [shop, store]

5 [motor], [repair, fix, fixing, mending, reparation], [replacement, replacing], [workshop, shop]

Page 28: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

Method – Flat list of Word senses and Linguistic

ontologyNO.

DISAMBIGUATED QUERY DESCRIPTIONS FOUND1 [car, auto, automobile, machine,

motorcar] 1, 2, 3, 4

2 [car, auto, automobile, machine, motorcar], [retail, sell retail]

1, 3, 4

3 [car, auto, automobile, machine, motorcar], [repair, fix, fixing, mending, reparation]

2, 3, 4

4 [motor], [repair, fix, fixing, mending, reparation]

2, 5

5 [locomotive, engine, locomotive engine, railway locomotive], [repair, fix, fixing, mending, reparation]

6 [motor], [substitution, exchange] 2, 5

Page 29: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

Method – Flat list of Word senses and Linguistic

ontology 

Decouple the user vocabulary from the data vocabulary, by covering the most common English words;Increase recall, by exploiting the hierarchy to make generic queries and recognizing synonyms;Increase precision, through the disambiguation mechanism and the ability to navigate the hierarchy to select specificqueries

Page 30: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

Conclusion and Future action…

Meaning based search engines can include the concept or idea expressed by the user in his query and can thus provide more accurate results than the traditional keyword search engines.

Universal Networking Language (UNL) can be used as an effective interlingua, to represent information in documents written in natural languages.

Multilingual search engines can help the users to access documents written in languages, other than the query language.

Future Work The lack of a large scored, multilingual corpus and the

adverse effects of polysemous words are found to be the cause of most of the limitations of MLIR systems. Research efforts are being directed towards these fields and approaches to use interlingua like UNL, subword clusters, etc. effectively for MLIR.

Page 31: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

References “What Do People Want from Information Retrieval?”, W. Bruce

Croft Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts, Amherst

“Beyond Google”, Joe Barker, [email protected], John Kupersmith, [email protected], A “Know Your Library” Workshop Teaching Library, University of California, Berkeley Fall 2006 

D.W. Oard and B.J. Dorr, A survey of multilingual text retrieval.Institute of Advanced Computer Studies and Computer Science Department University of sity of Maryland, 1996.

Mrugank Surve, Sarvjeet Singh, Satish Kagathara, AgroExplorer Group and , Pushpak Bhattacharyya, AgroExplorer: a Meaning Based Multilingual Search Engine, International Conference on Digital Libraries, Delhi, India, February,2004.

The UNL Center, The Universal Networking Language (UNL) Specifications. UNDL Foundation, 3rd edition, December 2004.

S. Singh, A Multilingual Meaning Based Search Engine, B.Tech Project Report, Indian Institute of Technology Bombay, 2003.

U. Hahn, K. Marko, S. Schulz, Subword Clusters as Light Weight Interlingua for Multilingual Document Retrieval, Proceedings of the 10th Machine Translation Summit of the International Association for Machine Translation, (MT-Summit X) Phuket, Thailand. 2005.

Page 32: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

References (cont) K. Marko, U. Hahn, S. Schulz, P. Daumke, and P. Nohama,

Interlingual indexing across different language, In RIAO 2004 – Conference Proceedings. Avignon,

France, 26-28 April 2004. Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd,

The pagerank citation ranking: Bringing order to the web, Technical report, Stanford Digital Library, Technologies Project, 1998.

K. Marko, S. Schulz, A. Medelyan and U. Hahn. 2005, Bootstrapping Dictionaries

for Cross Language Information Retrieval, In SIGIR 2005 , Proceedings of the Proceedings of the

28th Annual International ACM SIGIR Conference, Salvador, Brazil, August 15-19, 2005.

Page 33: Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)