Key Questions
What are we looking for? How do we find it? Why is it difficult?
“A prudent question is one-half of wisdom” Francis Bacon
Search Engines 2
What are we looking for? We are
Looking for X. Q&A: population of China Known-item Search: “Cather in the Rye”
Looking for something like/about X. General/background info: Taliban Collection Development: IR Literature Similar to (known) X: like “Cather in the Rye” WhatyoumacallX: “the rye-boy story”
Looking for something Problem Resoultion: how can we fight terrorism? Knowledge Development: what is IR?
Looking Need something, but don’t know what
– what’s it all about? Serendipity: Web surfing
Search Engines 3
How do we find it? Brute force search
Easy to build, maintain, and use Searcher does all the work; Hard to get satisfaction
Organize/structure the data (Information Organization) Intuitive to use Hard to build and maintain Knowledge of builder’s language & organization structure is crucial
Use a search tool (Information Retrieval) Easier to build and maintain: Less manipulation of data Sometimes works, sometimes not (Helps to know the language of the data)
Ask the experts (Expert System) Easy and satisfying to use (by definition) “Expert” knowledge is transitory, hard to encapsulate
Go with the crowd (User Ratings > Recommender System > PageRank) Relatively easy to build and maintain Limited utility: doesn’t work with “unpopular” X
Zen-Fusion search.
Search Engines 4
Information Seeking Process: Dynamic, Interactive, Iterative
User Intermediary Information
What am I looking for? - Identification of info. needHow do I find it? - Query formulation
What are we looking for? - Discovery of user’s information need - Query representationWhere is it? - Query-document matching
What is it? - Collection - ClassificationHow is it found? - Data structure - Representation
5Search Engines
IR vs. IO
Information Organization: - Add structure & annotation
Information Retrieval- Create a searchable index
Information Access- Retrieve information
Data Mining- Discover Knowledge
6Search Engines
Information Retrieval
Representation- indexing, term weighting
Searchable Index Raw Data
Query Formulation- “What is IR?”
Search Results- (ranked) document list
D1 wd1 wd2 wd3
D2 wd2 wd4 wd2 wd3
D3 wd1 wd4
D1 D2 D3
wd1 1 0 1
wd2 1 2 0
wd3 1 1 0
wd4 0 1 1
1 D2
2 D1
3 D3
7Search Engines
Information Organization
Representation- NLP & Machine Learning
Organized Data Raw Data
Query Formulation- “What is IR?”
Search Results- document groups
8Search Engines
Natural Language Processing (NLP)
Research Area, technique, tool for Knowledge Discovery, Data Mining
Lexical Analysis using Part-of-Speech (POS) tagging Sentence Parsing
9Search Engines
Machine Learning
Research Area, technique, tool for Information Organization, Knowledge Discovery, Data Mining
Information Organization via Supervised Learning (Automatic Classification) Unsupervised Learning (Clustering)
Class 1
Class 2
Class 1
Class 2Classification
Clustering
10Search Engines
Clustering Document Clustering
Cluster Hypothesis– Documents having similar contents tend to be relevant to the same query
Rank clusters by Query-Cluster Similarity– Cluster documents based on vector similarity
Post-retrieval clustering– Scatter-Gather
Keyword Clustering Automatic Thesaurus Construction
– Query Expansion
IO for IR
11 Search Engine
Classification Document Categorization
classify documents into manually defined categories– supports hierarchical browsing, query expansion via relevance feedback
Document Indexing assign keywords to documents
– automatic indexing with controlled vocabulary, metadata generation
Document Filtering e.g. news delivery, email spam filtering
Query Classification collection selection algorithm selection
IO for IR
12 Search Engine