Upload
pmanvi
View
139
Download
0
Tags:
Embed Size (px)
Citation preview
Praveen Manvi
April - 2009
Search Domain Basics
Objectives
Search Goals
Business Models Structured Vs Un-Structured content Search Terminologies Technologies behind search
Goal : “Make it like this”
Simple, Mostly accurate & fast
But that’s not always possible
Business ModelsSponsored Search
Content Match
TLR Confidential
It’s all about Bill Boards
TLR Confidential
Vertical search, or domain-specific search
Structured Vs Un-structured DataUnstructured – 80%, Structured
– 20%
Relational = structured all other = unstructured.
Why not use SQL/RDBMS?SQL Search limits – %bla bla% pretty
limited by schema & SQL (a limited DSL)Cannot handle Bad user inputs but actually
phonetically correct inputsDifficult to implement various search
requirement like Proximity - Java close to Serialization - if they are close to each other it means it’s a software content
Difficult to scale, manage changes & implement parallelization (Map-Reduce)
Sample Search requestsSample Collection: Sun JDK classesHow many times “synchronized” key word
has been in JDK java classes other than java.lang package?
How many static methods are present in JDK classes that have synchronized methods
How many java classes are there in the Collection framework that use synchronized keyword and have more than 200 lines
Search TerminologiesProximity search :A search where users
to specify that documents returned should have the words near each other.
Concept Search: A search for documents related conceptually to a word, rather than specifically containing the word itself.
Boolean search: A search allowing the inclusion or exclusion of documents containing certain words through the use of operators such as AND, NOT and OR.
Contd…Stemming: The ability for a search to
include the "stem" of words. For example, stemming allows a user to enter "running" and get back results also for the stem word "run."
Lemmatization: is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item
Contd…Noise or Stop words :Conjunctions,
prepositions and articles and other words such as AND, TO and A that appear often in documents yet alone may contain little meaning.
Thesaurus : A list of synonyms a search engine can use to find matches for particular words if the words themselves don't appear in documents.
Index: Normailzed presentation of words
Contd….Semantic Search: is a process used to
improve online searching by using data from semantic networks to disambiguate queries and web text in order to generate more relevant results.
Web Search Vs Enterprise SearchWeb Search : Content is public & generic.
Uses keywords, Links (relevancy) based some kind of historic traffic. Usually http crawlers are used for content acquisition
Enterprise Search : Also contains private documents that domain specific, Quality of content should be highest quality content & not necessarily popular Information/metadata needs to be secure with role based access to the content. It has to support security (Realms, Roles), SLAs and many other requirements.
Search TechnologiesRDBMS to store metadataCache service - for fast accessParsers – to interpret input queriesInternationalization – For handling
different languagesSearch DSL – catering to particular
domainMap/Reduce, Parallelization & AlgorithmsIndexing, File storage systems/ Multi-
threading
Contd…
Thank You!