21
Praveen Manvi April - 2009 Search Domain Basics

Search domain basics

  • Upload
    pmanvi

  • View
    139

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Search domain basics

Praveen Manvi

April - 2009

Search Domain Basics

Page 2: Search domain basics

Objectives

Search Goals

Business Models Structured Vs Un-Structured content Search Terminologies Technologies behind search

Page 3: Search domain basics

Goal : “Make it like this”

Simple, Mostly accurate & fast

Page 4: Search domain basics

But that’s not always possible

Page 5: Search domain basics

Business ModelsSponsored Search

Page 6: Search domain basics

Content Match

Page 7: Search domain basics

TLR Confidential

It’s all about Bill Boards

Page 8: Search domain basics
Page 9: Search domain basics
Page 10: Search domain basics

TLR Confidential

Vertical search, or domain-specific search

Page 11: Search domain basics

Structured Vs Un-structured DataUnstructured – 80%, Structured

– 20%

Relational = structured all other = unstructured.

Page 12: Search domain basics

Why not use SQL/RDBMS?SQL Search limits – %bla bla% pretty

limited by schema & SQL (a limited DSL)Cannot handle Bad user inputs but actually

phonetically correct inputsDifficult to implement various search

requirement like Proximity - Java close to Serialization - if they are close to each other it means it’s a software content

Difficult to scale, manage changes & implement parallelization (Map-Reduce)

Page 13: Search domain basics

Sample Search requestsSample Collection: Sun JDK classesHow many times “synchronized” key word

has been in JDK java classes other than java.lang package?

How many static methods are present in JDK classes that have synchronized methods

How many java classes are there in the Collection framework that use synchronized keyword and have more than 200 lines

Page 14: Search domain basics

Search TerminologiesProximity search :A search where users

to specify that documents returned should have the words near each other.

Concept Search: A search for documents related conceptually to a word, rather than specifically containing the word itself.

Boolean search: A search allowing the inclusion or exclusion of documents containing certain words through the use of operators such as AND, NOT and OR.

Page 15: Search domain basics

Contd…Stemming: The ability for a search to

include the "stem" of words. For example, stemming allows a user to enter "running" and get back results also for the stem word "run."

Lemmatization: is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item

Page 16: Search domain basics

Contd…Noise or Stop words :Conjunctions,

prepositions and articles and other words such as AND, TO and A that appear often in documents yet alone may contain little meaning.

Thesaurus : A list of synonyms a search engine can use to find matches for particular words if the words themselves don't appear in documents.

Index: Normailzed presentation of words

Page 17: Search domain basics

Contd….Semantic Search: is a process used to

improve online searching by using data from semantic networks to disambiguate queries and web text in order to generate more relevant results.

Page 18: Search domain basics

Web Search Vs Enterprise SearchWeb Search : Content is public & generic.

Uses keywords, Links (relevancy) based some kind of historic traffic. Usually http crawlers are used for content acquisition

Enterprise Search : Also contains private documents that domain specific, Quality of content should be highest quality content & not necessarily popular Information/metadata needs to be secure with role based access to the content. It has to support security (Realms, Roles), SLAs and many other requirements.

Page 19: Search domain basics

Search TechnologiesRDBMS to store metadataCache service - for fast accessParsers – to interpret input queriesInternationalization – For handling

different languagesSearch DSL – catering to particular

domainMap/Reduce, Parallelization & AlgorithmsIndexing, File storage systems/ Multi-

threading

Page 20: Search domain basics

Contd…

Page 21: Search domain basics

Thank You!