Upload
nitin-pande
View
8.041
Download
3
Embed Size (px)
DESCRIPTION
This presentation gives an introduction to the Search Engines. What are they? How do they work? It also has a brief introduction to Solr and Lucene
Citation preview
ENTERPRISE SEARCH
an introduction
Web Search
Desktop Search
Enterprise Search
so what is a
Search Engine?
a SOFTWARE
• that builds index on Text
• answers queries using that index
Any search application has two major
components
SEARCH component
INDEXING component - of importance to us developers
(read headache)
- of importance to the users
data
INDEX FILES
is indexed
user
sends search query
receives search results
INDEXING component
SEARCH component
Let’s start with
INDEXING
is it easy to search here . . .
or here . . .
• that’s information like garbage
• no structure
• comes in all kinds of shapes, sizes, formats
• And this is what indexing does
• Makes data accessible in a structured format, easily accessible through search.
so what all needs to be
Indexed and Searched ?
various FILE FORMATS
Text Files
HTMLPDF
MS Word
PPT
coming from various DATA SOURCES
EmailsCMS
File System
Database
Web Pages
data ( documents )
INDEX FILES
user
sends search query
receives search results
Analyzer
fed to
text that should be indexed
removing stop words such as "a" or "the"
converting all text to lowercase letters for case-insensitive searching
Stemming(A stemming algorithm reduces the words "fishing", "fished",
"fish", and "fisher" to the root word, "fish". )-
Index Writer
tokenized text
Document 1:Coffee isn't my cup of tea.
Document 2: Chocolate, men, coffee - some things are better rich.
INDEXcoffee - 1,2cup - 1 tea - 1chocolate - 1men - 1things - 1better - 1rich - 1
And now the
SEARCH Component
data
INDEX FILES
is indexed
user
receives search results
sends search query
search terms
Search Request Terms
Taxonomy
Spelling IndexCorrect Search Terms + Incorrect Search Terms
Search Terms +Related Terms from Taxonomy + Concept IDs
Search engine(INDEX)
Search results with
1) Actual Location of the result2) Rank3) Details4) Facet Categorization
Results’ Page
introducing
LUCENE
Full-text search library
Open Source
Documents in xml format
Can operate on its own or via Solr
Ways of storing fields of any document:
Indexed means it is searchable
Stored you may chose not to make a field searchable, means the content can be displayed in the search results. Example : “summary associated with a page”
Tokenized means it is run through an Analyzer, that converts the
content into a sequence of tokens
introducing
SOLRSolr
Solr
Lucene
Index
• open source
• handles index/Query to Lucene via HTTP and XML ( also JSON )
• manages document update, add and delete requests to Lucene
• straightforward schema and config files
• comprehensive HTML Admin Interfaces
• highly configurable
Adding Documentsto SOLR
HTTP POST to /update
<add><doc boost=“2”>
<field name=“type”>05991</field>
<field name=“from”>Apache Solr</field>
<field name=“subject”>An intro...</field>
<field name=“category”>search</field>
<field name=“category”>lucene</field>
<field name=“body”>Solr is a full...</field>
</doc></add>
Schema.xml field indexing and display definition
Solrconfig.xml file
defines cache size, faceted field type, request handler customization
Deleting Documents• Delete by Id
<delete><id>05591</id></delete>
• Delete by Query (multiple documents)
<delete>
<query>manufacturer:microsoft</query>
</delete>
Search Results
http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price
Default Parameters
param
default description
q The query
start 0 Offset into the list of matches
rows 10 Number of documents to return
fl * Stored fields to return
qt standard Query type; maps to query handler
df (schema) Default field to search
http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price
<response><responseHeader><status>0</status> <QTime>1</QTime></responseHeader> <result numFound="16173" start="0"> <doc> <str name="name">Apple 60 GB iPod with Video</str> <float name="price">399.0</float> </doc> <doc> <str name="name">ASUS Extreme N7800GTX/2DHTV</str> <float name="price">479.95</float> </doc> </result></response>
Solr Core
Lucene
AdminInterface
StandardRequestHandler
DisjunctionMax
RequestHandler
CustomRequestHandler
Update Handler
Caching
XMLUpdate Interface
Config
Analysis
HTTP Request Servlet
Concurrency
Update Servlet
XMLResponse
Writer
Replication
Schema
Search Requests hit here New document to be added here