Upload
zeeshanfrnd
View
236
Download
0
Embed Size (px)
Citation preview
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 1/24
Lucene and Solr
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 2/24
Lucene
◦ Doug Cutting Created in 1999
Donated to Apache in 2001
Features◦ Highly scalable
◦ Java (1.4)
◦ Ports to many other languages
◦ No crawler◦ No document parsing
◦ No “PageRank”
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 3/24
Lucene
◦ Powered by Lucene
IBM Omnifind Y! Edition
Technorati
Wikipedia Internet Archive
monster.com
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 4/24
Indexing
Logical structure◦ Index is collection of documents
◦ Documents are a collection of fields
◦ Fields are the content
Stored – Stored verbatim for retrival with results Indexed – Tokenized and made searchable
◦ Indexed terms stored in inverted index
Physical structure◦ Multiple documents (with all fields) stored in
segments mergeFactor
◦ All segments together make up the index
IndexWriter is interface object for entire index
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 5/24
Indexing
aardvark
hood
red
little
riding
robin
women
zoo
Little Red Riding Hood
Robin Hood
Little Women
0 1
0 2
0
0
2
1
0
1
2
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 6/24
Indexing
Analysis
◦ Extract tokens from text (tokenizer)
Whitespace
Hyphens◦ Manipulate or modify tokens (token filter)
Stemming
Removal
◦ Tokenizer / Token Filter chains are called
analyzers
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 7/24
LexCorp BFG-9000
LexCorp BFG-9000
BFG 9000Lex Corp
LexCorp
bfg 9000lex corp
lexcorp
WhitespaceTokenizer
WordDelimiterFilter catenateWords=1
LowercaseFilter
Indexing
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 8/24
Searching
Query Creation
◦ Query parser
◦ Manual query construction from terms
◦ title:“Bell” author:“Hemmingway”^3.0
Query terms are analyzed
◦ Same analyzer for indexing and searching
on each field
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 9/24
Searching
LexCorp BFG-9000
LexCorp BFG-9000
BFG 9000Lex Corp
LexCorp
bfg 9000lex corp
lexcorp
WhitespaceTokenizer
WordDelimiterFilter catenateWords=1
LowercaseFilter
Lex corp bfg9000
Lex bfg9000
bfg 9000Lex corp
bfg 9000lex corp
WhitespaceTokenizer
WordDelimiterFilter catenateWords=0
LowercaseFilter
A Match!
corp
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 10/24
Searching
Many query types Term
Phrase “bad wolf”
Proximity “quick fox”~4
Prefix pla?e (plate or place or plane)
practic* (practice or practical orpractically)
Fuzzy (edit distance)
planting~0.75 (granting or planning)
roam~ (default is 0.5)
Range date:[05072007 TO 05232007] (inclusive)
author: {king TO mason} (exclusive)
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 11/24
Searching
Multiple searchers at once
◦ Thread safe
Additions or deletions to index are not
reflected in already open searchers◦ Must be closed and reopened
Use commit or optimize on
indexWriter
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 12/24
Lucene Sub-projects
Nutch
◦ Web crawler with document parsing
Hadoop
◦ Distributed data processor
◦ Implements MapReduce
Solr
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 13/24
Solr
◦ Yonik Seeley Developed at CNET
Donated to Apache in 2006
Features◦ Servlet
◦ Web Administration Interface◦ XML/HTTP, JSON Interfaces
◦ Faceting
◦ Schema to define types and fields
◦ Highlighting
◦ Caching
◦ Index Replication (Master / Slaves)
◦ Pluggable
◦ Java 5
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 14/24
Solr
◦ Powered by Solr Netflix
CNET
Smithsonian
AOL:sports and music
RightNow ??
Drupal module
GameSpot
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 15/24
Configuration (solrconfig.xml)
<mainIndex><useCompoundFile>false</useCompoundFile>
<mergeFactor>10</mergeFactor>
<maxBufferedDocs>1000</maxBufferedDocs>
<maxMergeDocs>2147483647</maxMergeDocs>
<maxFieldLength>10000</maxFieldLength>
</mainIndex>
<requestHandler name="standard" class="solr.StandardRequestHandler" />
<requestHandler name=“custom" class="your.package.CustomRequestHandler" />
<autoCommit>
<maxDocs>10000</maxDocs>
<maxTime>1000</maxTime>
</autoCommit>
<queryResponseWriter name="xml" class="org.apache.solr.request.XMLResponseWriter"
default="true"/>
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 16/24
Schema (schema.xml)
Fields<uniqueKey>id</uniqueKey>
<field name="products" type="text" indexed="true" stored=“true"/>
<field name="keywords" type="text_ws" indexed="true" stored=“true”/>
<field name="keywordsSorted" type="text_sorted" indexed="true" stored="false"/><field name="timestamp" type="date" indexed="true" stored="true" default="NOW"/>
<dynamicField name="*_i" type="integer" indexed="true" stored="true"/>
<dynamicField name="desc_*" type="string" indexed="true" stored="false"/>
<copyField source=“keywords" dest=“keywordsSorted"/>
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 17/24
Schema
Analyzers<fieldtype name="nametext" class="solr.TextField">
<analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
</fieldtype>
<fieldtype name="text" class="solr.TextField">
<analyzer><tokenizer class="solr.StandardTokenizerFactory"/><filter class="solr.StandardFilterFactory"/><filter class="solr.LowerCaseFilterFactory"/><filter class="solr.StopFilterFactory"/><filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldtype>
<fieldtype name="myfieldtype" class="solr.TextField">
<analyzer><tokenizer class="solr.WhitespaceTokenizerFactory"/><filter class="solr.SnowballPorterFilterFactory" language="German" />
</analyzer>
</fieldtype>
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 18/24
Insertion
◦ HTTP POST to http://localhost:8983/solr/update/
<add>
<doc>
<field name="employeeId">05991</field>
<field name="office">Bridgewater</field>
<field name="skills">Perl</field>
<field name="skills">Java</field>
</doc>
[<doc> ... </doc>[<doc> ... </doc>]]
</add>
Documents or fields can have boosts attached
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 19/24
Update / Delete
Inserting a document with alreadypresent uniqueKey will erase the
original
Deleting◦ By uniqueKey field
<delete><id>05991</id></delete>
◦ By query<delete><query>name:Anthony</query></delete>
<Commit/>
<Optimize/>
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 20/24
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 21/24
Search
Faceting◦ Available in StandardRequestHandler and
DisMaxRequestHandler
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 22/24
Search
http://localhost:8983/solr/select?q=ipod&rows=0&facet=true&facet.limit=-1&facet.field=cat&facet.mincount=1&facet.field=inStock
<response>
<responseHeader>
<status>0</status>
<QTime>3</QTime>
</responseHeader>
<result numFound="4" start="0"/>
<lst name="facet_counts"><lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="cat">
<int name="music">1</int>
<int name="connector">2</int>
<int name="electronics">3</int>
</lst>
<lst name="inStock">
<int name="false">3</int><int name="true">1</int>
</lst>
</lst>
</lst>
</response>
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 23/24
Many more features
Replication◦ Master / Slave architecture for load
balancing and backups
More-like-this Easy to add RequestHandlers and
ResponseWriters
Responses in many formats Hit highlighting
8/13/2019 Lucene and Solr
http://slidepdf.com/reader/full/lucene-and-solr 24/24
Sources
http://lucene.apache.org/
http://lucene.apache.org/solr/
http://people.apache.org/~yonik/presentations/