Upload
gordon-james
View
217
Download
2
Embed Size (px)
Citation preview
Lucene Boot Camp
Grant IngersollLucid Imagination
Nov. 4, 2008 New Orleans, LA
2
Schedule
• In-depth Indexing/Searching – Performance, Internals– Filters, Sorting
• Terms and Term Vectors• Class Project• Q & A
3
Day I Recap
• Indexing– IndexWriter
– Document/Field– Analyzer
• Searching– IndexSearcher
– IndexReader
– QueryParser
• Analysis• Contrib
4
Indexing In-Depth
• Deletions and Updates• Optimize• Important Internals
– File Formats– Segments, Commits, Merging– Compound File System
• Performance
5
Lucene File Formats and Structures
• http://lucene.apache.org/java/2_4_0/fileformats.html
• A Lucene index is made up of one or more Segments
• Lucene tracks Documents internally by an int “id”
• This id may change across index operations– You should not rely on it unless you know your index isn’t changing
• You can ask for a Document by this id on the IndexReader
6
Segments
• Each Segment is an independent index containing:– Field Names– Stored Field values– Term Dictionary, proximity info and normalization factors
– Term Vectors (optional)– Deleted Docs
• Compound File System (CFS) stores all of these logical pieces in a single file
How Lucene Indexes
• Lucene indexes Documents into memory– At certain trigger points, memory (segments) are committed/flushed to the Directory•Can be forced by calling commit()
– Segments are periodically merged (more in a moment)
8
Segments and Merging
• May be created when new documents are added
• Are merged from time to time based on segment size in relation to:– MergePolicy– MergeScheduler– Optimization
9
Merge Policy
• Identifies Segments to be merged
• Two Current Implementations– LogDocMergePolicy– LogByteSizeMergePolicy
• mergeFactor - Max # of segments allowed before merging
10
MergeScheduler
• Responsible for performing the merge
• Two Implementations:– Serial - blocking– Concurrent - new, background
11
Optimize
• Optimize is the process of merging segments down into a single segment
• This process can yield significant speedups in search
• Can be slow• Can also do partial optimizes
12
Final Thoughts On Merging
• Usually don’t have to think about it, except when to optimize
• In high update, performance critical environments, you may need to dig into it more as it can sometimes cause long pauses
• Good to optimize when you can, otherwise, keep a low mergeFactor
Deletion
• A deletion only marks the Document as deleted– Doesn’t get physically removed until a merge
• Deletions can be a bit confusing– Both IndexReader and IndexWriter have delete methods•By: id, term(s), Query(s)
14
Task
– Build your index from yesterday and then try some deletes•Id, term, Query
– Also try out an optimize on a FSDirectory against the full Reuters sample
– 15-20 minutes
15
Updates
• Updates are always a delete and an add
• Updates are always a delete and an add– Yes, that is a repeat!– Nature of data structures used in search
• See IndexWriter.updateDocument()
Performance Factors• setRAMBufferSizeMB
– New model for automagically controlling indexing factors based on the amount of memory in use
– Obsoletes setMaxBufferedDocs• maxBufferedDocs
– Minimum # of docs before merge occurs and a new segment is created
– Usually, Larger == faster, but more RAM
17
More Factors
• mergeFactor– How often segments are merged
– Smaller == less RAM, better for incremental updates
– Larger == faster, better for batch indexing
• maxFieldLength– Limit the number of terms in a Document
• Analysis
• Reuse– Document, TokenStream, Token
Index Threading
• IndexWriter and IndexReader are thread-safe and can be shared between threads without external synchronization
• One open IndexWriter per Directory
• Parallel Indexing– Index to separate Directory instances– Merge using IndexWriter.addIndexes– Could also distribute and collect
Benchmarking Indexing
• contrib/benchmark• Try out different algorithms between Lucene 2.2 and 2.3– contrib/benchmark/conf:
• indexing.alg• indexing-multithreaded.alg
• Info:– Mac Pro 2 x 2GHz Dual-Core Xeon– 4 GB RAM– ant run-task -Dtask.alg=./conf/indexing.alg -
Dtask.mem=1024M
Benchmarking ResultsRecords/Sec
Avg. T Mem
2.2 421 39MTrunk 2,122 52MTrunk-mt (4)
3,680 57MYour results will depend on analysis, etc.
Searching
• Earlier we touched on basics of search using the QueryParser
• Now look at:– Searcher/IndexReader Lifecycle– Query classes– More details on the QueryParser– Filters– Sorting
Lifecycle
• Recall that the IndexReader loads a snapshot of index into memory– This means updates made since loading the index will not be seen
• Business rules are needed to define how often to reload the index, if at all– IndexReader.isCurrent() can help
• Loading an index is an expensive operation– Do not open a Searcher/IndexReader for every search
23
Reopen
• It is possible to have IndexReader reopen new or changed segments– Save some on the cost of loading a new index
• Does not close the old reader, so application must
• See DeletionsUpdatesTest.testReopen()
Query Classes• TermQuery is basis for all non-span queries
• BooleanQuery combines multiple Query instances as clauses– should– required
• PhraseQuery finds terms occurring near each other, position-wise– “slop” is the edit distance between two terms
• Take 2-3 minutes to explore Query implementations
Spans
• Spans provide information about where matches took place
• Not supported by the QueryParser
• Can be used in BooleanQuery clauses
• Take 2-3 minutes to explore SpanQuery classes– SpanNearQuery useful for doing phrase matching
QueryParser
• MultiFieldQueryParser• Boolean operators cause confusion
– Better to think in terms of required (+ operator) and not allowed (- operator)
• Check JIRA for QueryParser issues• http://www.gossamer-threads.com/lists/lucene/java-us
er/40945
• Most applications either modify QP, create their own, or restrict to a subset of the syntax
• Your users may not need all the “flexibility” of the QP
Sorting• Lucene default sort is by score• Searcher has several methods that take in a Sort object
• Sorting should be addressed during indexing
• Sorting is done on Fields containing a single term that can be used for comparison
• The SortField defines the different sort types available– AUTO, STRING, INT, FLOAT, CUSTOM, SCORE, DOC
Sorting II
• Look at Searcher, Sort and SortField
• Custom sorting is done with a SortComparatorSource
• Sorting can be very expensive– Terms are cached in the FieldCache
Filters
• Filters restrict the search space to a subset of Documents
• Use Cases– Search within a Search– Restrict by date– Rating– Security– Author
Filter Classes
• QueryWrapperFilter (QueryFilter)– Restrict to subset of Documents that match a Query
• RangeFilter– Restrict to Documents that fall within a range
– Better alternative to RangeQuery
• CachingWrapperFilter– Wrap another Filter and provide caching
31
Task
• Modify your program to sort by a field and to filter by a query or some other criteria– ~15 minutes
Searchers• MultiSearcher
– Search over multiple Searchables, including remote
• MultiReader– Not a Searcher, but can be used with IndexSearcher to achieve same results for local indexes
• ParallelMultiSearcher– Like MultiSearcher, but threaded
• RemoteSearchable– RMI based remote searching
• Look at MultiSearcherTest in example code
Expert Results
• Searcher has several “expert” methods
• HitCollector allows low-level access to all Documents as they are scored
Search Performance• Search speed is based on a number of factors:– Query Type(s)– Query Size– Analysis– Occurrences of Query Terms– Optimize– Index Size– Index type (RAMDirectory, other)– Usual Suspects
• CPU• Memory• I/O• Business Needs
Query Types
• Be careful with WildcardQuery as it rewrites to a BooleanQuery containing all the terms that match the wildcards
• Avoid starting a WildcardQuery with wildcard
• Use ConstantScoreRangeQuery instead of RangeQuery
• Be careful with range queries and dates– User mailing list and Wiki have useful tips for optimizing date handling
Query Size
• Stopword removal
• Search an “all” field instead of many fields with the same terms
• Disambiguation – May be useful when doing synonym expansion
– Difficult to automate and may be slower
– Some applications may allow the user to disambiguate
• Relevance Feedback/More Like This– Use most important words
– “Important” can be defined in a number of ways
Usual Suspects• CPU
– Profile your application
• Memory– Examine your heap size, garbage collection approach
• I/O– Cache your Searcher
• Define business logic for refreshing based on indexing needs
– Warm your Searcher before going live -- See Solr
• Business Needs– Do you really need to support Wildcards?
– What about date range queries down to the millisecond?
FieldSelector
• Prior to version 2.1, Lucene always loaded all Fields in a Document
• FieldSelector API addition allows Lucene to skip large Fields– Options: Load, Lazy Load, No Load, Load and Break, Load for Merge, Size, Size and Break
• Makes storage of original content more viable without large cost of loading it when not used
• FieldSelectorTest in example code
39
Relevance
• At some point along your journey, you will get results that you think are “bad”
• Is it a big deal?– Content, Content, Content!– Relevance Judgments– Don’t break other queries just to “fix” one
• Hardcode it!– A query doesn’t always have to result in a “search”
Scoring and Similarity
• Lucene has sophisticated scoring mechanism designed to meet most needs
• Has hooks for modifying scores
• Scoring is handled by the Query, Weight and Scorer class
Explanations
• explain(Query, int) method is useful for understanding why a Document scored the way it did
• Shows all the pieces that went into scoring the result:– Tf, DF, boosts, etc.
Tuning Relevance
• FunctionQuery from Solr (variation in Lucene)
• Override Similarity• Implement own Query and related classes
• Payloads• Boosts
43
Task
• Open Luke and try some queries and then use the “explain” button
• Or, write some code to do explains on a query and some documents
• See how Query type, boosting, other factors play a role in the score
44
Terms and Term Vectors
• Sometimes you need access to the Term Dictionary:– Auto suggest– Frequency information
• Sometimes you need a Document-centric view of terms, frequencies, positions and offsets– Term Vectors
Term Information• TermEnum gives access to terms and how many Documents they occur in– IndexReader.terms()
• TermDocs gives access to the frequency of a term in a Document– IndexReader.termDocs()
– TermPositions extends TermDocs and provides access to position and payload info– IndexReader.termPositions()
46
Term Vectors
• Term Vectors give access to term frequency information in a given Document– IndexReader.getTermFreqVector
• TermVectorMapper provides callbacks for working with Term Vectors
47
TermsTest
• Provides samples of working with terms and term vectors
Lunch ?
1-2:30
Recap
• Indexing• Searching• Performance• Odds and Ends
– Explains– FieldSelector– Relevance– Terms and Term Vectors
50
Class Project
• Your chance to really dig in and get your hands dirty
• Ask Questions• Options…
51
Option I
• Start building out your Lucene Application!– Index your Data (or any data)
•Threading/Updates/Deletions•Analysis
– Search•Caching/Warming•Dealing with Updates•Multi-threaded
– Display
52
Option II
• Dig deeper into an area of interest– Performance
•How fast can you index?•Search? Queries per Second?
– Analysis– Query Parsing– Scoring– Contrib
53
Option III
• Dig into JIRA issues and find something to fix in Lucene
• https://issues.apache.org/jira/secure/Dashboard.jspa
• http://wiki.apache.org/lucene-java/HowToContribute
55
Option V
• Other?– Architecture Review/Discussion– Use Case Discussion
Project Post-Mortem
• Volunteers to share?
Open Discussion
• Multilingual Best Practices– UNICODE– One Index versus many
• Advanced Analysis• Distributed Lucene• Crawling• Hadoop• Nutch• Solr
Resources
• [email protected]• Lucid Imagination– Support– Training– Value Add– [email protected]
Finally…
• Please take the time to fill out a survey to help me improve this training– Located in base directory of source
– Email it to me at [email protected]
• There are several Lucene related talks on Wednesday