Upload
stelios-gorilas
View
804
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Presentation at the Greek Java Hellenic group about the open source search engine Lucene
Citation preview
Introduction to Information
Retrieval with Lucene
By Stylianos Gkorilas
Introductions
Presenter Architect/Development Team Leader @Trasys Greece
Java EE projects for European Agencies
IR (Information Retrieval) The tracing and recovery of specific information from stored
data IR is interdisciplinary, based on computer science,
mathematics, library science, information science, information architecture, cognitive psychology, linguistics, and statistics.
Lucene Open Source – Apache Software License
(http://lucene.apache.org) Founder: Doug Cutting 0.01 release on March 2000 (SourceForge) 1.2 release June 2002 (First apache Jakarta Release) Its own top level apache project in 2005 Current version is 3.1
More Lucene Intro…
Lucene is high performance, scalable IR library (not a ready to use application) Number of full featured search applications
built on top (More later…)
Lucene ports and bindings in many other programming environments incl. Perl, Python, Ruby, C/C++, PHP and C# (.NET)
Lucene „Powered By‟ apps (a few of many): LinkedIn, Apple, MySpace, Eclipse IDE, MS Outlook, Atlassian (JIRA). See more @ http://wiki.apache.org/lucene-java/PoweredBy
Components of a Search
Application (1/4)
Acquire Content Gather and scope the content
e.g. from the web with a spider or crawler, a CMS, a Database or a file system
Projects helping Solr: handles RDBMS and XML
feeds and rich documents through Tika integration
Nutch: web crawler - sister project at apache
Grub: open source web crawler
Components of a Search
Application (2/4)
Build document Define the document
The unit of the search engine
Has fields
De-normalization involved
Projects helping: Usually the same frameworks cover both this and the previous step Compass and its evolution
ElasticSearch
Hibernate Search
DBSight
Oracle/Lucene Integration
Components of a Search
Application (3/4)
Analyze Document Handled by Analyzers
Built-in and contributed
Built with tokenizers and token filters
Index Document Through Lucene API or your
framework of choice
Search User Interface/Render Results Application specific
Components of a Search
Application (4/4)
Query Builder Lucene provides one Frameworks provide extensions but also
the application itself e.g. advanced search
Run Query Retrieve documents running the query
built Three common theoretical models
Boolean model Vector space model Probabilistic model
Administration e.g. tuning options
Analytics reporting
How Lucene models content
Documents
Fields
Denormalization of content
Flexible Schema
Inverted Index
Basic Lucene Classes
Indexing IndexWriter
Directory
Analyzer
Document
Field
Searching IndexSearcher
Query
TopDocs
Term
QueryParser
Basic Indexing
Adding documents
RAMDirectory directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory,
new WhitespaceAnalyzer(),
IndexWriter.MaxFieldLength.UNLIMITED);
Document doc = new Document();
doc.add(new Field(“post",
"the JHUG meeting is on this Saturday",
Field.Store.YES,
Field.Index.ANALYZED));
Deleting and updating documents Field options
Store Analyze Norms Term vectors Boost
Scoring – The formula
tf(t in d): Term frequency factor for the term (t) in the document (d), i.e. how many times the term t occurs in the document.
idf(t): Inverse document frequency of the term: a measure of how “unique” the term is. Very common terms have a low idf; very rare terms have a high idf.
boost(t.field in d): Field & Document boost, as set during indexing. This may be used to statically boost certain fields and certain documents over others.
lengthNorm(t.field in d): Normalization value of a field, given the number of terms within the field. This value is computed during indexing and stored in the index norms. Shorter fields (fewer tokens) get a bigger boost from this factor.
coord(q, d): Coordination factor, based on the number of query terms the document contains. The coordination factor gives an AND-like boost to documents that contain more of the search terms than other documents
queryNorm(q): Normalization value for a query, given the sum of the squared weights of each of the query terms.
Querying – the API
Variety of Query class implementations TermQuery PhraseQuery TermRangeQuery NumericRangeQuery PrefixQuery BooleanQuery WildCardQuery FuzzyQuery MatchAllDocsQuery …
Querying - Example
private void indexSingleFieldDocs(Field[] fields) throws Exception {
IndexWriter writer = new IndexWriter(directory,
new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED);
for (int i = 0; i < fields.length; i++) {
Document doc = new Document();
doc.add(fields[i]);
writer.addDocument(doc);
}
writer.optimize();
writer.close();
}
public void wildcard() throws Exception {
indexSingleFieldDocs(new Field[]
{ new Field("contents", "wild", Field.Store.YES, Field.Index.ANALYZED),
new Field("contents", "child", Field.Store.YES, Field.Index.ANALYZED),
new Field("contents", "mild", Field.Store.YES, Field.Index.ANALYZED),
new Field("contents", "mildew", Field.Store.YES, Field.Index.ANALYZED) });
IndexSearcher searcher = new IndexSearcher(directory, true);
Query query = new WildcardQuery(new Term("contents", "?ild*"));
TopDocs matches = searcher.search(query, 10);
}
Querying - QueryParser
Query query = new QueryParser("subject",
analyzer).parse("(clinical OR ethics) AND methodology");
trachea AND esophagus The default join condition is OR e.g. trachea esophagus cough AND (trachea OR esophagus) trachea NOT esophagus full_title:trachea "trachea disease" "trachea disease“~5 is_gender_male:y [2010-01-01 TO 2010-07-01] esophaguz~ Trachea^5 esophagus
Analyzers - Internals
At Indexing and querying time Inside an analyzer
Operates on a TokenStream A token has a text value and metadata like
Start end character offsets Token type Position increment Optionally application specific bit flags and byte[]
payload
Token stream is abstract. Tokenizer and TokenFilterare the concrete ones Tokenizer reads chars and produces tokens Token filter ingests tokens and produces new ones The composite pattern is implemented and they form
a chain of one another
Analyzers – building blocks
Analyzers can be created by combining token streams (Order is important)
Building blocks provided in core CharTokenizer WhitespaceTokenizer KeywordTokenizer. LetterTokenizer LowerCaseTokenizer SinkTokenizer StandardTokenizer LowerCaseFilter StopFilter PorterStemFilter TeeTokenFilter ASCIIFoldingFilter CachingTokenFilter LengthFilter StandardFilter
Analyzers - core
WhitespaceAnalyzer Splits tokens at whitespace
SimpleAnalyzer Divides text at non letter characters and lowercases
StopAnalyzer Divides text at non letter characters, lowercases, and removes stop words
KeywordAnalyzer Treats entire text as a single token
StandardAnalyzer Tokenizes based on a sophisticated grammar that recognizes e-mailaddresses, acronyms, Chinese-Japanese-Korean characters,alphanumerics, and more lowercases and removes stop words
Analyzers – Example (1/2)
Analyzing “The JHUG meeting is on this Saturday"
WhitespaceAnalyzer:
[The] [JHUG] [meeting] [is] [on] [this] [Saturday]
SimpleAnalyzer:
[the] [jhug] [meeting] [is] [on] [this] [saturday]
StopAnalyzer:
[jhug] [meeting] [saturday]
StandardAnalyzer:
[jhug] [meeting] [Saturday]
Analyzers – Example (2/2)
Analyzing "XY&Z Corporation - [email protected]"
WhitespaceAnalyzer:
[XY&Z] [Corporation] [-] [[email protected]]
SimpleAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StopAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StandardAnalyzer:
[xy&z] [corporation] [[email protected]]
Analyzers – Beyond the built in
language-specific analyzers, under contrib/analyzers. language-specific stemming and stop-word removal
Sounds Like analyzer e.g. MetaphoneReplacementAnalyzerthat transforms terms to their phonetic roots
SynonymAnalyzer Nutch Analysis: bigrams for stop words Stemming analysis
The PorterStemFilter. It stems words using the Porter stemming algorithm created by Dr. Martin Porter, and it‟s best defined in his own words: The Porter stemming algorithm (or „Porter stemmer‟) is a process
for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.
SnowballAnalyzer: Stemming for many European languages
Filters
Narrow the search space
Overloaded search methods that accept Filter instances
Examples TermRangeFilter
NumericRangeFilter
PrefixFilter
QueryWrapperFilter
SpanQueryFilter
ChainedFilter
Example: Filters for Security
Constraints known at indexing time Index the constraint as a field Search wrapping a TermQuery on the constraint
field with a QueryWrapperFilter
Factor in information at search time A custom filter Filter will access an external privilege store that will
provide some means of identifying documents in the index e.g. a unique term with regard to permissions
Return an DocIdSet to Lucene. Bit positions match the document numbers. Enabled bits mean the document for that position is available to be searched against the query; unset bits mean the documents won‟t be considered in the search
Internals - Concurrency
Any number of IndexReaders open IndexSearchers use underlying
IndexReaders
Only one IndexWriter at a time Locking with write lock file
IndexReaders may be open while the index is being changed by an IndexWriter It will see changes only when the writer
commits and is reopened
Both are thread safe/friendly classes
Internals - Indexing concepts
Index is made up from segment files Deleting documents does not actually deletes - only
marks for deletion Index writes are buffered and flushed periodically Segments need to be merged
Automatically by the IndexWriter Explicit calls to optimize
There is the notion of commit (as you would expect), which has 4 steps Flush buffered documents and deletions Sync files; force OS to write to stable storage of the
underlying I/O system Write and sync the segments_N file Remove old commits
Internals - Transactions
Two-phase commit is supported prepareCommit performs steps 1,2 and
most of 3
Lucene implements the ACID transactional model Atomicity: all or nothing commit Consistency: e.g. update will mean both
delete and add Isolation: IndexReaders cannot see what
has not been comitted Durability: Index is not corrupted and
persists in storage
Architectures
Cluster nodes that share a remote file system index Slower than local Possible limitations due to client side caching
(Samba, NFS, AFP) or stale file handles (NFS)
Index in database Much slower
Separate write and read indexes (replication) relies on the IndexDeletionPolicy feature of Lucene Out of the box in Solr and ElasticSearch
Autonomous search servers (e.g. Solr, ElasticSearch) Loose coupling through JSON or XML
Frameworks– Compass Document
definition via JPA mapping<compass-core-mapping package="eu.emea.eudract.model.entity">
<class name="cta.sectiona.CtaIdentification" alias="cta" root="true" support-unmarshall="false">
<id name="ctaIdentificationId">
<meta-data>cta_id</meta-data>
</id>
<dynamic-meta-data name="ncaName" converter="jexl" store="yes">data.submissionOrg.name
</dynamic-meta-data>
<property name="fullTitle">
<meta-data>cta_full_title</meta-data>
</property><property name="sponsorProtocolVersionDate">
<meta-data format="yyyy-MM-dd" store="no">cta_sponsor_protocol_version_date</meta-data>
</property>
<property name="isResubmission">
<meta-data converter="shortToYesNoNaConverter" store="no">cta_is_resubmission</meta-data>
</property>
<component name="eudractNumber" />
</class>
<class name="eudractnumber.EudractNumber" alias="eudract_number" root="false">
<property name="eudractNumberId">
<meta-data converter="dashHandlingConverter" store="no">filteredEudractNumberId</meta-data>
<meta-data>eudract_number</meta-data>
</property>
<property name="paediatricClinicalTrial">
<meta-data converter="shortToYesNoNaConverter" store="no">paediatric_clinical_trial
</meta-data>
</property>
</class>
.....
</compass-core-mapping>
Frameworks– Solr Document definition
via DB mapping<dataConfig>
<dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" />
<document name="products">
<entity name="item" query="select * from item">
<field column="ID" name="id" />
<field column="NAME" name="name" />
<field column="MANU" name="manu" />
<field column="WEIGHT" name="weight" />
<field column="PRICE" name="price" />
<field column="POPULARITY" name="popularity" />
<field column="INSTOCK" name="inStock" />
<field column="INCLUDES" name="includes" />
<entity name="feature" query="select description from feature where item_id='${item.ID}'">
<field name="features" column="description" />
</entity>
<entity name="item_category" query="select CATEGORY_ID from item_category where item_id='${item.ID}'">
<entity name="category" query="select description from category where id =
'${item_category.CATEGORY_ID}'">
<field column="description" name="cat" />
</entity>
</entity>
</entity>
</document>
</dataConfig>
Frameworks– Compass/Lucene
Configuration<compass name="default">
<setting name="compass.transaction.managerLookup">
org.compass.core.transaction.manager.OC4J</setting>
<setting name="compass.transaction.factory">
org.compass.core.transaction.JTASyncTransactionFactory</setting>
<setting name="compass.transaction.lockPollInterval">400</setting>
<setting name="compass.transaction.lockTimeout">90</setting>
<setting name="compass.engine.connection">file://P:/Tmp/stelinio</setting>
<!--<setting name="compass.engine.connection">
jdbc://jdbc/EudractV8DataSourceSecure</setting>-->
<!--<setting name="compass.engine.store.jdbc.connection.provider.class">-->
<!--org.compass.core.lucene.engine.store.jdbc.JndiDataSourceProvider-->
<!--</setting>-->
<!--<setting name="compass.engine.ramBufferSize">512</setting>-->
<!--<setting name="compass.engine.maxBufferedDocs">-1</setting>-->
<setting name="compass.converter.dashHandlingConverter.type">
eu.emea.eudract.compasssearch.DashHandlingConverter
</setting>
<setting name="compass.converter.shortToYesNoNaConverter.type">
eu.emea.eudract.compasssearch.ShortToYesNoNaConverter
</setting>
<setting name="compass.converter.shortToPerDayOrTotalConverter.type">
eu.emea.eudract.compasssearch.ShortToPerDayOrTotalConverter
</setting>
<setting name="compass.engine.store.jdbc.dialect">
org.apache.lucene.store.jdbc.dialect.OracleDialect
</setting>
<setting name="compass.engine.analyzer.default.type">
org.apache.lucene.analysis.standard.StandardAnalyzer
</setting>
</compass>
Cool extra features- Spellchecking
You will need a dictionary of valid words You could use the unique terms in your index Given the dictionary you could
Use a Sounds like algorithm like Soundex or Metaphone Or use Ngrams E.g. squirrel as a 3gram is squ, qui, uir, irr, rre, rel. As a
4gram squi, quir, uirr, irre, rrel. Mistakenly searching for squirel would match 5 grams, with 2 shared between the 3grams and 4grams. This would score high!
To present or not to present (the suggestion) Maybe use the Levenshtein distance
Other ideas Use the rest of the terms in the query to bias Maybe combine distance with frequency of term Check result numbers in initial and corrected searches
Even More features
Sorting Use a field for sorting instead of relevance e.g. when you use the MatchAllDocsQuery Beware it uses FieldCache which resides in RAM!
SpanQueries distance between terms (span) Family of queries like SpanNearQuery or SpanOrQuery and others
Synonyms Injection during indexing or during searching?
A MultiPhraseQuery is appropriate for searching time During indexing will allow faster searches
Leverage a synonyms knowledge base A good strategy is to convert it into an index
Key thing is to understand that synonyms must be injected on the same position increments
Spatial Searches Answer to the query “Greek Restaurants Near Me” An efficient technique is to use grids
Assign non-unique grid numbers at areas (e.g. in a mercator space) Index documents with a field containing the grid numbers that match their positional lingitude and
latitude
MoreLikeThis One use of term vectors
Function Queries e.g. add boosts for fields at search time
Some last things to bare in mind
It would be wise to back up you index You can have hot back ups (supported through the
CommitDeletionPolicy)
Performance has some trade-offs search latency indexing throughput near real time results index replication index optimization
Resource consumption Disk space File descriptors Memory
„Luke‟ is a really handy tool You can repair a corrupted index (corrupted
segments are just lost… D‟oh!)
Resources
Book: Lucene in Action
Solr: http://lucene.apache.org/solr/
Vector Space Model: http://en.wikipedia.org/wiki/Vector_Space_Model
IR Links: http://wiki.apache.org/lucene-java/InformationRetrieval
That’s it
Questions?