Upload
otisg
View
10.812
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Lucene introduction / overview, also touching on Lucene 2.9/3.0 features
Citation preview
Lucene Introduction
Otis Gospodnetic, Sematext Int’l @[email protected]://jroller.com/otishttp://sematext.com/
About Otis
• Lucener since pre-Apache (cca 2000)
• Committer: Lucene, Solr, Nutch, Mahout, Open Relevance
• Lucene in Action 1 & 2 co-author
• Solr in Action author
• Sematext co-founder
What is Lucene?
• Free, ASL, Java IR library, Jar
• Doug Cutting, ASF, 2001
• Application agnostic: Indexing & Searching
• High performance, scalable
• No dependencies
• Heavily ported
Otis Gospodnetic, Sematext Int’l
What Lucene Ain’t
• Turn key “solution”
• Application, no installer/wizard needed
• (Web) crawler
• Insert-doc-format-here parser / filter
Otis Gospodnetic, Sematext Int’l
The Lucene Family
• Lucene vs. Apache Lucene vs. Java Lucene: IR library• Nutch: Hadoop-loving crawler, indexer, searcher for web-wide scale SE• Solr: Search server• Droids: Standalone framework for writing crawlers• Lucene.Net: C#, Incubator graduate• Lucy: C Lucene impl• Mahout: Hadoop-loving ML library• Open Relevance: Relevance judgments• PyLucene: Python port
Otis Gospodnetic, Sematext Int’l
Integration
Data Source Data Source
GatherParse
Make Doc
Search UI
Search Appe.g. webapp
Search
Index
Index
Otis Gospodnetic, Sematext Int’l
Integration: Rich Doc Indexing
HTML PDF
Gather Make Doc
Index
Index
MS Word PDF
Parsewith Tika
Otis Gospodnetic, Sematext Int’l
Lucene Strengths
• Simple API
• Fast
• Concurrent indexing and searching
• Incremental indexing
• NRT: Near-Real-Time
• Boolean + Vector space, sorting, etc.
• Cheap
Otis Gospodnetic, Sematext Int’l
Query Types
• Single and multi-term queries
• Phrase queries (sloppiness allowed)
• Wildcard and fuzzy
• Range queries
• “Boolean”: required, prohibited, “should”
• Grouping
• Fields
Otis Gospodnetic, Sematext Int’l
Query Syntax
• +monkey +banana monkey AND banana
• +dog –snoopy dog AND NOT snoopy
• “pork flu”
• “pork flu” –”new york” “pork flu” NOT “new york”
• “sweet pork”~3
• natur*
• schmidt~
• createDate:[200901 TO 201001]
• author:doug
• author:”doug cutting”
• author:”doug cutting” AND project:(lucene OR nutch OR hadoop)
• title:lucene^5.0 body:lucene
Otis Gospodnetic, Sematext Int’l
Code: FS Indexer
Otis Gospodnetic, Sematext Int’l
private IndexWriter writer; public Indexer(String indexDir) throws IOException { Directory dir = FSDirectory.open(new File(indexDir)); writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.UNLIMITED); }
public void close() throws IOException { writer.close(); }
public void index(String dataDir, FileFilter filter) throws Exception { File[] files = new File(dataDir).listFiles(); for (File f: files) { Document doc = new Document(); doc.add(new Field("contents", new FileReader(f))); doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED)); writer.addDocument(doc); }}
Indexing Pipeline
Otis Gospodnetic, Sematext Int’l
Tokenizer TokenFilterDocument DocumentWriter
InvertedIndex
add
Indexer Pipeline: Analysis
Source: Lucene in Action
Otis Gospodnetic, Sematext Int’l
• 1 Tokenizer
• N TokenFilters
Analysis in Action
Otis Gospodnetic, Sematext Int’l
"The quick brown fox jumped over the lazy dogs" WhitespaceAnalyzer : [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] SimpleAnalyzer : [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] StopAnalyzer : [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] "XY&Z Corporation - [email protected]" WhitespaceAnalyzer: [XY&Z] [Corporation] [-] [[email protected]] SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer: [xy&z] [corporation] [[email protected]]
Field Options
• Doc has 1+ Fields. Field has name+value
• Field.Index.(no, (not)analyzed, no norms, not analyzed no norms)
• Field.Store.(yes, no)
• Field.TermVector.(yes, no, with pos., with offset, with both)
Otis Gospodnetic, Sematext Int’l
Inverted Index
Source: developer.apple.com
Otis Gospodnetic, Sematext Int’l
Index Directory# ls -lhtotal 1.1G-rw-r--r-- 1 root root 123M 2009-03-14 10:29 _0.fdt-rw-r--r-- 1 root root 44M 2009-03-14 10:29 _0.fdx-rw-r--r-- 1 root root 33 2009-03-14 10:31 _9j.fnm-rw-r--r-- 1 root root 372M 2009-03-14 10:36 _9j.frq-rw-r--r-- 1 root root 11M 2009-03-14 10:36 _9j.nrm-rw-r--r-- 1 root root 180M 2009-03-14 10:36 _9j.prx-rw-r--r-- 1 root root 5.5M 2009-03-14 10:36 _9j.tii-rw-r--r-- 1 root root 308M 2009-03-14 10:36 _9j.tis-rw-r--r-- 1 root root 64 2009-03-14 10:36 segments_2-rw-r--r-- 1 root root 20 2009-03-14 10:36 segments.gen
Details: http://lucene.apache.org/java/2_9_0/fileformats.html
Otis Gospodnetic, Sematext Int’l
Code: Searcher
Otis Gospodnetic, Sematext Int’l
public void search(String indexDir, String q) throws IOException, ParseException { Directory dir = FSDirectory.open(new File(indexDir)); IndexSearcher is = new IndexSearcher(dir, true);
QueryParser parser = new QueryParser("contents", new StandardAnalyzer(Version.LUCENE_CURRENT)); Query query = parser.parse(q); TopDocs hits = is.search(query, 10); System.err.println("Found " + hits.totalHits + " document(s)");
for (int i=0; i<hits.scoreDocs.length; i++) { ScoreDoc scoreDoc = hits.scoreDocs[i]; Document doc = is.doc(scoreDoc.doc); System.out.println(doc.get("filename")); }
is.close(); }
Code: Doc Deletion
Via IndexReader
void deleteDocument(int docNum) Deletes the document numbered docNum
int deleteDocuments(Term term) Deletes all documents that have a given term indexed.
Via IndexWriter
void deleteAll() Delete all documents in the index.
void deleteDocuments(Query query) Deletes the document(s) matching the provided query.
void deleteDocuments(Query[] queries) Deletes the document(s) matching any of the provided queries.
void deleteDocuments(Term term) Deletes the document(s) containing term.
void deleteDocuments(Term[] terms) Deletes the document(s) containing any of the terms.
Otis Gospodnetic, Sematext Int’l
Code: Doc Updates
voi
d
Via IndexWriter facade
void updateDocument(Term term, Document doc) Updates a document by first deleting the document(s) containing term and then adding the new document.
voi
dvoid updateDocument(Term term, Document doc, Analyzer analyzer)
Updates a document by first deleting the document(s) containing term and then adding the new document.
Otis Gospodnetic, Sematext Int’l
Pitfalls
• Update = delete + add
• No partial doc update
• No joins
Otis Gospodnetic, Sematext Int’l
Performance Tips
• Index: -Xmx, setRAMBufferSizeMB, !optimize, !compound, !NFS, multi-thread, analysis, NO_NORMS
• Search: 1 searcher, !NFS, RAM vs. heap, SSD, optimize, FieldSelector
Details:http://wiki.apache.org/lucene-java/ImproveIndexingSpeed http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
Otis Gospodnetic, Sematext Int’l
Lucene 2.9 & 3.0• Per segment searching and caching (can lead to much faster reopen among other
things)• Near real-time search (aka NRT)• New Query types• Smarter, more scalable multi-term queries (wildcard, range, etc)• Freshly optimized Collector/Scorer API• Improved Unicode support and the addition of Collation contrib• New Attribute based TokenStream API• New QueryParser framework in contrib with a core QueryParser replacement impl
included• Scoring is now optional when sorting by Field, or using a custom Collector, gaining
sizable performance when scores are not required• New analyzers (PersianAnalyzer, ArabicAnalyzer, SmartChineseAnalyzer)• New fast-vector-highlighter for large documents• Lucene now includes high-performance handling of numeric fields. Such fields are
indexed with a trie structure, enabling simple to use and much faster numeric range searching without having to externally pre-process numeric values into textual values.
Otis Gospodnetic, Sematext Int’l
Community
[email protected] [email protected]
Otis Gospodnetic, Sematext Int’l
"I posted, went to get a sandwich, and came back to see two answers. The change works, and I can get the fix into production today. This list is magic."
Resources
• http://lucene.apache.org/java– Wiki, MLs, javadoc
• http://manning.com/lucene– LIA2 soon, MEAP available
• @lucene
Otis Gospodnetic, Sematext Int’l
Contact
@otisg
sematext.com
jroller.com/otis
blog.sematext.com
Otis Gospodnetic, Sematext Int’l