Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
1 © Copyright 2012 EMC Corporation. All rights reserved.
Searching Large XML Databases using Lucene
Amsterdam, September 19, 2012
Petr Pleshachkov, EMC [email protected], September 19, 2012
2 © Copyright 2012 EMC Corporation. All rights reserved.
My Background
Petr Pleshachkov, Principal Software Engineer
xDB/xPlore team in Rotterdam – My site: EMC Netherlands
– Other xPlore/xDB sites: Pleasanton (California), Shanghai (China), and Grenoble (France)
Areas of expertise: – Semistructured data management
– Databases: transaction management, query optimization, full-text search
Academia & Research: – PhD in Computer Science, ISP RAS
3 © Copyright 2012 EMC Corporation. All rights reserved.
Agenda
Overview of EMC Documentum xDB/xPlore
Integration of Lucene into xDB
xDB transaction model & lucene transaction management
Performance analysis
Future optimizations
4 © Copyright 2012 EMC Corporation. All rights reserved.
Introducing Documentum xPlore
• EMC Documentum is a leading
supplier of Enterprise Content
Management software
• xPlore Provides ‘Integrated
Search’ for Documentum
– but is built as a standalone search
engine to replace FAST Instream
– Highly deployed across
Documentum environments
worldwide (over 70+ countries)
• xPlore Search Engine built over
EMC xDB, Lucene, and leading
content extraction and linguistic
analysis software
5 © Copyright 2012 EMC Corporation. All rights reserved.
Key values which xDB brings for xPlore
Flexible, hierarchical query & data models
Joins
High throughput, low-latency indexing –See documents within secs after saving
Leverage B-tree indexes when appropriate
–Lucene doesn’t fit all uses
Rich, innovative query language
Enterprise, single unified database
Why build a search engine over an XML database?
6 © Copyright 2012 EMC Corporation. All rights reserved.
Documentum xDB Formerly XHive database
– 100% Java
– XML stored in persistent DOM format
▪ Each XML node can be located through a 64 bit identifier
▪ Structure mapped to pages
▪ Easy to operate on GB XML files
Full Transactional Database
Query Language: XQuery
Indexing & Optimization – Palette of index options optimizer can pick from
– At it simplest: indexLookup(key) -> node id
Backup/Restore, scalability, multi-node architecture
7 © Copyright 2012 EMC Corporation. All rights reserved.
xDB Data Storage Model
A
B C
D
E
Database page
This node structure can be represented as a tree - DOM model
An XML Document can be thought of as a collection of elements, attributes (or ‘xml nodes’)
A B C D E
8 © Copyright 2012 EMC Corporation. All rights reserved.
Libraries & Indexes
Scope of index
covers all xml files in
all sub-libraries
A
B C
A
B
C
= X-Hive Library
= X-Hive Index
= X-Hive xml file
= xDB Library
= xDB Index
= xDB xml file
9 © Copyright 2012 EMC Corporation. All rights reserved.
Lucene Integration
Both value and full-text queries supported – XML SubPaths mapped into lucene fields
– Tokenized and value based indexes available
Composite key queries supported – Lucene index is much more flexible than B-
tree composite indexes
– Skip Lists
10 © Copyright 2012 EMC Corporation. All rights reserved.
Multipath Index Definition <PLAY> <ACT> <SCENE> <SPEECH> <SPEAKER>BRUTUS</SPEAKER> <LINE>I am not gamesome: I do lack some part</LINE> <p><LINE>Listen great things</LINE></p> </SPEECH> <SPEECH> <SPEAKER>CASSIUS</SPEAKER> <LINE>Then, Brutus, I have much mistook your passion;</LINE> <LINE>By means whereof this breast of mine hath buried</LINE> <p><LINE>Thoughts of great value, worthy cogitations.</LINE></p> </SPEECH> </SCENE> </ACT> </PLAY>
INDEX ROOT PATH: //SPEECH
SubPath1: (/SPEAKER, VALUE_COMPARISON)
SubPath2: (//LINE, FULL_TEXT_SEARCH)
11 © Copyright 2012 EMC Corporation. All rights reserved.
Mapping to Native Lucene Structures
/SPEAKER /txt /LINE /tkn /p/LINE/tkn
XHIVE_NODE
NOT_ANALYZED
STORE.NO
(BRUTUS)
ANALYZED
STORE.NO
(I am not
gamesome: I do
lack some part)
ANALYZED
STORE.NO
(Listen great things)
NOT_ANALYZED
STORE.YES
(1430532)
Lucene Document 1
/SPEAKER
/txt
/LINE /tkn /LINE /tkn /p/LINE/tkn
XHIVE_NODE
NOT_ANALYZED
STORE.NO
(CASSIUS)
ANALYZED
STORE.NO
(By means
whereof this
breast of
mine hath
buried)
ANALYZED
STORE.NO
(Then, Brutus, I
have much
mistook your
passion;)
ANALYZED
STORE.NO
(Thoughts of
great value,
worthy
cogitations.)
NOT_ANALYZED
STORE.YES
(1430537)
Lucene Document 2
12 © Copyright 2012 EMC Corporation. All rights reserved.
Lucene Inverted List
/LINE/tkn
/p/LINE/tkn
/SPEAKER/txt
XHIVE_NODE
great
lack
passion …..
great …..
…..
brutus
cassius
1430532
1430537
{1}
{1}
{2}
{1, 2}
{1}
{2}
{2}
{1}
Term Dictionary Document Store
Doc ID XHIVE_NODE 1 1430532
2 1430537
13 © Copyright 2012 EMC Corporation. All rights reserved.
Lucene Query Mapping
for $SPEECH score $s in collection(‘col1’)//SPEECH[SPEAKER=’CASSIUS’ and //LINE contains text ‘great’] order by $s return $SPEECH
BooleanQuery (TermQuery1, BooleanQuery(TermQuery2, TermQuery3, BooleanClause.Occur.SHOULD), BooleanClause.Occur.MUST) TermQuery1= TermQuery(new Term(‘/speaker/txt’, ‘CASSIUS’)) TermQuery2=TermQuery(new Term(‘/line/tkn’, ‘great’) TermQuery3=TermQuery(new Term(‘/p/line/tkn’, ‘great’))
14 © Copyright 2012 EMC Corporation. All rights reserved.
Lucene SubIndexes
Each user transaction creates a separate Lucene subIndex
Transaction performs all the updates in its own index
The delete operation does not physically touch subIndexes created by other transactions
A pair (minLSN, maxLSN) is associated with each subIndex, which is used to construct a global index snapshot
.
15 © Copyright 2012 EMC Corporation. All rights reserved.
Blacklists
The delete operation of transaction: – Physically deletes document from
transaction’s own subIndex
– Adds a pair (subIndexMinLSN, NODE_ID) to the blacklist structure
Index view constructor applies blacklists to eliminate deleted documents
Periodically merge operation merges small subIndexes into bigger one and physically deletes documents.
16 © Copyright 2012 EMC Corporation. All rights reserved.
xDB transaction management
ARIES-based ACID transactions – Every page has a Log Sequence Number
(pageLSN)
– Buffer manager tracks dirty pages using RecLSNs
– Log ALL updates on per page basis, including updates performed during rollbacks
– Periodically asynchronous thread runs checkpoint procedure
– The recovery procedure: ▪ Repeat the history. Redo all the updates since the
last successful checkpoint
▪ Undo not complete transactions
17 © Copyright 2012 EMC Corporation. All rights reserved.
xDB transaction isolation
READ_WRITE transaction follow two-phase-locking rule:
– Expanding phase: locks are acquired and no locks are released
– Shrinking phase: locks are released and no locks are acquired
READ_ONLY transaction does not acquire any locks!
– The data snapshot at the moment of transaction start is used
– Using log records we undo recent changes on the page level
18 © Copyright 2012 EMC Corporation. All rights reserved.
How to integrate Lucene into transactional xDB database ?
Old Solution (xDB 10.1/10.2 releases) – All lucene files are stored in separate directory
– New transaction model for lucene indexes is implemented
– Lucene does not use xDB buffer pool
– Backup/restore and replication do not use xDB mechanisms
New Solution (xDB 10.3) – All lucene files are stored in xDB data segment
– xDB transaction model is used since all the updates go through xDB data pages
– Backup/restore and replication are supported
automatically
19 © Copyright 2012 EMC Corporation. All rights reserved.
Lucene Index Access Model
New LIDirectoryImpl class is implemented (extends Directory class)
LIDirectory class stores all files in xDB blob objects
LIIndexInput class extends BufferedIndexInput – void readInternal(byte[] b, int offset, int len)
▪ Reads data from the blob
▪ The blob object is buffered on the xdb buffer management level
LIIndexOutput class extends BufferedIndexOutput
– void flushBuffer(byte[] b, int offset, int len) ▪ Writes lucene data to the blob object
▪ The operation is logged automatically on the buffer manager level
20 © Copyright 2012 EMC Corporation. All rights reserved.
Lucene Index Access Model (con’t)
readInternal flushBuffer
Lucene Blob Objects
IndexReader IndexWriter
buffered data pages
LIDirectoryImpl
LIIndexOutput LIIndexInput
Indexer Queries
Lucene Caches
21 © Copyright 2012 EMC Corporation. All rights reserved.
Lucene SubIndex Storage Model
LIDirectoryStore
LiFileEntryStore
LiFileEntryStore
Directory page
BlobStore page BlobStore page
Blob Tail
Blob Tail
Blob page
Blob page
Blob page
Blob page
Blob page
Blob page
22 © Copyright 2012 EMC Corporation. All rights reserved.
Lucene Index Master Record (MIR)
SI_1 SI_2 SI_3 … SI_N
Directory object
Directory Object
Blob objects
• Tracks information
about all subindexes
and their state
• Represented as a B-
tree concurrent index
• Used for lucene index
view construction
• Updated concurrently
by Ingest transactions
and merging/cleaning
tasks
• Periodically
asynchronous tasks
merges subIndexes
into bigger one
23 © Copyright 2012 EMC Corporation. All rights reserved.
SubIndexes Merging
Final Index
New Final Index
C
D
E
F
B
H
G F
24 © Copyright 2012 EMC Corporation. All rights reserved.
Ingest performance analysis (in seconds)
180,956
1009,459
2149,636
205,068
1015,937
2526,601
0
500
1000
1500
2000
2500
3000
Ingest 10000 docs Ingest 50000 docs Ingest 100000 docs
xDB 10.3 (pre-release) xDB 10.2
25 © Copyright 2012 EMC Corporation. All rights reserved.
Query performance analysis
(response time in ms.)
7,088
10,08
7,713
14,013
0
2
4
6
8
10
12
14
16
Q1 serie: queries with range and 3 value
comparison conditions
Q2 serie: queries with full-text and 2
value-comparison conditions
xDB 10.3 (pre-release) xDB 10.2
26 © Copyright 2012 EMC Corporation. All rights reserved.
Future optimizations
Reduce number of separate subIndexes
Final/NonFinal merge optimizations
Advanced buffer management techniques
Concurrent Lucene MultiPath Index