Upload
tommaso-teofili
View
760
Download
1
Embed Size (px)
DESCRIPTION
ApacheCon EU 2014 presentation about the flexible architecture for search in Apache Jackrabbit Oak.
Citation preview
Flexible search in Apache Jackrabbit Oak
Tommaso Teofili
Apache Jackrabbit Oak
• Scalable content repository • JCR 2.0 • Designed for concurrent access (MVCC) • Pluggable components (storage, indexes) • Powering AEM 6.0
18/11/14 2
Oak Architecture
• Oak-JCR • Oak-Core – MVCC (node states and immutable trees) – Core components (Security, Query engine, …) – Plugins
• Oak-MK – Pluggable storage
18/11/14 3
Oak – the Query Engine
• Query languages – XPATH – SQL-2
• Selects the index(es) supposed to perform better – Search is demanded to the underlying indexes – No index? The repository is traversed
• ACLs applied afterwards
18/11/14 4
Indexing – the IndexEditor API
• NodeState before = builder.getNodeState(); • builder.child(”a").setProperty(”foo", ”bar"); • NodeState after = builder.getNodeState(); • NodeState indexed = editorHook.processCommit(before, after, …); // who said MVCC?
18/11/14 5
Searching – the QueryIndex API
• Filter filter = … ; // "select * from [nt:folder]" • filter.restrictPath("/somenode",
Filter.PathRestriction.DIRECT_CHILDREN); • Cursor cursor = queryIndex.query(filter,
nodeState); // search against a state • IndexRow row = cursor.next(); // results
18/11/14 6
Searching – Filters
• Full text expressions • Property restrictions • Path restrictions – Exact – Parent – Child – Descendant
• Node type restrictions
18/11/14 7
Configuring indexes
• Indexes are declared by adding “query index configuration” nodes in the repository – Type – Asynchronous – Reindex – Index specific properties
18/11/14 8
In repository indexes
• Data structures designed as content – Property index – Ordered property index – Node type index – Reference index
18/11/14 9
Lucene index
• Full text and (sorted) property restrictions • Stored in repository • Tika for indexing binaries • Configurable indexing rules (boost), codec,
analyzers
19/11/14 10
Lucene index
• Interesting facts – DocValues for sorted property restrictions – Uncompressed stored fields – Property exists queries • TermRange vs Wildcard vs Term vs MatchAll
+FieldExistsFilter
19/11/14 11
Solr index
• Full text, property, path restrictions • Embedded or remote Solr(Cloud) • Configurable – Mapping restriction / fields – Page size – Commit policy
• Most is configured on the Solr side
18/11/14 12
Problems
• Hard to express complex queries • Cannot leverage underlying indexes
advanced capabilities
18/11/14 13
Native language support
• Leverage underlying index capabilities – Multiple query languages/parsers
• More accurate full text queries (and results) – … where native(’lucene', 'name:(hello world)
“hello world”^3') • Advanced index capabilities (e.g. MLT) – … where native('solr', 'mlt?q=path:/content/
sample1&mlt.fl=jcr:title') 19/11/14 14
Adding more indexes
• Create an IndexEditor – Turn diff into an “indexable”
• Create a QueryIndex – Turn a Filter into an index-specific query
• “Declare” the index
18/11/14 15
Looking forward
• Results aggregation features (e.g. facets) • More configuration options (Lucene, Solr) • Smarter index selection • Cover indexes
18/11/14 16
Thanks