43
Datum 21 augustus 2010 Enterprise Search EAI Semantic Web Open Source Search & Retrieval Platform Marc Teutelink

Open source enterprise search and retrieval platform

Embed Size (px)

DESCRIPTION

This technical talk describes the usage of Apache based open source software (Apache Lucene/SOLR, Apache Nutch, Apache Tika and Apache ServiceMix) within the implementation of an enterprise search and retrieval platform. The platform is the result of years of experience with enterprise search technologies combined with enterprise application integration and semantic (web) technologies, both within commercial and open source based environments. The talk will dive into the conceptual architecture of a typical search solutions based upon a real world use case, and will then present the accompanying framework that makes easy and swift implementations of enterprise search solutions possible, based upon this architecture. The architecture describes an innovative enterprise search solution, specifying all necessary components for collecting and indexing content (known in the architecture as the collection process, which consist of inbound, splitting, validating, filtering, enriching and indexing components) and publishing the content (known in the architecture as the publication process, which consists of inbound, validating, request enriching, searching, grouping, response enriching and presentation components). The framework can be seen as an orchestrator framework and contains all tools, components and default configurations and flow descriptions necessary to build enterprise solutions according to this architecture. The framework is entirely based upon open source technologies and are mainly Apache based

Citation preview

Page 1: Open source enterprise search and retrieval platform

Datum 21 augustus 2010

Enterprise SearchEAISemantic Web

Open Source Search & Retrieval Platform

Marc Teutelink

Page 2: Open source enterprise search and retrieval platform

How Apache open source software is used during the implementation of an

Enterprise Search and Retrieval Platform

(Lucene/SOLR, Nutch, Tika, ServiceMix/Camel, Felix/Ace)

Page 3: Open source enterprise search and retrieval platform

Marc Teutelink [email protected] @mteutelink

•Software architect at Luminis•15+ years experience in software development; specialized in Enterprise Search, Enterprise Application Integration and Semantic Web technology•Currently writing “Enterprise Search in Action” for Manning (Mid-2011)

Page 4: Open source enterprise search and retrieval platform

Agenda

•Enterprise Search• What is Enterprise Search: Functions and features• Challenges• Logical Architecture

•Enterprise Search Solution• Technology Stack

• Collection Process• Publication Process• Enricher framework

• Deployment•Conclusion

Page 5: Open source enterprise search and retrieval platform

What is Enterprise Search?

“Enterprise Search offers a solution for searching, finding and presenting enterprise related information in the larger sense of the word”

Enterprise search is all about searching through documents from any type and format from any sources located anywhere with the upmost flexibility

• Web search: limited to public documents on the web• Desktop search: limited to private documents on the local machine• Enterprise search: no limitations on document type and location

Page 6: Open source enterprise search and retrieval platform

Enterprise Search (features)

•Information Sources and Types• Wide range of sources: local and remote filesystems, content repositories,

e-mail, databases, internet, intranet and extranet• Type not limited: any type ranging from structured to unstructured data, text

and binary formats and compound formats (zip)

•Usage• Not limited to interactive use automated business processes

•Security• Integrations with enterprise security infrastructure

•User Interaction and personalization• Identity enables more personalized search results

Page 7: Open source enterprise search and retrieval platform

Enterprise Search (features)

•Extended metadata• More metadata better and more precise search results• More control over schema (for example Dynamic Fields)

•Ranking• More control over ranking: personalized ranking (group)

•Data extraction and derivation• Extract data using various techniques: Xpath, Xquery• Derive data: using external knowledge models: RDBMS, RDF Store, Web Services• Conditional extraction & derivation

•Managing and monitoring• On-the-fly management (JMX)• Real time monitoring

Page 8: Open source enterprise search and retrieval platform

Enterprise Search (features)

•User Interfaces• Web search

• All about selling advertisements to the mass• Generalistic & minimalistic screens; focus on adds

• Enterprise search• All about finding: rich navigation; focus on quick find• Small targeted audience

• Specialized and customized screens (use of ontologies, taxonomies and classifications)

• Use of identity (results customized to user) and web 2.0 • Grouping

• field collapsing, faceted search & clustering

Page 9: Open source enterprise search and retrieval platform

Enterprise Search (Challenges)

•Performance and scalability•Rich functions and features•Managebility•Flexibility•Easy maintenance•Quick issue and problem solving•Reduce total cost of ownerschip

Page 10: Open source enterprise search and retrieval platform

Enterprise Search (Challenges)

•Performance and scalability•Rich functions and features•Managebility•Flexibility•Easy maintenance•Quick issue and problem solving•Reduce total cost of ownerschip

Commercial Search Engines?

Page 11: Open source enterprise search and retrieval platform

Enterprise Search (Challenges)

•Performance and scalability•Rich functions and features•Managebility•Flexibility•Easy maintenance•Quick issue and problem solving•Reduce total cost of ownerschip

Apache Based (Open Source) Search & Retrieval Platform

Page 12: Open source enterprise search and retrieval platform

Enterprise Search (Logical Architecture)

Actors

Search EngineIndexing

Sources

Content InboundPull

(Crawling)Pull

(Harvesting)Push

(SOAP/ReST)

Searching & OrderingFiltering Grouping

Request InboundHTTP/Get

(URL)HTTP/Post(SOAP/ReST)

API (Java,Perl,...)

Request ValidationSyntactic Semantic

Request EnrichmentRedirection (Suggestions)

Enhancement (add/remove clauses)

Response EnrichmentRedirection (more like this)

Enhancement (metadata, editorial)

Response OutboundStateless

(XSLT, SolrJS)Statefull

(Webapp Framework)

Sorting

Content EnrichmentExtraction Enhancement Filtering

Collection Process Publication Process

Content ValidationSemanticSyntactic

Page 13: Open source enterprise search and retrieval platform

Enterprise Search (Collection Process)

Sources• Any document format• Any type

• Structured and unstructured• Textual and binary• Compound

• Residing Anywhere• Security

Search EngineIndexing

Sources

Content InboundPull

(Crawling)Pull

(Harvesting)Push

(SOAP/ReST)

Content EnrichmentExtraction Enhancement Filtering

Collection Process

Content ValidationSemanticSyntactic

Page 14: Open source enterprise search and retrieval platform

Enterprise Search (Collection Process)

Search EngineIndexing

Sources

Content InboundPull

(Crawling)Pull

(Harvesting)Push

(SOAP/ReST)

Content EnrichmentExtraction Enhancement Filtering

Collection Process

Content ValidationSemanticSyntactic

Content Inbound• Pull (Crawling/Spidering)

• Internet, intranet & extranet• Local and remote filesystems

• Pull (Harvesting)• Databases• Content Repositories / Mgmt Systems• Webservices inbound

• Push• Webservices (SOAP/REST)• Real time indexing

Page 15: Open source enterprise search and retrieval platform

Enterprise Search (Collection Process)

Search EngineIndexing

Sources

Content InboundPull

(Crawling)Pull

(Harvesting)Push

(SOAP/ReST)

Content EnrichmentExtraction Enhancement Filtering

Collection Process

Content ValidationSemanticSyntactic

Content Validation• Syntactic validation

• Based on DTD / XML-Schema• Structure and limited content

• Semantic validation• Based on algorithms:

• Groovy, XPath, Regex, …• Think about exception handling• Placed anywhere in flow

• During inbound: XML-Schema validation• After Enrichment: Validate derived metadata

Page 16: Open source enterprise search and retrieval platform

Enterprise Search (Collection Process)

Search EngineIndexing

Sources

Content InboundPull

(Crawling)Pull

(Harvesting)Push

(SOAP/ReST)

Content EnrichmentExtraction Enhancement Filtering

Collection Process

Content ValidationSemanticSyntactic

Content Enrichment• Extraction

• Metadata• Content (free text of document)

• Enhancing• Derive new and alter existing metadata

• Filtering• Remove (parts of) metadata

• Leverage external knowledge models• Conditional enrichment

Page 17: Open source enterprise search and retrieval platform

Enterprise Search (Collection Process)

Search EngineIndexing

Sources

Content InboundPull

(Crawling)Pull

(Harvesting)Push

(SOAP/ReST)

Content EnrichmentExtraction Enhancement Filtering

Collection Process

Content ValidationSemanticSyntactic

Indexing• Store in search engine(s)

• Content based routing• Document boosting

Page 18: Open source enterprise search and retrieval platform

Enterprise Search (Publication Process)

Request Inbound• HTTP/Get

• URL based with parameters• Response in XML, JSON, …

• HTTP/Post• XML (SOAP, REST) request• XML (SOAP, REST) response

• API• Java, Perl, …• Wrappers on HTTP/Get

Actors

Search EngineSearching & Ordering

Filtering Grouping

Request InboundHTTP/Get

(URL)HTTP/Post(SOAP/ReST)

API (Java,Perl,...)

Request ValidationSyntactic Semantic

Request EnrichmentRedirection (Suggestions)

Enhancement (add/remove clauses)

Response EnrichmentRedirection (more like this)

Enhancement (metadata, editorial)

Response OutboundStateless

(XSLT, SolrJS)Statefull

(Webapp Framework)

Sorting

Publication Process

Page 19: Open source enterprise search and retrieval platform

Enterprise Search (Publication Process)

Actors

Search EngineSearching & Ordering

Filtering Grouping

Request InboundHTTP/Get

(URL)HTTP/Post(SOAP/ReST)

API (Java,Perl,...)

Request ValidationSyntactic Semantic

Request EnrichmentRedirection (Suggestions)

Enhancement (add/remove clauses)

Response EnrichmentRedirection (more like this)

Enhancement (metadata, editorial)

Response OutboundStateless

(XSLT, SolrJS)Statefull

(Webapp Framework)

Sorting

Publication Process Request Validation• Syntactic Validation

• Correct Query syntax?• Semantic Validation

• Correct Field Filters?• Based on algorithms: Groovy, Regex

• Placed anywhere in flow• @inbound: XML-Schema validation• @enrichment: Validate derived request clauses

Page 20: Open source enterprise search and retrieval platform

Enterprise Search (Publication Process)

Actors

Search EngineSearching & Ordering

Filtering Grouping

Request InboundHTTP/Get

(URL)HTTP/Post(SOAP/ReST)

API (Java,Perl,...)

Request ValidationSyntactic Semantic

Request EnrichmentRedirection (Suggestions)

Enhancement (add/remove clauses)

Response EnrichmentRedirection (more like this)

Enhancement (metadata, editorial)

Response OutboundStateless

(XSLT, SolrJS)Statefull

(Webapp Framework)

Sorting

Publication Process Request Enrichment• Redirection

• Spelling suggestions• Metadata suggestions

• Enhancing• Add/Remove clauses• Stemming, Synonyms, stop words

Page 21: Open source enterprise search and retrieval platform

Enterprise Search (Publication Process)

Actors

Search EngineSearching & Ordering

Filtering Grouping

Request InboundHTTP/Get

(URL)HTTP/Post(SOAP/ReST)

API (Java,Perl,...)

Request ValidationSyntactic Semantic

Request EnrichmentRedirection (Suggestions)

Enhancement (add/remove clauses)

Response EnrichmentRedirection (more like this)

Enhancement (metadata, editorial)

Response OutboundStateless

(XSLT, SolrJS)Statefull

(Webapp Framework)

Sorting

Publication Process Searching & Ordering• Filtering

• Field Search• Grouping

• Add group information• Field collapsing, Faceted Search & Clustering

• Sorting• Sort on Field• Ranking

Page 22: Open source enterprise search and retrieval platform

Enterprise Search (Publication Process)

Actors

Search EngineSearching & Ordering

Filtering Grouping

Request InboundHTTP/Get

(URL)HTTP/Post(SOAP/ReST)

API (Java,Perl,...)

Request ValidationSyntactic Semantic

Request EnrichmentRedirection (Suggestions)

Enhancement (add/remove clauses)

Response EnrichmentRedirection (more like this)

Enhancement (metadata, editorial)

Response OutboundStateless

(XSLT, SolrJS)Statefull

(Webapp Framework)

Sorting

Publication Process Response Enrichment• Redirection

• Suggestions• More like this

• Enhancing• Add/Remove response fields• Schema information• Editorial information

Page 23: Open source enterprise search and retrieval platform

Enterprise Search (Publication Process)

Actors

Search EngineSearching & Ordering

Filtering Grouping

Request InboundHTTP/Get

(URL)HTTP/Post(SOAP/ReST)

API (Java,Perl,...)

Request ValidationSyntactic Semantic

Request EnrichmentRedirection (Suggestions)

Enhancement (add/remove clauses)

Response EnrichmentRedirection (more like this)

Enhancement (metadata, editorial)

Response OutboundStateless

(XSLT, SolrJS)Statefull

(Webapp Framework)

Sorting

Publication Process Response outbound• Stateless

• No security• XSLT, SolrJS

• Statefull• Security• Web2.0• Web Application Framework

Page 24: Open source enterprise search and retrieval platform

Technology Stack (Collection Process)

•Use ESB for the flow: Apache ServiceMix with Camel• Leverage standard ESB components (Transformers, Validation, Splitter,

Filter, Routers, Scripting)• Leverage standard ESB transports (WS, SMTP, JMS, JCR, JDBC, FILE) • Custom: Crawler Apache Nutch

• Leverage only crawl framework• Extend NutchIndexWriter; asynchronously pushing crawled documents

back into ESB flow (reply-to)

•ESB Makes distributed flow possibleContent based routing•Hot deploy Easy maintenance•Reusing services across collection processes•Search Engine independent

Page 25: Open source enterprise search and retrieval platform

Collection Process Flow

Content Indexer

Content Inbound

21

Documents Message

N

DDocument Messages

D D

Lucene/SolrINDEX

HTTP Transport(Channel Adapter)

Lucene/SOLR(SOLRJ)

DSOLR Document

Message

Splitter

Channel

Content Validation Content Enrichment

Enricher

Content Filter

Content Enricher

Syntactic Validation(Channel Purger)

Push Inbound(Message Endpoint)

Semantic Validation(Channel Purger)

Invalid Message Channel

!

?Invalid Message

ChannelChannel Transformer(Message Translator)

Page 26: Open source enterprise search and retrieval platform

Technology Stack(Publication Process)

•Use flow from Apache Lucene/Solr• Leverage standard Solr components (synonyms, stopwords,

stemming, MLT, spelling, faceted search, …)• Custom components: using Solr’s extendability framework

• Security: authority field in schema with Apache Shiro integration • Field filters (zipcode,…)

•User interfaces• Stateless: SolrJs, XSLTResponseWriter & VelocityResponseWriter• Statefull: Apache Wicket with Spring

Page 27: Open source enterprise search and retrieval platform

Actors

Search EngineIndexing

Sources

Content InboundPull

(Crawling)Pull

(Harvesting)Push

(SOAP/ReST)

Searching & OrderingFiltering Grouping

Request InboundHTTP/Get

(URL)HTTP/Post(SOAP/ReST)

API (Java,Perl,...)

Request ValidationSyntactic Semantic

Request EnrichmentRedirection (Suggestions)

Enhancement (add/remove clauses)

Response EnrichmentRedirection (more like this)

Enhancement (metadata, editorial)

Response OutboundStateless

(XSLT, SolrJS)Statefull

(Webapp Framework)

Sorting

Content EnrichmentExtraction Enhancement Filtering

Collection Process Publication Process

Content ValidationSemanticSyntactic

Enterprise Search (Logical Architecture)

Page 28: Open source enterprise search and retrieval platform

Actors

Search EngineIndexing

Sources

Content InboundPull

(Crawling)Pull

(Harvesting)Push

(SOAP/ReST)

Searching & OrderingFiltering Grouping

Request InboundHTTP/Get

(URL)HTTP/Post(SOAP/ReST)

API (Java,Perl,...)

Request ValidationSyntactic Semantic

Request EnrichmentRedirection (Suggestions)

Enhancement (add/remove clauses)

Response EnrichmentRedirection (more like this)

Enhancement (metadata, editorial)

Response OutboundStateless

(XSLT, SolrJS)Statefull

(Webapp Framework)

Sorting

Content EnrichmentExtraction Enhancement Filtering

Collection Process Publication Process

Content ValidationSemanticSyntactic

Lucene/SOLR

ServiceMix/CamelNutch Apache WicketSolrJS/XSLT

Enterprise Search (Logical Architecture)

Page 29: Open source enterprise search and retrieval platform

Actors

Search EngineIndexing

Sources

Content InboundPull

(Crawling)Pull

(Harvesting)Push

(SOAP/ReST)

Searching & OrderingFiltering Grouping

Request InboundHTTP/Get

(URL)HTTP/Post(SOAP/ReST)

API (Java,Perl,...)

Request ValidationSyntactic Semantic

Request EnrichmentRedirection (Suggestions)

Enhancement (add/remove clauses)

Response EnrichmentRedirection (more like this)

Enhancement (metadata, editorial)

Response OutboundStateless

(XSLT, SolrJS)Statefull

(Webapp Framework)

Sorting

Content EnrichmentExtraction Enhancement Filtering

Collection Process Publication Process

Content ValidationSemanticSyntactic

Enterprise Search (Logical Architecture)

Luminis Enricher Framework

Page 30: Open source enterprise search and retrieval platform

Luminis Enricher Framework

•Custom Enricher Framework• Existing ESB & SOLR enricher capabilities not sufficient.

• Enriching = one or more actions (extraction, enhancing & filtering) performed on documents with fields

• Same enricher to be used for:• Collection process:

• Documents enriching, filtering & splitting• Publication process:

• Search requests’first-components’ searchcomponent• Search response’last-components’ searchcomponent

Page 31: Open source enterprise search and retrieval platform

Luminis Enricher Framework

•Custom Enricher Framework• Existing ESB & SOLR enricher capabilities not sufficient.

• Enriching = one or more actions (extraction, enhancing & filtering) performed on documents with fields

• Same enricher to be used for:• Collection process:

• Documents enriching, filtering & splitting• Publication process:

• Search requests’first-components’ searchcomponent• Search response’last-components’ searchcomponent

Content Indexer

Content Inbound

21

Documents Message

N

DDocument Messages

D D

Lucene/SolrINDEX

SOLR Indexer(Channel Adapter)

Lucene/SOLR(SOLRJ)

DSOLR Document

Message

Splitter

Channel

Content Validation Content Enrichment

Enricher

Content Filter

Content Enricher

Syntactic Validation(Channel Purger)

Push Inbound(Message Endpoint)

Semantic Validation(Channel Purger)

Invalid Message Channel

!

?Invalid Message

ChannelChannel

Page 32: Open source enterprise search and retrieval platform

Luminis Enricher Framework

•Custom Enricher Framework• Existing ESB & SOLR enricher capabilities not sufficient.

• Enriching = one or more actions (extraction, enhancing & filtering) performed on documents with fields

• Same enricher to be used for:• Collection process:

• Documents enriching, filtering & splitting• Publication process:

• Search requests’first-components’ searchcomponent• Search response’last-components’ searchcomponent

Content Indexer

Content Inbound

21

Documents Message

N

DDocument Messages

D D

Lucene/SolrINDEX

SOLR Indexer(Channel Adapter)

Lucene/SOLR(SOLRJ)

DSOLR Document

Message

Splitter

Channel

Content Validation Content Enrichment

Enricher

Content Filter

Content Enricher

Syntactic Validation(Channel Purger)

Push Inbound(Message Endpoint)

Semantic Validation(Channel Purger)

Invalid Message Channel

!

?Invalid Message

ChannelChannel

<<SearchHandler>>RequestHandler

"first-components" "components" "last-components"

<<XML>> Response

<<SearchComponent>>query

<<SearchComponent>>facet

<<SearchComponent>>mlt

<<SearchComponent>>highlight

<<SearchComponent>>stats

<<SearchComponent>>debug

<<SOLRQueryRequest>>Query

<<XSLT>>XML2HTML

<<QueryResponseWriter>>XSLTResponseWriter

<<(X)HTML>>Resultaat

Page 33: Open source enterprise search and retrieval platform

Luminis Enricher Framework(architecture)

•Pipe-and-filter architecture• Documents flow through series of actions• Output from one action is input to another action

• Fields from input document can be used in action’s clauses: values in expressions filled by replacing velocity type patterns with field values

•Conditional flows supported•Reuse of flows & Subflows supported

Page 34: Open source enterprise search and retrieval platform

Luminis Enricher Framework(architecture)

•Pipe-and-filter architecture• Documents flow through series of actions• Output from one action is input to another action

• Fields from input document can be used in action’s clauses: values in expressions filled by replacing velocity type patterns with field values

•Conditional flows supported•Reuse of flows & Subflows supported

Action(select C where ${B})

Action(remove A2)

Document[[A1,A2],[B]]

Document[[A1],[B]]

Document[[A1],[B],[C1]]

If [B=3]

YES

Action(select C where ${A})

Document[[A1],[B],[C2]]

NO

Page 35: Open source enterprise search and retrieval platform

Luminis Enricher Framework(Configuration)

•Enricher flow and expression configuration via XML based DSL• Conditional: if-then-else & switch-case-else (with regex support)• Actions: Add & remove fields and field values using expressions• Expression handlers currently supported:

• Field• Function (execute methods via Java Reflection)• HttpClient (retrieve content by URL described by field values)• Xslt, Xpath, Xquery (external XML databases)• JDBC • SparQL (OpenRDF)• Apache Lucene/Solr• Apache Tika (Meta and Text extraction)

Page 36: Open source enterprise search and retrieval platform

Luminis Enricher Framework(Examples)

<enricher name="Field" > <field name="a">AA1</field> <field name="b">BB1</field> <field name="b">BB2</field> <multivalue-field name="c">CC1</multivalue-field> <multivalue-field name="c">CC2</multivalue-field> <if test="field::c" pattern="CC2"> <then> <field name="e">EE1</field> </then> </if> <if test="field::a"> <then> <field name="f">FF1</field> </then> </if> <rename-field name="b">d</rename-field> <remove-field name="a"/></enricher>

Page 37: Open source enterprise search and retrieval platform

Luminis Enricher Framework(Examples)

<enricher name="Field" > <field name="a">AA1</field> <field name="b">BB1</field> <field name="b">BB2</field> <multivalue-field name="c">CC1</multivalue-field> <multivalue-field name="c">CC2</multivalue-field> <if test="field::c" pattern="CC2"> <then> <field name="e">EE1</field> </then> </if> <if test="field::a"> <then> <field name="f">FF1</field> </then> </if> <rename-field name="b">d</rename-field> <remove-field name="a"/></enricher>

<enricher name="XPath” xmlns:str="http://exslt.org/strings" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:html="http://www.w3.org/1999/xhtml"> field name="Description" expression-type="xpath"> //html:meta[@name='DC.description']/@content </field> <multivalue-field name="Type" expression-type="xpath"> //html:meta[@name='DC.type' and (@scheme='OVERHEIDbm.bekendmakingtypeGemeente' or @scheme='OVERHEIDbm.bekendmakingtypeProvincie' or @scheme='OVERHEIDbm.bekendmakingtypeWaterschap') ]/@content </multivalue-field> <field name="publisher" expression-type="xpath"> fn:string-join(('Blow, ', 'blow, ', 'thou ', 'winter ', 'wind!'), '') </field> <field name="publisher" expression-type="xpath"> fn:concat(//html:meta[@name='OVERHEID.organisationType']/@content, //html:meta[@name='DC.creator']/@content) </field></enricher>

Page 38: Open source enterprise search and retrieval platform

Luminis Enricher Framework(Examples)

<enricher name="Field" > <field name="a">AA1</field> <field name="b">BB1</field> <field name="b">BB2</field> <multivalue-field name="c">CC1</multivalue-field> <multivalue-field name="c">CC2</multivalue-field> <if test="field::c" pattern="CC2"> <then> <field name="e">EE1</field> </then> </if> <if test="field::a"> <then> <field name="f">FF1</field> </then> </if> <rename-field name="b">d</rename-field> <remove-field name="a"/></enricher>

<enricher name="XPath” xmlns:str="http://exslt.org/strings" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:html="http://www.w3.org/1999/xhtml"> field name="Description" expression-type="xpath"> //html:meta[@name='DC.description']/@content </field> <multivalue-field name="Type" expression-type="xpath"> //html:meta[@name='DC.type' and (@scheme='OVERHEIDbm.bekendmakingtypeGemeente' or @scheme='OVERHEIDbm.bekendmakingtypeProvincie' or @scheme='OVERHEIDbm.bekendmakingtypeWaterschap') ]/@content </multivalue-field> <field name="publisher" expression-type="xpath"> fn:string-join(('Blow, ', 'blow, ', 'thou ', 'winter ', 'wind!'), '') </field> <field name="publisher" expression-type="xpath"> fn:concat(//html:meta[@name='OVERHEID.organisationType']/@content, //html:meta[@name='DC.creator']/@content) </field></enricher>

<enricher name="SPARQL"> <field name="place">http://www.my.com/#channels</field> <field expression-type="sparql" repository="TESTRDF"> <![CDATA[ PREFIX skos: <http://www.w3.org/2004/02/skos/core#> SELECT ?definition WHERE { ?${place} skos:definition ?definition. } ]]> </field></enricher>

Page 39: Open source enterprise search and retrieval platform

Luminis Enricher Framework(Examples)

<enricher name="Field" > <field name="a">AA1</field> <field name="b">BB1</field> <field name="b">BB2</field> <multivalue-field name="c">CC1</multivalue-field> <multivalue-field name="c">CC2</multivalue-field> <if test="field::c" pattern="CC2"> <then> <field name="e">EE1</field> </then> </if> <if test="field::a"> <then> <field name="f">FF1</field> </then> </if> <rename-field name="b">d</rename-field> <remove-field name="a"/></enricher>

<enricher name="XPath” xmlns:str="http://exslt.org/strings" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:html="http://www.w3.org/1999/xhtml"> field name="Description" expression-type="xpath"> //html:meta[@name='DC.description']/@content </field> <multivalue-field name="Type" expression-type="xpath"> //html:meta[@name='DC.type' and (@scheme='OVERHEIDbm.bekendmakingtypeGemeente' or @scheme='OVERHEIDbm.bekendmakingtypeProvincie' or @scheme='OVERHEIDbm.bekendmakingtypeWaterschap') ]/@content </multivalue-field> <field name="publisher" expression-type="xpath"> fn:string-join(('Blow, ', 'blow, ', 'thou ', 'winter ', 'wind!'), '') </field> <field name="publisher" expression-type="xpath"> fn:concat(//html:meta[@name='OVERHEID.organisationType']/@content, //html:meta[@name='DC.creator']/@content) </field></enricher>

<enricher name="SPARQL"> <field name="place">http://www.my.com/#channels</field> <field expression-type="sparql" repository="TESTRDF"> <![CDATA[ PREFIX skos: <http://www.w3.org/2004/02/skos/core#> SELECT ?definition WHERE { ?${place} skos:definition ?definition. } ]]> </field></enricher>

<enricher name=”HttpAndTika"> <field name="content.url"><![CDATA[http://na.apachecon.com/c/acna2010/speakers/501]]></field> <field expression-type=”http" name="content.file">field:content.url</field> <field name="auteur" source="field::content.file">xpath://H1</field> <multivalue-field expression-type=”tika.meta” source="field::content.file”/> <field name=”content" expression-type=”tika.text” source="field::content.file”/> <switch test=”field::content.url <case pattern=".*\.rijksweb\.nl.*"><field name=”source">Rijksweb</field></case> <case pattern=".*\.deventer\.nl.*"><field name=”source">Gemeente Deventer</field></case> <case pattern="file:.*"><field name=”source">Locale Harde Schijf</field></case> <else><field name=”source">Overige</field></else> </switch></enricher>

Page 40: Open source enterprise search and retrieval platform

Luminis Enricher Framework(Technology)

•Enricher and expresion handlers are Java based OSGi services:

• Hot pluggable and updatable• Flow and expression configuration changes no restart• Extendible: New expression handlers immediatly available in

actions after installing OSGi bundle•Runs in Apache Felix

• Collection Process: ServiceMix contains OSGi container• Publication Process: Custom OSGi loader for Lucene/Solr

•Centralized & transactional provisioning (Apache Ace)‑ Components & Configuration

Page 41: Open source enterprise search and retrieval platform

Deployment Architecture

<<device>>Slave Publication Server

(Slave2)

<<Container>>Apache Tomcat

Enricher(Luminis)

Lucene/SOLR(Apache)

Wicket(Apache)

<<config>>SOLR::schema.xml

<<config>>Luminis:Enricher.xml

<<config>>SOLR::solrconfig.xml

Felix OSGi(Apache)

<<device>>Firewall <<device>>

HTTP Load Balancer

<<device>>Master Collection Server

<<Container>>Apache Tomcat

Enricher(Luminis)

Nutch(Apache)

ServiceMix(Apache)

Tika(Apache)

Lucene/SOLR (Apache)

<<config>>SOLR::solrconfig.xml

<<config>>Luminis:Enricher.xml

<<config>>SOLR::schema.xml

<<config>>servicenix::config.xml

OpenRDF

<<Data Container>>SQL

<<Database>>Knowledge Models <<RDFTripleStore>>

Knowledge Models

<<HTTP>>

<<HTTP>>

<<HTTP>>

<<JDBC>>

<<HTTP>>

Felix OSGi(Apache)

<<HTTP>>

<<HTTP/ReST>>

<<HTTP/ReST>>

<<device>>Deployment Server

Ace(Apache)

Felix OSGi(Apache)

<<PROVISIONING>>

<<JDBC>>

<<device>>Slave Publication Server

(Slave1)

<<Container>>Apache Tomcat

Enricher(Luminis)

Lucene/SOLR(Apache)

Wicket(Apache)

<<config>>SOLR::schema.xml

<<config>>Luminis:Enricher.xml

<<config>>SOLR::solrconfig.xml

Felix OSGi(Apache)

Page 42: Open source enterprise search and retrieval platform

Conclusions

•Enterprise Search Solution is not Google search

•Open Source paves the way; misses some ingredients• Useful ingredients: Lucene/Solr, Nutch, Tika, ServiceMix/Camel,

Wicket, MySQL, OpenRDF, Felix/Ace• Missing ingredients: Enricher

•Interesting developments:• Apache Chemistry (CMIS)• Apache Clerezza• Apache Nutch• Apache Connectors Framework (ManifoldCF)

Page 43: Open source enterprise search and retrieval platform

Questions & (answers?)

Marc Teutelink

[email protected]

@mteutelink

MEAP December 2010