58
#SummitNow Super Size Your Search 6 th November 2013 Piergiorgio Lucidi (Sourcesense) Fran Alvarez (Zaizi)

Super Size Your Search

Embed Size (px)

DESCRIPTION

As organisations store more and more information in their Alfresco content hubs, search and discovery of content becomes important. Alfresco comes bundled with Apache Lucene and Apache Solr for search. Although these provide full text capabilities, they do not have the scalability and functionality of the newer cloud scalable search software such as Apache Solr Cloud 4, Elastic Search and Amazon Cloud Search. Also, searching across multiple Alfresco instances including Alfresco Cloud is quite a challenge and any of the possible approaches are not good enough to be production ready. This talk shows you how to index and search content stored in one or more Alfresco repositories, other CMIS repositories or file systems using either Apache Solr Cloud 4, Elastic Search or Amazon Cloud Search, while still ensuring the confidentiality of the documents based on the permissions configured in Alfresco or any other repositories.

Citation preview

Page 1: Super Size Your Search

#SummitNow

Super Size Your Search6th November 2013Piergiorgio Lucidi (Sourcesense)

Fran Alvarez (Zaizi)

Page 2: Super Size Your Search

#SummitNow

#SummitNow

Piergiorgio Lucidi• Open Source ECM Specialist at Sourcesense• Alfresco Certified Trainer / Engineer• Alfresco Wiki Gardener / Community Star• Alfresco forum supporter• Global Moderator of the italian forum• Author and Technical Reviewer at Packt• PMC Member and Mentor at ASF• Project Leader in the JBoss Community

Page 3: Super Size Your Search

#SummitNow

#SummitNow

OverviewHow to build and manage your search server:

1. Scenario2. Introducing Apache

ManifoldCF3. Zaizi Integrated Search

Solution

Page 4: Super Size Your Search

#SummitNow

#SummitNow

ScenarioAn overview about the typical complex search architecture

Page 5: Super Size Your Search

#SummitNow

#SummitNow

Scenario - Alfresco limitationsAlfresco supports these search engines:• Apache Lucene (embedded)• Apache Solr (provided by Alfresco)• needs development if other

repositories must be involvedEvery other approach must be implemented (ScheduledActions, WebScripts, etc..)

Page 6: Super Size Your Search

#SummitNow

#SummitNow

Scenario – Embedded

Simple Search Architecture

Alfresco is the only one repository involved in the architecture using the embedded search engine:• the repository must take care of indexes

also managing index transactions

Indexes

AlfrescoFrontEnd

applications

Apache Lucene

Page 7: Super Size Your Search

#SummitNow

#SummitNow

Scenario – Embedded - Cluster

Embedded

Not easy to scale out with Lucene1. every cluster must have its own search

indexes2. The cluster must synchronize indexes

Indexes

Alfresco

Apache Lucene

Indexes

Alfresco

Apache Lucene

JGroups

Page 8: Super Size Your Search

#SummitNow

#SummitNow

Scenario – Simple Architecture

Simple search architecture

Alfresco is the only one repository involved in the architecture with an external search server1. The search server can be used for

publish contents in the front end architecture

2. The repository will stay in the logic backend

Search Engine

Indexes

Alfresco FrontEnd applications

Page 9: Super Size Your Search

#SummitNow

#SummitNow

Scenario – Publish with searchA search engine can be used for:• advanced management of search

indexes• scaling out• executing complex search on

contents• publishing contents in the FE

architecture

Page 10: Super Size Your Search

#SummitNow

#SummitNow

Scenario – Publish with search

Publish with search architecture

Alfresco is the only one repository involved in the architecture with an external search server1. The search server can be used for

publishing contents in the front end architecture (HTML)

2. The repository will stay in the logic backend

Search Engine

Indexes

Alfresco FrontEnd applications

BackEnd FrontEnd

Lucene / Solr

Indexes

Page 11: Super Size Your Search

#SummitNow

#SummitNow

Scenario – Simple Architecture

Simple Search Architecture

Alfresco is the only one repository involved in the architecture with an external search server1. The search server can be used for

publish contents in the front end architecture

2. The repository will stay in the logic backend

Search Engine

Indexes

Alfresco FrontEnd applications

Page 12: Super Size Your Search

#SummitNow

#SummitNow

Scenario – Complex Architecture1. Alfresco is only one of the platforms

that must be involved in your search architecture

2. You don’t want to increase the development effort

3. You want just something to configure

Page 13: Super Size Your Search

#SummitNow

#SummitNow

Scenario – Complex Architecture

Architecture with different ECM systems

Alfresco is one of the content platforms that must be involved in the indexing process

Alfresco

Search Engine

Indexes

SharePoint

FileNet

CMIS

JIRA

Google Drive

DropBox

Page 14: Super Size Your Search

#SummitNow

#SummitNow

Scenario – Complex Architecture

Architecture with different ECM systems

Alfresco is one of the content platforms that must be involved in the indexing process

Alfresco

Search Engine

Indexes

SharePoint

FileNet

CMIS

JIRA

Google Drive

DropBox ?

Page 15: Super Size Your Search

#SummitNow

#SummitNow

Scenario – Complex Architecture

Architecture with different ECM systems

Alfresco is one of the content platforms that must be involved in the indexing process

Alfresco

Search Engine

Indexes

SharePoint

FileNet

CMIS

JIRA

Google Drive

DropBox

Page 16: Super Size Your Search

#SummitNow

#SummitNow

Introducing Apache ManifoldCF

Page 17: Super Size Your Search

#SummitNow

#SummitNow

Apache ManifoldCF - HistoryManifoldCF code base was granted by MetaCarta to the Apache Software Foundation in December 2009.

The MetaCarta effort represented more than five years of successful development and testing in multiple, challenging enterprise environments.

The project was graduated as Apache Top Level Project in July 2012.

Page 18: Super Size Your Search

#SummitNow

#SummitNow

Apache ManifoldCF – What is?Open Source crawler• crawling model (add, change,

delete)• schedule jobs to create indexes • get contents from repositories• push contents on search servers

Page 19: Super Size Your Search

#SummitNow

#SummitNow

Apache ManifoldCF – What is?

Repository 1

Repository 3

Repository 4

Repository 2Apache ManifoldCF

Search Server 1

Search Server 2

Search Server 3

Search Server 4

Page 20: Super Size Your Search

#SummitNow

#SummitNow

Apache ManifoldCF – What is?Out-Of-The-Box it is distributed as a webapp• REST API• Authority Service• ACL indexes

• Crawler UI  can be embedded in any Java application

Page 21: Super Size Your Search

#SummitNow

#SummitNow

Apache ManifoldCF – Why?• Reliability • Incremental• Flexible• Multi repositories• Security model• Monitoring

Page 22: Super Size Your Search

#SummitNow

#SummitNow

ManifoldCF – Why? - ReliabilityJobs scheduling and configuration are stored in the database to maintain the state of all the executions

Repository 1

Repository 3

Repository 4

Repository 2Apache ManifoldCF

Search Server 1

Search Server 2

Search Server 3

Search Server 4

Pull Agent Daemon

Database

Page 23: Super Size Your Search

#SummitNow

#SummitNow

ManifoldCF – Why? - Incrementalget content changesets obtained from the repository API

Repository 1 Apache ManifoldCF

Pull Agent Daemon

Database

query

Complete Changesets

Page 24: Super Size Your Search

#SummitNow

#SummitNow

ManifoldCF – Why? - FlexibleIf the repository can't supply all the changes Manifold can discover them through crawling

Apache ManifoldCF

Pull Agent Daemon

Database

queryIncomplete Changesets

Change Discovery

N N

Page 25: Super Size Your Search

#SummitNow

#SummitNow

ManifoldCF – Why? – Multi repoJobs can retrieve contents from the following repositories:• Google Drive• Dropbox• HDFS• CMIS-compliant• Alfresco • IBM FileNet• EMC Documentum

• Microsoft SharePoint• OpenText LiveLink• Autonomy Meridio• Memex Patriarch• Windows Share/DFS • Generic JDBC • Generic Filesystem • Generic RSS and Web

Page 26: Super Size Your Search

#SummitNow

#SummitNow

ManifoldCF – Why? – Multi repoJobs can ingest contents to the following search servers:

• Apache Solr• ElasticSearch • OpenSearchServ

er• MetaCarta GTS

Page 27: Super Size Your Search

#SummitNow

#SummitNow

ManifoldCF – Why? - SecurityRetrieve per-content ACLs

Repository 1

Repository 3

Repository 4

Repository 2

Apache ManifoldCFSearch Server 1

Search Server 2

Search Server 3

Search Server 4

Authority Service

Authority 1

Authority 2

access tokens

Page 28: Super Size Your Search

#SummitNow

#SummitNow

ManifoldCF – Why? - SecurityRetrieve per-content ACLs

Repository 1

Repository 3

Repository 4

Repository 2

Apache ManifoldCFSearch Server 1

Search Server 2

Search Server 3

Search Server 4

Authority Service

Authority 1

Authority 2

user access tokens

user specific search results

Page 29: Super Size Your Search

#SummitNow

#SummitNow

ManifoldCF – Why? – MonitoringUI Crawler allows you to:• configure jobs and connectors• monitor jobs execution• monitor contents ingestion

• status reports• document status• queue status

• history reports • simple history• maximum activity• maximum bandwidth• result histogram

Page 30: Super Size Your Search

#SummitNow

#SummitNow

ManifoldCF – Architecture

Repository Job Search Server

ACLs

Page 31: Super Size Your Search

#SummitNow

#SummitNow

ManifoldCF – Architecture

Repository Job Search Server

ACLs

Repository Connector

Page 32: Super Size Your Search

#SummitNow

#SummitNow

ManifoldCF – Architecture

Repository Job Search Server

ACLs

Repository Connector Output Connector

Page 33: Super Size Your Search

#SummitNow

#SummitNow

ManifoldCF – Architecture

Repository Job Search Server

ACLs

Repository Connector Output Connector

Authority Connector

Page 34: Super Size Your Search

#SummitNow

#SummitNow

ManifoldCF – Architecture

Repository Job Search Server

ACLs

Repository Connectorquery to retrieve

contentsOutput Connector

Authority Connector

Page 35: Super Size Your Search

#SummitNow

#SummitNow

ManifoldCF – Architecture

Repository Job Search Server

ACLs

Repository Connectorquery to retrieve

contents

Output Connectormetadata mappingcontent ingestion

Authority Connector

Page 36: Super Size Your Search

#SummitNow

#SummitNow

ManifoldCF – Architecture

Repository Job Search Server

ACLs

Repository Connectorquery to retrieve

contents

Output Connectormetadata mappingcontent ingestion

Authority Connectorretrieve content

ACEs

Page 37: Super Size Your Search

#SummitNow

#SummitNow

ManifoldCF – Architecture

Repository Job Search Server

ACLs

Repository Connectorquery to retrieve

contents

Output Connectormetadata mappingcontent ingestion

Authority Connectorretrieve content

ACEs

• verbal description

• crawling model• scheduling

Page 38: Super Size Your Search

#SummitNow

#SummitNow

Who is using ManifoldCF?

Page 39: Super Size Your Search

#SummitNow

#SummitNow

ManifoldCF - Resources

The project is available at http://manifoldcf.apache.org/

From this website you can access to the mailing lists, documentation and download links for binaries and source.

Page 40: Super Size Your Search

#SummitNow

#SummitNow

ManifoldCF – Resources - BookManifoldCF in Action by Karl Wright published by Manning  Karl is the original developer and the principal committer of Apache ManifoldCF The book is available at

http://www.manning.com/wright

Page 41: Super Size Your Search

#SummitNow

#SummitNow

Zaizi Integrated Search Solution

Page 42: Super Size Your Search

#SummitNow

#SummitNow

Fran Alvarez• Director of Zaizi Iberia and Lead

Architect• Alfresco Certified Engineer• Responsible of large Alfresco

architectures• Semantic Consultant for Sensefy• Alfresco Meetups Organizer

Page 43: Super Size Your Search

#SummitNow

#SummitNow

Alfresco + Solr ApproachQuite a good architecture

• Performance issues are solved

• Different architectures depending on business requirements

However…

• It does not cover some use cases or scenarios

• It does not leverage Cloud benefits or latest technologies

• With huge data volume there are other approaches

How can we solve limitations and enhance benefits?

Page 44: Super Size Your Search

#SummitNow

#SummitNow

Alfresco + Solr Approach• Decouples Search solution from Alfresco•Allow to implement different Search solutions•Allow to change Search solution without changing anything in Alfresco

• Not even a property!•Provides an API to integrate it with Alfresco as search engine

• Even other repository vendors! E.g. Filesystem, Sharepoint, Documentum, Filenet, Drupal…

•And preserve security permissions in the results• Alfresco permissions are indexed and used during search

It’s included in our Semantic solution: Sensefy!

Page 45: Super Size Your Search

#SummitNow

#SummitNow

What we’ve done in ManifoldRepository Connector:• Alfresco Repository Connector: New implementation

• Removing dependency with Alfresco Solr APIOutput connectors:• Cloud Search Output Connector: Design & Development• Elastic Search Output Connector: Improvements• Solr Cloud Output Connector: Configuration for Alfresco

Authority Connector• Alfresco Authority Connector: Design & Development

• Similar approach to Alfresco Solr• Acl reads for Users and Groups in Alfresco

Page 46: Super Size Your Search

#SummitNow

#SummitNow

Scenarios

Let’s see some examples

Page 47: Super Size Your Search

#SummitNow

#SummitNow

I: Several Alfresco instancesCurrent Approach:

• Each Alfresco has its own Search subsystem

• They can’t share indexes

Implications:• Federated search is not an option• Results can’t be merged

• If so, what resultset should be first?

ConclusionResults could be presented to users in different tabs or “manually” merged.Not the best approach

Page 48: Super Size Your Search

#SummitNow

#SummitNow

I: Several Alfresco instancesZaizi Approach:

• Our solution like search box• Which manages a single index

Implications:• All documents are driven to same

index• Users can select results from

either all Alfresco instances or a subset

ConclusionSearch across Repositories

Could be based Elastic Search, Solr Cloud, Amazon Cloud, etc.

Page 49: Super Size Your Search

#SummitNow

#SummitNow

II: Alfresco + Other data providers

Current Approach:• Alfresco has its own Search

subsystem• Other repository may have (or

not) its own Search subsystemImplications:

• Different data providers mean different formats• E.g. Filesystem does not

support CMIS• Alfresco can’t reach external data

ConclusionNo way to merge results and

present them uniformly to end users

Page 50: Super Size Your Search

#SummitNow

#SummitNow

II: Alfresco + Other data providers

Zaizi Approach:

• Both Alfresco and other repositories share Search subsystem (Manifold)

Implications:• Alfresco and other providers

results will have same format in our Solution• They will speak ‘our’ language

• Alfresco reaches external data when communicating with our solution

ConclusionResults are present and accessible between data providers

Page 51: Super Size Your Search

#SummitNow

#SummitNow

III: Alfresco + O(TB) dataCurrent Approach:

• Alfresco has its own Search subsystem

• All data is in one (or several if cluster) Solr instance

Implications:• Every Solr node manages the

whole index• No chance to apply scale

techniques for indexing:• Sharding, Replication…

ConclusionHuge servers are required and performance might be compromised

Page 52: Super Size Your Search

#SummitNow

#SummitNow

III: Alfresco + O(TB) dataZaizi Approach:

• Alfresco uses our solution• Data is indexed in search solution

which better suits:• Amazon Cloud, Solr Cloud,

Elastic Search…

Implications:• Cloud Search solution manages

index• Indexing techniques can be applied

according to use cases• Sharding, Replication

ConclusionSearch strategy can be adopted and easily implemented with search solution which better fits

Page 53: Super Size Your Search

#SummitNow

#SummitNow

Apache Manifold: Other benefitsCan extract, index and map information from any other sources• Apache Stanbol, RedLink, any other data enricher• Our solution will gather everything in one place

• Documents, entities…Permissions are checked just once• Everything is in the same place, even user

authorization capabilities• Performance and scalability is improved• Faceted search and other search capabilities are

combined with such permission feature

Page 54: Super Size Your Search

#SummitNow

#SummitNow

Demo

Page 55: Super Size Your Search

#SummitNow

#SummitNow

ConclusionsZaizi solution allows searching and indexing in the most popular Cloud Search solutions

• Other Search solutions can be integrated as wellZaizi solution allows retrieving information from the most popular repositories

• Other Data providers can be integrated too• It solves plenty of current issues related search and

indexing in Alfresco• Can be used outside Alfresco or even with Alfresco and

any other data repositoryZaizi solution manages permissions and security from the most popular repositories and the latest Cloud search technologies Fully supported by us!

Page 56: Super Size Your Search

#SummitNow

#SummitNow

Conclusions

Page 57: Super Size Your Search

#SummitNow

#SummitNow

What’s comingPowerful User Interface• Admin functions• Wide range of

facets• UI for Share

Benchmarking

New connectors• Filesystem

authority• RedLink repository• Stanbol repository

Alfresco Search Subsystem?

Page 58: Super Size Your Search

#SummitNow