Upload
lucidworks
View
434
Download
1
Embed Size (px)
Citation preview
O C T O B E R 1 3 -‐ 1 6 , 2 0 1 6 • A U S T I N , T X
Solr & R to deploy Custom Search Interfaces
Patrick Beaucamp
Chairman – Bpm-‐Conseil -‐ France patrick.beaucamp@bpm-‐conseil.com
PresentaHon Agenda
Solr & R IntegraHon inside AklaBox
AklaBox PresentaHon
AklaBox & Solr + R & GoJS & OSM
Demo Pla;orm : AklaBox
Going further : Vanilla Air, Spark & R & Solr
Cer@fied on Cloudera & HortonWorks
Run on Hadoop : Solr/Cloud, Hdfs ...
Ready for OpenStack
Aklabox PresentaHon
Aklabox PresentaHon
User Interface
Aklabox PresentaHon
Upload your documents
Share your documents
Collaborate on documents
Search on documents
Synchronize your
documents
Publish your documents
Document Viewer
Aklabox PresentaHon
WorkFlow
Synchro
Mobile
Aklabox PresentaHon
Standard Search Interface
Solr & R IntegraHon inside AklaBox
• Why do I get this list when I search inside the document repository ?
• What does value when I run a search : weight of every words ? • If a word is 100 @mes in a document, is the document more valuable for my search ?
• May be the document I’m looking for has not the exact word spelling ?
• How do I take into account mul@ language support ?
Solr & R IntegraHon inside AklaBox
• We need to review our module and rethink how we can help user to deploy their own search policy
• R was a natural choice to create a new search algorithm • We use R for our Data Mining development • R contains packages to inspect documents • R has virtually no limit to analyze and classify documents • We read a lot about R & Search engine …
Solr & R IntegraHon inside AklaBox
• When do we analyze documents with R : • Before Solr Indexa@on • AZer Solr Indexa@on
• Choice : • Before Solr Indexa@on • We add Metadata on every document, like top words, class of document ….
• We create classes for documents, and rela@on between classes
Solr & R IntegraHon inside AklaBox
Keywords are added inside Solr Index
Solr & R IntegraHon inside AklaBox
Solr & R IntegraHon inside AklaBox
Solr & R IntegraHon inside AklaBox
R Packages : • tm, textmining func@ons (stemming, words frequency, words manipula@on,
etc...) • TF IDF funcHon (Term Frequency)
• Matrix, for complex ma@rx manipula@on
• cluster -‐ fanny & kmeans func-ons, to calculate classes on various group
• libsvm -‐ fonc@uns svm, predict e& tune, for automa@c words classifica@on
• Sampling – to create & manipulate different data sets
Solr & R IntegraHon inside AklaBox
+ • R algorithm runs when the document is uploaded
• We keep only a few number of words per documents (parameter) • We create classes for documents • We can managed other concerns, such as interna@onalisa@on
• R Package can be switch (other algorithm, new deployment) • easy & flexible to deploy and maintain
• No impact on Solr
-‐ • Solr index is a gold mine … and we don’t run analysis on it
AklaBox & Solr + R & GoJS & OSM
AklaBox & Solr + R & GoJS & OSM
Mind Map with Words associa@on
AklaBox & Solr + R & GoJS & OSM
Map Visualiza@on
OSM Visualiza@on
DemonstraHon
DemonstraHon
• Other Business Cases
• Document Management : Pre-‐classifica@on of documents (pharmaceu@cal industry)
• Search engine : Analysis of WebSite during crawling process
• Open Door to New development
• Phone@cs search (to solve the word spelling problem)
Vanilla Air, Spark, Spark Sql for Solr
New Technologies are emerging … well : it’s already there !!!
Vanilla Air, Spark, Spark Sql for Solr
• Vanilla Air – Can Process R Packages – Can scale with growing number of documents
www.vanillasmartdata.com
Vanilla Air, Spark, Spark Sql for Solr
Easy Switch in Architecture -‐> scalability
Vanilla Air, Spark, Spark & R & Solr
Spark 1.5 Version 1.5 (sept 2015) support for YARN cluster mode in R
Vanilla Air, Spark, Spark & R & Solr
We have now Spark & Solr Tools : SolrRDD Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ
hlps://github.com/LucidWorks/spark-‐solr
Vanilla Air, Spark, Spark & R & Solr
Admin Side – Runing complex R program on Solr index, using Vanilla Air
Lucky One !