42
OCTOBER 1114, 2016 BOSTON, MA

Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by Paul Nelson, Search Technologies

Embed Size (px)

Citation preview

O C T O B E R   1 1 -­‐ 1 4 ,   2 0 1 6     •     B O S T O N ,   M A  

O C T O B E R   1 1 -­‐ 1 4 ,   2 0 1 6     •     B O S T O N ,   M A  

Searching the Enterprise Data Lake with Solr - Watch us do it! Paul Nelson – [email protected]

Chief Architect, Search Technologies

THERE  WILL  BE  A  DEMO  Stay  Tuned!  

205+  Search  Consultants  Worldwide  

San  Diego  

San  Jose,  CR  

Cincinna6  

Manila,  PH  Washington              (HQ)  

•  Founded  2005  •  Deep  search  experLse  

•  900+  customers  worldwide  •  Consistent  profitability  

•  Search  engines  &  Big  Data  •  Vendor  independent  

London,  UK  

Frankfurt,  DE  Prague,  CZ  

Agenda  •  The  Enterprise  Data  Lake  (EDL)  •  Why  Search  the  EDL?  •  The  Process  •  How  To:    Step  By  Step  •  And  then  what?  

In  The  Beginning  

Applica6on  

Computer  Users  

Database  

Dashboards  

Reports  

Search  &  Troubleshoo6ng  

Alerts  

This  Evolved  to  Data  Warehouses  

Many  Computer  Users  

Dozens  of    Applica6ons  Dozens  of    Applica6ons  Dozens  of    Applica6ons  Dozens  of    Applica6ons  Dozens  of    Applica6ons  Dozens  of    Applica6ons  

Extract  Transform  

Load  

Enterprise  Data  Warehouse  

Dashboards  

Reports  

Search  &  Troubleshoo6ng  

Alerts  

And  Now  the  Enterprise  Data  Lake  

Many,  many,  many    Computer  Users  

Enterprise  Data  Lake  

Dashboards  

Reports  

Search  &  Troubleshoo6ng  

Alerts  Analyze  

Hundreds  of  Applica6ons   Raw  Data  

And  Processed  Data  

What’s  new  about  the  Data  Lake?  •  Ingest  RAW  DATA  •  Keep  it  FOREVER  •  Make  it  ALL  AVAILABLE  •  Analyze  it  ONLY  WHEN  NEEDED  •  Do  it  at  MASSIVE  SCALE  

Why  the  Data  Lake?  •  You  never  know  what’s  important  up  front  –  New  data  mining  techniques  invented  daily  –  Therefore,  keep  everything  

•  There  is  too  much  data  variety  –  Therefore,  only  process  what  you  need  

•  Save  money  by  not  ETL’ing  useless  stuff  •  There  are  many  different  use  cases  –  Shared  re-­‐use  of  data  by  anyone  –  Data  is  power!  Power  to  the  people!  

But  Now  There’s  a  Problem:  •  10’s  of  thousands  of  databases  •  Billions  of  records    

How  to  find  the  data  you  need?  

SO  LET’S  SEARCH  THE  DATA  LAKE  

“People  today  think  search  and  big  data  are  separate  but  in  two  or  three  years,  everyone  will  wonder  why  we  ever  thought  that.”    Doug  Cu?ng  Chief  Architect,  Cloudera  Creator  of  Lucene  &  Hadoop  

The  Process  

Ingest  

1  

Research  the  Data  

2  

Configure  Solr  

3  

Parse  &    Index  

4  

Search  &  Analyze  

5  

Produc6on  

6  

1.    Ingest  

HDFS  Load  Data  

Hadoop  

2.    Research  the  Data  

HDFS  Research  

Hadoop  

3.    Configure  Solr  

HDFS  

solrconfig.xml  

schema.xml  

Hadoop  

4.    Parse  &  Index  

HDFS  

Index  Morphlines  

Hadoop  

5.    Search  &  Analyze  

HDFS  

Index  

Hadoop  

Hue  Morphlines  

6.  Move  to  Produc6on  •  Tes6ng,  Quality  Control  –  Field  processing  –  Search  Features  –  Analy6cs  

•  Incremental  Processing  –  Flume,  Spark  Streaming,  Incremental  Batches  

•  Workflow  /  Scheduled  Jobs  (Oozie)  •  Security  Controls  

WATCH  US  DO  IT!  

Resources  •  HDFS  File  System  Commands  

–  hips://hadoop.apache.org/docs/r2.7.3/hadoop-­‐project-­‐dist/hadoop-­‐common/FileSystemShell.html  

•  solrctl  Reference  Guide  –  hips://www.cloudera.com/documenta6on/enterprise/5-­‐7-­‐x/topics/search_solrctl_ref.html    

•  Morphlines  Reference  Guide  –  hip://kitesdk.org/docs/1.1.0/morphlines/morphlines-­‐reference-­‐guide.html  –  hips://github.com/typesafehub/config/blob/master/HOCON.md    

•  MapReduce  Indexer  Tool  –  hips://github.com/cloudera/search/tree/cdh5-­‐1.0.0_5.2.1/search-­‐mr    

•  Crunch  Indexer  –  hips://github.com/cloudera/search/tree/cdh5-­‐1.0.0_5.2.1/search-­‐crunch    

•  Lily  HBase  Indexer  –  hip://www.cloudera.com/documenta6on/enterprise/latest/topics/search_hbase_batch_indexer.html    

What’s  Next  •  Explore  other  analy6c  interfaces  

–  Banana,  Zoom  Data  •  Spark  

–  Streaming  Data  –  Complex  Analy6cs  à  Store  results  in  Solr  à  More  analy6cs!  

•  Index  Many  More  Collec6ons  –  Create  a  Process:    Data  research  à  Data  Model  Design  à  Implement  

•  Self-­‐Service  Inges6on  –  Document  processes  for  others  to  use  –  Templates  for  inges6on  

•  Hire  Search  Technologies!  

QUESTIONS?  ANSWERS!  Thank  you!