Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by Paul Nelson, Search Technologies

O C T O B E R 1 1 -‐ 1 4 , 2 0 1 6 • B O S T O N , M A

O C T O B E R 1 1 -‐ 1 4 , 2 0 1 6 • B O S T O N , M A

Searching the Enterprise Data Lake with Solr - Watch us do it! Paul Nelson – [email protected]

Chief Architect, Search Technologies

THERE WILL BE A DEMO Stay Tuned!

205+ Search Consultants Worldwide

San Diego

San Jose, CR

Cincinna6

Manila, PH Washington (HQ)

•  Founded 2005 •  Deep search experLse

•  900+ customers worldwide •  Consistent profitability

•  Search engines & Big Data •  Vendor independent

London, UK

Frankfurt, DE Prague, CZ

Agenda •  The Enterprise Data Lake (EDL) •  Why Search the EDL? •  The Process •  How To: Step By Step •  And then what?

In The Beginning

Applica6on

Computer Users

Database

Dashboards

Reports

Search & Troubleshoo6ng

Alerts

This Evolved to Data Warehouses

Many Computer Users

Dozens of Applica6ons Dozens of Applica6ons Dozens of Applica6ons Dozens of Applica6ons Dozens of Applica6ons Dozens of Applica6ons

Extract Transform

Load

Enterprise Data Warehouse

Dashboards

Reports


Alerts

And Now the Enterprise Data Lake

Many, many, many Computer Users

Enterprise Data Lake

Dashboards

Reports


Alerts Analyze

Hundreds of Applica6ons Raw Data

And Processed Data

What’s new about the Data Lake? •  Ingest RAW DATA •  Keep it FOREVER •  Make it ALL AVAILABLE •  Analyze it ONLY WHEN NEEDED •  Do it at MASSIVE SCALE

Why the Data Lake? •  You never know what’s important up front –  New data mining techniques invented daily –  Therefore, keep everything

•  There is too much data variety –  Therefore, only process what you need

•  Save money by not ETL’ing useless stuff •  There are many different use cases –  Shared re-‐use of data by anyone –  Data is power! Power to the people!

But Now There’s a Problem: •  10’s of thousands of databases •  Billions of records

How to find the data you need?

SO LET’S SEARCH THE DATA LAKE

“People today think search and big data are separate but in two or three years, everyone will wonder why we ever thought that.” Doug Cu?ng Chief Architect, Cloudera Creator of Lucene & Hadoop

The Process

Ingest

1

Research the Data

2

Configure Solr

3

Parse & Index

4

Search & Analyze

5

Produc6on

6

1. Ingest

HDFS Load Data

Hadoop

2. Research the Data

HDFS Research

Hadoop

3. Configure Solr

HDFS

solrconfig.xml

schema.xml

Hadoop

4. Parse & Index

HDFS

Index Morphlines

Hadoop

5. Search & Analyze

HDFS

Index

Hadoop

Hue Morphlines

6. Move to Produc6on •  Tes6ng, Quality Control –  Field processing –  Search Features –  Analy6cs

•  Incremental Processing –  Flume, Spark Streaming, Incremental Batches

•  Workflow / Scheduled Jobs (Oozie) •  Security Controls

WATCH US DO IT!

Resources •  HDFS File System Commands

–  hips://hadoop.apache.org/docs/r2.7.3/hadoop-‐project-‐dist/hadoop-‐common/FileSystemShell.html

•  solrctl Reference Guide –  hips://www.cloudera.com/documenta6on/enterprise/5-‐7-‐x/topics/search_solrctl_ref.html

•  Morphlines Reference Guide –  hip://kitesdk.org/docs/1.1.0/morphlines/morphlines-‐reference-‐guide.html –  hips://github.com/typesafehub/config/blob/master/HOCON.md

•  MapReduce Indexer Tool –  hips://github.com/cloudera/search/tree/cdh5-‐1.0.0_5.2.1/search-‐mr

•  Crunch Indexer –  hips://github.com/cloudera/search/tree/cdh5-‐1.0.0_5.2.1/search-‐crunch

•  Lily HBase Indexer –  hip://www.cloudera.com/documenta6on/enterprise/latest/topics/search_hbase_batch_indexer.html

What’s Next •  Explore other analy6c interfaces

–  Banana, Zoom Data •  Spark

–  Streaming Data –  Complex Analy6cs à Store results in Solr à More analy6cs!

•  Index Many More Collec6ons –  Create a Process: Data research à Data Model Design à Implement

•  Self-‐Service Inges6on –  Document processes for others to use –  Templates for inges6on

•  Hire Search Technologies!

QUESTIONS? ANSWERS! Thank you!

Technology

Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by Paul Nelson, Search Technologies