32
SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIME Ryan Tabora - Think Big Analytics NoSQL Search Roadshow - June 6, 2013 1

SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

SEARCHING BILLIONS OF PRODUCT LOGS IN

REAL TIMERyan Tabora - Think Big Analytics

NoSQL Search Roadshow - June 6, 2013

1

Page 2: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

WHO AM I?

Ryan Tabora

Think Big Analytics - Senior Data Engineer

Lover of dachshunds, bass, and zombies

2

Page 3: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

OVERVIEW

Primers

What are product logs?

How do they apply to big data?

Real use case

Real issues and designs

Conclusion

3

Page 4: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

PRODUCT LOGS?•Device data

• IT, Energy, Healthcare, Manufacturing, Telecom ...

• These devices are pushing data back home (pull works too!)

• As more devices are sold/installed, more and more data comes back to ‘home base’

4

Page 5: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

• Realtime Visualization

• Realtime Response

• Ad Hoc Analysis

• Full Historical Capture

• Blended Data Sets

POWER OF DEVICE DATA

DEVICEDATA

5

Page 6: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

TRADITIONAL APPROACHES

• SQL: PostGres, MySQL, Oracle, Microsoft

• SQL provides many of the search features required for typical search applications

• Joins, regex, group by, sorting, etc

• But the these technologies can only scale so far...

6

Page 7: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

• Hadoop

• HBase/Cassandra/Accumulo

• Search features are very limited

• HBase row scans, primary key index

• Cassandra limited secondary indexing

NEW TECHNIQUESSTORING DATA

7

Page 8: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

•What is an index?

• Lucene

• Paralleling Index Creation

•MapReduce/Flume/Storm

• Real Time Search

• Searching before it hits disk

NEW TECHNIQUESINDEXING DATA

8

Page 9: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

• Solr/ElasticSearch

• Both build on top of Lucene

• Search servers

• RESTful HTTP APIs

• Easy to administer

• Add powerful text/numerical search capabilities

NEW TECHNIQUESSEARCHING DATA

9

Page 10: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

BASIC SEARCH FEATURES

• Boolean logic (AND, OR + -)

• Sorting and Group By

• Range queries

• Phrase/Prefix/Fuzzy queries

10

Page 11: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

ADVANCED SEARCH FEATURES

• Custom ranking/scoring

•More like this

• Auto suggest

• Faceting/Highlighting

• Geo-spacial search

11

Page 12: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

SCALING SEARCH

• ElasticSearch and SolrCloud both have distributed features built in

• Auto-sharding

• Replication

•Query routing

• Transaction log

12

Page 13: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

USE CASE

Problem

Sample Solution

Core Design Issues

Other Solutions

13

Page 14: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

THE PROBLEM

Home Base

NetApp FilerNetApp FilerDevice

Client A

NetApp FilerNetApp FilerDevice

Client B

NetApp FilerNetApp FilerDevice

Client C

Log

Log

Log

LogsREST API

Full SQL Access

Flat File Access

LatestAll Applications

Engineers& Analysts

14

Page 15: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

SEARCH APPLICATION FEATURES

• Find last three days of raw logs from an entire cluster

• Group capacity available grouped by machine serial number and show the largest capacities first

• Search all device header lines for “FAILURE”

• View all hard disk objects that have product number 2341AB

• Find all motherboards with an associated customer ticket

15

Page 16: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

SAMPLE SOLUTION

Logs

Ingestion

Parsing/Loading

Custom

RESTful Search API

QueriesIndexing

HDFS

MapReduce

16

Page 17: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

INGESTION

17

Page 18: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

PARSING, LOADING, AND INDEXING

Load HBase with parsed objects

Store HBase ROW_IDStore pointer to raw file in HDFSIndex a number of desired fields

/ingestion/sequencefile1

/ingestion/sequencefile2

/ingestion/sequencefile3

1534 4562 5323 7232

4601 5105

0

0

0 1492 2987 4767 5987

18

Page 19: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

INSIDE OF HBASE

...... ......

...... ...

...... .........

object5

…...

object4

...

object3object2

...

object1rowkey

19

Page 20: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

THE SOLR DOCUMENT

2343sfOffset/ingest/file2sequenceFile

1333-2241-3411cluster_id42ADFF-BZMM

...

configs.log...

2013/05/12

WARNING: DISK DEAD...

headercontentsfile_namedate_sentsystem_id

rowkey

Solr Document

20

Page 21: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

SEARCH APPLICATION

Search Application

Query

Data locations

Stored Object

UserQuery Results

2

1

34

5

8

67

Raw Data Location

Raw DataRow_id

21

Page 22: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

CORE DESIGN ISSUES

• Changing the Solr schema (manual reindex)

• Elastic shard scaling (manual reindex)

•No distributed joining (denormalizing the data)

• Replication*

•Manually managing Solr partitioning/sharding*

•Write durability*

22

Page 23: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

SOLRCLOUD

• Automatic shard creation, routing

• Replication

• Limited to a fixed number of shards defined on initial creation

• ZooKeeper for coordination

• Large community

23

Page 24: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

ELASTICSEARCH

• Similar feature set to Solr

• Purpose built for easily managing a distributed index

• Rapidly growing community

• Custom built coordination mechanism

• JSON based API

24

Page 25: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

• Integrates Cassandra and Solr

• Automatic indexing in Solr/storing in Cassandra

• Automatic partitioning

• Automatic reindexing

•Not limited to fixed number of shards

• Proprietary and costs money

DATASTAX ENTERPRISE

+

25

Page 26: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

• Collecting and analyzing device data/product logs can be a very difficult challenge

• You can use NoSQL and search technologies like Solr or ElasticSearch in unison...

• ...but it is not always easy to integrate search with NoSQL

CONCLUSION

26

Page 27: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

QUESTIONS?

• Feel free to reach out if you have any questions or need help with big data/search!

• http://ryantabora.com

• http://thinkbiganalytics.com

• http://www.slideshare.net/ratabora

[email protected]

• @ryantabora

27

Page 28: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

BONUS SLIDES

28

Page 29: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

HBASE AND SOLR

• Automatic partitioning/reindexing

• Automatic index updates on HBase inserts/deletes

•Mapping HBase cells to a Solr schema

•No perfect commercial/open source solution yet

•Many many many more...

29

Page 30: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

HBASE + SOLRAUTOMATIC INDEXING

• HBase coprocessors are like storedprocs/triggers

•New, powerful, and dangerous

• Triggers on HBase puts/deletes

•Mapping data to a schema?

30

Page 31: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

HBASE + SOLR WRITE DURABILITY

Solr Shard 1

Solr Shard 3

Solr Shard 2

HBase Table - SOLR_QUEUE

MapReduce Indexing Application

Solr Queue Reader

Create SolrDocument from raw ASUP1

Get oldest SolrDocument from HBase Queue Table

3

Use custom hash algorithm to determine which shard to add SolrDocument to

4

Query Solr, if SolrDocument was added, then remove it

from the SOLR_QUEUE

5

SolrDocument

SolrDocument

Add SolrDocument to HBase for durability2

31

Page 32: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES

HBASE + SOLRELASTIC SHARDING

• HBase’s distributing mechanism uses the concept of regions to split data across many nodes

• Region splitting can be automatic or manual (performance degradation as regions split)

• Piggybacking Solr sharding on HBase Region splitting

32