28
Securely explore your data WHAT'S NEXT FOR BIGTABLE? Adam Fuchs, CTO Sqrrl Data, Inc. May 22, 2014

What's Next for Google's BigTable

  • Upload
    sqrrl

  • View
    61

  • Download
    0

Embed Size (px)

Citation preview

Page 1: What's Next for Google's BigTable

Securely explore your data

WHAT'S NEXT FOR BIGTABLE?

Adam Fuchs, CTO Sqrrl Data, Inc. May 22, 2014

Page 2: What's Next for Google's BigTable

TODAY’S TALK

•  History of the World: Part 3 •  Bigtable/Accumulo Technology Overview •  Accumulo Demonstration •  Database Technology Survey

© 2014 Sqrrl Data, Inc. | All Rights Reserved 2

Page 3: What's Next for Google's BigTable

TIMELINE OF RELEVANT EVENTS

© 2014 Sqrrl Data, Inc. | All Rights Reserved

Google’s BigTable Paper

2006

NSA Builds Accumulo

2008 Sqrrl Founded

2012 1st Sqrrl Release and Customers

2013

NSA Open Sources

Accumulo 2011

3

Page 4: What's Next for Google's BigTable

Accumulo is a: •  Apache Software Foundation (ASF) Open-

Source Software Project •  Clone of Google’s Bigtable •  Secure, Sorted Key-Value Store •  Row-level ACID (locally) Distributed NoSQL

Database

© 2014 Sqrrl Data, Inc. | All Rights Reserved 4

Page 5: What's Next for Google's BigTable

Sqrrl is: •  A commercial software company located in

Cambridge, MA •  A search and Exploration Platform built with

Apache Accumulo •  An exciting startup with a long roadmap of

challenging problems to solve •  Hiring!

© 2014 Sqrrl Data, Inc. | All Rights Reserved 5

Page 6: What's Next for Google's BigTable

6

Page 7: What's Next for Google's BigTable

BIGTABLE & ACCUMULO TECH OVERVIEW

1.  Data Model & API 2.  Underlying Architecture 3.  Distinguishing Features

© 2014 Sqrrl Data, Inc. | All Rights Reserved 7

Page 8: What's Next for Google's BigTable

An Accumulo key is a 5-tuple, consisting of: •  Row: Controls Atomicity •  Column Family: Controls Locality •  Column Qualifier: Controls Uniqueness •  Visibility Label: Controls Access •  Timestamp: Controls Versioning

Row Col. Fam. Col. Qual. Visibility Timestamp Value

John Doe Notes PCP PCP_JD 20120912 Patient suffers from an acute …

John Doe Test Results Cholesterol JD|PCP_JD 20120912 183 John Doe Test Results Mental Health JD|PSYCH_JD 20120801 Pass

John Doe Test Results X-Ray JD|PHYS_JD 20120513 1010110110100…

Accumulo  Key/Value  Example  

ACCUMULO DATA FORMAT

© 2014 Sqrrl Data, Inc. | All Rights Reserved 8

Page 9: What's Next for Google's BigTable

Instance new ZooKeeperInstance(...)

new MockInstance()

Connector

getConnector(...)

TableOperations

InstanceOperations

SecurityOperations Scanner BatchScanner

createScanner(...) createBatchScanner(...)

Range

IteratorOption

Map.Entry

Key Value

iterator()

BatchWriter

createBatchWriter(...)

Mutation

addMutation(...)

THE ACCUMULO CLIENT API

© 2014 Sqrrl Data, Inc. | All Rights Reserved 9

Page 10: What's Next for Google's BigTable

•  Collections of KV pairs form Tables •  Tables are partitioned into Tablets

•  Metadata tablets hold info about other tablets, forming a 3-level hierarchy

•  A Tablet is a unit of work for a Tablet Server

Data  Tablet  -­‐∞  :  thing  

Data  Tablet  thing  :  ∞    

Data  Tablet  -­‐∞  :  Ocelot    

Data  Tablet  Ocelot  :  Yak    

Data  Tablet  Yak  :  ∞    

Data  Tablet  -­‐∞  to  ∞    

Table:    Adam’s  Table   Table:    Encyclopedia   Table:    Foo  

ACCUMULO TABLETS

Well-­‐Known  Loca9on  

(zookeeper)  

Root  Tablet  -­‐∞  to  ∞    

Metadata  Tablet  2  “Encyclopedia:Ocelot”  to  ∞  

Metadata  Tablet  1  -­‐∞  to  “Encyclopedia:Ocelot”  

© 2014 Sqrrl Data, Inc. | All Rights Reserved 10

Page 11: What's Next for Google's BigTable

Tablet  Server  

Tablet  

Tablet  Server  

Tablet  

Tablet  Server  

Tablet  

Applica9on  

Zookeeper  

Zookeeper  

Zookeeper  

Master  

HDFS  

Read/Write  

Store/Replicate  

Assign/Balance  

Delegate  Authority  

Delegate  Authority  

Applica9on  

Applica9on  

ACCUMULO PROCESSES

© 2014 Sqrrl Data, Inc. | All Rights Reserved 11

Page 12: What's Next for Google's BigTable

In-­‐Memory  Map  

Write  Ahead  Log  

(For  Recovery)  

Sorted,  Indexed  File  

Sorted,  Indexed  File  

Sorted,  Indexed  File  

Tablet  Reads  

Iterator  Tree  

Minor  Compac<on  

Merging  /  Major  Compac<on  

Iterator  Tree  

Writes   Iterator  Tree  

Scan  

TABLET DATA FLOW

© 2014 Sqrrl Data, Inc. | All Rights Reserved 12

Page 13: What's Next for Google's BigTable

Iterator Operations: •  File Reads •  Block Caching •  Merging •  Deletion •  Isolation •  Locality Groups •  Range Selection •  Column Selection •  Cell-level Security •  Versioning •  Filtering •  Aggregation •  Partitioned Joins

ITERATOR FRAMEWORK

© 2014 Sqrrl Data, Inc. | All Rights Reserved 13

Page 14: What's Next for Google's BigTable

WORD COUNT: SUMMING AGGREGATING ITERATOR

Input Corpus

© 2014 Sqrrl Data, Inc. | All Rights Reserved 14

Page 15: What's Next for Google's BigTable

Ingesters Queriers Tablet Servers

ACCUMULO LATENCIES

Input Batch Writer

In-Memory

Map

Scan Iterators

Scanner/Batch

Scanner

In-Memory

Map

RFile

Compaction

Iterators

Scan Iterators

RFile

Compaction

Iterators

In-Memory

Map

RFiles

Compaction

Iterators

Scan Iterators

Output

~ms ~ms ~ms

ms

- min

© 2014 Sqrrl Data, Inc. | All Rights Reserved 15

Page 16: What's Next for Google's BigTable

ACCUMULO THROUGHPUT

Ingesters Queriers Tablet Servers

Input Batch Writer

In-Memory

Map

Scan Iterators

Scanner/Batch

Scanner

In-Memory

Map

RFile

Compaction

Iterators

Scan Iterators

RFile

Compaction

Iterators

In-Memory

Map

RFiles

Compaction

Iterators

Scan Iterators

Output

~ms ~ms ~ms

ms

- min

Scan: ~1M entries/s per

node

Ingest: ~200K entries/s

per node

Read-Modify-Write Latency: ~ms ê

>1K entries/s challenging with R-M-W

© 2014 Sqrrl Data, Inc. | All Rights Reserved 16

Page 17: What's Next for Google's BigTable

Securely explore your data

DEMO

Page 18: What's Next for Google's BigTable

R-M-R VS. COMPACTION-TIME AGGREGATION

Read/Modify/Write (HBase) vs. Iterators/Combiners (Accumulo)

© 2014 Sqrrl Data, Inc. | All Rights Reserved 18

Page 19: What's Next for Google's BigTable

SURVEY OF DATABASE TECHNOLOGY

•  Exercises in Center-Seeking •  SQL vs. NoSQL •  Ingest-time vs. Query-time Analytics •  ACID vs. BASE •  Normalized vs. Denormalized Data Models

•  Primary Use Cases for Sqrrl+Accumulo

© 2014 Sqrrl Data, Inc. | All Rights Reserved 19

Page 20: What's Next for Google's BigTable

SQL VS. NOSQL

NoSQL •  Optimized for get/put

operations •  Specialized for client

languages •  High concurrency •  More client-side

control

Hybrid •  Extend and evolve

SQL •  Standardize and

incorporate NoSQL paradigms

SQL •  Optimized for joins •  Strong mathematical

roots in set theory •  Automatic query

optimization

© 2014 Sqrrl Data, Inc. | All Rights Reserved 20

Page 21: What's Next for Google's BigTable

INGEST-TIME VS. QUERY-TIME ANALYTICS

Ingest-Time •  Optimized for online

statistics •  Can reduce storage

footprint •  Can be indexed for

low latency •  Leverages a variety

of indexes •  Requires extensive

data organization at ingest

Hybrid •  Create partial

summary at ingest (Question-focused datasets, knowledge bases, etc.)

•  Support ad-hoc queries over summaries

•  Leverage all known indexing strategies **

Query-Time •  Can compute holistic

statistics, like ranking, topN, etc.

•  Ad-hoc analytics: don’t know the query ahead of time

•  High latency and low concurrency at scale

•  Leverages block indexes, columnar layout

•  Ingest can be “stream to disk”

© 2014 Sqrrl Data, Inc. | All Rights Reserved 21

Page 22: What's Next for Google's BigTable

ACID VS. BASE

ACID •  Atomicity: all or

nothing for a group of operations

•  Consistency and Isolation: support simple reasoning for distributed, multithreaded clients

•  Durability: simple reasoning for whether data might be lost

Hybrid •  Must make some

relaxations for performance at scale (under failure modes)

•  Many options for “Lightweight” transaction support

•  Accumulo limits atomicity, consistency, and isolation to row-level operations

BASE •  Basically Available:

ensure that core operations always complete in an advertised time

•  Soft-State: relaxation of referential integrity, etc.

•  Eventual Consistency: relaxation of

© 2014 Sqrrl Data, Inc. | All Rights Reserved 22

Page 23: What's Next for Google's BigTable

NORMALIZED VS. DENORMALIZED DATA MODELS

Normalized •  “Normal Form

Relational Database” •  Minimizes data

footprint •  Minimizes cost of

data maintenance •  Can lead to

expensive joins at query time

Hybrid •  Start with document

store •  Introduce links/edges

for quick joins •  Dynamically adapt to

flexible or sparse schemas

•  Similar to property graphs

Denormalized •  “Document Store” •  Flexible schema lets

applications adapt quickly to changing environments

•  Pre-joined to eliminate joins at query-time

•  Optimized for “append-only” data

•  Can inflate data sizes and slow data ingest

© 2014 Sqrrl Data, Inc. | All Rights Reserved 23

Page 24: What's Next for Google's BigTable

KNOWLEDGE-BASE USE CASE

2014-04-14 06:36:09 429 73.105.179.202 [email protected] 500 POST application/json

2014-04-14 06:36:09 429 73.105.179.202 [email protected] 500 POST application/json HTTPS “wikipedia.org:443/grouchinesses/?215=felled&297=wading&768=shimmies...” "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.43 Safari/537.31” 208.80.152.201

HR

Netflow

Proxy Logs

HTTPS “wikipedia.org:443/grouchinesses/?215=felled&297=wading&768=shimmies...” "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.43 Safari/537.31” 208.80.152.201

Email

Social Media

© 2014 Sqrrl Data, Inc. | All Rights Reserved 24

Page 25: What's Next for Google's BigTable

STREAM PROCESSING USE CASE

© 2014 Sqrrl Data, Inc. | All Rights Reserved

Dashboards

Actions

Interactive Analysis Tools (Discovery + Forensics)

1.  SPE queries Sqrrl to enrich streaming data 2.  SPE persists results in Sqrrl for future query 3.  SPE takes action automatically 4.  SPE issues data-driven alerts

5.  Sqrrl provides context for dashboards 6.  Analysis tools query use Sqrrl to search and

manipulate historical data

DATA

SPE

25

Page 26: What's Next for Google's BigTable

SQRRL OPERATIONALIZES ACCUMULO WITH...

© 2014 Sqrrl Data, Inc. | All Rights Reserved 26

Data-Centric Security

Petabyte Scale and Operational Speeds

Document and Graph Data Models

SqrrlQL, including Aggregates, Secure Full-Text Search, and Secure Graph Search

Analytics, including Real-Time Statistics and Hadoop Integrations

Page 27: What's Next for Google's BigTable

MODERNIZING VISUALIZATION

© 2014 Sqrrl Data, Inc. | All Rights Reserved 27

Sqrrl is building the next generation of operational analytics visualizations

Page 28: What's Next for Google's BigTable

UPCOMING EVENTS Accumulo Summit 2014 •  June 12 in College Park, MD •  http://accumulosummit.com •  Multiple tracks of talks from the leaders of the Accumulo community

IEEE HPEC Conference 2014 •  September 9-11 in Waltham, MA •  http://www.ieee-hpec.org/ •  Accumulo Users Group Meeting as a Special Event •  Accumulo tutorial

Watch for more meetup opportunities coming soon!

© 2014 Sqrrl Data, Inc. | All Rights Reserved 28