89
Design Decisions and Trade-offs in Apache Accumulo Aaron Cordova CTO Koverse Inc. 33rd International Conference Massive Storage Systems and Technology

Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Design Decisions and Trade-offs in Apache Accumulo

Aaron CordovaCTO Koverse Inc.33rd International Conference Massive Storage Systems and Technology

Page 2: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

–Steve Jobs

“Design is not just what it looks like and feels like. Design is how it works.”

Page 3: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

–Milton Friedman, others

“There ain’t no such thing as a free lunch.”

Page 4: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

2003 Mountain View California

Page 5: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

WebSearch for a Planet: the Google Cluster Architecture

Page 6: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

“… the most important factors that influence its design: energy efficiency

and price-performance ratio.”

Page 7: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

“… we provide reliability in software rather than in server-class hardware, so we can use commodity PCs to build a high-end computing cluster at a low-end price.”

Page 8: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

WebSearch for a Planet: the Google Cluster Architecture

Different queries run on different processors

Partitioned Index

A single query uses multiple processors

More than 15,000 commodity-class PCs

Fault-tolerance built into software

Superior performance at a fraction of the cost of a system built from fewer, but more expensive high-end servers

Page 9: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

2005 Jeffrey Dean, University of Washington

Page 10: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

BigTable

Page 11: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Motivation

lots of semi-structured data behind google apps

multiple versions of crawled web pages

user information

satellite imagery

geographical data

100s of millions of users, many queries per second

Page 12: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

“scale is too large for most commercial databases”

Page 13: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

“even if scale were not too large, the cost would be very

high …”

Page 14: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

“… requiring high-end hardware that doesn’t match well with

infrastructure”

Page 15: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

“building internally means system can be applied across many

projects for low incremental cost”

Page 16: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

“low-level storage optimizations help performance significantly”

Page 17: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

“because we're able to develop code at all levels, can take advantage of

storage and network transfer optimizations, much harder to do when

running on top of a database layer”

Page 18: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

“also fun and challenging to build large scale systems”

Page 19: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Large-scale Incremental Processing Using Distributed Transactions and Notifications

2010 OSDI

Page 20: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Percolator, a system for incrementally processing updates to a large data set, is

used to produce Google's websearch index, persisted in BigTable

Page 21: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

“The indexing system could store the repository in a DBMS and update individual

documents while using transactions to maintain invariants …”

Page 22: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

“However, existing DBMSs can’t handle the sheer volume of data: Google’s

indexing system stores tens of petabytes across thousands of machines.”

Page 23: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

BigTable Design Objectives

Page 24: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

BigTable Design Objectives

want a lot of asynchronous processes to continuously update and read from their part of the global state

want access to most current data at any time

need high read/write rates

efficient scans over all or interesting subsets of data

efficient joins of large one-to-one and one to many data sets

want to examine data changes over time

Page 25: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

BigTable Design Decisions

Highly consistent, not eventually consistent

Designed for a single data center, not geographically distributed data centers

Keys organized via sorting, partitioned into ranges, not hashing

Service of each range is decoupled from storage, reassignment doesn’t require data movement

Support for single-row transactions

Page 26: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

BigTable Features

Distributed multi-level map - interesting data model

Fault tolerant, persistent

Scalable

Thousands of serversTerabytes of in memory dataPetabytes of disk based dataMillions of reads and writes per second, efficient scans

Self managing

Servers can be added / removed dynamicallyServers adjust to load imbalance

Page 27: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

BigTable Data Model

Key Value

(sorted) (not sorted)

Page 28: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

BigTable Data Model

KeyValue

row ID Column Timestamp

Key consists of three main components

Page 29: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

BigTable Data Model

age phone sneakers hat

bill 49 555-1212 $100 -

george 37 - $80 $30

age phone sneakers hat

bill 49 555-1212 $100 -

george 38 - $80 $30

time

columns

row

s

Page 30: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

BigTable Data Model

row column time value

bill age Jun 2010 49

bill phone Jun 2010 555-1212

bill sneakers Apr 2010 $100

george age Oct 2009 38

george sneakers Nov 2009 $80

george hat Dec 2009 $30

Page 31: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

BigTable Data Model

A read request for a single key or a range of keys is routed to one server, and is designed to involve a minimal amount of seeks on a cheap spinning disk, read data sequentially, and return, typically in less than a second.

BigTable loads key value pairs into memory in blocks, and caches recently read blocks, enabling applications to exploit temporal and spatial locality in access patterns

Page 32: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

BigTable Data Model

KeyValue

row IDColumn

TimestampFamily Qualifier

Column is split into two additional components

Page 33: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

BigTable Data Model

Column families can be assigned to locality groups which are stored together on disk.

This allows scanning columns within a locality group without reading other columns from disk.

Locality groups can be marked as being served from memory, loaded lazily.

Page 34: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

BigTable Data Model

Column families must be declared before hand, but column qualifiers do not, can be dynamically created during ingest.

Rows can be very large, millions of columns or more.

Rows within a table need not all have the same set of columns. No penalty for highly sparse and dynamic data.

Page 35: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

BigTable Architecture

GFS

MapReduce

BigTable

Chubby

Applications

Page 36: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Architecture: Tables

BigTable

Tablet Servers

Master

Table

Page 37: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Architecture: Tables

BigTable

Tablet Servers

Master

P2P1 P3

Page 38: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Architecture: Tables

BigTable

Tablet Servers

Master

Page 39: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Architecture: Splits

row col fam col qual time value

bill attribute age Jun 2010 49

bill attribute phone Jun 2010 555-1212

bill purchases sneakers Apr 2010 $100

george attribute age Oct 2009 38

george purchases sneakers Nov 2009 $80

george returns hat Dec 2009 $30

Page 40: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Architecture: Splits

BigTable

Tablet Servers

Master

Page 41: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Architecture: Splits

BigTable

Tablet Servers

Master

Page 42: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Architecture: Splits

BigTable

Tablet Servers

Master

Page 43: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

user tables

metadata table

Metadata Hierarchy

root

md1 md2 md3

user1 user2 index1 index2

Page 44: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Architecture: Lookup

Accumulo

Tablet Servers

Master

Client

ZooKeeper

Page 45: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Architecture: Lookup

Accumulo

Tablet Servers

Master

Client

ZooKeeperClient knows zookeeper, finds root tablet

Page 46: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Architecture: Lookup

Accumulo

Tablet Servers

Master

Client

ZooKeeperScan root tablet find metadata tablet that describes the user table we want

Page 47: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Architecture: Lookup

Accumulo

Tablet Servers

Master

Client

ZooKeeperRead location info

of tablets of user table and cache it

Page 48: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Architecture: Lookup

Accumulo

Tablet Servers

Master

Client

ZooKeeperRead directly from server

holding the tablets we want

Page 49: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Architecture: Lookup

Accumulo

Tablet Servers

Master

Client

ZooKeeperFind other tablets via cache lookups

Page 50: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Architecture: Recovery

BigTable

Tablet Servers

Master

DataNodesNameNode

Page 51: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Architecture: Recovery

BigTable

Tablet Servers

Master

DataNodesNameNode

Page 52: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Architecture: Recovery

BigTable

Tablet Servers

Master

DataNodesNameNode

Page 53: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Architecture: Recovery

BigTable

Tablet Servers

Master

DataNodesNameNode

Page 54: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Architecture: Recovery

BigTable

Tablet Servers

Master

DataNodes

Master reassigns

NameNode

Page 55: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Architecture: Recovery

BigTable

Tablet Servers

Master

DataNodes

Replay Write-ahead Log

NameNode

Page 56: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Architecture: Recovery

BigTable

Tablet Servers

Master

DataNodesNameNode

Page 57: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Architecture: Recovery

BigTable

Tablet Servers

Master

DataNodesNameNode

Page 58: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

“Our users like the performance and high availability provided by the Bigtable implementation, and that

they can scale the capacity of their clusters by simply adding more machines to the system as their

resource demands change over time”

Page 59: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

“New users are sometimes uncertain of how to best use the Bigtable interface, particularly if they are accustomed to using relational databases that

support general-purpose transactions”

Page 60: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

“We have gotten a substantial amount of flexibility from designing our own data model for Bigtable.”

Page 61: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

“our control over Bigtable’s implementation, and the other Google infrastructure upon which

Bigtable depends, means that we can remove bottlenecks and inefficiencies as they arise”

Page 62: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

2008 National Security Agency

Page 63: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

“a team of computer scientists and mathematicians at the National Security Agency were evaluating the use of various

big data technologies, including Apache Hadoop and HBase, in an effort to help solve the issues involved with storing and

processing large amounts of data of different sensitivity levels.”

Page 64: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

“After reviewing existing solutions and comparing the stated objectives of existing open source

projects to the agency’s goals, the team began a new implementation of Google’s BigTable.”

Page 65: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

“the team extended the BigTable design with additional features that included a method for labeling each key-

value pair with its own access information, called Column Visibilities, and a mechanism for performing additional server-side functionality, called Iterators.”

Page 66: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

“In 2011 Accumulo became a public open source incubator project hosted by the Apache Software

Foundation, and in March 2012 Accumulo graduated to top-level project status.”

Page 67: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Accumulo Data Model

KeyValue

row IDColumn

TimestampFamily Qualifier Visibility

Accumulo introduces an additional column component

Page 68: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Accumulo Data Model

age phone sneakers hat

bill 49 555-1212 $100 -

george 37 - $80 $30

attribute:age

attribute:phone

purchases:sneakers returns:hat

bill 49 555-1212 $100 -

george 38 - $80 $30

time

column family column family column familycolumn qualifiers

row

s

private public

Page 69: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Accumulo Data Model

row col fam col qual col vis time value

bill attribute age public Jun 2010 49

bill attribute phone private Jun 2010 555-1212

bill purchases sneakers public Apr 2010 $100

george attribute age private Oct 2009 38

george purchases sneakers public Nov 2009 $80

george returns hat public Dec 2009 $30

Page 70: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Accumulo API

Column families can be created dynamically

Introduced batch scanners, maintain support for large rows, both of which enable building tables that serve as secondary indexes

Page 71: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

‘Archaeologist’s Approach’ to Data

Allow the data to inform you about its schema

Avoid making assumptions, changing data as long as possible

Store, protect, index data to allow exploration and discovery

Use bulk processing like Spark to create clean, summarized derivatives of data. Preserve original in case assumptions prove to be false and reprocessing is required.

Page 72: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Accumulo Architecture

HDFS

MapReduce

Accumulo

ZooKeeper

ApplicationsSpark

Page 73: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Accumulo Proof Points

AWS benchmark

Tested at 300, 500, and 1000 machines100 million entries written per second408 terabytes7.56 trillion total entriesSeveral hardware failures, zero interruptions

https://accumulo.apache.org/papers/accumulo-benchmarking-2.1.pdf

Page 74: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Ingest Benchmark

0

25

50

75

100

0 250 500 750 1000

Milli

ons

of e

ntrie

s pe

r sec

ond

Size of Cluster

Page 75: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Scan Latency

0

0.013

0.025

0.038

0.05

0 250 500 750 1000

Aver

age

scan

late

ncy

(ms)

Size of Cluster

Page 76: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Administrative Overhead

0

3

6

9

12

0 250 500 750 1000

Failed Machines Admin Intervention

Num

ber o

f eve

nts

Size of Cluster

Page 77: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Accumulo Proof Points

Graph processing benchmark

1200 machines4.4 trillion vertices70.4 trillion edges1 petabyte processed149 million edges traversed per second

http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf

Page 78: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Accumulo Proof Points

D4M benchmark

D4M is a data model integrating Accumulo with pMatlab216 machines115 million inserts per secondUsed checkpointing instead of write-ahead log

http://www.ieee-hpec.org/2014/CD/index_htm_files/FinalPapers/31.pdf

Page 79: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Accumulo Architecture

HDFS

MapReduce

Accumulo

ZooKeeper

ApplicationsSpark

HDFS

Page 80: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Accumulo in the Enterprise

Page 81: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Accumulo in the Enterprise

Besides the Intelligence community, Accumulo receives special interest from highly regulated industries such as finance and healthcare, often because of its strong security features and scalability.

Accumulo is supported by all Hadoop vendors and several companies have built commercial products on Accumulo.

Page 82: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Enterprises have so much data in so many different systems that the ‘archaeologist’s approach’ is warranted

Page 83: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Bringing data together physically in Accumulo and protecting it logically is a major enabler to data science initiatives

Page 84: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Accumulo in the Enterprise

Three strengths make it attractive as a place for gathering data:

1. Flexible schema handling, columns created dynamically, making it possible to load data without fully characterizing it first, and to handle inconsistent or changing data

2. Highly scalable

3. Fine-grained access control, avoiding creating a security problem just because there are multiple levels of data sensitivity

Page 85: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Accumulo in the Enterprise

Accumulo has good support for secondary indexing, making it possible to query data on values in any field.

Support for analytical frameworks like MapReduce and Spark make it possible to process data in situ and serves as a good place to host and serve up analytical results for interactive consumption by users, services, or applications

Page 86: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Challenges with Accumulo in the Enterprise

Like BigTable new users are sometimes uncertain of how to best use the Accumulo interface and data model.

While open source components allow organizations some control over the entire storage stack, many organizations lack the expertise to modify these components.

Mapping organizational security policies to Accumulo column visibilities remains an exercise left to the reader

Page 87: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Typical Architecture

HDFS

MapReduce

Accumulo

ZooKeeper

Custom Applications

Spark

Vendor (e.g. Koverse)

Tableau, Excel, etc

Ingest Index Query

Profile Sample

Analytic Flows Security

Page 88: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Resources

accumulo.apache.org

@ApacheAccumulo on Twitter

#accumulo on FreeNode IRC

Page 89: Accumulo Design Decisionsstorageconference.us/2017/Presentations/Cordova.pdfPercolator, a system for incrementally processing updates to a large data set, is used to produce Google's

Aaron Cordova

@aaroncordovawww.koverse.com