110
Apache Accumulo Introduction

Introduction to Apache Accumulo

Embed Size (px)

DESCRIPTION

Description of Apache Accumulo including data model, scaling and recovery features, API, security, and applications

Citation preview

Page 1: Introduction to Apache Accumulo

Apache Accumulo

Introduction

Page 2: Introduction to Apache Accumulo

Introduction

• Aaron Cordova

• Founded Accumulo project with several others

• Led development through release 1.0

[email protected]

Page 3: Introduction to Apache Accumulo

Agenda

• Introduction

• Data Model

• API

• Architecture - scaling, recovery

• Security

• Data-lifecycle

• Applications

Page 4: Introduction to Apache Accumulo

Introduction

Page 5: Introduction to Apache Accumulo

History

• Began writing in summer of 2008, after comparing design goals with BigTable paper and existing implementations Hbase, Hypertable

• Released internal version 1.0 summer of 2009.

• September 2011 accepted as an Apache Incubator project. Doug Cutting, founder of Hadoop, was the Champion Sponsor

• Feb 2012 1.4 Released

• March 2012 graduates to a top level Apache project

• V 1.5 due out soon

Page 6: Introduction to Apache Accumulo

Introduction

• Accumulo is a sparse, distributed, sorted, multi-dimensional map

• Modeled after Google’s BigTable design

• Scales to trillions of records and 100s of Terabytes

• Features automatic load balancing, high-availability, dynamic control over data layout

Page 7: Introduction to Apache Accumulo

Data Model

Page 8: Introduction to Apache Accumulo

Data Model

KeyKeyKeyKeyKeyValue

row IDColumnColumnColumn

TimestampValue

row IDFamily Qualifier Visibility

TimestampValue

Page 9: Introduction to Apache Accumulo

Data Model (Logical 2D table structure)

attribute:age

attribute:phone

purchases:sneakers returns:hat

bill 49 555-1212 $100 -

george 38 - $80 $30

Page 10: Introduction to Apache Accumulo

Physical layout (sorted keys)

row col fam col qual col vis time value

bill attribute age public Jun 2010 49

bill attribute phone private Jun 2010 555-1212

bill purchases sneakers public Apr 2010 $100

george attribute age private Oct 2009 38

george purchases sneakers public Nov 2009 $80

george returns hat public Dec 2009 $30

Page 11: Introduction to Apache Accumulo

High-level API

Page 12: Introduction to Apache Accumulo

Accumulo API

• To use Accumulo, must write a an application using the Accumulo Java client library. There is no SQL (hence NoSQL)

• Data is packaged into Mutation objects which are added to a BatchWriter which sends them to TabletServers

• Clients can scan a set of key value pairs by specifying optional start and end keys (Range) and obtaining a Scanner. Iterating over the scanner returns sorted key value pairs for that range. Each scan takes milliseconds to start.

• Can scan over a subset of the columns

• Can send a set of Ranges to a BatchScanner, get matching key value pairs, unsorted

Page 13: Introduction to Apache Accumulo

Insert

row col fam col qual col vis time value

bill attribute age public Jun 2010 49

bill purchases sneakers public Apr 2010 $100

george attribute age private Oct 2009 38

george purchases sneakers public Nov 2009 $80

george returns hat public Dec 2009 $30

Page 14: Introduction to Apache Accumulo

Insert

row col fam col qual col vis time value

bill attribute age public Jun 2010 49

bill purchases sneakers public Apr 2010 $100

george attribute age private Oct 2009 38

george purchases sneakers public Nov 2009 $80

george returns hat public Dec 2009 $30

bill attribute phone private Jun 2010 555-1212

Page 15: Introduction to Apache Accumulo

Insert

row col fam col qual col vis time value

bill attribute age public Jun 2010 49

bill attribute phone private Jun 2010 555-1212

bill purchases sneakers public Apr 2010 $100

george attribute age private Oct 2009 38

george purchases sneakers public Nov 2009 $80

george returns hat public Dec 2009 $30

Page 16: Introduction to Apache Accumulo

Scan - Full key lookup

row col fam col qual col vis time value

bill attribute age public Jun 2010 49

bill attribute phone private Jun 2010 555-1212

bill purchases sneakers public Apr 2010 $100

george attribute age private Oct 2009 38

george purchases sneakers public Nov 2009 $80

george returns hat public Dec 2009 $30

bill attribute phone private Jun 2010

Page 17: Introduction to Apache Accumulo

Scan - Single row

row col fam col qual col vis time value

bill attribute age public Jun 2010 49

bill attribute phone private Jun 2010 555-1212

bill purchases sneakers public Apr 2010 $100

george attribute age private Oct 2009 38

george purchases sneakers public Nov 2009 $80

george returns hat public Dec 2009 $30

bill

Page 18: Introduction to Apache Accumulo

Scan - Multiple Rows

row col fam col qual col vis time value

bill attribute age public Jun 2010 49

bill attribute phone private Jun 2010 555-1212

bill purchases sneakers public Apr 2010 $100

george attribute age private Oct 2009 38

george purchases sneakers public Nov 2009 $80

george returns hat public Dec 2009 $30

bill - will

Page 19: Introduction to Apache Accumulo

Scan - Multiple Rows, Selected Columns

row col fam col qual col vis time value

bill attribute age public Jun 2010 49

bill attribute phone private Jun 2010 555-1212

bill purchases sneakers public Apr 2010 $100

george attribute age private Oct 2009 38

george purchases sneakers public Nov 2009 $80

george returns hat public Dec 2009 $30

bill - will, fetch purchases

Page 20: Introduction to Apache Accumulo

Architecture - Scaling and Recovery

Page 21: Introduction to Apache Accumulo

Performance

• Accumulo ‘scales’ because aggregate read and write performance increase as more machines are added, and because individual reads/write performance remains very good even with trillions of key-value pairs already in the system

• Sources: http://www.slideshare.net/acordova00/accumulo-on-ec2

http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf

10

100

1000

10000

1 16 64 256 1024

Thou

sand

s of

writ

es p

er s

econ

d

Number of machines

AccumuloBigTable circa 2006Cassandra

Page 22: Introduction to Apache Accumulo

Accumulo Prerequisites

• One to hundreds of computers with local hard drives, connected via ethernet

• Password-less SSH access

• Local directory for write-ahead logs

• Hadoop and ZooKeeper installed, configured, and running

Page 23: Introduction to Apache Accumulo

Architecture

HDFS MapReduce

Accumulo

ZooKeeper

Page 24: Introduction to Apache Accumulo

Architecture: HDFS

HDFSNameNode

DataNodes

File

Page 25: Introduction to Apache Accumulo

Architecture: HDFS

HDFSNameNode

DataNodes

Block 2Block 1

Page 26: Introduction to Apache Accumulo

Architecture: HDFS

HDFSNameNode

DataNodes

Page 27: Introduction to Apache Accumulo

Architecture: Tables

Accumulo

Tablet Servers

Master

Table

Page 28: Introduction to Apache Accumulo

Architecture: Tables

Accumulo

Tablet Servers

Master

P2P1 P3

Page 29: Introduction to Apache Accumulo

Architecture: Tables

Accumulo

Tablet Servers

Master

Page 30: Introduction to Apache Accumulo

Architecture: Writes

HDFS

P1

File1

MemTable

Page 31: Introduction to Apache Accumulo

Architecture: Writes

HDFS

P1

File1

MemTable

Client

Write-ahead Log

Page 32: Introduction to Apache Accumulo

Architecture: Writes

HDFS

File1 File 2

P1 MemTable

Write-ahead Log

Page 33: Introduction to Apache Accumulo

Architecture: Writes

HDFS

File1 File 2

P1 MemTable

Write-ahead LogX

Page 34: Introduction to Apache Accumulo

Architecture: Splits

row col fam col qual col vis time value

bill attribute age public Jun 2010 49

bill attribute phone private Jun 2010 555-1212

bill purchases sneakers public Apr 2010 $100

george attribute age private Oct 2009 38

george purchases sneakers public Nov 2009 $80

george returns hat public Dec 2009 $30

Page 35: Introduction to Apache Accumulo

Architecture: Splits

Accumulo

Tablet Servers

Master

Page 36: Introduction to Apache Accumulo

Architecture: Splits

Accumulo

Tablet Servers

Master

Page 37: Introduction to Apache Accumulo

Architecture: Splits

Accumulo

Tablet Servers

Master

Page 38: Introduction to Apache Accumulo

• Because keys are sorted, tables can be partitioned based on the data

• partitions (tablets) are uniform in size, regardless of data distribution,(as long as single rows are smaller than the partition size)

• not based on the number of servers

• Can add /remove / fail servers at any time, the system is always automatically balanced

Sorted keys - dynamic partitioning

Page 39: Introduction to Apache Accumulo

Partitioning Contrast

• Some relational databases allow partitioning. May require users to choose a field or two on which to partition. Hopefully that field is uniformly distributed

• Hash-based systems (default Cassandra, CouchDB, Riak, Voldemort) avoid this problem, but at the cost of range scans. Some support range scans via other means.

• Many systems couple partition storage with partition service, requiring data movement to rebalance partition service (MongoDB, Cassandra, etc)

Page 40: Introduction to Apache Accumulo

Architecture: Reads

File1 File 2

P1 MemTable

Client

Merge

Page 41: Introduction to Apache Accumulo

Architecture: Recovery

Accumulo

Tablet Servers

Master

DataNodesNameNode

Page 42: Introduction to Apache Accumulo

Architecture: Recovery

Accumulo

Tablet Servers

Master

DataNodesNameNode

Page 43: Introduction to Apache Accumulo

Architecture: Recovery

Accumulo

Tablet Servers

Master

DataNodesNameNode

Page 44: Introduction to Apache Accumulo

Architecture: Recovery

Accumulo

Tablet Servers

Master

DataNodesNameNode

Page 45: Introduction to Apache Accumulo

Architecture: Recovery

Accumulo

Tablet Servers

Master

DataNodesMaster reassigns

NameNode

Page 46: Introduction to Apache Accumulo

Architecture: Recovery

Accumulo

Tablet Servers

Master

DataNodesReplay Write-ahead Log

NameNode

Page 47: Introduction to Apache Accumulo

Architecture: Recovery

Accumulo

Tablet Servers

Master

DataNodesNameNode

Page 48: Introduction to Apache Accumulo

Architecture: Recovery

Accumulo

Tablet Servers

Master

DataNodesNameNode

Page 49: Introduction to Apache Accumulo

user tables

metadata table

Metadata Hierarchy

root

md1 md2 md3

user1 user2 index1 index2

Page 50: Introduction to Apache Accumulo

Architecture: Lookup

Accumulo

Tablet Servers

Master

Client

ZooKeeper

Page 51: Introduction to Apache Accumulo

Architecture: Lookup

Accumulo

Tablet Servers

Master

Client

ZooKeeperClient knows zookeeper,finds root tablet

Page 52: Introduction to Apache Accumulo

Architecture: Lookup

Accumulo

Tablet Servers

Master

Client

ZooKeeperScan root tabletfind metadata tabletthat describes theuser table we want

Page 53: Introduction to Apache Accumulo

Architecture: Lookup

Accumulo

Tablet Servers

Master

Client

ZooKeeperRead location info

of tablets of user tableand cache it

Page 54: Introduction to Apache Accumulo

Architecture: Lookup

Accumulo

Tablet Servers

Master

Client

ZooKeeperRead directly from server

holding the tablets we want

Page 55: Introduction to Apache Accumulo

Architecture: Lookup

Accumulo

Tablet Servers

Master

Client

ZooKeeperFind other tabletsvia cache lookups

Page 56: Introduction to Apache Accumulo

Security

Page 57: Introduction to Apache Accumulo

Security

• Design and Guarantees

• Data Labeling

• Authentication

• User Configuration

Page 58: Introduction to Apache Accumulo

Data Security

• Accumulo will only return cells whose visibility labels are satisfied by user credentials presented at Scan time

• Two necessary conditions

• Correctly labeling data on ingest

• Presenting right user credentials

Page 59: Introduction to Apache Accumulo

Security Labels

row IDcolumncolumncolumn

timestamp valuerow IDfamily qualifier visibility

timestamp value

Extension of BigTable data model

Page 60: Introduction to Apache Accumulo

Column Visibility

row col fam col qual col vis time value

bill attribute age public Jun 2010 49

bill attribute phone private Jun 2010 555-1212

bill purchases sneakers public Apr 2010 $100

george attribute age private Oct 2009 38

george purchases sneakers public Nov 2009 $80

george returns hat public Dec 2009 $30

Page 61: Introduction to Apache Accumulo

Security Label Syntax

• A & B - both A and B required

• A | B - must have either A or B

• (A | B) & C - must have C and A or B

• A | (B & C) - must have A or both B and C

• A & (B | (C & D))

Page 62: Introduction to Apache Accumulo

Security Label Example

• Drive needs:

• license&over15

• Join military:

• (over17|(over16&parentConsent)) & (greencard|USCitizen)

• Access to Classified data

• TS&SI&(USA|GBR|NZL|CAN|AUS)

Page 63: Introduction to Apache Accumulo

Security Perimeter

Security Model

Accumulo

Trusted Client Auth Service

User

ID, password, cert

auths

verify

auths data

data

Page 64: Introduction to Apache Accumulo

Trusted Client Responsibility

• Ensure that credentials belong to the user

• Ensure that the user is authenticated

Page 65: Introduction to Apache Accumulo

Application Authorization

• Trusted Client applications must have max authorizations set before they can be passed

• The Trusted Client limits the set of authorizations by application

Page 66: Introduction to Apache Accumulo

Application Authorization Example

• Data may be labeled with any combination of the following:

{ personal, research, finance, diet, cancer }

• We wish to limit certain applications to a subset

Page 67: Introduction to Apache Accumulo

Example Table

row colF ColQ col vis value

row0 name - personal|finance Johnrow0 age - personal|research 49row0 phone - personal|finance 555-1212row0 owed - personal|finance $5440

row0 diagnosis - personal|(research & cancer)

melanoma

row0 diagnosis - personal|(research & diet) diabetes

Page 68: Introduction to Apache Accumulo

Application Authorizations

Cancer Research: cancer diagnoses, age

Diabetes Research: diet info, age

Accounting System: balance, name, phone

Personal Records Management: all

Page 69: Introduction to Apache Accumulo

Security Perimeter

Security Model

Accumulo

Auth Service

Researcher

ID, password, cert

Cancer Research App

Page 70: Introduction to Apache Accumulo

Security Perimeter

Security Model

Accumulo

Auth Service

ID, password, cert

verify

Researcher

Cancer Research App

Page 71: Introduction to Apache Accumulo

Security Perimeter

Security Model

Accumulo

Auth Service

ID, password, cert

research, cancer, diabetes

verify

Researcher

Cancer Research App

Page 72: Introduction to Apache Accumulo

Security Perimeter

Security Model

Accumulo

Auth Service

ID, password, cert

research,cancer

Researcher

Cancer Research App

Page 73: Introduction to Apache Accumulo

Security Perimeter

Security Model

Accumulo

Auth Service

ID, password, cert

dataresearch,cancer

Researcher

Cancer Research App

Page 74: Introduction to Apache Accumulo

Security Perimeter

Security Model

Accumulo

Auth Service

ID, password, cert

data

data

research,cancer

Researcher

Cancer Research App

Page 75: Introduction to Apache Accumulo

Data life-cycle

Page 76: Introduction to Apache Accumulo

Data Model

KeyKeyKeyKeyKeyValue

row IDColumnColumnColumn

TimestampValue

row IDFamily Qualifier Visibility

TimestampValue

Page 77: Introduction to Apache Accumulo

Versions

rowID family qualifier timestamp valuerow1 fam1 qual1 1005 2

row1 fam1 qual1 1004 5

row1 fam1 qual1 1003 3

row1 fam1 qual1 1002 2

row1 fam1 qual1 1001 7

What can we do with multiple versions of the same data?

Page 78: Introduction to Apache Accumulo

Iterators

• Mechanism for adding online functionality to tables

• Aggregation (called Combiners)

• Age-Off

• Filtering (including by security label)

Page 79: Introduction to Apache Accumulo

Versioning Iterators

rowID family qualifier timestamp valuerow1 fam1 qual1 1005 2row1 fam1 qual1 1004 5row1 fam1 qual1 1003 3row1 fam1 qual1 1002 2row1 fam1 qual1 1001 7

Page 80: Introduction to Apache Accumulo

Filtering Iterators

• Age Off

• RegEx

• Arbitrary filtering

Page 81: Introduction to Apache Accumulo

Age Off

• Can specify a particular date - e.g. delete everything older than July 1, 2007

• Can specify a time period - e.g. delete everything older than 6 months

Page 82: Introduction to Apache Accumulo

Age-Off

rowID family qualifier timestamp valuerow1 fam1 qual1 1005 2row1 fam1 qual1 1004 5row1 fam1 qual1 1003 3row1 fam1 qual1 1002 2row1 fam1 qual1 1001 7

Current Time: 1103

K/V pair ismore than

100 sec. old

Page 83: Introduction to Apache Accumulo

Age-Off

rowID family qualifier timestamp valuerow1 fam1 qual1 1005 2row1 fam1 qual1 1004 5row1 fam1 qual1 1003 3row1 fam1 qual1 1002 2row1 fam1 qual1 1001 7

Current Time: 1104

K/V pair ismore than

100 sec. old

Page 84: Introduction to Apache Accumulo

Age-Off

rowID family qualifier timestamp valuerow1 fam1 qual1 1005 2row1 fam1 qual1 1004 5row1 fam1 qual1 1003 3row1 fam1 qual1 1002 2row1 fam1 qual1 1001 7

Current Time: 1105 K/V pair ismore than

100 sec. old

Page 85: Introduction to Apache Accumulo

Manual Deletes

• Can insert ‘deletes’. They are inserted like other key-value pairs, any keys with an older timestamp is suppressed from reads

• Compactions write non-deleted data to new files

• Old files are then removed from HDFS

• To ensure data is deleted from disk,

• write deletes (they are now absent from query results)

• compact (can compact a particular range of a table if it’s large)

Page 86: Introduction to Apache Accumulo

Garbage Collection

• Garbage collector compares the files in HDFS with the set of files currently active

• When files are no longer on the active list, GC waits for a while, then deletes from HDFS

Page 87: Introduction to Apache Accumulo

Applications

• Fast lookups / scan on extremely large tables with flexible schemas, varying security

• Large index across heterogeneous data sets

• Continuous Summary Analytics via Iterators

• Secure Storage of key value pairs for MapReduce jobs

Page 88: Introduction to Apache Accumulo

Where does your data come from?

• BigTable was designed to store data for web applications serving millions of users. Web application creates all the data. Many NoSQL databases are designed solely for this purpose. Accumulo can certainly support that.

• However, many organizations have lots of data from various sources. Different schema, different security levels. Bringing them together for analysis is very valuable. Accumulo can support this too.

Page 89: Introduction to Apache Accumulo

Indexing and queries

• BigTable data model supports building a wide variety of indexes

• Simple strings, numbers, geo points, ip addresses, etc

• Each has to be coupled with query code

• New applications should examine their data access use cases, indexes and query code to accomplish those can then be written

• Best applications are constructed so each user request is a single scan, or a small number of scans

Page 90: Introduction to Apache Accumulo

Compared to MapReduce

• Hadoop’s HDFS stores simple files. Usually unsorted.

• MapReduce is designed to process all or most of the files at once.

• Accumulo maintains a set of sorted files in HDFS

• Accumulo scans are designed to access a small portion of the data quickly.

• Fairly complementary

Page 91: Introduction to Apache Accumulo

Tough use case

• Ran MapReduce on some input data set to create a large result set.

• Now have a few new records, want to update the result set

• MapReduce has to process all the data again, have to wait

• Accumulo allows users to perform a limited set of operations to update a result set incrementally, using Iterators

• Result sets are always up to date, immediately after insert

Page 92: Introduction to Apache Accumulo

Combiners

row col fam col qual col vis time value

bill perf June_calls P June 1 9

bill perf June_calls P June 4 3

bill perf July_calls P July 3 4

bill perf July_calls P July 11 7

bill perf August_calls P Aug 12 5

bill perf August_calls P Aug 29 2

Page 93: Introduction to Apache Accumulo

Combiners

row col fam col qual col vis time value

bill perf June_calls P - 12

bill perf July_calls P - 11

bill perf August_calls P - 7

Page 94: Introduction to Apache Accumulo

Combiners

• Almost equivalent to Reduce of MapReduce except:

• Cannot assume we have seen all the values for a particular key

• Exactly equivalent to a Combiner function

Page 95: Introduction to Apache Accumulo

Combiners

• Useful Combiners:

• Event count (StringSummation or LongSummation aggregator)

• Event hour occurrence histogram (NumArraySummation aggregator)

• Event duration histogram (NumArraySummation aggregator)

Page 96: Introduction to Apache Accumulo

Conceptual Graph Representation

a

c

b

e

f

d

g

Page 97: Introduction to Apache Accumulo

Edge table

row col fam col qual col vis time valuea edge f 1.0c edge b 1.0c edge d 1.0d edge b 1.0d edge e 1.0e edge d 1.0f edge g 1.0g edge e 1.0g edge f 1.0

Page 98: Introduction to Apache Accumulo

Edge Weights

• Summing Combiners are typically used to efficiently and incrementally update edge weights

• See SummingCombiner

Page 99: Introduction to Apache Accumulo

Edge table

row col fam col qual col vis time valuea edge f 1.0c edge b 1.0c edge d 1.0d edge b 1.0d edge e 1.0e edge d 1.0f edge g 1.0

Incoming: a, edge, f, 1.0

Page 100: Introduction to Apache Accumulo

Edge table

row col fam col qual col vis time valuea edge f 2.0c edge b 1.0c edge d 1.0d edge b 1.0d edge e 1.0e edge d 1.0f edge g 1.0

Page 101: Introduction to Apache Accumulo

Edge table

row col fam col qual col vis time valuea edge f 2.0c edge b 1.0c edge d 1.0d edge b 1.0d edge e 1.0e edge d 1.0f edge g 1.0

Incoming: c, edge, b, 6.0

Page 102: Introduction to Apache Accumulo

Edge table

row col fam col qual col vis time valuea edge f 2.0c edge b 7.0c edge d 1.0d edge b 1.0d edge e 1.0e edge d 1.0f edge g 1.0

Page 103: Introduction to Apache Accumulo

Edge table

row col fam col qual col vis time valuea edge f 2.0c edge b 7.0c edge d 1.0d edge b 1.0d edge e 1.0e edge d 1.0f edge g 1.0

Incoming: a, edge, f, 2.3

Page 104: Introduction to Apache Accumulo

Edge table

row col fam col qual col vis time valuea edge f 4.3c edge b 7.0c edge d 1.0d edge b 1.0d edge e 1.0e edge d 1.0f edge g 1.0

Page 105: Introduction to Apache Accumulo

Edge Table Applications

• Graph Analytics - traversal, neighbors, connected components

• Neighborhood = feature vector. Vector-based machine learning techniques. Nearest neighbor search, clustering, classification

• Automated dossiers, fact accumulation - ‘tell me everything we know about X’ in a single scan

• Find entities based on features - ‘show me everyone who has feature value > x’ or ‘with < 5 neighbors of type k’

Page 106: Introduction to Apache Accumulo

RDF Triples

row col fam col qual col vis time value

DC is_capital_of USA 1.0

Don vacations_in Arctic 7.0

Don is_employed_by MI6 1.0

Sean has_status “007” 1.0

Sean starred_with Ursula 1.0

Sean starred_with Anya 0.7

Sean starred_with Teresa 0.3

Page 108: Introduction to Apache Accumulo

Additional Training

Page 109: Introduction to Apache Accumulo

Additional Training

• Talked about the basics today

• 3 days of developer training with hands on examples covering

• installation, configuration, read / write API, MapReduce, security, table configuration, indexing specific types, querying index tables, combiners, custom iterators, table constraints, storing relational data, joins, high performance considerations, document-partitioned indexing (text search), machine learning, object persistence

• 2 days of administrator training covering

• hardware selection, process assignment, troubleshooting, maintenance, replication and high availability, cluster modification, failure handling

Page 110: Introduction to Apache Accumulo

Next Scheduled Training Sessions

• March 5-7 Columbia MD

• April 9-11 Columbia MD

• http://www.tetraconcepts.com/training

[email protected]

[email protected]