Upload
aaron-cordova
View
4.932
Download
4
Tags:
Embed Size (px)
DESCRIPTION
Description of Apache Accumulo including data model, scaling and recovery features, API, security, and applications
Citation preview
Apache Accumulo
Introduction
Introduction
• Aaron Cordova
• Founded Accumulo project with several others
• Led development through release 1.0
Agenda
• Introduction
• Data Model
• API
• Architecture - scaling, recovery
• Security
• Data-lifecycle
• Applications
Introduction
History
• Began writing in summer of 2008, after comparing design goals with BigTable paper and existing implementations Hbase, Hypertable
• Released internal version 1.0 summer of 2009.
• September 2011 accepted as an Apache Incubator project. Doug Cutting, founder of Hadoop, was the Champion Sponsor
• Feb 2012 1.4 Released
• March 2012 graduates to a top level Apache project
• V 1.5 due out soon
Introduction
• Accumulo is a sparse, distributed, sorted, multi-dimensional map
• Modeled after Google’s BigTable design
• Scales to trillions of records and 100s of Terabytes
• Features automatic load balancing, high-availability, dynamic control over data layout
Data Model
Data Model
KeyKeyKeyKeyKeyValue
row IDColumnColumnColumn
TimestampValue
row IDFamily Qualifier Visibility
TimestampValue
Data Model (Logical 2D table structure)
attribute:age
attribute:phone
purchases:sneakers returns:hat
bill 49 555-1212 $100 -
george 38 - $80 $30
Physical layout (sorted keys)
row col fam col qual col vis time value
bill attribute age public Jun 2010 49
bill attribute phone private Jun 2010 555-1212
bill purchases sneakers public Apr 2010 $100
george attribute age private Oct 2009 38
george purchases sneakers public Nov 2009 $80
george returns hat public Dec 2009 $30
High-level API
Accumulo API
• To use Accumulo, must write a an application using the Accumulo Java client library. There is no SQL (hence NoSQL)
• Data is packaged into Mutation objects which are added to a BatchWriter which sends them to TabletServers
• Clients can scan a set of key value pairs by specifying optional start and end keys (Range) and obtaining a Scanner. Iterating over the scanner returns sorted key value pairs for that range. Each scan takes milliseconds to start.
• Can scan over a subset of the columns
• Can send a set of Ranges to a BatchScanner, get matching key value pairs, unsorted
Insert
row col fam col qual col vis time value
bill attribute age public Jun 2010 49
bill purchases sneakers public Apr 2010 $100
george attribute age private Oct 2009 38
george purchases sneakers public Nov 2009 $80
george returns hat public Dec 2009 $30
Insert
row col fam col qual col vis time value
bill attribute age public Jun 2010 49
bill purchases sneakers public Apr 2010 $100
george attribute age private Oct 2009 38
george purchases sneakers public Nov 2009 $80
george returns hat public Dec 2009 $30
bill attribute phone private Jun 2010 555-1212
Insert
row col fam col qual col vis time value
bill attribute age public Jun 2010 49
bill attribute phone private Jun 2010 555-1212
bill purchases sneakers public Apr 2010 $100
george attribute age private Oct 2009 38
george purchases sneakers public Nov 2009 $80
george returns hat public Dec 2009 $30
Scan - Full key lookup
row col fam col qual col vis time value
bill attribute age public Jun 2010 49
bill attribute phone private Jun 2010 555-1212
bill purchases sneakers public Apr 2010 $100
george attribute age private Oct 2009 38
george purchases sneakers public Nov 2009 $80
george returns hat public Dec 2009 $30
bill attribute phone private Jun 2010
Scan - Single row
row col fam col qual col vis time value
bill attribute age public Jun 2010 49
bill attribute phone private Jun 2010 555-1212
bill purchases sneakers public Apr 2010 $100
george attribute age private Oct 2009 38
george purchases sneakers public Nov 2009 $80
george returns hat public Dec 2009 $30
bill
Scan - Multiple Rows
row col fam col qual col vis time value
bill attribute age public Jun 2010 49
bill attribute phone private Jun 2010 555-1212
bill purchases sneakers public Apr 2010 $100
george attribute age private Oct 2009 38
george purchases sneakers public Nov 2009 $80
george returns hat public Dec 2009 $30
bill - will
Scan - Multiple Rows, Selected Columns
row col fam col qual col vis time value
bill attribute age public Jun 2010 49
bill attribute phone private Jun 2010 555-1212
bill purchases sneakers public Apr 2010 $100
george attribute age private Oct 2009 38
george purchases sneakers public Nov 2009 $80
george returns hat public Dec 2009 $30
bill - will, fetch purchases
Architecture - Scaling and Recovery
Performance
• Accumulo ‘scales’ because aggregate read and write performance increase as more machines are added, and because individual reads/write performance remains very good even with trillions of key-value pairs already in the system
• Sources: http://www.slideshare.net/acordova00/accumulo-on-ec2
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf
10
100
1000
10000
1 16 64 256 1024
Thou
sand
s of
writ
es p
er s
econ
d
Number of machines
AccumuloBigTable circa 2006Cassandra
Accumulo Prerequisites
• One to hundreds of computers with local hard drives, connected via ethernet
• Password-less SSH access
• Local directory for write-ahead logs
• Hadoop and ZooKeeper installed, configured, and running
Architecture
HDFS MapReduce
Accumulo
ZooKeeper
Architecture: HDFS
HDFSNameNode
DataNodes
File
Architecture: HDFS
HDFSNameNode
DataNodes
Block 2Block 1
Architecture: HDFS
HDFSNameNode
DataNodes
Architecture: Tables
Accumulo
Tablet Servers
Master
Table
Architecture: Tables
Accumulo
Tablet Servers
Master
P2P1 P3
Architecture: Tables
Accumulo
Tablet Servers
Master
Architecture: Writes
HDFS
P1
File1
MemTable
Architecture: Writes
HDFS
P1
File1
MemTable
Client
Write-ahead Log
Architecture: Writes
HDFS
File1 File 2
P1 MemTable
Write-ahead Log
Architecture: Writes
HDFS
File1 File 2
P1 MemTable
Write-ahead LogX
Architecture: Splits
row col fam col qual col vis time value
bill attribute age public Jun 2010 49
bill attribute phone private Jun 2010 555-1212
bill purchases sneakers public Apr 2010 $100
george attribute age private Oct 2009 38
george purchases sneakers public Nov 2009 $80
george returns hat public Dec 2009 $30
Architecture: Splits
Accumulo
Tablet Servers
Master
Architecture: Splits
Accumulo
Tablet Servers
Master
Architecture: Splits
Accumulo
Tablet Servers
Master
• Because keys are sorted, tables can be partitioned based on the data
• partitions (tablets) are uniform in size, regardless of data distribution,(as long as single rows are smaller than the partition size)
• not based on the number of servers
• Can add /remove / fail servers at any time, the system is always automatically balanced
Sorted keys - dynamic partitioning
Partitioning Contrast
• Some relational databases allow partitioning. May require users to choose a field or two on which to partition. Hopefully that field is uniformly distributed
• Hash-based systems (default Cassandra, CouchDB, Riak, Voldemort) avoid this problem, but at the cost of range scans. Some support range scans via other means.
• Many systems couple partition storage with partition service, requiring data movement to rebalance partition service (MongoDB, Cassandra, etc)
Architecture: Reads
File1 File 2
P1 MemTable
Client
Merge
Architecture: Recovery
Accumulo
Tablet Servers
Master
DataNodesNameNode
Architecture: Recovery
Accumulo
Tablet Servers
Master
DataNodesNameNode
Architecture: Recovery
Accumulo
Tablet Servers
Master
DataNodesNameNode
Architecture: Recovery
Accumulo
Tablet Servers
Master
DataNodesNameNode
Architecture: Recovery
Accumulo
Tablet Servers
Master
DataNodesMaster reassigns
NameNode
Architecture: Recovery
Accumulo
Tablet Servers
Master
DataNodesReplay Write-ahead Log
NameNode
Architecture: Recovery
Accumulo
Tablet Servers
Master
DataNodesNameNode
Architecture: Recovery
Accumulo
Tablet Servers
Master
DataNodesNameNode
user tables
metadata table
Metadata Hierarchy
root
md1 md2 md3
user1 user2 index1 index2
Architecture: Lookup
Accumulo
Tablet Servers
Master
Client
ZooKeeper
Architecture: Lookup
Accumulo
Tablet Servers
Master
Client
ZooKeeperClient knows zookeeper,finds root tablet
Architecture: Lookup
Accumulo
Tablet Servers
Master
Client
ZooKeeperScan root tabletfind metadata tabletthat describes theuser table we want
Architecture: Lookup
Accumulo
Tablet Servers
Master
Client
ZooKeeperRead location info
of tablets of user tableand cache it
Architecture: Lookup
Accumulo
Tablet Servers
Master
Client
ZooKeeperRead directly from server
holding the tablets we want
Architecture: Lookup
Accumulo
Tablet Servers
Master
Client
ZooKeeperFind other tabletsvia cache lookups
Security
Security
• Design and Guarantees
• Data Labeling
• Authentication
• User Configuration
Data Security
• Accumulo will only return cells whose visibility labels are satisfied by user credentials presented at Scan time
• Two necessary conditions
• Correctly labeling data on ingest
• Presenting right user credentials
Security Labels
row IDcolumncolumncolumn
timestamp valuerow IDfamily qualifier visibility
timestamp value
Extension of BigTable data model
Column Visibility
row col fam col qual col vis time value
bill attribute age public Jun 2010 49
bill attribute phone private Jun 2010 555-1212
bill purchases sneakers public Apr 2010 $100
george attribute age private Oct 2009 38
george purchases sneakers public Nov 2009 $80
george returns hat public Dec 2009 $30
Security Label Syntax
• A & B - both A and B required
• A | B - must have either A or B
• (A | B) & C - must have C and A or B
• A | (B & C) - must have A or both B and C
• A & (B | (C & D))
Security Label Example
• Drive needs:
• license&over15
• Join military:
• (over17|(over16&parentConsent)) & (greencard|USCitizen)
• Access to Classified data
• TS&SI&(USA|GBR|NZL|CAN|AUS)
Security Perimeter
Security Model
Accumulo
Trusted Client Auth Service
User
ID, password, cert
auths
verify
auths data
data
Trusted Client Responsibility
• Ensure that credentials belong to the user
• Ensure that the user is authenticated
Application Authorization
• Trusted Client applications must have max authorizations set before they can be passed
• The Trusted Client limits the set of authorizations by application
Application Authorization Example
• Data may be labeled with any combination of the following:
{ personal, research, finance, diet, cancer }
• We wish to limit certain applications to a subset
Example Table
row colF ColQ col vis value
row0 name - personal|finance Johnrow0 age - personal|research 49row0 phone - personal|finance 555-1212row0 owed - personal|finance $5440
row0 diagnosis - personal|(research & cancer)
melanoma
row0 diagnosis - personal|(research & diet) diabetes
Application Authorizations
Cancer Research: cancer diagnoses, age
Diabetes Research: diet info, age
Accounting System: balance, name, phone
Personal Records Management: all
Security Perimeter
Security Model
Accumulo
Auth Service
Researcher
ID, password, cert
Cancer Research App
Security Perimeter
Security Model
Accumulo
Auth Service
ID, password, cert
verify
Researcher
Cancer Research App
Security Perimeter
Security Model
Accumulo
Auth Service
ID, password, cert
research, cancer, diabetes
verify
Researcher
Cancer Research App
Security Perimeter
Security Model
Accumulo
Auth Service
ID, password, cert
research,cancer
Researcher
Cancer Research App
Security Perimeter
Security Model
Accumulo
Auth Service
ID, password, cert
dataresearch,cancer
Researcher
Cancer Research App
Security Perimeter
Security Model
Accumulo
Auth Service
ID, password, cert
data
data
research,cancer
Researcher
Cancer Research App
Data life-cycle
Data Model
KeyKeyKeyKeyKeyValue
row IDColumnColumnColumn
TimestampValue
row IDFamily Qualifier Visibility
TimestampValue
Versions
rowID family qualifier timestamp valuerow1 fam1 qual1 1005 2
row1 fam1 qual1 1004 5
row1 fam1 qual1 1003 3
row1 fam1 qual1 1002 2
row1 fam1 qual1 1001 7
What can we do with multiple versions of the same data?
Iterators
• Mechanism for adding online functionality to tables
• Aggregation (called Combiners)
• Age-Off
• Filtering (including by security label)
Versioning Iterators
rowID family qualifier timestamp valuerow1 fam1 qual1 1005 2row1 fam1 qual1 1004 5row1 fam1 qual1 1003 3row1 fam1 qual1 1002 2row1 fam1 qual1 1001 7
Filtering Iterators
• Age Off
• RegEx
• Arbitrary filtering
Age Off
• Can specify a particular date - e.g. delete everything older than July 1, 2007
• Can specify a time period - e.g. delete everything older than 6 months
Age-Off
rowID family qualifier timestamp valuerow1 fam1 qual1 1005 2row1 fam1 qual1 1004 5row1 fam1 qual1 1003 3row1 fam1 qual1 1002 2row1 fam1 qual1 1001 7
Current Time: 1103
K/V pair ismore than
100 sec. old
Age-Off
rowID family qualifier timestamp valuerow1 fam1 qual1 1005 2row1 fam1 qual1 1004 5row1 fam1 qual1 1003 3row1 fam1 qual1 1002 2row1 fam1 qual1 1001 7
Current Time: 1104
K/V pair ismore than
100 sec. old
Age-Off
rowID family qualifier timestamp valuerow1 fam1 qual1 1005 2row1 fam1 qual1 1004 5row1 fam1 qual1 1003 3row1 fam1 qual1 1002 2row1 fam1 qual1 1001 7
Current Time: 1105 K/V pair ismore than
100 sec. old
Manual Deletes
• Can insert ‘deletes’. They are inserted like other key-value pairs, any keys with an older timestamp is suppressed from reads
• Compactions write non-deleted data to new files
• Old files are then removed from HDFS
• To ensure data is deleted from disk,
• write deletes (they are now absent from query results)
• compact (can compact a particular range of a table if it’s large)
Garbage Collection
• Garbage collector compares the files in HDFS with the set of files currently active
• When files are no longer on the active list, GC waits for a while, then deletes from HDFS
Applications
• Fast lookups / scan on extremely large tables with flexible schemas, varying security
• Large index across heterogeneous data sets
• Continuous Summary Analytics via Iterators
• Secure Storage of key value pairs for MapReduce jobs
Where does your data come from?
• BigTable was designed to store data for web applications serving millions of users. Web application creates all the data. Many NoSQL databases are designed solely for this purpose. Accumulo can certainly support that.
• However, many organizations have lots of data from various sources. Different schema, different security levels. Bringing them together for analysis is very valuable. Accumulo can support this too.
Indexing and queries
• BigTable data model supports building a wide variety of indexes
• Simple strings, numbers, geo points, ip addresses, etc
• Each has to be coupled with query code
• New applications should examine their data access use cases, indexes and query code to accomplish those can then be written
• Best applications are constructed so each user request is a single scan, or a small number of scans
Compared to MapReduce
• Hadoop’s HDFS stores simple files. Usually unsorted.
• MapReduce is designed to process all or most of the files at once.
• Accumulo maintains a set of sorted files in HDFS
• Accumulo scans are designed to access a small portion of the data quickly.
• Fairly complementary
Tough use case
• Ran MapReduce on some input data set to create a large result set.
• Now have a few new records, want to update the result set
• MapReduce has to process all the data again, have to wait
• Accumulo allows users to perform a limited set of operations to update a result set incrementally, using Iterators
• Result sets are always up to date, immediately after insert
Combiners
row col fam col qual col vis time value
bill perf June_calls P June 1 9
bill perf June_calls P June 4 3
bill perf July_calls P July 3 4
bill perf July_calls P July 11 7
bill perf August_calls P Aug 12 5
bill perf August_calls P Aug 29 2
Combiners
row col fam col qual col vis time value
bill perf June_calls P - 12
bill perf July_calls P - 11
bill perf August_calls P - 7
Combiners
• Almost equivalent to Reduce of MapReduce except:
• Cannot assume we have seen all the values for a particular key
• Exactly equivalent to a Combiner function
Combiners
• Useful Combiners:
• Event count (StringSummation or LongSummation aggregator)
• Event hour occurrence histogram (NumArraySummation aggregator)
• Event duration histogram (NumArraySummation aggregator)
Conceptual Graph Representation
a
c
b
e
f
d
g
Edge table
row col fam col qual col vis time valuea edge f 1.0c edge b 1.0c edge d 1.0d edge b 1.0d edge e 1.0e edge d 1.0f edge g 1.0g edge e 1.0g edge f 1.0
Edge Weights
• Summing Combiners are typically used to efficiently and incrementally update edge weights
• See SummingCombiner
Edge table
row col fam col qual col vis time valuea edge f 1.0c edge b 1.0c edge d 1.0d edge b 1.0d edge e 1.0e edge d 1.0f edge g 1.0
Incoming: a, edge, f, 1.0
Edge table
row col fam col qual col vis time valuea edge f 2.0c edge b 1.0c edge d 1.0d edge b 1.0d edge e 1.0e edge d 1.0f edge g 1.0
Edge table
row col fam col qual col vis time valuea edge f 2.0c edge b 1.0c edge d 1.0d edge b 1.0d edge e 1.0e edge d 1.0f edge g 1.0
Incoming: c, edge, b, 6.0
Edge table
row col fam col qual col vis time valuea edge f 2.0c edge b 7.0c edge d 1.0d edge b 1.0d edge e 1.0e edge d 1.0f edge g 1.0
Edge table
row col fam col qual col vis time valuea edge f 2.0c edge b 7.0c edge d 1.0d edge b 1.0d edge e 1.0e edge d 1.0f edge g 1.0
Incoming: a, edge, f, 2.3
Edge table
row col fam col qual col vis time valuea edge f 4.3c edge b 7.0c edge d 1.0d edge b 1.0d edge e 1.0e edge d 1.0f edge g 1.0
Edge Table Applications
• Graph Analytics - traversal, neighbors, connected components
• Neighborhood = feature vector. Vector-based machine learning techniques. Nearest neighbor search, clustering, classification
• Automated dossiers, fact accumulation - ‘tell me everything we know about X’ in a single scan
• Find entities based on features - ‘show me everyone who has feature value > x’ or ‘with < 5 neighbors of type k’
RDF Triples
row col fam col qual col vis time value
DC is_capital_of USA 1.0
Don vacations_in Arctic 7.0
Don is_employed_by MI6 1.0
Sean has_status “007” 1.0
Sean starred_with Ursula 1.0
Sean starred_with Anya 0.7
Sean starred_with Teresa 0.3
RDF Triples - RYA
• See RYA project : http://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf
Additional Training
Additional Training
• Talked about the basics today
• 3 days of developer training with hands on examples covering
• installation, configuration, read / write API, MapReduce, security, table configuration, indexing specific types, querying index tables, combiners, custom iterators, table constraints, storing relational data, joins, high performance considerations, document-partitioned indexing (text search), machine learning, object persistence
• 2 days of administrator training covering
• hardware selection, process assignment, troubleshooting, maintenance, replication and high availability, cluster modification, failure handling
Next Scheduled Training Sessions
• March 5-7 Columbia MD
• April 9-11 Columbia MD
• http://www.tetraconcepts.com/training