Apache Accumulo Overview

Apache Accumulo OverviewBill HavankiSolutions Architect, Cloudera Government Solutions

•Quick History•Storage Model•Loading and Querying•Daemons•Getting Started, a.k.a., the Pitch

Agenda

A Quick History

Google BigTable

Compressed, high-performance, scalable, distributed sorted map

Google BigTable

• Began development in 2004• Built on Google File System• Non-relational• Byte-oriented and schemaless• Stores data in the petabyte range• Research paper published in 2006

Child(ren) of BigTable

• Apache HBase (begun 2006, top-level 2010)• Apache Cassandra (begun 2008-ish, top-level 2010)• Apache Accumulo ...

From Cloudbase to Accumulo

• Started in 2008 as National Security Agency project• Submitted to Apache Incubator in 2011 (and renamed)• Top-level project in 2012

Storage Model

Key / Value Store

Accumulo stores tables of key / value pairs

Key / Value Store

A row is a sorted sequence of key / value pairsEach pair is a cell

The Key

column

timestamp

family qualifier visibility

An example key

bhavanki

column1401041295

personal middle PII

Another example key

column1401041296

employment salary FIN

It’s all bytes

All key and value data are stored as bytesexcept timestamp is a long

There are no built-in data typesbut lexicoders help with common types

Key components are usually UTF-8 strings

Some rows for you

row cf cq cv ts value

bhavanki job employer 2013-09-01 Cloudera

bhavanki personal beer 2013-09-15 Omission

bhavanki personal house NOMUGGL 2014-01-25 Ravenclaw

brees job employer 2013-10-01 White Cliffs

brees personal house NOMUGGL 2014-01-01 Hufflepuff

Visibility Labels

Boolean expression

Specialist | (Management & SpecTraining)

Authorizations are provided in each scan

Locality Groups

You can identify sets of one or more column families as locality groups

Data in a locality group is stored together for improved read performance

Tablets

A table is comprised of one or more tablets

employeesemployees

employees;Semployees;Semployees;Hemployees;H employees;~employees;~

Tablets

Tablets maps to data files in HDFS

employees;Semployees;Semployees;Hemployees;H employees;~employees;~

rfile 2rfile 2rfile 1rfile 1 rfile 3rfile 3

Tablets

Data also kept in write-ahead logs and memtable

employees;Hemployees;H

rfile 1rfile 1

walogswalogs

memtablememtable

Loading and Querying

Java Client API

Read using scanners

Scanner s = conn.createScanner(“employees”, new Authorizations());s.setRange(“alice”, “eve”);s.setColumnFamily(“personal”);for (Entry<Key, Value> e : s) employeeIds.add(e.getKey().getRow());

Java Client API

Read access via iterator pattern• server-side system iterators handle timestamps,

authorization checks, and lots more• iterators almost always wrap other iterators, forming a chain• you can define your own, client-side or server-side

Java Client API

Scanners fetch sorted rows from one rangeBatch scanners fetch unsorted rows from multiple

ranges in parallelIsolated scanners ensure that you do not see a row

mid-change

MapReduce

AccumuloInputFormatAccumuloOutputFormat

MapReduce

AccumuloRowInputFormatAccumuloRowOutputFormat

Command-line / manual access to Accumulo data• scan, insert, delete• iterator management• table management (creation, deletion, cloning)• user and authorization management• table splitting and merging• ... more

Bulk Import

Got lots of data to import quickly?• Use MR job to format data using AccumuloFileOutputFormat• Import files using shell

Trade off latency / availablity for throughput

Daemons

Tablet Server

Serves tablets (table data)• writes data to walog, memtable; deals with compaction• serves data for reads from files, memtable• handles recovery from walogs in case of server failure

Most client calls go to tablet servers

Master

• assigns tablets to tablet servers• detects tablet server failures and reassigns tablets• balances tablet assignments over time• coordinates table operations

Multiple supported for failover, only one active

Everybody Else in Accumulo

Garbage Collector (GC) - identifies and deletes files in HDFS that are no longer neededTracer - listens for and stores distributed trace messages using a special table

Everybody Else in Accumulo

• Monitor - collects and serves status information• server status• log inspection• performance data• table inspection

Everybody Else outside Accumulo

• HDFS (as part of Apache Hadoop)• stores tablet files• stores write-ahead logs (1.5+)

• MapReduce (Hadoop)• bulk import• batch processing

• Apache ZooKeeper

Getting Starteda.k.a. the Pitch

Easy as 1-2-3?

1.Install Hadoop (HDFS and MapReduce)2.Install ZooKeeper3.Install Accumulo!

Making Steps 1 and 2 Easier

Use a complete, pre-packaged Hadoop distribution... like CDH!

a leading commercial distribution centered on Apache Hadoop

•many ecosystem components•configured / updated to work together

Making Steps 1 and 2 Easier

Cloudera Manager•deployment•configuration•operation•security

Making Step 3 Easier

Standard Apache Accumulo installation is via tarball• no longer shipping RPM / DEB / ...

Using CDH/CM you can use:• a tarball, RPM or DEB with Accumulo packaged for CDH • a parcel (like RPM / ZIP) for easier upgrades

• 1.4.4 and 1.4.5 available now• 1.6.0 soon

Where to Go for More

• http://accumulo.apache.org/• http://www.cloudera.com/content/cloudera/en/products-and-s

ervices/cdh.html• http://www.cloudera.com/content/cloudera/en/products-and-s

ervices/cloudera-enterprise/cloudera-manager.html• http://www.cloudera.com/content/cloudera/en/products-and-

services/cdh/accumulo.html

Accumulo Summit

Join us on June 12

Quick Thanks

• My slide reviewers• Sean Busbey• Mike Drob

• Accumulo community• You all for listening

Thank you!Bill Havankibhavanki@clouderagovt.com

Apache Accumulo Overview

Data & Analytics

Apache hbase overview (20160427)

Overview SCALE14x 2016. Agenda/Schedule -Apache Bigtop Overview -Apache Spark Overview/Getting Started -Lunch Break -Apache Ignite -Workshop, tutorial,

Performance Models for Apache Accumulo

Apache Accumulo Installaon Guide - Cloudera...Using Sqoop 1 with Accumulo Sqoop 1 Client under CDH 5 and Cloudera Manager Sqoop 1 without Cloudera Manager Using Accumulo with Maven

Apache Commons Overview

Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Apache Accumulo User Manual Version 1accumulo.apache.org/1.5/accumulo_user_manual.pdf · of the popular Apache Hadoop project. Accumulo supports eﬃcient storage and retrieval of

Apache Accumulo Installaon Guide - Cloudera · 2019. 7. 30. · This guide describes how to install the Cloudera packaging of Apache Accumulo for use with CDH. Introducing Apache

Apache Kudu Overview · 2020. 8. 7. · Cloudera Runtime Apache Kudu overview Apache Kudu overview Apache Kudu is a columnar storage manager developed for the Hadoop platform. Kudu

Data-Center Replication with Apache Accumulo

Accumulo Summit 2014: Monitoring Apache Accumulo

Apache Htrace overview (20160520)

Apache Spark Overview @ ferret

Apache Flink - Overview

Accumulo Extensions to Google's Bigtable Designafuchs/slides/morgan_state_talk.pdf · 2012-04-10 · Accumulo Adam Fuchs Design Drivers Apache Accumulo Intro to Bigtable Iterators

Compaction and Splitting in Apache Accumulo

Accumulo Summit 2014: Data-Center Replication with Apache Accumulo

Effective Testing of Apache Accumulo Iterators

Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail of a Shared-Nothing Architecture [Performance]

Stupid Shell Tricks with Apache Accumulo