Apache Accumulo Overview

Preview:

DESCRIPTION

An overview of the Apache Accumulo high-performance, scalable, distributed key/value store.

Citation preview

11

Apache Accumulo OverviewBill HavankiSolutions Architect, Cloudera Government Solutions

2 ©2014 Cloudera, Inc. All rights reserved.2

•Quick History•Storage Model•Loading and Querying•Daemons•Getting Started, a.k.a., the Pitch

Agenda

3

A Quick History

3

4 ©2014 Cloudera, Inc. All rights reserved.

Google BigTable

Compressed, high-performance, scalable, distributed sorted map

4

5 ©2014 Cloudera, Inc. All rights reserved.

Google BigTable

• Began development in 2004• Built on Google File System• Non-relational• Byte-oriented and schemaless• Stores data in the petabyte range• Research paper published in 2006

5

6 ©2014 Cloudera, Inc. All rights reserved.

Child(ren) of BigTable

• Apache HBase (begun 2006, top-level 2010)• Apache Cassandra (begun 2008-ish, top-level 2010)• Apache Accumulo ...

6

7 ©2014 Cloudera, Inc. All rights reserved.

From Cloudbase to Accumulo

• Started in 2008 as National Security Agency project• Submitted to Apache Incubator in 2011 (and renamed)• Top-level project in 2012

7

8

Storage Model

8

9 ©2014 Cloudera, Inc. All rights reserved.

Key / Value Store

Accumulo stores tables of key / value pairs

9

10

©2014 Cloudera, Inc. All rights reserved.

Key / Value Store

A row is a sorted sequence of key / value pairsEach pair is a cell

10

11

©2014 Cloudera, Inc. All rights reserved.

The Key

11

row

column

timestamp

family qualifier visibility

12

©2014 Cloudera, Inc. All rights reserved.

An example key

12

bhavanki

column1401041295

personal middle PII

13

©2014 Cloudera, Inc. All rights reserved.

Another example key

13

brees

column1401041296

employment salary FIN

14

©2014 Cloudera, Inc. All rights reserved.

It’s all bytes

All key and value data are stored as bytesexcept timestamp is a long

There are no built-in data typesbut lexicoders help with common types

Key components are usually UTF-8 strings

14

15

©2014 Cloudera, Inc. All rights reserved.

Some rows for you

15

row cf cq cv ts value

bhavanki job employer 2013-09-01 Cloudera

bhavanki personal beer 2013-09-15 Omission

bhavanki personal house NOMUGGL 2014-01-25 Ravenclaw

brees job employer 2013-10-01 White Cliffs

brees personal house NOMUGGL 2014-01-01 Hufflepuff

16

©2014 Cloudera, Inc. All rights reserved.

Visibility Labels

Boolean expression

Specialist | (Management & SpecTraining)

Authorizations are provided in each scan

16

17

©2014 Cloudera, Inc. All rights reserved.

Locality Groups

You can identify sets of one or more column families as locality groups

Data in a locality group is stored together for improved read performance

17

18

©2014 Cloudera, Inc. All rights reserved.

Tablets

A table is comprised of one or more tablets

18

employeesemployees

employees;Semployees;Semployees;Hemployees;H employees;~employees;~

19

©2014 Cloudera, Inc. All rights reserved.

Tablets

Tablets maps to data files in HDFS

19

employees;Semployees;Semployees;Hemployees;H employees;~employees;~

rfile 2rfile 2rfile 1rfile 1 rfile 3rfile 3

20

©2014 Cloudera, Inc. All rights reserved.

Tablets

Data also kept in write-ahead logs and memtable

20

employees;Hemployees;H

rfile 1rfile 1

walogswalogs

memtablememtable

21

Loading and Querying

21

22

©2014 Cloudera, Inc. All rights reserved.

Java Client API

22

23

©2014 Cloudera, Inc. All rights reserved.

Java Client API

Read using scanners

Scanner s = conn.createScanner(“employees”, new Authorizations());s.setRange(“alice”, “eve”);s.setColumnFamily(“personal”);for (Entry<Key, Value> e : s) employeeIds.add(e.getKey().getRow());

23

24

©2014 Cloudera, Inc. All rights reserved.

Java Client API

Read access via iterator pattern• server-side system iterators handle timestamps,

authorization checks, and lots more• iterators almost always wrap other iterators, forming a chain• you can define your own, client-side or server-side

24

25

©2014 Cloudera, Inc. All rights reserved.

Java Client API

Scanners fetch sorted rows from one rangeBatch scanners fetch unsorted rows from multiple

ranges in parallelIsolated scanners ensure that you do not see a row

mid-change

25

26

©2014 Cloudera, Inc. All rights reserved.

MapReduce

AccumuloInputFormatAccumuloOutputFormat

26

27

©2014 Cloudera, Inc. All rights reserved.

MapReduce

AccumuloRowInputFormatAccumuloRowOutputFormat

27

28

©2014 Cloudera, Inc. All rights reserved.

Shell

Command-line / manual access to Accumulo data• scan, insert, delete• iterator management• table management (creation, deletion, cloning)• user and authorization management• table splitting and merging• ... more

28

29

©2014 Cloudera, Inc. All rights reserved.

Bulk Import

Got lots of data to import quickly?• Use MR job to format data using AccumuloFileOutputFormat• Import files using shell

Trade off latency / availablity for throughput

29

30

Daemons

30

31

©2014 Cloudera, Inc. All rights reserved.

Tablet Server

Serves tablets (table data)• writes data to walog, memtable; deals with compaction• serves data for reads from files, memtable• handles recovery from walogs in case of server failure

Most client calls go to tablet servers

31

32

©2014 Cloudera, Inc. All rights reserved.

Master

• assigns tablets to tablet servers• detects tablet server failures and reassigns tablets• balances tablet assignments over time• coordinates table operations

Multiple supported for failover, only one active

32

33

©2014 Cloudera, Inc. All rights reserved.

Everybody Else in Accumulo

Garbage Collector (GC) - identifies and deletes files in HDFS that are no longer neededTracer - listens for and stores distributed trace messages using a special table

33

34

©2014 Cloudera, Inc. All rights reserved.

Everybody Else in Accumulo

• Monitor - collects and serves status information• server status• log inspection• performance data• table inspection

34

35

©2014 Cloudera, Inc. All rights reserved.

Everybody Else outside Accumulo

• HDFS (as part of Apache Hadoop)• stores tablet files• stores write-ahead logs (1.5+)

• MapReduce (Hadoop)• bulk import• batch processing

• Apache ZooKeeper

35

36

Getting Starteda.k.a. the Pitch

36

37

©2014 Cloudera, Inc. All rights reserved.

Easy as 1-2-3?

1.Install Hadoop (HDFS and MapReduce)2.Install ZooKeeper3.Install Accumulo!

37

38

©2014 Cloudera, Inc. All rights reserved.

Making Steps 1 and 2 Easier

Use a complete, pre-packaged Hadoop distribution... like CDH!

a leading commercial distribution centered on Apache Hadoop

•many ecosystem components•configured / updated to work together

38

39

©2014 Cloudera, Inc. All rights reserved.

Making Steps 1 and 2 Easier

Cloudera Manager•deployment•configuration•operation•security

39

40

©2014 Cloudera, Inc. All rights reserved.

Making Step 3 Easier

Standard Apache Accumulo installation is via tarball• no longer shipping RPM / DEB / ...

Using CDH/CM you can use:• a tarball, RPM or DEB with Accumulo packaged for CDH • a parcel (like RPM / ZIP) for easier upgrades

• 1.4.4 and 1.4.5 available now• 1.6.0 soon

40

41

©2014 Cloudera, Inc. All rights reserved.

Where to Go for More

• http://accumulo.apache.org/• http://www.cloudera.com/content/cloudera/en/products-and-s

ervices/cdh.html• http://www.cloudera.com/content/cloudera/en/products-and-s

ervices/cloudera-enterprise/cloudera-manager.html• http://www.cloudera.com/content/cloudera/en/products-and-

services/cdh/accumulo.html

41

42

©2014 Cloudera, Inc. All rights reserved.

Accumulo Summit

Join us on June 12

42

43

©2014 Cloudera, Inc. All rights reserved.

Quick Thanks

• My slide reviewers• Sean Busbey• Mike Drob

• Accumulo community• You all for listening

43

44

©2014 Cloudera, Inc. All rights reserved.

Thank you!Bill Havankibhavanki@clouderagovt.com

44

Recommended