Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Preview:

Citation preview

Mike Walch

Using Fluo to incrementally process data in Accumulo

Problem: Maintain counts of inbound links

fluo.io

github.com

apache.org

nytimes.com

Website

fluo.iogithub.comapache.orgnytimes.com

# Inbound Links

032

0

Example DataExample Graph

Solution 1 - Maintain counts using batch processing

Website

fluo.iogithub.comapache.orggithub.comnytimes.comapache.org

# Inbound

+1-1

+1-1

+1+1

Link count change log

Website

fluo.iogithub.comapache.orgnytimes.com

# Inbound

+1-23

+65 +105

Last Hour Aggregates

Website

fluo.iogithub.comapache.orgnytimes.com

# Inbound

531,385,1922,528,190

53,395,000

Website

fluo.iogithub.comapache.orgnytimes.com

# Inbound

541,385,1692,528,255

53,395,105

Historical

Latest Counts

MapReduce

MapReduce

WebCrawler

Internet

WebCache

Solution 2 - Maintain counts using Fluo

Website

fluo.iogithub.comapache.orgnytimes.com

# Inbound

531,385,1922,528,190

53,395,000

Fluo Table

+1

-1WebCrawler

Internet

WebCache

Solution 3 - Use both: update popular sites using batch processing & update long tail using Fluo

# InboundLinks

Update every hour using

MapReduce

Update in real-timeusing Fluo

Website Distribution

nytimes.com

github.com

fluo.io

Fluo 101 - Basics

- Provides cross-row transactions and snapshot isolation which makes it safe to do concurrent updates

- Allows for incremental processing of data

- Based on Google’s Percolator paper

- Started as a side project by Keith Turner in 2013

- Originally called Accismus

- Tested using synthetic workloads

- Almost ready for production environments

Fluo 101 - Accumulo vs Fluo

- Fluo is a transactional API built on top of Accumulo- Fluo stores its data in Accumulo

- Fluo uses Accumulo conditional mutations for transactions

- Fluo has a table structure (row, column, value) similar to Accumulo except Fluo has no timestamp

- Each Fluo application runs its own processes- Oracle allocates timestamps for transactions

- Workers run user code (called observers) that perform transactions

Fluo 101 - Architecture

Accumulo

HDFS

Zookeeper

YARN

Client Cluster

Fluo Client for App 1

Fluo Clientfor App 1

Fluo Clientfor App 2

Fluo Application 2Fluo Application 1

Fluo Worker

Observer1 Observer2

Fluo Oracle

Fluo Worker

ObserverA

Fluo Oracle

Fluo Worker

Observer1 Observer2

Table1 Table2

Fluo 101 - Client API

Used by developers to ingest data or interact with Fluo from external applications (REST services, crawlers, etc)

public void addDocument(FluoClient fluoClient, String docId, String content) {

TypeLayer typeLayer = new TypeLayer(new StringEncoder());

try (TypedTransaction tx1 = typeLayer.wrap(fluoClient.newTransaction())) {

if (tx1.get().row(docId).col(CONTENT_COL).toString() == null) { tx1.mutate().row(docId).col(CONTENT_COL).set(content); tx1.commit(); } }}

Fluo 101 - Observers- Developers can write observers that are triggered when a column is

modified and run by Fluo workers.

- Best practice: Do work/transactions in observers over client code

public class DocumentObserver extends TypedObserver {

@Override public void process(TypedTransactionBase tx, Bytes row, Column column) { // do work here }

@Override public ObservedColumn getObservedColumn() { return new ObservedColumn(CONTENT_COL, NotificationType.STRONG); }}

Example Fluo Application

- Problem: Maintain word & document counts as documents are added and deleted from Fluo in real time

- Fluo client performs two actions:1. Add document to table 2. Mark document for deletion

- Which triggers two observers: - Add Observer - increase word and document counts- Delete Observer - decrease counts and clean up

Add first document to table

Fluo Table

Row

d : doc1

Column

doc

Value

my first hello world

Fluo Client

Client Cluster

AddObserver

DeleteObserver

An observer increments word counts

Fluo Table

Row

d : doc1

w : firstw : hellow : myw : world

total : docs

Column

doc

cntcntcntcnt

cnt

Value

my first hello world

1111

1Fluo Client

Client Cluster

AddObserver

DeleteObserver

A second document is added

Fluo Table

Row

d : doc1d : doc2

w : firstw : hellow : myw : secondw : world

total : doc

Column

docdoc

cntcntcntcntcnt

cnt

Value

my first hello worldsecond hello world

12112

2

Fluo Client

Client Cluster

AddObserver

DeleteObserver

First document is marked for deletion

Fluo Table

Row

d : doc1d : doc1d : doc2

w : firstw : hellow : myw : secondw : world

total : doc

Column

docdeletedoc

cntcntcntcntcnt

cnt

Value

my first hello world

second hello world

12112

2

Fluo Client

Client Cluster

AddObserver

DeleteObserver

Observer decrements counts and deletes document

Fluo Table

Row

d : doc1d : doc1d : doc2

w : firstw : hellow : myw : secondw : world

total : doc

Column

docdeletedoc

cntcntcntcntcnt

cnt

Value

my first hello world

second hello world

11111

1

Fluo Client

Client Cluster

AddObserver

DeleteObserver

Things to watch out for...

- Collisions occur when two transactions update the same data at the same time

- Only one transaction will succeed. Others need to be retried.

- Some OK but too many can slow computation

- Avoid collisions by not updating same row/column on every transaction

- Write Skew occurs when two transactions read an overlapping data set and make disjoint updates without seeing the other update

- Result is different than if transactions were serialized

- Prevent write skew by making both transactions update same row/column. If concurrent, a collision will occur and only one transaction will succeed.

How does Fluo fit in?

Higher

Large JoinThroughput

Lower

Slower Processing Latency Faster

Batch Processing

MapReduce, Spark

Incremental Processing

Fluo, Percolator

Stream Processing

Storm

Don’t use Fluo if...

1. You want to do ad-hoc analysis on your data (use batch processing instead)

2. Your incoming data is being joined with a small data set(use stream processing instead)

Use Fluo if...

1. If you want to maintain a large scale computation using a series of small transaction updates

2. Periodic batch processing jobs are taking too long to join new data with existing data

Fluo Application Lifecycle

1. Use batch processing to seed computation with historical data

2. Use Fluo to process incoming data and maintain computation in real-time

3. While processing, Fluo can be queried and notifications can be made to user

Major Progress

2010 2013 2014 2015

Google releases Percolator paper

Keith Turner starts work on Percolator implementation for Accumulo as a side project (originally called Accismus)

Fluo can process transactions

1.0.0-alpha released

Oracle and worker can be run in YARN

Changed project name to Fluo

1.0.0-beta releasing soon

Solidified Fluo Client/Observer API

Automated running Fluo cluster on Amazon EC2

Multi-application support

Improved how observer notifications are found

Created Stress Test

Fluo Stress Test- Motivation: Needed test that stresses Fluo

and is easy to verify for correctness

- The stress test computes the number of unique integers by building a bitwise trie

- New integers are added at leaf nodes

- Observers watch all nodes, create parents, and percolate total up to root node

- Test runs successfully if count at root is same a number of leaf nodes

- Multiple transactions can operate on same nodes causing collisions

1110

11xx = 3

1100

10xx = 0 01xx = 1 00xx = 1

xxxx = 5

0101 00011110

Easy to run Fluo

1. On machine with Maven+Git, clone the fluo-dev and fluo repos

2. Follow some basic configuration steps

3. Run the following commands

It’s just as easy to run a Fluo cluster on Amazon EC2

fluo-dev download # Downloads Accumulo, Hadoop, Zookeeper tarballsfluo-dev setup # Sets up locally Accumulo, Hadoop, etcfluo-dev deploy # Build Fluo distribution and deploy locallyfluo new myapp # Create configuration for ‘myapp’ Fluo applicationfluo init myapp # Initialize ‘myapp’ in Zookeeperfluo start myapp # Start the oracle and worker processes of ‘myapp’ in YARNfluo scan myapp # Print snapshot of data in Fluo table of ‘myapp’

Fluo Ecosystem

fluoMain Project Repo

fluo-quickstart

Simple Fluo example

fluo-stressStresses Fluo on

cluster

fluo-io.github.io

Fluo project website

phrasecountIn-depth Fluo

example

fluo-deployRun Fluo on EC2

cluster

fluo-devHelps developers

run Fluo locally

Future Direction- Primary focus: Release production-ready 1.0 release with stable API

- Other possible work:

- Fluo-32: Real world example application

- Possibly using CommonCrawl data

- Fluo-58: Support writing observers in Python

- Fluo-290: Support running Fluo on Mesos

- Fluo-478: Automatically scale up & down Fluo workers based on workload

Get involved!

1. Experiment with Fluo- API has stabilized- Tools and development process make it easy- Not recommended for production yet (wait for 1.0)

2. Contribute to Fluo- ~85 open issues on GitHub- Review-then-commit process

Recommended