27
Mike Walch Using Fluo to incrementally process data in Accumulo

Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Embed Size (px)

Citation preview

Page 1: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Mike Walch

Using Fluo to incrementally process data in Accumulo

Page 2: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Problem: Maintain counts of inbound links

fluo.io

github.com

apache.org

nytimes.com

Website

fluo.iogithub.comapache.orgnytimes.com

# Inbound Links

032

0

Example DataExample Graph

Page 3: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Solution 1 - Maintain counts using batch processing

Website

fluo.iogithub.comapache.orggithub.comnytimes.comapache.org

# Inbound

+1-1

+1-1

+1+1

Link count change log

Website

fluo.iogithub.comapache.orgnytimes.com

# Inbound

+1-23

+65 +105

Last Hour Aggregates

Website

fluo.iogithub.comapache.orgnytimes.com

# Inbound

531,385,1922,528,190

53,395,000

Website

fluo.iogithub.comapache.orgnytimes.com

# Inbound

541,385,1692,528,255

53,395,105

Historical

Latest Counts

MapReduce

MapReduce

WebCrawler

Internet

WebCache

Page 4: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Solution 2 - Maintain counts using Fluo

Website

fluo.iogithub.comapache.orgnytimes.com

# Inbound

531,385,1922,528,190

53,395,000

Fluo Table

+1

-1WebCrawler

Internet

WebCache

Page 5: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Solution 3 - Use both: update popular sites using batch processing & update long tail using Fluo

# InboundLinks

Update every hour using

MapReduce

Update in real-timeusing Fluo

Website Distribution

nytimes.com

github.com

fluo.io

Page 6: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Fluo 101 - Basics

- Provides cross-row transactions and snapshot isolation which makes it safe to do concurrent updates

- Allows for incremental processing of data

- Based on Google’s Percolator paper

- Started as a side project by Keith Turner in 2013

- Originally called Accismus

- Tested using synthetic workloads

- Almost ready for production environments

Page 7: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Fluo 101 - Accumulo vs Fluo

- Fluo is a transactional API built on top of Accumulo- Fluo stores its data in Accumulo

- Fluo uses Accumulo conditional mutations for transactions

- Fluo has a table structure (row, column, value) similar to Accumulo except Fluo has no timestamp

- Each Fluo application runs its own processes- Oracle allocates timestamps for transactions

- Workers run user code (called observers) that perform transactions

Page 8: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Fluo 101 - Architecture

Accumulo

HDFS

Zookeeper

YARN

Client Cluster

Fluo Client for App 1

Fluo Clientfor App 1

Fluo Clientfor App 2

Fluo Application 2Fluo Application 1

Fluo Worker

Observer1 Observer2

Fluo Oracle

Fluo Worker

ObserverA

Fluo Oracle

Fluo Worker

Observer1 Observer2

Table1 Table2

Page 9: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Fluo 101 - Client API

Used by developers to ingest data or interact with Fluo from external applications (REST services, crawlers, etc)

public void addDocument(FluoClient fluoClient, String docId, String content) {

TypeLayer typeLayer = new TypeLayer(new StringEncoder());

try (TypedTransaction tx1 = typeLayer.wrap(fluoClient.newTransaction())) {

if (tx1.get().row(docId).col(CONTENT_COL).toString() == null) { tx1.mutate().row(docId).col(CONTENT_COL).set(content); tx1.commit(); } }}

Page 10: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Fluo 101 - Observers- Developers can write observers that are triggered when a column is

modified and run by Fluo workers.

- Best practice: Do work/transactions in observers over client code

public class DocumentObserver extends TypedObserver {

@Override public void process(TypedTransactionBase tx, Bytes row, Column column) { // do work here }

@Override public ObservedColumn getObservedColumn() { return new ObservedColumn(CONTENT_COL, NotificationType.STRONG); }}

Page 11: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Example Fluo Application

- Problem: Maintain word & document counts as documents are added and deleted from Fluo in real time

- Fluo client performs two actions:1. Add document to table 2. Mark document for deletion

- Which triggers two observers: - Add Observer - increase word and document counts- Delete Observer - decrease counts and clean up

Page 12: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Add first document to table

Fluo Table

Row

d : doc1

Column

doc

Value

my first hello world

Fluo Client

Client Cluster

AddObserver

DeleteObserver

Page 13: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

An observer increments word counts

Fluo Table

Row

d : doc1

w : firstw : hellow : myw : world

total : docs

Column

doc

cntcntcntcnt

cnt

Value

my first hello world

1111

1Fluo Client

Client Cluster

AddObserver

DeleteObserver

Page 14: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

A second document is added

Fluo Table

Row

d : doc1d : doc2

w : firstw : hellow : myw : secondw : world

total : doc

Column

docdoc

cntcntcntcntcnt

cnt

Value

my first hello worldsecond hello world

12112

2

Fluo Client

Client Cluster

AddObserver

DeleteObserver

Page 15: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

First document is marked for deletion

Fluo Table

Row

d : doc1d : doc1d : doc2

w : firstw : hellow : myw : secondw : world

total : doc

Column

docdeletedoc

cntcntcntcntcnt

cnt

Value

my first hello world

second hello world

12112

2

Fluo Client

Client Cluster

AddObserver

DeleteObserver

Page 16: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Observer decrements counts and deletes document

Fluo Table

Row

d : doc1d : doc1d : doc2

w : firstw : hellow : myw : secondw : world

total : doc

Column

docdeletedoc

cntcntcntcntcnt

cnt

Value

my first hello world

second hello world

11111

1

Fluo Client

Client Cluster

AddObserver

DeleteObserver

Page 17: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Things to watch out for...

- Collisions occur when two transactions update the same data at the same time

- Only one transaction will succeed. Others need to be retried.

- Some OK but too many can slow computation

- Avoid collisions by not updating same row/column on every transaction

- Write Skew occurs when two transactions read an overlapping data set and make disjoint updates without seeing the other update

- Result is different than if transactions were serialized

- Prevent write skew by making both transactions update same row/column. If concurrent, a collision will occur and only one transaction will succeed.

Page 18: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

How does Fluo fit in?

Higher

Large JoinThroughput

Lower

Slower Processing Latency Faster

Batch Processing

MapReduce, Spark

Incremental Processing

Fluo, Percolator

Stream Processing

Storm

Page 19: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Don’t use Fluo if...

1. You want to do ad-hoc analysis on your data (use batch processing instead)

2. Your incoming data is being joined with a small data set(use stream processing instead)

Page 20: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Use Fluo if...

1. If you want to maintain a large scale computation using a series of small transaction updates

2. Periodic batch processing jobs are taking too long to join new data with existing data

Page 21: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Fluo Application Lifecycle

1. Use batch processing to seed computation with historical data

2. Use Fluo to process incoming data and maintain computation in real-time

3. While processing, Fluo can be queried and notifications can be made to user

Page 22: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Major Progress

2010 2013 2014 2015

Google releases Percolator paper

Keith Turner starts work on Percolator implementation for Accumulo as a side project (originally called Accismus)

Fluo can process transactions

1.0.0-alpha released

Oracle and worker can be run in YARN

Changed project name to Fluo

1.0.0-beta releasing soon

Solidified Fluo Client/Observer API

Automated running Fluo cluster on Amazon EC2

Multi-application support

Improved how observer notifications are found

Created Stress Test

Page 23: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Fluo Stress Test- Motivation: Needed test that stresses Fluo

and is easy to verify for correctness

- The stress test computes the number of unique integers by building a bitwise trie

- New integers are added at leaf nodes

- Observers watch all nodes, create parents, and percolate total up to root node

- Test runs successfully if count at root is same a number of leaf nodes

- Multiple transactions can operate on same nodes causing collisions

1110

11xx = 3

1100

10xx = 0 01xx = 1 00xx = 1

xxxx = 5

0101 00011110

Page 24: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Easy to run Fluo

1. On machine with Maven+Git, clone the fluo-dev and fluo repos

2. Follow some basic configuration steps

3. Run the following commands

It’s just as easy to run a Fluo cluster on Amazon EC2

fluo-dev download # Downloads Accumulo, Hadoop, Zookeeper tarballsfluo-dev setup # Sets up locally Accumulo, Hadoop, etcfluo-dev deploy # Build Fluo distribution and deploy locallyfluo new myapp # Create configuration for ‘myapp’ Fluo applicationfluo init myapp # Initialize ‘myapp’ in Zookeeperfluo start myapp # Start the oracle and worker processes of ‘myapp’ in YARNfluo scan myapp # Print snapshot of data in Fluo table of ‘myapp’

Page 25: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Fluo Ecosystem

fluoMain Project Repo

fluo-quickstart

Simple Fluo example

fluo-stressStresses Fluo on

cluster

fluo-io.github.io

Fluo project website

phrasecountIn-depth Fluo

example

fluo-deployRun Fluo on EC2

cluster

fluo-devHelps developers

run Fluo locally

Page 26: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Future Direction- Primary focus: Release production-ready 1.0 release with stable API

- Other possible work:

- Fluo-32: Real world example application

- Possibly using CommonCrawl data

- Fluo-58: Support writing observers in Python

- Fluo-290: Support running Fluo on Mesos

- Fluo-478: Automatically scale up & down Fluo workers based on workload

Page 27: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Get involved!

1. Experiment with Fluo- API has stabilized- Tools and development process make it easy- Not recommended for production yet (wait for 1.0)

2. Contribute to Fluo- ~85 open issues on GitHub- Review-then-commit process