Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Mike Walch

Using Fluo to incrementally process data in Accumulo

Problem: Maintain counts of inbound links

fluo.io

github.com

apache.org

nytimes.com

Website

fluo.iogithub.comapache.orgnytimes.com

# Inbound Links

Example DataExample Graph

Solution 1 - Maintain counts using batch processing

Website

fluo.iogithub.comapache.orggithub.comnytimes.comapache.org

# Inbound

Link count change log

Website

# Inbound

+65 +105

Last Hour Aggregates

Website

# Inbound

531,385,1922,528,190

53,395,000

Website

# Inbound

541,385,1692,528,255

53,395,105

Historical

Latest Counts

MapReduce

WebCrawler

Internet

WebCache

Solution 2 - Maintain counts using Fluo

Website

# Inbound

531,385,1922,528,190

53,395,000

Fluo Table

-1WebCrawler

Internet

WebCache

Solution 3 - Use both: update popular sites using batch processing & update long tail using Fluo

# InboundLinks

Update every hour using

MapReduce

Update in real-timeusing Fluo

Website Distribution

nytimes.com

github.com

fluo.io

Fluo 101 - Basics

- Provides cross-row transactions and snapshot isolation which makes it safe to do concurrent updates

- Allows for incremental processing of data

- Based on Google’s Percolator paper

- Started as a side project by Keith Turner in 2013

- Originally called Accismus

- Tested using synthetic workloads

- Almost ready for production environments

Fluo 101 - Accumulo vs Fluo

- Fluo is a transactional API built on top of Accumulo- Fluo stores its data in Accumulo

- Fluo uses Accumulo conditional mutations for transactions

- Fluo has a table structure (row, column, value) similar to Accumulo except Fluo has no timestamp

- Each Fluo application runs its own processes- Oracle allocates timestamps for transactions

- Workers run user code (called observers) that perform transactions

Fluo 101 - Architecture

Accumulo

Zookeeper

Client Cluster

Fluo Client for App 1

Fluo Clientfor App 1

Fluo Clientfor App 2

Fluo Application 2Fluo Application 1

Fluo Worker

Observer1 Observer2

Fluo Oracle

Fluo Worker

ObserverA

Fluo Oracle

Fluo Worker

Observer1 Observer2

Table1 Table2

Fluo 101 - Client API

Used by developers to ingest data or interact with Fluo from external applications (REST services, crawlers, etc)

public void addDocument(FluoClient fluoClient, String docId, String content) {

TypeLayer typeLayer = new TypeLayer(new StringEncoder());

try (TypedTransaction tx1 = typeLayer.wrap(fluoClient.newTransaction())) {

if (tx1.get().row(docId).col(CONTENT_COL).toString() == null) { tx1.mutate().row(docId).col(CONTENT_COL).set(content); tx1.commit(); } }}

Fluo 101 - Observers- Developers can write observers that are triggered when a column is

modified and run by Fluo workers.

- Best practice: Do work/transactions in observers over client code

public class DocumentObserver extends TypedObserver {

@Override public void process(TypedTransactionBase tx, Bytes row, Column column) { // do work here }

@Override public ObservedColumn getObservedColumn() { return new ObservedColumn(CONTENT_COL, NotificationType.STRONG); }}

Example Fluo Application

- Problem: Maintain word & document counts as documents are added and deleted from Fluo in real time

- Fluo client performs two actions:1. Add document to table 2. Mark document for deletion

- Which triggers two observers: - Add Observer - increase word and document counts- Delete Observer - decrease counts and clean up

Add first document to table

Fluo Table

d : doc1

Column

my first hello world

Fluo Client

Client Cluster

AddObserver

DeleteObserver

An observer increments word counts

Fluo Table

d : doc1

w : firstw : hellow : myw : world

total : docs

Column

cntcntcntcnt

1Fluo Client

Client Cluster

AddObserver

DeleteObserver

A second document is added

Fluo Table

d : doc1d : doc2

w : firstw : hellow : myw : secondw : world

total : doc

Column

docdoc

cntcntcntcntcnt

my first hello worldsecond hello world

Fluo Client

Client Cluster

AddObserver

DeleteObserver

First document is marked for deletion

Fluo Table

d : doc1d : doc1d : doc2

total : doc

Column

docdeletedoc

cntcntcntcntcnt

second hello world

Fluo Client

Client Cluster

AddObserver

DeleteObserver

Observer decrements counts and deletes document

Fluo Table

d : doc1d : doc1d : doc2

total : doc

Column

docdeletedoc

cntcntcntcntcnt

second hello world

Fluo Client

Client Cluster

AddObserver

DeleteObserver

Things to watch out for...

- Collisions occur when two transactions update the same data at the same time

- Only one transaction will succeed. Others need to be retried.

- Some OK but too many can slow computation

- Avoid collisions by not updating same row/column on every transaction

- Write Skew occurs when two transactions read an overlapping data set and make disjoint updates without seeing the other update

- Result is different than if transactions were serialized

- Prevent write skew by making both transactions update same row/column. If concurrent, a collision will occur and only one transaction will succeed.

How does Fluo fit in?

Higher

Large JoinThroughput

Slower Processing Latency Faster

Batch Processing

MapReduce, Spark

Incremental Processing

Fluo, Percolator

Stream Processing

Don’t use Fluo if...

1. You want to do ad-hoc analysis on your data (use batch processing instead)

2. Your incoming data is being joined with a small data set(use stream processing instead)

Use Fluo if...

1. If you want to maintain a large scale computation using a series of small transaction updates

2. Periodic batch processing jobs are taking too long to join new data with existing data

Fluo Application Lifecycle

1. Use batch processing to seed computation with historical data

2. Use Fluo to process incoming data and maintain computation in real-time

3. While processing, Fluo can be queried and notifications can be made to user

Major Progress

2010 2013 2014 2015

Google releases Percolator paper

Keith Turner starts work on Percolator implementation for Accumulo as a side project (originally called Accismus)

Fluo can process transactions

1.0.0-alpha released

Oracle and worker can be run in YARN

Changed project name to Fluo

1.0.0-beta releasing soon

Solidified Fluo Client/Observer API

Automated running Fluo cluster on Amazon EC2

Multi-application support

Improved how observer notifications are found

Created Stress Test

Fluo Stress Test- Motivation: Needed test that stresses Fluo

and is easy to verify for correctness

- The stress test computes the number of unique integers by building a bitwise trie

- New integers are added at leaf nodes

- Observers watch all nodes, create parents, and percolate total up to root node

- Test runs successfully if count at root is same a number of leaf nodes

- Multiple transactions can operate on same nodes causing collisions

11xx = 3

10xx = 0 01xx = 1 00xx = 1

xxxx = 5

0101 00011110

Easy to run Fluo

1. On machine with Maven+Git, clone the fluo-dev and fluo repos

2. Follow some basic configuration steps

3. Run the following commands

It’s just as easy to run a Fluo cluster on Amazon EC2

fluo-dev download # Downloads Accumulo, Hadoop, Zookeeper tarballsfluo-dev setup # Sets up locally Accumulo, Hadoop, etcfluo-dev deploy # Build Fluo distribution and deploy locallyfluo new myapp # Create configuration for ‘myapp’ Fluo applicationfluo init myapp # Initialize ‘myapp’ in Zookeeperfluo start myapp # Start the oracle and worker processes of ‘myapp’ in YARNfluo scan myapp # Print snapshot of data in Fluo table of ‘myapp’

Fluo Ecosystem

fluoMain Project Repo

fluo-quickstart

Simple Fluo example

fluo-stressStresses Fluo on

cluster

fluo-io.github.io

Fluo project website

phrasecountIn-depth Fluo

example

fluo-deployRun Fluo on EC2

cluster

fluo-devHelps developers

run Fluo locally

Future Direction- Primary focus: Release production-ready 1.0 release with stable API

- Other possible work:

- Fluo-32: Real world example application

- Possibly using CommonCrawl data

- Fluo-58: Support writing observers in Python

- Fluo-290: Support running Fluo on Mesos

- Fluo-478: Automatically scale up & down Fluo workers based on workload

Get involved!

1. Experiment with Fluo- API has stabilized- Tools and development process make it easy- Not recommended for production yet (wait for 1.0)

2. Contribute to Fluo- ~85 open issues on GitHub- Review-then-commit process

Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Technology

Pierantoni S.p.A. - Cash & Carry · osama.com. uni . pTn - PWA . ROSE ROSA ROSA UO uni 0-5 FINE LINE WATER AND FADE PROOF. MARKER . FLUO YELLOW JAUNE FLUO AMARILLO FLUO FLUO MARKER

Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]

Accumulo Summit 2014: Accumulo with Distributed SQL queries

ET 59631 Fluo N7511C

Accumulo Summit 2014 Keynote: The Accumulo Community

Specialty Inks NUEVA CARTILLA TXP CORE FLUO AMARILLO ...fluo rosa fluo fucsia fluo naranja fluo rojo naranja magenta azul navy azul royal azul marino azul ultra viol-eta fluo verde

#Baby #Fluo - campodeifiori4.it€¦ · Rosa Fluo / Turchese 965070 Fluo/ Triangolo Triangolo Slip fiocco Tg. 40-48 Rosa Fluo / Turchese 965073 Fluo/ Int. triang. Intero triangolo

15 años - Shirly FLUO

Accumulo Summit 2015: Verifiable Responses to Accumulo Queries [Security]

GIMME5 FLUO

Accumulo Summit 2014: Accumulo on YARN

Cours Fluo

Accumulo Summit 2015: Event-Driven Big Data with Accumulo - Leveraging Big Data in Motion [Leveraging Accumulo]

Accumulo Summit 2014: Data-Center Replication with Apache Accumulo

Accumulo Summit 2015: Reactive programming in Accumulo: The Observable WAL [Internals]

Accumulo Summit 2015: Ambari and Accumulo: HDP 2.3 Upcoming Features [Sponsored]

Fluo ruházat

Papir i papirna konfekcija - knjizaralom.hr · Papir ILK PAPERLINE A4 80gr. Pk 500 boje: pastelni zeleni, pastelni šamoa, svijetlo žuti, fluo žuti, fluo roza, fluo žuti, fluo

Catalogo PROT.CIVILE ANTINCENDIO BOSCH. 2019 CDG Divise€¦ · fluo, arancione fluo, verde fluo, rosa fluo . 15 ANTINCENDIO BOSCHIVO . ANTINCENDIO BOSCHIVO 16 . cat. AB13 Tuta ST7

Fluo Comunicación