Transcript
Page 1: The Amino Analytical Framework - Leveraging Accumulo to the Fullest

© 2013 42six Solutions, All Rights Reserved, www.42six.com

Framework for Big Data Discovery and Analytics

Page 2: The Amino Analytical Framework - Leveraging Accumulo to the Fullest

© 2013 42six Solutions, All Rights Reserved, www.42six.com

Hadoop MapReduce

• We can look across all our data to

answer questions!

Developers can write MapReduce code to analyze data, but don’t know what to look for; the analysts know what to look for, but don’t know how to write code.

Technology is not the problem. It’s enabling the analyst to effectively leverage technology and reuse it.

Problem Statement:

Page 3: The Amino Analytical Framework - Leveraging Accumulo to the Fullest

© 2013 42six Solutions, All Rights Reserved, www.42six.com

Typical Analyst Workflow:

• I have an entity I want to learn more about

• Everything is indexed by entities

• We can ask questions of Big Data, but they aren’t

Big Questions – we always start with an entity

We should be able to:

• Have a pattern and see entities that match that pattern

• We can ask complex questions of Big Data

Page 4: The Amino Analytical Framework - Leveraging Accumulo to the Fullest

© 2013 42six Solutions, All Rights Reserved, www.42six.com

Naïve Way:Custom MapReduce job for each question

Amino Way: Pre-compute features (micro-analytics), the building blocks of questions, and let analysts mix those on the fly to ask complex questions

The Amino index executes Analysts’ complex questions as a real time scan, less competition for resources, more scalable.

Scales to billions of entities and features

Page 5: The Amino Analytical Framework - Leveraging Accumulo to the Fullest

© 2013 42six Solutions, All Rights Reserved, www.42six.com

Live DemoWhat could go wrong?…

Page 6: The Amino Analytical Framework - Leveraging Accumulo to the Fullest

© 2013 42six Solutions, All Rights Reserved, www.42six.com

Amino Framework

Feature Creation API

• Abstracts the complexities of MapReduce

• Focus on logic of the feature/micro-analytic

• Write-once DataLoader for each data source

• Simple and powerful data joins

Amino Index

• AminoOutputFormat

• Bulk Ingest into Accumulo

Query API

• Iterators

Page 7: The Amino Analytical Framework - Leveraging Accumulo to the Fullest

© 2013 42six Solutions, All Rights Reserved, www.42six.com

Workflow

Page 8: The Amino Analytical Framework - Leveraging Accumulo to the Fullest

© 2013 42six Solutions, All Rights Reserved, www.42six.com

• Data Agnostic

• Not a black box

• Fully scalable

• Crowd source micro-analytics

• Inherent cross-datasource linked indexes

• Encourages sharing of knowledge, discovery

• Index built to support machine learning

• Security considered up front – index is in Accumulo

• Built on open source, for open source

Benefits

Page 9: The Amino Analytical Framework - Leveraging Accumulo to the Fullest

© 2013 42six Solutions, All Rights Reserved, www.42six.com

Feature Creation

-Can join multiple datasets-Keys are established in the DataLoader

Any external job can output this format and it will be indexed properly during indexing jobs

Notice there’s no key – that’s on purpose!

Page 10: The Amino Analytical Framework - Leveraging Accumulo to the Fullest

© 2013 42six Solutions, All Rights Reserved, www.42six.com

Index Goals

Now all our features are indexed, let’s let the analysts start building!

• Fast scans

• Highly dimensional scans

• Data compression

• Simple query structure

Page 11: The Amino Analytical Framework - Leveraging Accumulo to the Fullest

© 2013 42six Solutions, All Rights Reserved, www.42six.com

Accumulo Index 1: More Dimensions than Entities

Row CF CQ Value

Shard Number: Data Source : Bucket Name Bucket Value Hash Salt Compressed Bitmap

Row CF CQ Value

2:Twitter:handle stevetouw 0 010011010010011

Example:

JavaEWAH is a word-aligned compressed variant of the Java bitset class. It does not achieve the best compression, but rather improves query processing time

Indexes in the bit vector represent the features that entity falls in – a feature vector

Page 12: The Amino Analytical Framework - Leveraging Accumulo to the Fullest

© 2013 42six Solutions, All Rights Reserved, www.42six.com

At Query Time…

Bloom Filter based on Lexicographical first and last of each dimension of the query

Number of followers: 10 - 200

Number of tweets per day: 0 - 6

Handle starts with letter: S

First: aachimba Last: zzrka

First: aaabbb Last: zyrbb

First: saarba Last: szaban

Smallest range

Dimensions map to a query bit vector

000001001111000101000011100101010011100101Note there is an index for every possible value between the ratio features

Page 13: The Amino Analytical Framework - Leveraging Accumulo to the Fullest

© 2013 42six Solutions, All Rights Reserved, www.42six.com

Accumulo Iterator Time!!

Row CF CQ Value

2:Twitter:handle saarba 0 00101011001110

2:Twitter:handle saarra 0 00101111010100

2:Twitter:handle stevetouw 0 01111100001100

2:Twitter:handle szaban 0 00110011001111

Push our query bit vector through the range found in the previous step

If the result of the bitwise operation contains an index at each dimension, we have a match!

Page 14: The Amino Analytical Framework - Leveraging Accumulo to the Fullest

© 2013 42six Solutions, All Rights Reserved, www.42six.com

What is the Salt For?

Row CF CQ Value

Shard Number: Data Source : Bucket Name Bucket Value Hash Salt Compressed Bitmap

Row CF CQ Value

2:Twitter:handle stevetouw 0 0100110100100101

Collisions are possible (using 32 bit vector). Salt is used to hash the feature indexes, so you need as many matches in the previous step as you have salts.

We have used 3 salts with 15 billion records and have had no collisions

Page 15: The Amino Analytical Framework - Leveraging Accumulo to the Fullest

© 2013 42six Solutions, All Rights Reserved, www.42six.com

Benefits of this Index

• Tables are small, bit vector compression is good, only one row per entity

• Works great if you have more dimensions than you have entities or the range in your dimensions are good bloom filters (like “handle starts with letter …”)

• No matter how many dimensions, the query will always be as fast as the smallest range

• All processing/boolean logic occurs on the nodes (thanks iterators), fully scalable

• Represents a feature vector for your entities – great for machine learning

Page 16: The Amino Analytical Framework - Leveraging Accumulo to the Fullest

© 2013 42six Solutions, All Rights Reserved, www.42six.com

Accumulo Index 2: More Entities than Dimensions

Row CF CQ Value

shard:salt Data Source#Bucket Name#FeatureId Feature Value Compressed Bitmap

Row CF CQ Value

2:0 Twitter#handle#123456 s 0100110100101001

Example:

123456 could map to feature “Handle starts with letter”

Indexes in the bit vector represent the entities that fall in that feature

So handle stevetouw could map to index 73 (for salt 0)

Page 17: The Amino Analytical Framework - Leveraging Accumulo to the Fullest

© 2013 42six Solutions, All Rights Reserved, www.42six.com

Row CF CQ Value

2:0 Twitter#handle#444411 10 0010111011100

2:0 Twitter#handle#444411 11 0101010101101

2:0 Twitter#handle#444411 200 0000001011000

2:0 Twitter#handle#555522 0 1111110001101

2:0 Twitter#handle#555522 1 1010100000100

2:0 Twitter#handle#555522 6 1111001010000

2:0 Twitter#handle#123456 S 1111110001101

……

……

OR

OR

AND

Number of followers: 10 – 200 (feature id: 444411)

Number of tweets per day: 0 – 6 (feature id: 555522)

Handle starts with letter: S (feature id: 123456)

That Same Query Again…

Magic iterator that handles all the boolean logic

Page 18: The Amino Analytical Framework - Leveraging Accumulo to the Fullest

© 2013 42six Solutions, All Rights Reserved, www.42six.com

Row CF CQ Value

2:0 Twitter#handle#444411 10 0010111011100

2:0 Twitter#handle#444411 11 0101010101101

2:0 Twitter#handle#444411 200 0000001011000

2:0 Twitter#handle#555522 0 1111110001101

2:0 Twitter#handle#555522 1 1010100000100

2:0 Twitter#handle#555522 6 1111001010000

2:0 Twitter#handle#123456 S 1111110001101

……

……

OR

OR

AND

The same entity is guaranteed to always land in the same shard:salt no matter the feature

We are left with a set of indexes for each salt, now what?

More Details

Page 19: The Amino Analytical Framework - Leveraging Accumulo to the Fullest

© 2013 42six Solutions, All Rights Reserved, www.42six.com

Convert Indexes to Entities

Row CF CQ Value

shard Index Position#Data Source#Bucket Name#Salt Bucket Value

Row CF CQ Value

2 73#Twitter#handle#0 stevetouw

Example:

The iterator scans the rows using a CF filter with the indexes desired

The iterator ensures it gets the same CQ “# of salts” times before it sends the resulting CQ results back

Again, use the power of iterators and pushing code to the data rather than doing the salt set operation in the web tier

Page 20: The Amino Analytical Framework - Leveraging Accumulo to the Fullest

© 2013 42six Solutions, All Rights Reserved, www.42six.com

Benefits of this Index

• Tables are small, bit vector compression is good

• Works great if you have more entities than you have dimensions (most likely scenario)

• Affords the ability to do full boolean logic in-iterator, rather than just ANDs as in the previous index

• All processing/boolean logic occurs on the nodes (thanks iterators), fully scalable

Page 21: The Amino Analytical Framework - Leveraging Accumulo to the Fullest

© 2013 42six Solutions, All Rights Reserved, www.42six.com

Conclusion

• Amino helps non-technical folk leverage MapReduce cleanly and without hogging cluster resources

• Accumulo iterators are the reason for the index performance

• Amino is all about sharing and reuse, crowd source the building blocks, save analysts hypotheses, the more people touching Amino, the smarter it becomes

• Open source (documentation needs help): https://github.com/amino-cloud/amino

Page 22: The Amino Analytical Framework - Leveraging Accumulo to the Fullest

© 2013 42six Solutions, All Rights Reserved, www.42six.com

Questions?

Steve Touw, [email protected]

Barrett Stabile, [email protected]

Joe Bruner, [email protected]

Sapan Shah, [email protected]