© 2013 42six Solutions, All Rights Reserved, www.42six.com
Framework for Big Data Discovery and Analytics
© 2013 42six Solutions, All Rights Reserved, www.42six.com
Hadoop MapReduce
• We can look across all our data to
answer questions!
Developers can write MapReduce code to analyze data, but don’t know what to look for; the analysts know what to look for, but don’t know how to write code.
Technology is not the problem. It’s enabling the analyst to effectively leverage technology and reuse it.
Problem Statement:
© 2013 42six Solutions, All Rights Reserved, www.42six.com
Typical Analyst Workflow:
• I have an entity I want to learn more about
• Everything is indexed by entities
• We can ask questions of Big Data, but they aren’t
Big Questions – we always start with an entity
We should be able to:
• Have a pattern and see entities that match that pattern
• We can ask complex questions of Big Data
© 2013 42six Solutions, All Rights Reserved, www.42six.com
Naïve Way:Custom MapReduce job for each question
Amino Way: Pre-compute features (micro-analytics), the building blocks of questions, and let analysts mix those on the fly to ask complex questions
The Amino index executes Analysts’ complex questions as a real time scan, less competition for resources, more scalable.
Scales to billions of entities and features
© 2013 42six Solutions, All Rights Reserved, www.42six.com
Live DemoWhat could go wrong?…
© 2013 42six Solutions, All Rights Reserved, www.42six.com
Amino Framework
Feature Creation API
• Abstracts the complexities of MapReduce
• Focus on logic of the feature/micro-analytic
• Write-once DataLoader for each data source
• Simple and powerful data joins
Amino Index
• AminoOutputFormat
• Bulk Ingest into Accumulo
Query API
• Iterators
© 2013 42six Solutions, All Rights Reserved, www.42six.com
Workflow
© 2013 42six Solutions, All Rights Reserved, www.42six.com
• Data Agnostic
• Not a black box
• Fully scalable
• Crowd source micro-analytics
• Inherent cross-datasource linked indexes
• Encourages sharing of knowledge, discovery
• Index built to support machine learning
• Security considered up front – index is in Accumulo
• Built on open source, for open source
Benefits
© 2013 42six Solutions, All Rights Reserved, www.42six.com
Feature Creation
-Can join multiple datasets-Keys are established in the DataLoader
Any external job can output this format and it will be indexed properly during indexing jobs
Notice there’s no key – that’s on purpose!
© 2013 42six Solutions, All Rights Reserved, www.42six.com
Index Goals
Now all our features are indexed, let’s let the analysts start building!
• Fast scans
• Highly dimensional scans
• Data compression
• Simple query structure
© 2013 42six Solutions, All Rights Reserved, www.42six.com
Accumulo Index 1: More Dimensions than Entities
Row CF CQ Value
Shard Number: Data Source : Bucket Name Bucket Value Hash Salt Compressed Bitmap
Row CF CQ Value
2:Twitter:handle stevetouw 0 010011010010011
Example:
JavaEWAH is a word-aligned compressed variant of the Java bitset class. It does not achieve the best compression, but rather improves query processing time
Indexes in the bit vector represent the features that entity falls in – a feature vector
© 2013 42six Solutions, All Rights Reserved, www.42six.com
At Query Time…
Bloom Filter based on Lexicographical first and last of each dimension of the query
Number of followers: 10 - 200
Number of tweets per day: 0 - 6
Handle starts with letter: S
First: aachimba Last: zzrka
First: aaabbb Last: zyrbb
First: saarba Last: szaban
Smallest range
Dimensions map to a query bit vector
000001001111000101000011100101010011100101Note there is an index for every possible value between the ratio features
© 2013 42six Solutions, All Rights Reserved, www.42six.com
Accumulo Iterator Time!!
Row CF CQ Value
2:Twitter:handle saarba 0 00101011001110
2:Twitter:handle saarra 0 00101111010100
2:Twitter:handle stevetouw 0 01111100001100
2:Twitter:handle szaban 0 00110011001111
Push our query bit vector through the range found in the previous step
If the result of the bitwise operation contains an index at each dimension, we have a match!
© 2013 42six Solutions, All Rights Reserved, www.42six.com
What is the Salt For?
Row CF CQ Value
Shard Number: Data Source : Bucket Name Bucket Value Hash Salt Compressed Bitmap
Row CF CQ Value
2:Twitter:handle stevetouw 0 0100110100100101
Collisions are possible (using 32 bit vector). Salt is used to hash the feature indexes, so you need as many matches in the previous step as you have salts.
We have used 3 salts with 15 billion records and have had no collisions
© 2013 42six Solutions, All Rights Reserved, www.42six.com
Benefits of this Index
• Tables are small, bit vector compression is good, only one row per entity
• Works great if you have more dimensions than you have entities or the range in your dimensions are good bloom filters (like “handle starts with letter …”)
• No matter how many dimensions, the query will always be as fast as the smallest range
• All processing/boolean logic occurs on the nodes (thanks iterators), fully scalable
• Represents a feature vector for your entities – great for machine learning
© 2013 42six Solutions, All Rights Reserved, www.42six.com
Accumulo Index 2: More Entities than Dimensions
Row CF CQ Value
shard:salt Data Source#Bucket Name#FeatureId Feature Value Compressed Bitmap
Row CF CQ Value
2:0 Twitter#handle#123456 s 0100110100101001
Example:
123456 could map to feature “Handle starts with letter”
Indexes in the bit vector represent the entities that fall in that feature
So handle stevetouw could map to index 73 (for salt 0)
© 2013 42six Solutions, All Rights Reserved, www.42six.com
Row CF CQ Value
2:0 Twitter#handle#444411 10 0010111011100
2:0 Twitter#handle#444411 11 0101010101101
2:0 Twitter#handle#444411 200 0000001011000
2:0 Twitter#handle#555522 0 1111110001101
2:0 Twitter#handle#555522 1 1010100000100
2:0 Twitter#handle#555522 6 1111001010000
2:0 Twitter#handle#123456 S 1111110001101
……
……
OR
OR
AND
Number of followers: 10 – 200 (feature id: 444411)
Number of tweets per day: 0 – 6 (feature id: 555522)
Handle starts with letter: S (feature id: 123456)
That Same Query Again…
Magic iterator that handles all the boolean logic
© 2013 42six Solutions, All Rights Reserved, www.42six.com
Row CF CQ Value
2:0 Twitter#handle#444411 10 0010111011100
2:0 Twitter#handle#444411 11 0101010101101
2:0 Twitter#handle#444411 200 0000001011000
2:0 Twitter#handle#555522 0 1111110001101
2:0 Twitter#handle#555522 1 1010100000100
2:0 Twitter#handle#555522 6 1111001010000
2:0 Twitter#handle#123456 S 1111110001101
……
……
OR
OR
AND
The same entity is guaranteed to always land in the same shard:salt no matter the feature
We are left with a set of indexes for each salt, now what?
More Details
© 2013 42six Solutions, All Rights Reserved, www.42six.com
Convert Indexes to Entities
Row CF CQ Value
shard Index Position#Data Source#Bucket Name#Salt Bucket Value
Row CF CQ Value
2 73#Twitter#handle#0 stevetouw
Example:
The iterator scans the rows using a CF filter with the indexes desired
The iterator ensures it gets the same CQ “# of salts” times before it sends the resulting CQ results back
Again, use the power of iterators and pushing code to the data rather than doing the salt set operation in the web tier
© 2013 42six Solutions, All Rights Reserved, www.42six.com
Benefits of this Index
• Tables are small, bit vector compression is good
• Works great if you have more entities than you have dimensions (most likely scenario)
• Affords the ability to do full boolean logic in-iterator, rather than just ANDs as in the previous index
• All processing/boolean logic occurs on the nodes (thanks iterators), fully scalable
© 2013 42six Solutions, All Rights Reserved, www.42six.com
Conclusion
• Amino helps non-technical folk leverage MapReduce cleanly and without hogging cluster resources
• Accumulo iterators are the reason for the index performance
• Amino is all about sharing and reuse, crowd source the building blocks, save analysts hypotheses, the more people touching Amino, the smarter it becomes
• Open source (documentation needs help): https://github.com/amino-cloud/amino
© 2013 42six Solutions, All Rights Reserved, www.42six.com
Questions?
Steve Touw, [email protected]
Barrett Stabile, [email protected]
Joe Bruner, [email protected]
Sapan Shah, [email protected]