The Amino Analytical Framework - Leveraging Accumulo to the Fullest

  • View
    776

  • Download
    0

Embed Size (px)

DESCRIPTION

Speaker: Steve Touw, CTO, 42six Solutions a CSC Company Amino is an open source analytical framework that focuses on a building-blocks approach to data discovery by pre-computing features about data at the most granular level possible and then allows analysts and data scientists to easily combine those features into more complex questions. The magic behind Amino is found in its custom Accumulo index; that index strives to provide fast scans, highly dimensional scans, data compression, and a simple query structure. The index leverages Accumulo iterators to do much of the scan time logic which has no limit on dimensionality of the query. Iterators are what makes Accumulo unique and enables the Amino index to execute the complex queries.

Transcript

  • 1. Framework for Big Data Discovery and Analytics 2013 42six Solutions, All Rights Reserved, www.42six.com

2. Hadoop MapReduce We can look across all our data to answer questions!Problem Statement: Developers can write MapReduce code to analyze data, but dont know what to look for; the analysts know what to look for, but dont know how to write code.Technology is not the problem. Its enabling the analyst to effectively leverage technology and reuse it. 2013 42six Solutions, All Rights Reserved, www.42six.com 3. Typical Analyst Workflow: I have an entity I want to learn more about Everything is indexed by entities We can ask questions of Big Data, but they arent Big Questions we always start with an entityWe should be able to: Have a pattern and see entities that match that pattern We can ask complex questions of Big Data 2013 42six Solutions, All Rights Reserved, www.42six.com 4. Nave Way: Custom MapReduce job for each questionAmino Way: Pre-compute features (micro-analytics), the building blocks of questions, and let analysts mix those on the fly to ask complex questions The Amino index executes Analysts complex questions as a real time scan, less competition for resources, more scalable. Scales to billions of entities and features 2013 42six Solutions, All Rights Reserved, www.42six.com 5. Live Demo What could go wrong? 2013 42six Solutions, All Rights Reserved, www.42six.com 6. Amino Framework Feature Creation API Abstracts the complexities of MapReduce Focus on logic of the feature/micro-analytic Write-once DataLoader for each data source Simple and powerful data joinsAmino Index AminoOutputFormat Bulk Ingest into Accumulo Query API Iterators 2013 42six Solutions, All Rights Reserved, www.42six.com 7. Workflow 2013 42six Solutions, All Rights Reserved, www.42six.com 8. Benefits Data Agnostic Not a black box Fully scalable Crowd source micro-analytics Inherent cross-datasource linked indexes Encourages sharing of knowledge, discovery Index built to support machine learning Security considered up front index is in Accumulo Built on open source, for open source 2013 42six Solutions, All Rights Reserved, www.42six.com 9. Feature Creation-Can join multiple datasets -Keys are established in the DataLoaderAny external job can output this format and it will be indexed properly during indexing jobsNotice theres no key thats on purpose! 2013 42six Solutions, All Rights Reserved, www.42six.com 10. Index GoalsNow all our features are indexed, lets let the analysts start building! Fast scans Highly dimensional scans Data compression Simple query structure 2013 42six Solutions, All Rights Reserved, www.42six.com 11. Accumulo Index 1: More Dimensions than Entities RowCFShard Number: Data Source : Bucket Name Bucket ValueCQValueHash Salt Compressed BitmapExample: RowCFCQValue2:Twitter:handlestevetouw0010011010010011JavaEWAH is a word-aligned compressed variant of the Java bitset class. It does not achieve the best compression, but rather improves query processing timeIndexes in the bit vector represent the features that entity falls in a feature vector 2013 42six Solutions, All Rights Reserved, www.42six.com 12. At Query Time Bloom Filter based on Lexicographical first and last of each dimension of the query Number of followers: 10 - 200First: aachimbaLast: zzrkaNumber of tweets per day: 0 - 6First: aaabbbLast: zyrbbHandle starts with letter: SFirst: saarbaLast: szabanSmallest range Dimensions map to a query bit vector 000001001111000101000011100101010011100101 Note there is an index for every possible value between the ratio features 2013 42six Solutions, All Rights Reserved, www.42six.com 13. Accumulo Iterator Time!!RowCFCQValue2:Twitter:handlesaarba0001010110011102:Twitter:handlesaarra0001011110101002:Twitter:handlestevetouw0011111000011002:Twitter:handleszaban000110011001111Push our query bit vector through the range found in the previous stepIf the result of the bitwise operation contains an index at each dimension, we have a match! 2013 42six Solutions, All Rights Reserved, www.42six.com 14. What is the Salt For? RowCFCQValueShard Number: Data Source : Bucket Name Bucket Value Hash Salt Compressed Bitmap RowCFCQValue2:Twitter:handlestevetouw00100110100100101Collisions are possible (using 32 bit vector). Salt is used to hash the feature indexes, so you need as many matches in the previous step as you have salts. We have used 3 salts with 15 billion records and have had no collisions 2013 42six Solutions, All Rights Reserved, www.42six.com 15. Benefits of this Index Tables are small, bit vector compression is good, only one row per entity Works great if you have more dimensions than you have entities or the range in your dimensions are good bloom filters (like handle starts with letter ) No matter how many dimensions, the query will always be as fast as the smallest range All processing/boolean logic occurs on the nodes (thanks iterators), fully scalable Represents a feature vector for your entities great for machine learning 2013 42six Solutions, All Rights Reserved, www.42six.com 16. Accumulo Index 2: More Entities than DimensionsRowCFCQshard:salt Data Source#Bucket Name#FeatureIdValueFeature ValueCompressed BitmapExample: RowCFCQValue2:0Twitter#handle#123456s0100110100101001123456 could map to feature Handle starts with letter Indexes in the bit vector represent the entities that fall in that feature So handle stevetouw could map to index 73 (for salt 0) 2013 42six Solutions, All Rights Reserved, www.42six.com 17. That Same Query Again Number of followers: 10 200 (feature id: 444411) Number of tweets per day: 0 6 (feature id: 555522) Handle starts with letter: S (feature id: 123456) RowCQValue2:0 ORCF Twitter#handle#4444111000101110111002:0Twitter#handle#444411110101010101101 2:0OR20000000010110002:0ANDTwitter#handle#444411 Twitter#handle#555522011111100011012:0Twitter#handle#55552211010100000100 2:0Twitter#handle#555522611110010100002:0Twitter#handle#123456S1111110001101Magic iterator that handles all the boolean logic 2013 42six Solutions, All Rights Reserved, www.42six.com 18. More Details RowCQValue2:0 ORCF Twitter#handle#4444111000101110111002:0Twitter#handle#444411110101010101101 2:0OR20000000010110002:0ANDTwitter#handle#444411Twitter#handle#555522011111100011012:0Twitter#handle#55552211010100000100 2:0Twitter#handle#555522611110010100002:0Twitter#handle#123456S1111110001101The same entity is guaranteed to always land in the same shard:salt no matter the featureWe are left with a set of indexes for each salt, now what? 2013 42six Solutions, All Rights Reserved, www.42six.com 19. Convert Indexes to Entities RowCFCQshardIndex Position#Data Source#Bucket Name#SaltValueBucket ValueExample: RowCFCQ273#Twitter#handle#0ValuestevetouwThe iterator scans the rows using a CF filter with the indexes desired The iterator ensures it gets the same CQ # of salts times before it sends the resulting CQ results back Again, use the power of iterators and pushing code to the data rather than doing the salt set operation in the web tier 2013 42six Solutions, All Rights Reserved, www.42six.com 20. Benefits of this Index Tables are small, bit vector compression is good Works great if you have more entities than you have dimensions (most likely scenario) Affords the ability to do full boolean logic in-iterator, rather than just ANDs as in the previous index All processing/boolean logic occurs on the nodes (thanks iterators), fully scalable 2013 42six Solutions, All Rights Reserved, www.42six.com 21. Conclusion Amino helps non-technical folk leverage MapReduce cleanly and without hogging cluster resources Accumulo iterators are the reason for the index performance Amino is all about sharing and reuse, crowd source the building blocks, save analysts hypotheses, the more people touching Amino, the smarter it becomes Open source (documentation needs help): https://github.com/aminocloud/amino 2013 42six Solutions, All Rights Reserved, www.42six.com 22. Questions? Steve Touw, steve@42six.com Barrett Stabile, bstabile@42six.com Joe Bruner, jbruner@42six.com Sapan Shah, sshah@42six.com 2013 42six Solutions, All Rights Reserved, www.42six.com