Upload
boorad
View
10.174
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools. Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think. This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.
Citation preview
1
Big DataAnalysis PatternsAtlanta Big Data User Group8/15/2013
2
whoami• Brad Anderson• Solutions Architect at MapR (Atlanta)• ATLHUG co-chair• NoSQL East Conference 2009• “boorad” most places (twitter, github)• [email protected]
3
Announcements Next ATLHUG Meeting - Sept. 26–How Google Does Big Data
Wednesday – MapR Data Warehouse Offload Roadshow
MapR Upcoming Training• MapR M7 & HBase for Developers on August 27 in Campbell, CA• MapR M7 & HBase for Developers on Sept 17 in Reston, VA• MapR M5 for Administrators on Oct 3 in Campbell, CA
3
4
BIG DATA
5
6
Big Data is not new!but the tools are.
7
The Good News in Big Data:
“Simple algorithms and lots of data trump complex models”
Halevy, Norvig, and Pereira, GoogleIEEE Intelligent Systems
8
The Challenge: So Many Solutions!
What solutions fit your business problem?
For example, do you need… Apache Hadoop? Apache Mahout? Storm? Apache Solr/Lucene? Apache HBase (or MapR M7)? Apache Drill (or Impala?) d3.js or Tableau? Node.js Titan?
8
9
Ask a Different Question
It may be more useful to better define the problem by asking some of these questions: How large is the data to be stored? How large is the data to be queried? (the analysis volume) What time frame is appropriate for your query response? How fast is data arriving? (bursts or continuously?) Are queries by sophisticated users? Are you looking for common patterns or outliers? How are your data sources structures?
9
10
Picking the Best Solution
Your responses to these questions can help you better: define the problem recognize the analysis pattern to which it belongs guide the choice of solutions to try
But first, here’s a quick review of a few of the technologies you might choose, and then we will focus on three of the questions as a part of the landscape.
10
11
Apache Solr/Lucene
Solr/Lucene is a powerful search engine used for flexible, heavily indexed queries including data such as Full text Geographical data Statistically weighted data
Solr is a small data tool that has flourished in a big data world
12
Apache Mahout
Mahout provides a library of scalable machine learning algorithms useful for big data analysis based on Hadoop or other storage systems.
Mahout algorithms mainly are used for Recommendation (collaborative filtering) Clustering Classification
Mahout can be used in conjunction with solutions such as Solr: You might use Mahout to create a co-occurrence data base that could then be queried using a search tool such as Solr
13
Apache Drill
Google Dremel clone Pluggable Query Languages– Starts with ANSI SQL 2003– Hive, Pig, Cascading, MongoQL, …
Pluggable Storage Backends– Hadoop, Hbase– MongoDB (BSON)– RDBMS?
Bypasses MapReduce
14
Storm
Realtime Stream Computation Engine Horizontal Scalability Guaranteed Data Processing Fault Tolerance Higher level abstraction over:– Message Queues– Worker Logic
“The Hadoop of Realtime”
15
Titan Distributed Graph Database Property Graph Pluggable Backend Storage– HBase or M7– Cassandra– Berkeley DB
Search Integrated– Solr/Lucene– Elastic Search
Faunus– Batch processing of large graphs
Fulgora– Graph traversals on subset– In-memory
16
Using the Answers to Guide Your Choices
For simplicity, let’s focus in on the first three questions: How large is the data to be stored? How large is the data to be queried? (the analysis volume) What time frame is appropriate for your query response?
17
Big Data Decision Tree
How big is your data?
<10 GB >200 GBmid
What size queries?
Single element at a time
One passover 100%
Multiple passesover big chunks
Big storage Streaming
Response time?
< 100s(human scale)
throughputnot response
A
B C
ED
??
18
Use Cases Company Data Shape Technique(s) Business Value
19
Business Value
20
Business Value
21
Telecommunications Giant
ETL Offload
22
Lots of Data Lots of Queries across Large Sets Throughput important
Data ShapeTelecommunications
23
Techniques
AnalyticsETL
Telecommunications
24
Techniques
+
ETL (Hadoop) Analytics (Teradata)
Telecommunications
25
Business ValueTelecommunications
26
Credit CardIssuer
27
Customer Purchase History (big) Merchant Designations Merchant Special Offers Throughput important Recommendations
Data Shape
Credit CardIssuer
28
History matrix
One row per user
One column per thing
A Recommendation Engine with Mahout and Solr/Lucene
Techniques
Credit CardIssuer
29
Recommendation based on cooccurrence
Cooccurrence gives item-item mapping
One row and column per thing
Techniques
Credit CardIssuer
30
Cooccurrence matrix can also be implemented as a search index
Techniques
Credit CardIssuer
31
SolRIndexerSolR
IndexerSolrindexing
Cooccurrence(Mahout)
Item meta-data
Indexshards
Complete history
Techniques
20 Hrs 3 Hrs
Credit CardIssuer
32
SolRIndexerSolR
IndexerSolrsearchWeb tier
Item meta-data
Indexshards
User history
Techniques
8Hrs 3 Min
Credit CardIssuer
33
Techniques
PurchaseHistory
Merchant Information
Merchant Offers
RecommendationEngine Results
(Mahout)
PresentationData Store
(DB2)
App
App
App
App
App
Hadoop Export(4 hrs)
Import(4 hrs)
Credit CardIssuer
34
Techniques
PurchaseHistory
Merchant Information
Merchant Offers
RecommendationEngine Results
(Mahout)
RecommendationSearch Index
(Solr)
App
App
App
App
App
Hadoop
IndexUpdate(3 min)
Credit CardIssuer
35
Business Value
Credit CardIssuer
36
Idle Alerts
Waste & Recycling Leader
37
Truck Geolocation Data– 20,000 trucks– 5 sec interval (arriving quickly)
Landfill Geographic Boundaries
Data Shape
38
Techniques
TruckGeolocation
Data
Realtime Stream Computation(Storm)
Batch Computation(MapReduce)
ImmediateAlerts
Tax ReductionReporting
HadoopStorage
Shortest PathGraph Algorithm
(Titan)
Route Optimization
39
Business Value
40
Social Engagement Application
Beverage Company
41
Tweets, FB Messages Person, Activity links Graph Traversal
Data Shape
42
Consumer Activity Graph
Wal*Mart.com
CVS
Dollar General
Ebay
Ebay Motors
Toys R UsStubHub
Shopping.comSam’s
43
Techniques
Property Graph(Titan)
Key/Value Store(MapR M7)
Social Activity Stream
Graph Traversal(Faunus/Fulgora)
44
Business Value
45
Fraud DetectionData Lake
46
Anti-Money Laundering Consumer Transactions
Data Sources
47
TechniquesAnti-Money Laundering
SystemConsumer Transactions
System
48
Techniques
AML
Consumer Transactions
Data Lake(Hadoop)
Suspicious Events
Latent Dirichlet Allocation,Bayesian Learning Neural Network,
Peer Group Analysis
Analyst
49
Business Value
50
Machine LearningSearch Relevance
DNA Matching
51
Birth, Death, Census, Military, Immigration records
Search Behavior Activity DNA SNP (snips)
Data Sources
52
Techniques Record Linking Search Relevance Clickstream Behavior Security Forensics DNA Matching
53
Business Value
54
Traffic Analytics
55
Inrix Road Segment Data– Avg Speed / minute / segment– Reference Speeds
Road Segment Geolocation Data
Data Sources
56
Techniques Bottleneck Detection Algorithm Time Offset Correlations– Alternate Routes
Predictive Congestion Analysis– Growth & Term Assumptions
57
58
59
Business Value
60
Similar Characteristics Lots of Data Structured, Semi-Structured, Unstructured Varied Systems Interoperating– Hadoop, Storm, Solr, MPP, Visualizations
Increase Revenue Decrease Costs
61
Questions?