22
THE HOLY GRAIL OF DATA ANALYTICS Dan Lynn, CEO

The Holy Grail of Data Analytics

Embed Size (px)

Citation preview

Page 1: The Holy Grail of Data Analytics

THE HOLY GRAIL OF DATA ANALYTICS

Dan Lynn, CEO

Page 2: The Holy Grail of Data Analytics

• Data Services • Data Strategy • Data Integration / BI / Analytics • Modernize Data Infrastructures • Custom Applications & APIs

• Distributed over 6 states! • Fully-virtualized staff

www.agildata.com

Dan LynnCEO

Co-Founder @ FullContact 15 years building data systems Techstars [email protected]

Page 3: The Holy Grail of Data Analytics

www.agildata.comAll product names, logos, and brands are property of their respective owners. All company, product and service names used are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.

Free MySQL Performance Analyzer

www.agildata.com/gibbs

AgilData Scalable Cluster

Page 4: The Holy Grail of Data Analytics
Page 5: The Holy Grail of Data Analytics

TRADE-OFFS

Page 6: The Holy Grail of Data Analytics

OLTP vs OLAP

Page 7: The Holy Grail of Data Analytics

OLTP OVERVIEW• “Online Transaction Processing”

• Database is optimized for low latency access to current data

• Short transactions (INSERT, UPDATE, DELETE)

• High concurrency

• Examples:

• Add item to shopping cart

• Reset password

Page 8: The Holy Grail of Data Analytics

OLAP OVERVIEW• Online Analytical Processing

• Database is optimized for aggregation of historical data

• Aggregations can span millions or billions of records

• Low(er) concurrency

• Examples:

• What is our average shopping cart size, grouped by week and by affiliate?

• What are the top 5 paths that users take when navigating our website?

Page 9: The Holy Grail of Data Analytics

HOW DATABASES OPTIMIZE FOR OLTP

• Optimized for reading or updating an entire row • (e.g. the full customer record)

• Data is written to and read from disk on a row-by-row basis.

• Indexes are used to construct full business object from multiple tables via JOINs. • (e.g. SELECT*FROMorderoJOINcustomercONc.id=o.customer_id)

• Hadoop and NoSQL systems generally behave the same.

• Scan performance is limited

Page 10: The Holy Grail of Data Analytics

HOW DATABASES OPTIMIZE FOR OLAP

• Optimized for aggregating columns • (e.g. SELECTAVG(unit_price*qty)FROMorder_lineGROUPBYc.id)

• Data is laid out on disk on a per-column basis. • Great for scans, not so good for random row-level access

• Doesn’t support random UPDATEs

Page 11: The Holy Grail of Data Analytics

HOW HADOOP OPTIMIZES FOR OLAP

• Data is partitioned in HDFS in append-only blocks of ~64MB.

• These blocks are spread out across the cluster.

• Processing (i.e. queries) is sent to the data, instead of bringing the data to the application for processing.

• Columnar data formats like Parquet can be stored on HDFS for very fast scan performance.

• Updates are very expensive.

Page 12: The Holy Grail of Data Analytics

Scan Performance

VS

DATABASE

Updatability

Page 13: The Holy Grail of Data Analytics

THE LAMBDA ARCHITECTURE

Page 14: The Holy Grail of Data Analytics

Kafka, etc…

Data Stream

Write to HDFS Batch Computation(MapReduce, Spark)

Batch Views

Speed Layer(Storm, Spark Streaming, Flink, etc…)

Real-time views

Serving Layer(HBase, MySQL,

PostgreSQL, etc…)

THE LAMBDA ARCHITECTURE

Page 15: The Holy Grail of Data Analytics

• Apache Project (incubating)

• Started at Cloudera, growing industry adoption.

• Currently v0.9.1

• 1.0 release likely coming out in September 2016

Page 16: The Holy Grail of Data Analytics

Source: http://www.slideshare.net/cloudera/kudu-new-hadoop-storage-for-fast-analytics-on-fast-data

Page 17: The Holy Grail of Data Analytics

APACHE KUDU USE CASES• Online Reporting

• Examples: Operational Data Store, Customer-facing analytics, real-time dashboards

• Workload: Inserts, updates, scans, random lookups

• Time Series • Examples: Market analytics, fraud section, risk monitoring, message queueing

• Workload: Inserts, updates, scans, random lookups

• Machine Data Analysis

• Examples: Network threat detection, devops monitoring and alerting

• Workload: Inserts, scans, random lookups

Page 18: The Holy Grail of Data Analytics

THE ROAD AHEAD

Page 19: The Holy Grail of Data Analytics

THE ROAD AHEAD

• Reactive processing

• Dynamic / intelligent indexing

• High performance mutable message queueing

Page 20: The Holy Grail of Data Analytics

LINKS

• Kudu project website:http://kudu.apache.org/

• Details about OLTP vs OLAP workloadshttp://datawarehouse4u.info/OLTP-vs-OLAP.html

• Analyst perspective on Kuduhttp://www.dbms2.com/2015/09/28/introduction-to-cloudera-kudu/

Page 21: The Holy Grail of Data Analytics

www.agildata.com

[email protected]

@danklynn

Thanks!

Page 22: The Holy Grail of Data Analytics

CREDITS

• Grail image: https://upload.wikimedia.org/wikipedia/commons/1/10/London-Victoria_and_Albert_Museum-Grail-02.jpg

• Balanced scales:https://commons.wikimedia.org/wiki/File:Balanced_scale_of_Justice.svg