50
1 Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12 Building Analytical Applications on Hadoop Josh Wills | Director of Data Science November 2012

Builiding analytical apps on Hadoop

Embed Size (px)

Citation preview

Page 1: Builiding analytical apps on Hadoop

1

Headline Goes HereSpeaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12

Building Analytical Applications on HadoopJosh Wills | Director of Data Science November 2012

Page 2: Builiding analytical apps on Hadoop

2

About Me

Page 3: Builiding analytical apps on Hadoop

3

What are ‘Analytical Applications?’

Page 4: Builiding analytical apps on Hadoop

4

The Humble Dashboard

Page 5: Builiding analytical apps on Hadoop

5

Crossfilter with Flight Information

Page 6: Builiding analytical apps on Hadoop

6

New York Times Electoral Vote Map

Page 7: Builiding analytical apps on Hadoop

7

New York Times Electoral Vote Map (Detail)

Page 8: Builiding analytical apps on Hadoop

8

Analytical Applications vs. Frameworks

Page 9: Builiding analytical apps on Hadoop

9

A Case Study

Developing Analytical Applications

Page 10: Builiding analytical apps on Hadoop

10

2012: The Predicting of the President

Page 11: Builiding analytical apps on Hadoop

11

RealClearPolitics

• Simple Average of Polls

• Transparent

• Simple Interactions

Page 12: Builiding analytical apps on Hadoop

12

FiveThirtyEight

• “Foxy” Model

• Opaque

• Simple Interactions with a richer UI

Page 13: Builiding analytical apps on Hadoop

13

Princeton Election Consortium

•Medians and Polynomials

• Transparent

• Rich Interactions

Page 14: Builiding analytical apps on Hadoop

14

How Did They Do?

Page 15: Builiding analytical apps on Hadoop

15

A Few of These, Because They’re Fun

Page 16: Builiding analytical apps on Hadoop

16

A Few of These, Because They’re Fun

Page 17: Builiding analytical apps on Hadoop

17

A Few of These, Because They’re Fun

Page 18: Builiding analytical apps on Hadoop

18

Here’s the Rub: One Expert Beat Nate

Page 19: Builiding analytical apps on Hadoop

19

Index Funds, Hedge Funds, and Warren Buffett

Page 20: Builiding analytical apps on Hadoop

20

A Brief Introduction to Hadoop

Page 21: Builiding analytical apps on Hadoop

21

Data Storage in 2001: Databases

• Structured schemas• Intensive processing

done where data is stored• Somewhat reliable• Expensive at scale

Page 22: Builiding analytical apps on Hadoop

22

Data Storage in 2001: Filers

• No schemas, stores any kind of file• No data processing

capability• Reliable• Expensive at scale

Page 23: Builiding analytical apps on Hadoop

23

And Then, This Happened

Page 24: Builiding analytical apps on Hadoop

24

Data Economics: Return on Byte

Page 25: Builiding analytical apps on Hadoop

25

Big Data Economics

• No individual record is particularly valuable• Having every record is

incredibly valuable• Web index• Recommendation systems• Sensor data• Market basket analysis• Online advertising

Page 26: Builiding analytical apps on Hadoop

26

Introduction to Hadoop

Page 27: Builiding analytical apps on Hadoop

27

The Hadoop Distributed File System

• Based on the Google File System• Data stored in large files

• Large block size: 64MB to 256MB per block• Blocks are replicated to

multiple nodes in the cluster

Page 28: Builiding analytical apps on Hadoop

28

Simple, Reliable Processing: MapReduce

• Map Stage• Embarrassingly parallel

• Shuffle Stage: Large-scale distributed sort• Reduce Stage

• Process all of the values that have the same key in a single step• Process the data where it is stored• Write once and you’re done.

Page 29: Builiding analytical apps on Hadoop

29

Developing Analytical Applications with Hadoop

Page 30: Builiding analytical apps on Hadoop

30

Novelty is the Enemy of Adoption

Page 31: Builiding analytical apps on Hadoop

31

The Best Way to Get Started: Apache Hive

• Apache Hive• Data Warehouse System on

top of Hadoop• SQL-based query language

• SELECT, INSERT, CREATE TABLE

• Includes some MapReduce-specific extensions

Page 32: Builiding analytical apps on Hadoop

32

Borrowing Abstractions

Page 33: Builiding analytical apps on Hadoop

33

Improving the UX (http://github.com/cloudera/impala)

Page 34: Builiding analytical apps on Hadoop

34

Moving Beyond the Abstractions

Page 35: Builiding analytical apps on Hadoop

35

Making the Abstract Concrete

Page 36: Builiding analytical apps on Hadoop

36

Cloudera’s Data Science Course

Page 37: Builiding analytical apps on Hadoop

37

Analytical Applications I Love

Page 38: Builiding analytical apps on Hadoop

38

The Experiments Dashboard

Page 39: Builiding analytical apps on Hadoop

39

Adverse Drug Events

Page 40: Builiding analytical apps on Hadoop

40

Gene Sequencing and Analytics

Page 41: Builiding analytical apps on Hadoop

41

The Doctor’s Perspective

Page 42: Builiding analytical apps on Hadoop

42

A Couple of Themes

1. Structure data the data in the way that makes sense for the problem.

2. Interactive inputs, not just interactive outputs.

3. Simpler interfaces that yield more sophisticated answers.

Page 43: Builiding analytical apps on Hadoop

43

Working Towards The Dream

Page 44: Builiding analytical apps on Hadoop

44

Moving Beyond MapReduce

Developing Analytical Applications

Page 45: Builiding analytical apps on Hadoop

45

The Cambrian Explosion…of Frameworks

Page 46: Builiding analytical apps on Hadoop

46

It’s Frameworks All The Way Down: Spark

• Developed at Berkeley’s AMP Lab• Defines operations on

distributed in-memory collections• Written in Scala• Supports reading to and

writing from HDFS

Page 47: Builiding analytical apps on Hadoop

47

IFATWD: Graphlab

• Developed at CMU• Lower-level primitives

• (but higher than MPI)• Map/Reduce =>

Update/Sort• Flexible, allows for

asynchronous computations• Reads from HDFS

Page 48: Builiding analytical apps on Hadoop

48

Playing with YARN

Page 49: Builiding analytical apps on Hadoop

49

BranchReduce (http://github.com/cloudera/branchreduce)

Page 50: Builiding analytical apps on Hadoop

50