21
Dan Bassett, Jonathan Canfield December 13, 2011

Dan Bassett, Jonathan Canfield December 13, 2011

  • Upload
    brody

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

Dan Bassett, Jonathan Canfield December 13, 2011. What is Hadoop ?. Allows for the distributed processing of large data sets across clusters of computers Open-source project written in Java Actively supported Inspired by a project that Google started. What’s the big deal?. - PowerPoint PPT Presentation

Citation preview

Page 1: Dan Bassett, Jonathan Canfield December 13, 2011

Dan Bassett, Jonathan CanfieldDecember 13, 2011

Page 2: Dan Bassett, Jonathan Canfield December 13, 2011

2

What is Hadoop?• Allows for the distributed processing of large data sets across

clusters of computers• Open-source project written in Java• Actively supported• Inspired by a project that Google started

Page 3: Dan Bassett, Jonathan Canfield December 13, 2011

3

What’s the big deal?

• Changes the economics and dynamics of large scale computing

• Scalable• Cost effective• Flexible• Fault Tolerant

Page 4: Dan Bassett, Jonathan Canfield December 13, 2011

4

Commercially supported

• InfoSphere BigInsights• Silicon Graphics CloudRack• EMC Greenplum• Google App Engine• Oracle Big Data Appliance• Cloudera CDH, Professional Services• Microsoft Windows Server, SQL Server

Page 5: Dan Bassett, Jonathan Canfield December 13, 2011

5

Who Uses Hadoop?

Page 6: Dan Bassett, Jonathan Canfield December 13, 2011

6

Prominent Users

• Facebook - claims to have the largest Hadoop cluster in the world at 30PB.

• Yahoo! - claims to have the world’s largest Hadoop production application.

• eBay – 5.3PB, 532 nodes cluster• New York Times – processed 4TB of image data

into 11 million PDFs at cost of ~ $240

Page 7: Dan Bassett, Jonathan Canfield December 13, 2011

7

HOW DOES IT WORK?

Page 8: Dan Bassett, Jonathan Canfield December 13, 2011

8

Architecture• Hadoop Common• Hadoop Distributed File System (HDFS)• MapReduce Engine

Page 9: Dan Bassett, Jonathan Canfield December 13, 2011

9

File System (HDFS)• One big file system from many nodes• Fault-tolerant• Runs on low-cost commodity hardware

Page 10: Dan Bassett, Jonathan Canfield December 13, 2011

10

MapReduce Engine• Splits input data• Assigns work to nodes• Processed in parallel

Page 11: Dan Bassett, Jonathan Canfield December 13, 2011

11

MapReduce Illustration

Page 12: Dan Bassett, Jonathan Canfield December 13, 2011

12

MapReduce Step 1

Page 13: Dan Bassett, Jonathan Canfield December 13, 2011

13

MapReduce Step 2

Page 14: Dan Bassett, Jonathan Canfield December 13, 2011

14

MapReduce Step 3

Page 15: Dan Bassett, Jonathan Canfield December 13, 2011

15

MapReduce Step 4

Page 16: Dan Bassett, Jonathan Canfield December 13, 2011

16

MapReduce Step 4

Page 17: Dan Bassett, Jonathan Canfield December 13, 2011

17

MapReduce Step 5

Page 18: Dan Bassett, Jonathan Canfield December 13, 2011

18

MapReduce Step 5

Page 19: Dan Bassett, Jonathan Canfield December 13, 2011

19

MapReduce Step 6

Page 20: Dan Bassett, Jonathan Canfield December 13, 2011

20

MapReduce Illustration

Page 21: Dan Bassett, Jonathan Canfield December 13, 2011

21

Resources

• Project Homehttp://hadoop.apache.org/

• Wikipediahttp://en.wikipedia.org/wiki/Apache_Hadoop

• IBMhttp://www-01.ibm.com/software/data/infosphere/hadoop/