145
© 2013 IBM Corporation 1 AVNET – Hadoop Fundamentals I Romeo Kienzler IBM Innovation Center Zurich

Hadoop Fundamentals I

Embed Size (px)

DESCRIPTION

IBM Innovation Center DACH/Zurich, Romeo Kienzler

Citation preview

Page 1: Hadoop Fundamentals I

© 2013 IBM Corporation1

AVNET – Hadoop Fundamentals I

Romeo Kienzler

IBM Innovation Center Zurich

Page 2: Hadoop Fundamentals I

© 2013 IBM Corporation2

1) Welcome

2) What is big data?

3) Introduction to Hadoop

4) BigInsights

5) Hadoop architecture

6) Lab 1 – Core Hadoop

7) MapReduce

8) Lab 2 – MapReduce

9) Pig, Jaql, Hive, BigSQL, SystemT/AQL

10) Lab 3 – Pig, Hive, and Jaql

11) Certification on BigDataUniversity

Agenda

Page 3: Hadoop Fundamentals I

© 2013 IBM Corporation3

What is BIG data?

Page 4: Hadoop Fundamentals I

© 2013 IBM Corporation4

Traditional Business Intelligence / Data Warehousing

...60 percent, were unsatisfied with their data warehousing system.¹

¹http://www.information-management.com/issues/20010601/3494-1.html

Page 5: Hadoop Fundamentals I

© 2013 IBM Corporation5

What is BIG data?

Page 6: Hadoop Fundamentals I

© 2013 IBM Corporation6

What is BIG data?

Page 7: Hadoop Fundamentals I

© 2013 IBM Corporation7

What is BIG data?

Big Data

Hadoop

Page 8: Hadoop Fundamentals I

© 2013 IBM Corporation8

What is BIG data?

Business Intelligence

Data Warehouse

Page 9: Hadoop Fundamentals I

© 2013 IBM Corporation9

Map-Reduce → Hadoop → BigInsights

Page 10: Hadoop Fundamentals I

© 2013 IBM Corporation1010

Why is Big Data important?

Data AVAILABLE to an organization

data an organization can PROCESS

Missed

opportunity

Enterprises are “more blind” to new opportunities.

Organizations are able to process less and less of the

available data.

100 Millionen Tweets are posted every day, 35 hours of video are beeing uploaded every minute,6.1 x 10^12 text messages have been sent in 2011 and 247 x 10^9 E-Mails passed through the net. 80 % spam and viruses. => Prefiltering is more and more important.

Page 11: Hadoop Fundamentals I

© 2013 IBM Corporation11

Why is Big Data important?

Page 12: Hadoop Fundamentals I

© 2013 IBM Corporation12

Why is Big Data important?

Page 13: Hadoop Fundamentals I

© 2013 IBM Corporation13

Why is Big Data important?

Page 14: Hadoop Fundamentals I

© 2013 IBM Corporation1414

VolumeTerabytes, petabytes, even exabytes

VarietyAll kinds of dataAll kinds of analytics

Velocity

Agility

Analyze data in. . . Hours instead of daysDays instead of weeks

Dynamically responsive Rapid data exploration

Traditional / Non-traditional data sources

Store

Analyze

Explore

What is BIG data?

Volume*Variaty*Velocity=Value

Page 15: Hadoop Fundamentals I

© 2013 IBM Corporation15

BigData Analytics

Page 16: Hadoop Fundamentals I

© 2013 IBM Corporation16

BigData Analytics – Predictive Analytics

Page 17: Hadoop Fundamentals I

© 2013 IBM Corporation17

BigData Analytics – Predictive Analytics

Page 18: Hadoop Fundamentals I

© 2013 IBM Corporation18

BigData Analytics – Correlation / Text / NLP

Page 19: Hadoop Fundamentals I

© 2013 IBM Corporation19

BigData Analytics – Feature Extraction

Feature extraction involves simplifying the amount of resources

required to describe a large set of data accurately¹

¹: Wikipedia

Page 20: Hadoop Fundamentals I

© 2013 IBM Corporation20

BigData Analytics – Predictive Analytics

Storage / DataCPU’s / Algorithm

Business Value / Insight

Page 21: Hadoop Fundamentals I

© 2013 IBM Corporation21

BigData Analytics – Predictive Analytics

"sometimes it's not who has the best algorithm that wins; it's who has the most data."

(C) Google Inc.

The Unreasonable Effectiveness of Data¹

¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf

No Sampling => Work with full dataset => Long Tail Distributions

Page 22: Hadoop Fundamentals I

© 2013 IBM Corporation22

Realtime / In-Memory Computing: InfoSphere Streams / Watson

Page 23: Hadoop Fundamentals I

© 2013 IBM Corporation23

Page 24: Hadoop Fundamentals I

© 2013 IBM Corporation24

Page 25: Hadoop Fundamentals I

© 2013 IBM Corporation25

Page 26: Hadoop Fundamentals I

© 2013 IBM Corporation26

The Paris Hilton Problem

Watson Workshop: What is Watson?

Page 27: Hadoop Fundamentals I

© 2013 IBM Corporation27

Introduction to Hadoop

Page 28: Hadoop Fundamentals I

© 2013 IBM Corporation28

Page 29: Hadoop Fundamentals I

© 2013 IBM Corporation29

BigInsights

Page 30: Hadoop Fundamentals I

© 2013 IBM Corporation30

Page 31: Hadoop Fundamentals I

© 2013 IBM Corporation31

BigInsights Demonstration

Page 32: Hadoop Fundamentals I

© 2013 IBM Corporation32

Hadoop Architecture

Page 33: Hadoop Fundamentals I

© 2013 IBM Corporation33

Page 34: Hadoop Fundamentals I

© 2013 IBM Corporation34

Page 35: Hadoop Fundamentals I

© 2013 IBM Corporation35

HDFS – Hadoop File System

Page 36: Hadoop Fundamentals I

© 2013 IBM Corporation36

Page 37: Hadoop Fundamentals I

© 2013 IBM Corporation37

Page 38: Hadoop Fundamentals I

© 2013 IBM Corporation38

Page 39: Hadoop Fundamentals I

© 2013 IBM Corporation39

Page 40: Hadoop Fundamentals I

© 2013 IBM Corporation40

Page 41: Hadoop Fundamentals I

© 2013 IBM Corporation41

Page 42: Hadoop Fundamentals I

© 2013 IBM Corporation42

Page 43: Hadoop Fundamentals I

© 2013 IBM Corporation43

Page 44: Hadoop Fundamentals I

© 2013 IBM Corporation44

Page 45: Hadoop Fundamentals I

© 2013 IBM Corporation45

Page 46: Hadoop Fundamentals I

© 2013 IBM Corporation46

Page 47: Hadoop Fundamentals I

© 2013 IBM Corporation47

Page 48: Hadoop Fundamentals I

© 2013 IBM Corporation48

Page 49: Hadoop Fundamentals I

© 2013 IBM Corporation49

Page 50: Hadoop Fundamentals I

© 2013 IBM Corporation50

Page 51: Hadoop Fundamentals I

© 2013 IBM Corporation51

Page 52: Hadoop Fundamentals I

© 2013 IBM Corporation52

Page 53: Hadoop Fundamentals I

© 2013 IBM Corporation53

Page 54: Hadoop Fundamentals I

© 2013 IBM Corporation54

Lab 1 – Hadoop Architecture

1)Start from chapter 1.2

2)Replace /home/biadmin with /home/biadminX where X is your user ID

3)In chapter 1.3 skip task 1.3.1._1 and go to http://10.199.20.51:8080 instead

4)Skip 1.3.5

5)In chapter 1.3.6._30 use any file you like on your desktop computer

Page 55: Hadoop Fundamentals I

© 2013 IBM Corporation55

Map-Reduce

Page 56: Hadoop Fundamentals I

© 2013 IBM Corporation56

Page 57: Hadoop Fundamentals I

© 2013 IBM Corporation57

Page 58: Hadoop Fundamentals I

© 2013 IBM Corporation58

Page 59: Hadoop Fundamentals I

© 2013 IBM Corporation59

Page 60: Hadoop Fundamentals I

© 2013 IBM Corporation60

Page 61: Hadoop Fundamentals I

© 2013 IBM Corporation61

Page 62: Hadoop Fundamentals I

© 2013 IBM Corporation62

Page 63: Hadoop Fundamentals I

© 2013 IBM Corporation63

Page 64: Hadoop Fundamentals I

© 2013 IBM Corporation64

Page 65: Hadoop Fundamentals I

© 2013 IBM Corporation65

Page 66: Hadoop Fundamentals I

© 2013 IBM Corporation66

Page 67: Hadoop Fundamentals I

© 2013 IBM Corporation67

Page 68: Hadoop Fundamentals I

© 2013 IBM Corporation68

Page 69: Hadoop Fundamentals I

© 2013 IBM Corporation69

Page 70: Hadoop Fundamentals I

© 2013 IBM Corporation70

Page 71: Hadoop Fundamentals I

© 2013 IBM Corporation71

Page 72: Hadoop Fundamentals I

© 2013 IBM Corporation72

Page 73: Hadoop Fundamentals I

© 2013 IBM Corporation73

Page 74: Hadoop Fundamentals I

© 2013 IBM Corporation74

Page 75: Hadoop Fundamentals I

© 2013 IBM Corporation75

Page 76: Hadoop Fundamentals I

© 2013 IBM Corporation76

Page 77: Hadoop Fundamentals I

© 2013 IBM Corporation77

Page 78: Hadoop Fundamentals I

© 2013 IBM Corporation78

Page 79: Hadoop Fundamentals I

© 2013 IBM Corporation79

Page 80: Hadoop Fundamentals I

© 2013 IBM Corporation80

Page 81: Hadoop Fundamentals I

© 2013 IBM Corporation81

Page 82: Hadoop Fundamentals I

© 2013 IBM Corporation82

Page 83: Hadoop Fundamentals I

© 2013 IBM Corporation83

Page 84: Hadoop Fundamentals I

© 2013 IBM Corporation84

Page 85: Hadoop Fundamentals I

© 2013 IBM Corporation85

Page 86: Hadoop Fundamentals I

© 2013 IBM Corporation86

Page 87: Hadoop Fundamentals I

© 2013 IBM Corporation87

Page 88: Hadoop Fundamentals I

© 2013 IBM Corporation88

Page 89: Hadoop Fundamentals I

© 2013 IBM Corporation89

Page 90: Hadoop Fundamentals I

© 2013 IBM Corporation90

Page 91: Hadoop Fundamentals I

© 2013 IBM Corporation91

Page 92: Hadoop Fundamentals I

© 2013 IBM Corporation92

Page 93: Hadoop Fundamentals I

© 2013 IBM Corporation93

Page 94: Hadoop Fundamentals I

© 2013 IBM Corporation94

Page 95: Hadoop Fundamentals I

© 2013 IBM Corporation95

Page 96: Hadoop Fundamentals I

© 2013 IBM Corporation96

Page 97: Hadoop Fundamentals I

© 2013 IBM Corporation97

Data Parallelism

Page 98: Hadoop Fundamentals I

© 2013 IBM Corporation98

Aggregated Bandwith between CPU, Main Memory and Hard Drive

1 TB (at 10 GByte/s)

- 1 Node - 100 sec

- 10 Nodes - 10 sec

- 100 Nodes - 1 sec

- 1000 Nodes - 100 msec

Page 99: Hadoop Fundamentals I

© 2013 IBM Corporation99

Page 100: Hadoop Fundamentals I

© 2013 IBM Corporation100

Page 101: Hadoop Fundamentals I

© 2013 IBM Corporation101

Page 102: Hadoop Fundamentals I

© 2013 IBM Corporation102

Page 103: Hadoop Fundamentals I

© 2013 IBM Corporation103

Lab 2 - MapReduce

1)Skip task 1.1._1, use putty to connect to [email protected] instead

2)Replace /home/biadmin with /home/biadminX where X is your user ID

3)In 1.1._4 - 1.1._6 replace output with with /home/biadminX/output where X is your user ID

4)Skip chapter 1.2

5)Chapter 1.3 is optional (using your local virtual machine), maybe during lunch break :)

Page 104: Hadoop Fundamentals I

© 2013 IBM Corporation104

Pig, Jaql, Hive, BigSQL, SystemT/AQL

Page 105: Hadoop Fundamentals I

© 2013 IBM Corporation105

Page 106: Hadoop Fundamentals I

© 2013 IBM Corporation106

Page 107: Hadoop Fundamentals I

© 2013 IBM Corporation107

Page 108: Hadoop Fundamentals I

© 2013 IBM Corporation108

Page 109: Hadoop Fundamentals I

© 2013 IBM Corporation109

Page 110: Hadoop Fundamentals I

© 2013 IBM Corporation110

Page 111: Hadoop Fundamentals I

© 2013 IBM Corporation111

Page 112: Hadoop Fundamentals I

© 2013 IBM Corporation112

Page 113: Hadoop Fundamentals I

© 2013 IBM Corporation113

Page 114: Hadoop Fundamentals I

© 2013 IBM Corporation114

Page 115: Hadoop Fundamentals I

© 2013 IBM Corporation115

Page 116: Hadoop Fundamentals I

© 2013 IBM Corporation116

Page 117: Hadoop Fundamentals I

© 2013 IBM Corporation117

Page 118: Hadoop Fundamentals I

© 2013 IBM Corporation118

Page 119: Hadoop Fundamentals I

© 2013 IBM Corporation119

Page 120: Hadoop Fundamentals I

© 2013 IBM Corporation120

Page 121: Hadoop Fundamentals I

© 2013 IBM Corporation121

Page 122: Hadoop Fundamentals I

© 2013 IBM Corporation122

Page 123: Hadoop Fundamentals I

© 2013 IBM Corporation123

Page 124: Hadoop Fundamentals I

© 2013 IBM Corporation124

Page 125: Hadoop Fundamentals I

© 2013 IBM Corporation125

Page 126: Hadoop Fundamentals I

© 2013 IBM Corporation126

Page 127: Hadoop Fundamentals I

© 2013 IBM Corporation127

Page 128: Hadoop Fundamentals I

© 2013 IBM Corporation128

Page 129: Hadoop Fundamentals I

© 2013 IBM Corporation129

Page 130: Hadoop Fundamentals I

© 2013 IBM Corporation130

Page 131: Hadoop Fundamentals I

© 2013 IBM Corporation131

Page 132: Hadoop Fundamentals I

© 2013 IBM Corporation132

Page 133: Hadoop Fundamentals I

© 2013 IBM Corporation133

SQL for BigInsights

Data warehouse augmentation is a very common use case for Hadoop

While highly scalable, MapReduce is notoriously difficult to use– Java API is tedious and requires programming expertise– Unfamiliar languages (e.g. Pig) also requiring expertise– Many different file formats, storage mechanisms, configuration options, etc.– Joins, grouping, sorting tedious to orchestrate

SQL support opens the data to a much wider audience– Familiar, widely known syntax– Common catalog for identifying data and structure– Clear separation of defining the what (you want) vs. the how (to get it)

Page 134: Hadoop Fundamentals I

© 2013 IBM Corporation134

Query Processing

Big SQL consists of two query processing engines – The SQL optimization engine– Jaql as the query execution engine

Client

SQL Engine

Jaql

Jaql SQL

Optimizer

Runtime

Page 135: Hadoop Fundamentals I

© 2013 IBM Corporation135

Big SQL vs. Alternatives

There are a number of SQL solutions, where does Big SQL fit in? Hive

– Open source• Established Hadoop component• Active development community

– Restrictive SQL syntax• No subqueries (Hive 0.11 adds non-correlated subquery support)• No windowed aggregates (Hive 0.11 adds windowed aggregate support)• Ansi join syntax only

– Limited type support• No varchar(n), decimal(p,s), etc.

– Poor client support• Limited JDBC and ODBC drivers

– Poor low-latency query support (via local mapreduce)

Page 136: Hadoop Fundamentals I

© 2013 IBM Corporation136

Big SQL vs. Alternatives (cont.)

Impala– Recently open sourced– Achieves low latency by bypassing MapReduce infrastructure

• Installs a completely separate execution infrastructure• Can lead to resource scheduling conflicts

– Execution engine is C++• Great for performance, makes extending difficult (e.g. UDF's & UDA's)• Support for limited set of file formats

– Currently limited to broadcast joins• All tables must fit in memory (aggregate cluster memory)• Scalability limitation for larger clusters

– Uses Hive 0.9 query syntax (more limitations than the current Hive)– Uses Hive 0.9 type system (more limitations than the current Hive)

Page 137: Hadoop Fundamentals I

© 2013 IBM Corporation137

Page 138: Hadoop Fundamentals I

© 2013 IBM Corporation138

Page 139: Hadoop Fundamentals I

© 2013 IBM Corporation139

Page 140: Hadoop Fundamentals I

© 2013 IBM Corporation140

Page 141: Hadoop Fundamentals I

© 2013 IBM Corporation141

Lab 3 – Querying Data with Pig, Hive, Jaql

1)putty to [email protected]

2)Skip task 1.1._2, start jaql shell using command /opt/ibm/biginsights/jaql/bin/jaqlshell

3)In 1.1._5 replace biadmin with with biadminX where X is your user ID

4)Skip chapter 1.2 (optional using virtual machine)

5)In 1.3._2 replace biadmin with with biadminX where X is your user ID

6)Instead of task 1.3._2 type /opt/ibm/biginsights/pig/bin/pig

7)In 1.3._4 replace sampleData/NewsGroups.csv with /user/biadminX/sampleData/NewsGroups.csv

8)Skip chapter 1.4 (optional using virtual machine)

9)Skip 1.5._12 and _13 and type /opt/ibm/biginsights/hive/bin/hive instead

10)Type "use biadminX" where X is your user ID

11)continue with task _14

Page 142: Hadoop Fundamentals I

© 2013 IBM Corporation142

NoSQL Databases Column Store

– Hadoop / HBASE– Cassandra– Amazon Simple DB

JSON / Document Store– MongoDB– CouchDB

Key / Value Store– Amazon DynamoDB– Voldemort

Graph DBs– DB2 SPARQL Extension– Neo4J

MP RDBMS– DB2 DPF, DB2 pureScale, PureData for Operational Analytics– Oracle RAC– Greenplum

http://nosql-database.org/ > 150

Page 143: Hadoop Fundamentals I

© 2013 IBM Corporation143

CAP Theorem / Brewers Theorem¹ impossible for a distributed computer system simultaneously guarantee all 3 properties

– Consistency (all nodes see the same data at the same time)– Availability (guarantee that every request knows whether it was successful or failed)– Partition tolerance (continues to operate despite failure of part of the system)

What about ACID?– Atomicity– Consistency– Isolation– Durability

BASE, the new ACID– Basically Available– Soft state– Eventual consistency

• Monotonic Read Consistency• Monotonic Write Consistency

• Read Your Own Writes

Page 144: Hadoop Fundamentals I

© 2013 IBM Corporation144

Certification Go to www.bigdatauniversity.com Search for “hadoop fundamentals” Choose “Hadoop Fundamentals I – Version 2” Sign up Login with existing account or one of the following:

Take the test:

Page 145: Hadoop Fundamentals I

© 2013 IBM Corporation145

Questions?