11
•• ••• ••• ••• ••• ••• •• out THIRD EDITION Hadoop: The Definitive Guide Tom White O'REILLY® Beijing Cambridge Farnham Koln Sebastopol Tokyo

THIRD EDITION Hadoop: The Definitive Guide •• ••• ••• ••• ••library02.embl.de/InmagicGenie/DocumentFolder/TableO… ·  · 2015-03-03THIRD EDITION Hadoop:

Embed Size (px)

Citation preview

Page 1: THIRD EDITION Hadoop: The Definitive Guide •• ••• ••• ••• ••library02.embl.de/InmagicGenie/DocumentFolder/TableO… ·  · 2015-03-03THIRD EDITION Hadoop:

•• ••• ••• ••• ••• ••• ••

out

THIRD EDITION

Hadoop: The Definitive Guide

Tom White

O'REILLY® Beijing • Cambridge • Farnham • Koln • Sebastopol • Tokyo

Page 2: THIRD EDITION Hadoop: The Definitive Guide •• ••• ••• ••• ••library02.embl.de/InmagicGenie/DocumentFolder/TableO… ·  · 2015-03-03THIRD EDITION Hadoop:

Table of Contents

Foreword .. .......... . ....... . . . .... .... . .... . ... . .. .. .. . . . . ........ . ....... xv

Preface ..... . . .. ....... . ....... . .. . ........ . . . . .. .. . . . .. ....... ... .. .. . .. . . xvii

1. Meet Hadoop ....................................... .. . .......... . .. . ... 1 Datal 1 Data Storage and Analysis 3 Comparison with Other Systems 4

Rational Database Management System 4 Grid Computing 6 Volunteer Computing 8

A Brief History of Hadoop 9 Apache Hadoop and the Hadoop Ecosystem 12 Hadoop Releases 13

What's Covered in This Book 15 Compatibility 15

2. MapReduce ........... ..... . . .. . .. . ....... . .............. . ... . ....... . 17 A Weather Dataset 17

Data Format 17 Analyzing the Data with Unix Tools 19 Analyzing the Data with Hadoop 20

Map and Reduce 20 Java MapReduce 22

Scaling Out 30 Data Flow 30 Combiner Functions 33 Running a Distributed MapReduce Job 36

Hadoop Streaming 36 Ruby 36 Python 39

v

Page 3: THIRD EDITION Hadoop: The Definitive Guide •• ••• ••• ••• ••library02.embl.de/InmagicGenie/DocumentFolder/TableO… ·  · 2015-03-03THIRD EDITION Hadoop:

Hadoop Pipes 40 Writablt

Compiling and Running 41 lmplemt

The Hadoop Distributed Filesystem .............................. .. .... . .. 43 Serializa

3. Avro The Design of HDFS 43 Avro Da HDFS Concepts 45 In-Mem

Blocks 45 Avro Da Namenodes and Datanodes 46 lnteropt HDFS Federation 47 Schema HDFS High-Availability 48 Sort On

The Command-Line Interface 49 AvroM Basic Filesystem Operations 50 Sorting

Hadoop Filesystems 52 AvroM Interfaces 53 File-Based

The Java Interface 55 Sequenc Reading Data from a Hadoop URL 55 Map File Reading Data Using the FileSystem API 57 Writing Data 60 5. Developing Directories 62 The Confi Querying the Filesystem 62 Combi1 Deleting Data 67 Variabl

Data Flow 67 Setting Ur Anatomy of a File Read 67 Manag' Anatomy of a File Write 70 Generic Coherency Model 72 Writing a

Data Ingest with Flume and Sqoop 74 Mappe Parallel Copying with distcp 75 Reduct

Keeping an HDFS Cluster Balanced 76 Running 1 Hadoop Archives 77 Runnir

Using Hadoop Archives 77 Testin1 Limitations 79 Runningc

Packag

4. Hadoop 1/0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 La unci

Data Integrity 81 TheM

Data Integrity in HDFS 81 Retrie'

LocalFileSystem 82 Debug

ChecksumFileSystem 83 Hadoc

Compression 83 Remot

Codecs 85 Tuning a, Compression and Input Splits 89 Profili

Using Compression in MapReduce 90 MapRedt

Serialization 93 Decon

The Writable Interface 94 JobCc

vi I Table of Contents

Page 4: THIRD EDITION Hadoop: The Definitive Guide •• ••• ••• ••• ••library02.embl.de/InmagicGenie/DocumentFolder/TableO… ·  · 2015-03-03THIRD EDITION Hadoop:

40 41

. ........ . .. 43 43 45 45 46 47 48 49 50 52 53 55 55 57 60 5. 62 62 67 67 67 70 72 74 75 76 77 77 79

•• ttt t •• •• • 81 81 81 82 83 83 85 89 90 93 94

Writable Classes Implementing a Custom Writable Serialization Frameworks

Avro Avro Data Types and Schemas In-Memory Serialization and Deserialization Avro Datafiles Interoperability Schema Resolution Sort Order Avro MapReduce Sorting Using Avro MapReduce Avro MapReduce in Other Languages

File-Based Data Structures SequenceFile MapFile

96 103 108 110 111 114 117 118 121 123 124 128 130 130 130 137

Developing a MapReduce Application .................................... 143 The Configuration API 144

Combining Resources 145 Variable Expansion 146

Setting Up the Development Environment 146 Managing Configuration 148 GenericOptionsParser, Tool, and ToolRunner 150

Writing a Unit Test with MRUnit 154 Mapper 154 Reducer 156

Running Locally on Test Data Running a Job in a Local Job Runner Testing the Driver

Running on a Cluster Packaging a Job Launching a Job The MapReduce Web Ul Retrieving the Results Debugging a Job Hadoop Logs Remote Debugging

Tuning a Job Profiling Tasks

MapReduce Workflows Decomposing a Problem into MapReduce Jobs JobControl

157 157 160 161 162 163 165 168 170 175 177 178 179 181 181 183

Table otcontents I vii

Page 5: THIRD EDITION Hadoop: The Definitive Guide •• ••• ••• ••• ••library02.embl.de/InmagicGenie/DocumentFolder/TableO… ·  · 2015-03-03THIRD EDITION Hadoop:

Apache Oozie 183 User

6. How MapReduce Works ............... ...... .. .. .. . . ................... 189 Sorting

Prep Anatomy of a Map Reduce Job Run 189 Parti

Classic MapReduce (MapReduce 1) 190 Tota YARN (MapReduce 2) 196 Seco

Failures 202 Joins Failures in Classic MapReduce 202 Map Failures in YARN 204 Redt

Job Scheduling 206 SideD< The Fair Scheduler 207 Usin The Capacity Scheduler 207 Dist

Shuffle and Sort 208 MapRe The Map Side 208 The Reduce Side 210 9. Setting Configuration Tuning 211 Cluste1

Task Execution 214 Net' The Task Execution Environment 215 Cluste1 Speculative Execution 215 Inst: Output Committers 217 Cre: Task JVM Reuse 219 Inst Skipping Bad Records 220 Test

SSHC 7. MapReduce Types and Formats .. .................... . .......... . ........ 223 Hadoc

MapReduce Types 223 Cor The Default MapReduce Job 227 Env

Input Formats 234 Imr Input Splits and Records 234 Hac

Text Input 245 Oth

Binary Input 249 Use

Multiple Inputs 250 YARN

Database Input (and Output) 251 Imr

Output Formats 251 YAi

Text Output 252 Securi

Binary Output 253 Ker

Multiple Outputs 253 Del

Lazy Output 257 Otl

Database Output 258 Bench Ha,

8. MapReduce Features ......... . .... .. .................................. 259 Us( Hado,

Counters 259 Ap

Built-in Counters 259 User-Defined Java Counters 264

viii I Table ofContents

Page 6: THIRD EDITION Hadoop: The Definitive Guide •• ••• ••• ••• ••library02.embl.de/InmagicGenie/DocumentFolder/TableO… ·  · 2015-03-03THIRD EDITION Hadoop:

183

. .. .. ... .. .. 189 189 190 196 202 202 204 206 207 207 208 208 210 211 214 215 215 217 219 220

............ 223 223 227 234 234 245 249 250 251 251 252 253 253 257 258

. . .... . .... 259 259 259 264

User-Defined Streaming Counters

Sorting Preparation Partial Sort Total Sort Secondary Sort

Joins . Map-Side Joms Reduce-Side Joins

Side Data Distribution Using the Job Configuration Distributed Cache

MapReduce Library Classes

268 268 269 270 274 277 283 284 285 288 288 289 295

9. Setting Up a Hadoop Cluster .............. ..... . ..... .... .... . .......... 297 297 299 301 302 302 302 303 303 304 305 307 311 316 317 320 320 321 324 325 326 328 329 331 331 333 334 334

Cluster Specification Network Topology

Cluster Setup and Installation Installing Java Creating a Hadoop User Installing Hadoop Testing the Installation

SSH Configuration Hadoop Configuration

Configuration Management Environment Settings Important Hadoop Daemon Properties Hadoop Daemon Addresses and Ports Other Hadoop Properties User Account Creation

YARN Configuration Important YARN Daemon Properties YARN Daemon Addresses and Ports

Security Kerberos and Hadoop Delegation Tokens Other Security Enhancements

Benchmarking a Hadoop Cluster Hadoop Benchmarks User Jobs

Hadoop in the Cloud Apache Whirr

Table ofContents I ix

Page 7: THIRD EDITION Hadoop: The Definitive Guide •• ••• ••• ••• ••library02.embl.de/InmagicGenie/DocumentFolder/TableO… ·  · 2015-03-03THIRD EDITION Hadoop:

10. Administering Hadoop ...................................... .. . ........ 339 HDFS 339

Persistent Data Structures 339 Safe Mode 344 Audit Logging Tools

Monitoring Logging Metrics Java Management Extensions

Maintenance Routine Administration Procedures Commissioning and Decommissioning Nodes Upgrades

346 347 351 352 352 355 358 358 359 362

11. Pig .......... . ... . .... . .. . ......................................... . 367 Installing and Running Pig 368

Execution Types 368 Running Pig Programs 370 Grunt 370 Pig Latin Editors 371

An Example 371 Generating Examples 373

Comparison with Databases 374 Pig Latin 375

Structure 376 Statements 377 Expressions 381 Types 382 Schemas 384 Functions 388 Macros 390

User-Defined Functions 391 A Filter UDF 391 An Eva! UDF 394 A Load UDF 396

Data Processing Operators 399 Loading and Storing Data 399 Filtering Data 400 Grouping and Joining Data 402 Sorting Data 407 Combining and Splitting Data 408

Pig in Practice 409

x I Table of Contents

12.

13.

Parallel Para me

Hive ..... Installing :

The Hi· An Examr Running l

Config1 HiveS( TheM•

Com paris Schem Updat(

HiveQL Data T Opera1

Tables Manaf Partiti• Storag Impoli Alterii Dropr:

Queryin! Sortin MapR Joins Subqu Views

User-Del Writi1 Writi!

HBase .. HBasics

Back< Concepl

Whir Imp It

lnstallat Test

Clients

Page 8: THIRD EDITION Hadoop: The Definitive Guide •• ••• ••• ••• ••library02.embl.de/InmagicGenie/DocumentFolder/TableO… ·  · 2015-03-03THIRD EDITION Hadoop:

.. ..... .. ... 339 339 339 344 346 347 351 352 352 355 358 358 359 362

........... 367 368 368 370 370 371 371 373 374 375 376 377 381 382 384 388 390 391 391 394 396 399 399 400 402 407 408 409

Parallelism Parameter Substitution

409 410

H·ve ....... · · · · · · · · · · · · · · · ... · ....... ............ .... ...... 413 12. I ......... Installing Hive 414

The Hive Shell 415 AnExample 416 Running Hive 417

Configuring Hive 417 Hive Services 419 The Metastore 4 21

Comparison with Traditional Databases 423 Schema on Read Versus Schem~ on Write 423 Updates, Transactions, and Indexes 424

HiveQL 425 Data Types 426 Operators and Functions 428

Tables 429 Managed Tables and External Tables 429 Partitions and Buckets 431 Storage Formats 435 Importing Data 441 Altering Tables 443 Dropping Tables 443

Querying Data 444 Sorting and Aggregating 444 MapReduce Scripts 445 Joins 446 Subqueries 449 Views 450

User-Defined Functions 451 Writing a UDF 452 Writing a UDAF 454

13. HBase ... ... ................................... ...... ................ 459 HBasics

Backdrop Concepts

Whirlwind Tour of the Data Model Implementation

Installation Test Drive

Clients

459 460 460 460 461 464 465 467

Table of Contents I xi

Page 9: THIRD EDITION Hadoop: The Definitive Guide •• ••• ••• ••• ••library02.embl.de/InmagicGenie/DocumentFolder/TableO… ·  · 2015-03-03THIRD EDITION Hadoop:

Java Avro, REST, and Thrift

Example Schemas Loading Data Web Queries

HBase Versus RDBMS Successful Service HBase Use Case: HBase at Streamy.com

Praxis Versions HDFS Ul Metrics Schema Design Counters Bulk Load

467 470 472 472 473 476 479 480 481 481 483 483 484 485 485 486 486 487

14. ZooKeeper ............................... . . . . ... . ...... . .......... . .. 489 Installing and Running ZooKeeper 490 An Example 492

Group Membership in ZooKeeper 492 Creating the Group 493 Joining a Group 495 Listing Members in a Group 496 Deleting a Group 498

The ZooKeeper Service 499 Data Model 499 Operations Implementation Consistency Sessions States

Building Applications with ZooKeeper A Configuration Service The Resilient ZooKeeper Application A Lock Service More Distributed Data Structures and Protocols

ZooKeeper in Production Resilience and Performance Configuration

xii I Table of Contents

501 506 507 509 511 512 512 515 519 521 522 523 524

15.

16.

Sqoop. Gettin~

Sqoop ASamJ

Tex1 Genera

Add Import

Con Imp Dire

Worki1 Imp

Import Perforr Export

Exp Exp

CaseSt1 Hadoo

Last Had Gen The Sum

Hadoo Had HyJ1 Hiv1 Pro!

Nutch Dat: Sele Surr

Log Pr· Req Brie Chc Col MaJ

Cascac FieJ.

Page 10: THIRD EDITION Hadoop: The Definitive Guide •• ••• ••• ••• ••library02.embl.de/InmagicGenie/DocumentFolder/TableO… ·  · 2015-03-03THIRD EDITION Hadoop:

467 470 472 472 473 476 479 480 481 481 483 483 484 485 485 486 486 487

. ...... . ... .. 489 490 492 492 493 495 496 498 499 499 501 506 507 509 511 512 512 515 519 521 522 523 524

.......................... ...... . ...... ............. 527 15. Sqoop """"" · 527

Getting Sqoop Connectors 529

Sqoop 9 A Sample Import . 52

Text and Binary F1le Formats 532 Jencrared ocl 532

Additional erialization Systems 533 Im.porr : A Deeper Look 533

ntrolling the 1m port 535 Imports and Consistency 536 Direct-mode Imports 536

Working with Imported Data 536 Imported Data and Hive 537

Importing Large Objects 540 Performing an Export 542 Exports: A Deeper Look 543

Exports and Transactionality 545 Exports and SequenceFiles 545

16. Case Studies ....... .. ... . . . . . ........... . .................... . . . ... . . 547 Hadoop Usage at Last.fm 547

Last.fm: The Social Music Revolution 547 Hadoop at Last.fm 547 Generating Charts with Hadoop 548 The Track Statistics Program 549 Summary 556

Hadoop and Hive at Facebook 556 Hadoop at Facebook 556 Hypothetical Use Case Studies 559 Hive 562 Problems and Future Work 566

Nutch Search Engine Data Structures Selected Examples of Hadoop Data Processing in Nutch Summary

Log Processing at Rackspace Requirements/The Problem Brief History Choosing Hadoop Collection and Storage MapReduce for Logs

Cascading Fields, Tuples, and Pipes

567 568 571 580 581 581 582 582 582 583 589 590

Table otcontents I xiii

Page 11: THIRD EDITION Hadoop: The Definitive Guide •• ••• ••• ••• ••library02.embl.de/InmagicGenie/DocumentFolder/TableO… ·  · 2015-03-03THIRD EDITION Hadoop:

Operations 593 Taps, Schemes, and Flows 594 Cascading in Practice 595 Flexibility 598 Hadoop and Cascading at Share This 599 Summaty 603

TeraByte Sort on Apache Hadoop 603 Using Pig and Wukong to Explore Billion-edge Network Graphs 607

Measuring Community 609 Everybody's Talkin' at Me: The Twitter Reply Graph 609 Symmetric Links 612 Community Extraction 613

. A. Installing Apache Hadoop .............................................. 617

B. Cloudera's Distribution Including Apache Hadoop ............ .. ............ 623

C. Preparing the NCDC Weather Data ......... . .......... . ...... . ........... 625

Index ..... . . . . . ................................... . ................ . ...... 629

xiv J Table of Contents

[-J,1doop got its s web search engi handful of comr route became cit having with Nut< ,1s a part of Nutc

We managed to ~ to handle the Wt moreover, that t1

Around that timt We split off the d of Yahoo!, Hado

In 2006, Tom Wi excellent article l in clear prose. I ~ to read as his pre

From the beginn: for the project. U in tweaking the s anyone to use.

Initially, Tom sg ices. Then he mfj MapReduce API work. In all case role of Hadoop c· Management Co

Tom is now are~ he's an expert in easier to use and