16
Apache Hadoop and HBase in the Real World Joey Echeverria @fwiffo #novahug

Hadoop and h base in the real world

Embed Size (px)

Citation preview

Page 1: Hadoop and h base in the real world

Apache Hadoop and HBase in the Real World

Joey Echeverria

@fwiffo

#novahug

Page 2: Hadoop and h base in the real world

2Copyright 2011 Cloudera Inc. All rights reserved

The Plug

We're Training!

Developer Training July 25 to 27

Admin Training July 28 to 29

http://www.cloudera.com/training

We’re Hiring!

Solution Architects, Trainers,

Distributed Systems Engineers

http://www.cloudera.com/careers

Page 3: Hadoop and h base in the real world

3Copyright 2011 Cloudera Inc. All rights reserved

1 Minute Hadoop Recap

HDFS

Distributed file system

Optimized for streaming reads and writes

Block-level replication

MapReduce

Distributed processing framework

Reads/writes data in HDFS (typically)

Operates over (key, value) view of data

Page 4: Hadoop and h base in the real world

4Copyright 2011 Cloudera Inc. All rights reserved

Where does HBase come in?

Google

Google invented GFS and MapReduce

GFS optimized for streaming reads and writes

BigTable

Google's answer to random read/write workloads

Page 5: Hadoop and h base in the real world

5Copyright 2011 Cloudera Inc. All rights reserved

HBase: BigTable-like storage (for Hadoop)

Page 6: Hadoop and h base in the real world

6Copyright 2011 Cloudera Inc. All rights reserved

What is HBase?

Key/value column family store

Data stored in HDFS

ZooKeeper for coordination

Access model is get/put/del

Plus range scans and versions

Page 7: Hadoop and h base in the real world

Image courtsey Lars George, Licensed uner Creative Commons Attribution-Noncommercial

7Copyright 2011 Cloudera Inc. All rights reserved

Architecture

Page 8: Hadoop and h base in the real world

8Copyright 2011 Cloudera Inc. All rights reserved

Tables and Column Families

Static part of the schema

Column families also form locality groups

One Store per family

Multiple HFiles per Store

Tables split into regions

Continuous range of row keys

Unit of distribution

Automatically split

Pre-split for performance

Page 9: Hadoop and h base in the real world

9Copyright 2011 Cloudera Inc. All rights reserved

Why use HBase?

Variable schema in each record

Collections of data for each key

Atomic control of per-key data

Row access to each column family

Page 10: Hadoop and h base in the real world

10Copyright 2011 Cloudera Inc. All rights reserved

HBase Applications

“Smart Data, at Scale, made Easy”

http://www.lilyproject.org

“Distributed, scalable Time Series Database (TSDB)”

http://opentsdb.net

Page 11: Hadoop and h base in the real world

11Copyright 2011 Cloudera Inc. All rights reserved

Real-time ad optimizations

Capturing impressions and serving ads

HBase front-end – to serve models (via memcached)

HBase back-end – to serve pixels and capture cookies

MapReduce to compute models between the two

Page 12: Hadoop and h base in the real world

12Copyright 2011 Cloudera Inc. All rights reserved

Click stream sessionization

Key on userid and time

Seperate table for significant events (e.g. purchase)

Load data using HBase importtsv tool

Sessionization performed by simple scans

Page 13: Hadoop and h base in the real world

13Copyright 2011 Cloudera Inc. All rights reserved

Mozilla - Soccorro

When Firefox crashes, where do reports go?

The Mozilla team gathers those crashes in HBase

Crashes varry widely and change format often

Processors take each individually and parse it out

http://code.google.com/p/socorro

Page 14: Hadoop and h base in the real world

14Copyright 2011 Cloudera Inc. All rights reserved

Navteq

Location based content serving

All served out of HBase, location makes a great key

Content is variable – Maps, POI, User Data

Preprocessing is all done via MR jobs

Page 15: Hadoop and h base in the real world

15Copyright 2011 Cloudera Inc. All rights reserved

Cloudera

Gathers data about customer clusters

Each customer node is a key with Avro values

Easy to browse, quick to find issues on Nodes

Dump to HDFS and process with Pig

Page 16: Hadoop and h base in the real world

16Copyright 2011 Cloudera Inc. All rights reserved