22
Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms Jeffrey Breen Director, Think Big Academy November 2014

Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

  • Upload
    mongodb

  • View
    223

  • Download
    0

Embed Size (px)

DESCRIPTION

Drawn from Think Big's experience on real-world client projects, Think Big Academy Director and Principal Architect Jeffrey Breen will review specific ways to integrate NoSQL databases into Hadoop-based Big Data systems: preserving state in otherwise stateless processes; storing pre-computed metrics and aggregates to enable interactive analytics and reporting; and building a secondary index to provide low latency, random access to data stored stored on the high latency HDFS. A working example of secondary indexing is presented in which MongoDB is used to index web site visitor locations from Omniture clickstream data stored on HDFS.

Citation preview

Page 1: Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

Jeffrey BreenDirector, Think Big Academy

November 2014

Page 2: Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

CONFIDENTIAL | 2CONFIDENTIAL 2

• Introduction

• Hadoop and NoSQL: What? Where? Why? When?

• Document-Oriented NoSQL and Hadoop

• Example: Add Statefulness

• Example: Analytics Store

• Example: Secondary Index

− Caution: contains code

• MongoDB Connector for Hadoop

Outline

Page 3: Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

CONFIDENTIAL | 3

Delivering Business Value Through Big Data

Leading Provider of Big Data Solutions

& Support

Exclusive Focus on Big Data Tools, Technologies, and

Techniques

Onshore Team-Based Engineering and Data Science

Methodology

Prebuilt, Proven Components to

Accelerate Delivery & Lower Risk

Page 4: Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

CONFIDENTIAL | 4

We Accelerate Your Time to Value

Agile Methodology

Experiment-Driven Short Sprints with Quick Release Cycles

Breaking Down Business and IT Barriers

Discrete Projects with Beginning and End

Early Releases to Validate ROI andEnsure Long Term Success

DATA ENGINEERS

DATA SCIENTISTS

BUSINESS GOALS

Innovation and Value

Page 5: Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

CONFIDENTIAL | 5CONFIDENTIAL 5

Director, Think Big Academy

Principal Consultant and Hands-on Architect

IT guy, Data guy, Open Source guy

Pilot and Airplane Geek

Twitter: @JeffreyBreen

[email protected]

Jeffrey Breen

Page 6: Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

CONFIDENTIAL | 6CONFIDENTIAL 6

• Not “either-or”

− When together? Where? For what?

• Hadoop

− Not a database− Low cost storage with fault tolerance− Batch-oriented analytics (MapReduce, Hive, Pig)− Not good for random access and/or updates

• NoSQL

− Real databases with CRUD− Optimized for fast, random access− Many shapes and sizes (key-val, tabular, graph, document oriented)

Hadoop and NoSQL

Page 7: Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

CONFIDENTIAL | 7

Reference Architecture

Page 8: Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

CONFIDENTIAL | 8CONFIDENTIAL 8

• Advantages

− Simple but flexible data model− Field-level indexing for fast querying− Easy and open APIs and data exchange formats

• Examples

1. Add Statefulness. Preserve state between jobs and other stateless operations.

2. Analytics Store. Provide high performance destination for calculations and metrics.

3. Secondary Indexing. Add low-latency querying and access for high-latency data stores like HDFS.

Document-Oriented NoSQL with Hadoop

Page 9: Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

CONFIDENTIAL | 9CONFIDENTIAL 9

Overview

- Sometimes you just need a fast and safe place to store data between jobs, applications, iterations

Scenarios

- Data extraction jobs- Ingestion processing status- Broadcasting “last best”

parameters in machine learning, genetic algorithms, and other model fitting

{

"process": "db-extractor",

"system": "database1",

"tables": {

"table1": { "columns": ["ts"],

"values": ["2014-03-25 03:15:23"] },

"table2": { "columns": [ "client_id" ],

"values": ["43110221"] }

}

}

Example: Add Statefulness

Page 10: Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

CONFIDENTIAL | 10CONFIDENTIAL 10

• Great place to store aggregates and other calculated metrics

• Can be populated from batch or streaming analytics

• Great for serving live dashboards and reporting

Example: Analytics Store

{"metric": "session-length","visitor": "{2CC8C651-A9F4-4CB4-8639-7688FCD21D59}","visit-start": "2014-03-25 03:15:23","data": { "value": 245.3, "units": "seconds" } }}

Page 11: Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

CONFIDENTIAL | 11CONFIDENTIAL 11

• HDFS is optimized for scans; seeks are very expensive

• As in relational databases, secondary indexes can be created on specific elements

• Hive even has indexing built in, but keeps the results on HDFS (still not optimized for seeks)

• Solution: Use separate NoSQL database for secondary indexes

Example: Secondary Indexing

Page 12: Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

CONFIDENTIAL | 12CONFIDENTIAL 12

• Sample Omniture clickstream files are available from Hortonworks

− 420,000+ page views over 15 days− https://s3.amazonaws.com/hw-sandbox/tutorial8/RefineDemoData.zip

• Example records combine web page and visitor information, including geocoding:

1331434018 2012-03-10 18:46:58 2850813067829261564 4611687161967479390 FAS-2.8-AS3 N 0 24.63.166.252 1 0 10 http://www.acme.com/SH5568487/VD55169229 {2CC8C651-A9F4-4CB4-8639-7688FCD21D59} U en-US 313 598 1259 Y Y Y 1 2 304 comcast.net 10/2/2012 20:50:37 6 300 45 36 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; WOW64; GTB7.3; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.30618; .NET CLR 3.5.30729; .NET4.0C) 71 0 2 53 0 taunton usa 521 ma 0 0 0 0 0 ABC 0 120 ABC 0

1331434006 2012-03-10 18:46:46 2850864012585216412 6917530841728651042 FAS-2.8-AS3 N 0 24.6.122.234 1 0 10 http://www.acme.com/SH55126545/VD55177927 {52B4FFFE-606A-1C2B-77E7-F62057879CC8} U en-us 574 0 0 U U Y 0 0 304 comcast.net 10/2/2012 18:17:59 6 480 45 2 Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1 71 0 37 2 0 los gatos usa 807 ca 0 0 0 0 0 KGO 0 120 KGO

Sample Clickstream Data

Page 13: Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

CONFIDENTIAL | 13CONFIDENTIAL 13

• Time is a very common dimension on which to organize data

• Great for processing incoming data and for filtering any time-based queries…

• …but can complicate other access patterns

/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=1/000000_0

/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=2/000000_0

/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=3/000000_0

/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=4/000000_0

/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=5/000000_0

/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=6/000000_0

/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=7/000000_0

/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=8/000000_0

/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=9/000000_0

/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=10/000000_0

[…]

Time-Partitioned Data

Hive partitions correspond to directories on HDFS

Page 14: Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

CONFIDENTIAL | 14CONFIDENTIAL 14

Top 10 Bottom 2000≃

Distribution of geographic locations detected in clickstream data:

> sum(subset(df, rank <= 10)$count)

[1] 36986

> sum(subset(df, rank > max(df$rank) - 2000)$count)

[1] 33971

In this sample clickstream data set, the top 10 cities account for more traffic than the bottom 2,000 combined

Optimizations are usually designed for the most common cases

- “Biggest bang for the buck” due to size, frequency, etc.

- What are the chances that the optimizations you pick to handle the most common cases work well for the long tail?

- What if a new business opportunity depends on the long tail?

Welcome to the Long Tail

> sum(subset(df, rank <= 10)$count)[1] 36986

> sum(subset(df, rank > max(df$rank) - 2000)$count)[1] 33971

Page 15: Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

CONFIDENTIAL | 15CONFIDENTIAL 15

• Hive has built-in facilities to index data

create index location on table omniture_daily(city, state, country)

as 'COMPACT' with deferred rebuild;

alter index location on omniture_daily rebuild;

• Index stores pointers to locations of each found record (path, file, and byte offset)

• However, resulting index is partitioned the same way as the underlying table

Secondary Indexing in Hive

Page 16: Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

CONFIDENTIAL | 16CONFIDENTIAL 16

• Hive can easily read/write JSON data via a SerDe:

− https://github.com/sheetaldolas/Hive-JSON-Serde/tree/master

add jar json-serde-1.1.9.2-Hive13-jar-with-dependencies.jar;

create table json_export (

city string,

country string,

state string,

bucketname string,

offsets array<bigint>,

year int,

month int,

day int

) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe’

STORED AS TEXTFILE;

insert into table json_export select * from default__omniture_daily_location__;

Exporting Hive Data as JSON

Column parsing determinedby Hive SerDe classes

Hadoop’s InputFormat and OutputFormat

Page 17: Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

CONFIDENTIAL | 17CONFIDENTIAL 17

Hive indices contain physical location of original data, including byte offsets:

{

"city": "taunton",

"state": "ma",

"country": "usa”,

"bucketname": "hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=10/000000_0”,

"offsets": [ 4748045, 3522685 ],

"year": 2012

"day": 10,

"month": 3,

}

Sample Index entry

Page 18: Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

CONFIDENTIAL | 18CONFIDENTIAL 18

• Since our Hive index data is now stored on HDFS as JSON format, it’s very easy to load into Mongo directly.

• Don’t do this in production, but that’s what makes simple examples so much fun:

Exporting Index Data to Mongo

$ hadoop fs -text /apps/hive/warehouse/json_export/000000_0 | \ mongoimport --host localhost --db clickstream --collection locidx

connected to: localhostSat Sep 27 10:30:22.325 100 16/secondSat Sep 27 10:30:24.448 check 9 12262Sat Sep 27 10:30:24.449 imported 12262 objects

Page 19: Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

CONFIDENTIAL | 19CONFIDENTIAL 19

$ mongo localhost

MongoDB shell version: 2.4.6

connecting to: localhost

> use clickstream;

switched to db clickstream

> db.locidx.find( {'state':'ma', 'city':'taunton'} );

{ "_id" : ObjectId("5426f42e6a6b0b1939528f80"),

"bucketname” : "hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=10/000000_0”,

"offsets" : [ 4748045, 3522685 ], "month" : 3, "state" : "ma", "year" : 2012, "day" : 10, "country" : "usa", "city" : "taunton” }

Querying the Index in Mongo

Specific file on HDFS containing the records of interest

Byte offsets within that file containing the records of interest

Page 20: Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

CONFIDENTIAL | 20CONFIDENTIAL 20

$ curl -L 'http://sandbox.hortonworks.com:50070/webhdfs/v1/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=10/000000_0?op=OPEN&offset=3522685&length=615'; echo

1331431385 2012-03-10 18:03:05 2850813067829261564 4611687161967479390 FAS-2.8-AS3 N 0 24.63.166.252 1 0 10 http://www.acme.com/SH5568487/VD55169229 {2CC8C651-A9F4-4CB4-8639-7688FCD21D59} en-US 313 598 1259 Y Y Y 1 2 304 comcast.net 10/2/2012 20:50:37 6 300 45 36 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; WOW64; GTB7.3; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.30618; .NET CLR 3.5.30729; .NET4.0C) 71 0 2 20 0 taunton usa 521 ma 0 0 0 0 ABC 0 120 ABC

$ curl -L 'http://sandbox.hortonworks.com:50070/webhdfs/v1/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=10/000000_0?op=OPEN&offset=4748045&length=615'; echo

1331434018 2012-03-10 18:46:58 2850813067829261564 4611687161967479390 FAS-2.8-AS3 N 0 24.63.166.252 1 0 10 http://www.acme.com/SH5568487/VD55169229 {2CC8C651-A9F4-4CB4-8639-7688FCD21D59} en-US 313 598 1259 Y Y Y 1 2 304 comcast.net 10/2/2012 20:50:37 6 300 45 36 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; WOW64; GTB7.3; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.30618; .NET CLR 3.5.30729; .NET4.0C) 71 0 2 53 0 taunton usa 521 ma 0 0 0 0 ABC 0 120 ABC

Using the index data to retrieve the original data

Page 21: Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

CONFIDENTIAL | 21CONFIDENTIAL 21

Check out the MongoDB Connector for Hadoop

• Available at https://github.com/mongodb/mongo-hadoop

• Contains a “storage engine” to connect Hive directly to MongoDB for live querying

• Provides a Hive SerDe for direct access to static BSON files (i.e., backup files)

• Allows Hadoop Streaming jobs (python, perl, R, etc.) access to Mongo files

• And more

So what’s the right way to do it?

Page 22: Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

DATA SCIENTISTS

DATA ARCHITECTS

DATA SOLUTIONS

Think Big Start Smart Scale Fast

Work with theLeading Innovator in Big Data

CONFIDENTIAL22