Mongo db and hadoop driving business insights - final

Preview:

DESCRIPTION

MongoDB and Hadoop can work together to solve big data problems facing today's enterprises. We will take an in-depth look at how the two technologies complement and enrich each other with complex analyses and greater intelligence. We will take a deep dive into the MongoDB Connector for Hadoop and how it can be applied to enable new business insights with MapReduce, Pig, and Hive, and demo a Spark application to drive product recommendations.

Citation preview

MongoDB and Hadoop

Software Engineer, MongoDB

Luke Lovett

Agenda

• Complementary Approaches to Data

• MongoDB & Hadoop Use Cases

• MongoDB Connector Overview and Features

• Demo

Complementary Approaches to Data

Operational: MongoDB

Real-Time Analytics

Product/Asset Catalogs

Security & Fraud

Internet of Things

Mobile AppsCustomer

Data Mgmt

Single View Social

Churn Analysis Recommender

Warehouse & ETL

Risk Modeling

Trade Surveillance

Predictive Analytics

Ad TargetingSentiment

Analysis

MongoDB

• Store and read data frequently

• Easy administration

• Built-in analytical tools

– aggregation framework

– JavaScript MapReduce

– Geo/text indexes

Analytical: Hadoop

Real-Time Analytics

Product/Asset Catalogs

Security & Fraud

Internet of Things

Mobile AppsCustomer

Data Mgmt

Single View Social

Churn Analysis Recommender

Warehouse & ETL

Risk Modeling

Trade Surveillance

Predictive Analytics

Ad TargetingSentiment

Analysis

Hadoop

The Apache Hadoop software library is a framework that allows for the

distributed processing of large data sets across clusters of computers

using simple programming models.

• Terabyte and Petabyte datasets

• Data warehousing

• Advanced analytics

Operational vs. Analytical: Lifecycle

Real-Time Analytics

Product/Asset Catalogs

Security & Fraud

Internet of Things

Mobile AppsCustomer

Data Mgmt

Single View Social

Churn Analysis Recommender

Warehouse & ETL

Risk Modeling

Trade Surveillance

Predictive Analytics

Ad TargetingSentiment

Analysis

MongoDB & Hadoop Use Cases

Batch Aggregation

Applicatio

ns

powered

by

Analysis

powered

by

● Need more than MongoDB aggregation

● Need offline processing

● Results sent back to MongoDB

● Can be left as BSON on HDFS for further analysis

MongoDB Connector

for Hadoop

Commerce

Applicatio

ns

powered

by

Analysis

powered

by

• Products & Inventory

• Recommended

products

• Customer profile

• Session management

• Elastic pricing

• Recommendation

models

• Predictive analytics

• Clickstream history

MongoDB Connector

for Hadoop

Fraud Detection

Payments

Fraud modeling

Nightly

Analysis

MongoDB Connector

for Hadoop

Results

Cache

Online payments

processing

3rd Party Data

Sources

Fraud

Detection

query

only

query

only

MongoDB Connector for Hadoop

Connector Overview

HadoopMap Reduce, Hive, Pig, Spark

HDFS / S3Hadoop Connector

Text Files

Hadoop

Connector

Apache Hadoop / Cloudera CDH / Hortonworks HDP / Amazon

EMR

BSON FilesMongoDB

Single Node, Replica Set,

Cluster

Data Movement

Dynamic queries with most recent data

Puts load on operational database

Snapshots move load to Hadoop

Snapshots add predictable load to MongoDB

Dynamic queries to MongoDB vs. BSON snapshots in

HDFS

Connector Operation

1. Split according to given InputFormat

- many options available for reading from live cluster

- configure key pattern, split strategy

1. Write splits file

2. Output to BSON file or live MongoDB

- BSON file splits written automatically for future tasks

- Mongo insertion round-robin across collections

Getting Splits

• Split on a sharded cluster

– Split by chunk

– Split by shard

• Splits on replica

set/standalone

– splitVector command

• BSON files

– specify max docs

– split per input file

Config

Servers

Chunk

Chunk

Chunk

Shard

Mongos

Chunk

Chunk

Chunk

Shard

Chunk

Chunk

Chunk

Shard

MongoDB Connector for Hadoop

Config

Servers

Getting Splits

• Split on a sharded cluster

– Split by chunk

– Split by shard

• Splits on replica

set/standalone

– splitVector command

• BSON files

– specify max docs

– split per input file

Chunk

Chunk

Chunk

Shard

Mongos

Chunk

Chunk

Chunk

Shard

Chunk

Chunk

Chunk

Shard

MongoDB Connector for Hadoop

MapReduce Configuration

• MongoDB input

– mongo.job.input.format = com.hadoop.MongoInputFormat

– mongo.input.uri = mongodb://mydb:27017/db1.collection1

• MongoDB output

– mongo.job.output.format = com.hadoop.MongoOutputFormat

– mongo.output.uri = mongodb://mydb:27017/db1.collection2

MapReduce Configuration

• BSON input/output

– mongo.job.input.format = com.hadoop.BSONFileInputFormat

– mapred.input.dir = hdfs:///tmp/database.bson

– mongo.job.output.format = com.hadoop.BSONFileOutputFormat

– mapred.output.dir = hdfs:///tmp/output.bson

Spark Usage

• Use with MapReduce

input/output formats

• Create Configuration objects with

input/output formats and data

URI

• Load/save data using

SparkContext Hadoop file API

Hive Support

CREATE TABLE mongo_users (id int, name string, age int)

STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"

WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”)

TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)

• Access collections as Hive tables

• Use with MongoStorageHandler or BSONSerDe

Hive Support

MongoDB Hive

Primitive type (int, String, etc.) Primitive type (int, float, etc.)

Document Row

Sub-document Struct, Map, or exploded field

Array Array or exploded field

● Types given by schema

● May use structs to project fields out of documents and ease access

● Can explode nested fields to make them top-level:{“customer”: {“name”: “Bart”}}

can be accessed with “customer.name”.

Pig Mappings

• Input: BSONLoader and MongoLoader

data = LOAD ‘mongodb://mydb:27017/db.collection’

using com.mongodb.hadoop.pig.MongoLoader

• Output: BSONStorage and MongoInsertStorage

STORE records INTO ‘hdfs:///output.bson’

using com.mongodb.hadoop.pig.BSONStorage

Pig Mappings

MongoDB Pig

Primitive type (int, String, etc.) Primitive type (int, chararray, etc.)

Document Tuple (schema given)

Document Tuple containing a Map (no schema)

Sub-document Map

Array Bag

● Organize and prune documents by specifying a schema

● Access full document in a Map without needing a schema

Demo!

Questions?