Mongo db and hadoop driving business insights - final

MongoDB and Hadoop

Software Engineer, MongoDB

Luke Lovett

Agenda

• Complementary Approaches to Data

• MongoDB & Hadoop Use Cases

• MongoDB Connector Overview and Features

• Demo

Complementary Approaches to Data

Operational: MongoDB

Real-Time Analytics

Product/Asset Catalogs

Security & Fraud

Internet of Things

Mobile AppsCustomer

Data Mgmt

Single View Social

Churn Analysis Recommender

Warehouse & ETL

Risk Modeling

Trade Surveillance

Predictive Analytics

Ad TargetingSentiment

Analysis

MongoDB

• Store and read data frequently

• Easy administration

• Built-in analytical tools

– aggregation framework

– JavaScript MapReduce

– Geo/text indexes

Analytical: Hadoop

Real-Time Analytics

Security & Fraud

Internet of Things

Mobile AppsCustomer

Data Mgmt

Single View Social

Warehouse & ETL

Risk Modeling

Trade Surveillance

Analysis

Hadoop

The Apache Hadoop software library is a framework that allows for the

distributed processing of large data sets across clusters of computers

using simple programming models.

• Terabyte and Petabyte datasets

• Data warehousing

• Advanced analytics

Operational vs. Analytical: Lifecycle

Real-Time Analytics

Security & Fraud

Internet of Things

Mobile AppsCustomer

Data Mgmt

Single View Social

Warehouse & ETL

Risk Modeling

Trade Surveillance

Analysis

MongoDB & Hadoop Use Cases

Batch Aggregation

Applicatio

powered

Analysis

powered

● Need more than MongoDB aggregation

● Need offline processing

● Results sent back to MongoDB

● Can be left as BSON on HDFS for further analysis

MongoDB Connector

for Hadoop

Commerce

Applicatio

powered

Analysis

powered

• Products & Inventory

• Recommended

products

• Customer profile

• Session management

• Elastic pricing

• Recommendation

models

• Predictive analytics

• Clickstream history

MongoDB Connector

for Hadoop

Fraud Detection

Payments

Fraud modeling

Nightly

Analysis

MongoDB Connector

for Hadoop

Results

Online payments

processing

3rd Party Data

Sources

Detection

MongoDB Connector for Hadoop

Connector Overview

HadoopMap Reduce, Hive, Pig, Spark

HDFS / S3Hadoop Connector

Text Files

Hadoop

Connector

Apache Hadoop / Cloudera CDH / Hortonworks HDP / Amazon

BSON FilesMongoDB

Single Node, Replica Set,

Cluster

Data Movement

Dynamic queries with most recent data

Puts load on operational database

Snapshots move load to Hadoop

Snapshots add predictable load to MongoDB

Dynamic queries to MongoDB vs. BSON snapshots in

Connector Operation

1. Split according to given InputFormat

- many options available for reading from live cluster

- configure key pattern, split strategy

1. Write splits file

2. Output to BSON file or live MongoDB

- BSON file splits written automatically for future tasks

- Mongo insertion round-robin across collections

Getting Splits

• Split on a sharded cluster

– Split by chunk

– Split by shard

• Splits on replica

set/standalone

– splitVector command

• BSON files

– specify max docs

– split per input file

Config

Servers

Mongos

Config

Servers

Getting Splits

• Split on a sharded cluster

– Split by chunk

– Split by shard

• Splits on replica

set/standalone

– splitVector command

• BSON files

– specify max docs

– split per input file

Mongos

MapReduce Configuration

• MongoDB input

– mongo.job.input.format = com.hadoop.MongoInputFormat

– mongo.input.uri = mongodb://mydb:27017/db1.collection1

• MongoDB output

– mongo.job.output.format = com.hadoop.MongoOutputFormat

– mongo.output.uri = mongodb://mydb:27017/db1.collection2

MapReduce Configuration

• BSON input/output

– mongo.job.input.format = com.hadoop.BSONFileInputFormat

– mapred.input.dir = hdfs:///tmp/database.bson

– mongo.job.output.format = com.hadoop.BSONFileOutputFormat

– mapred.output.dir = hdfs:///tmp/output.bson

Spark Usage

• Use with MapReduce

input/output formats

• Create Configuration objects with

input/output formats and data

• Load/save data using

SparkContext Hadoop file API

Hive Support

CREATE TABLE mongo_users (id int, name string, age int)

STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"

WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”)

TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)

• Access collections as Hive tables

• Use with MongoStorageHandler or BSONSerDe

Hive Support

MongoDB Hive

Primitive type (int, String, etc.) Primitive type (int, float, etc.)

Document Row

Sub-document Struct, Map, or exploded field

Array Array or exploded field

● Types given by schema

● May use structs to project fields out of documents and ease access

● Can explode nested fields to make them top-level:{“customer”: {“name”: “Bart”}}

can be accessed with “customer.name”.

Pig Mappings

• Input: BSONLoader and MongoLoader

data = LOAD ‘mongodb://mydb:27017/db.collection’

using com.mongodb.hadoop.pig.MongoLoader

• Output: BSONStorage and MongoInsertStorage

STORE records INTO ‘hdfs:///output.bson’

using com.mongodb.hadoop.pig.BSONStorage

Pig Mappings

MongoDB Pig

Primitive type (int, String, etc.) Primitive type (int, chararray, etc.)

Document Tuple (schema given)

Document Tuple containing a Map (no schema)

Sub-document Map

Array Bag

● Organize and prune documents by specifying a schema

● Access full document in a Map without needing a schema

Questions?

Mongo db and hadoop driving business insights - final

Technology

Mongo db administration guide

Mongo db intro

Mongo db manual

Mongo db rev001

Mongo db &_spark

Superficial mongo db

Mongo db 잡학상식

Mongo db实战

Mongo db basics

Introduction to mongo db

Mongo db 2장

Mongo db london

Mongo db bangalore

Mongo DB y Mongo DB atlas - pandorafms.com

Mongo db applicationmonitor

Javaでmongo db

Mongo db at_customink

Mongo db 최범균

Wilver mongo db

Mongo db : Introduction