MongoDB and Hadoop: Driving Business Insights

MongoDB and Hadoop Driving Business Insights

Software Engineer, MongoDB

Justin Lee

#MongoDB DC

Agenda

•  Evolving Data Landscape

•  MongoDB & Hadoop Use Cases

•  MongoDB Connector Features

•  Demo

Evolving Data Landscape

Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

•  Terabyte and Petabtye datasets

•  Data warehousing

•  Advanced analytics

Enterprise IT Stack

EDW

Man

agem

ent &

Mon

itorin

g Security &

Auditing

RDBMS

CRM, ERP, Collaboration, Mobile, BI

OS & Virtualization, Compute, Storage, Network

RDBMS

Applications

Infrastructure

Data Management

Operational Analytical

Operational vs. Analytical: Enrichment

Applications, Interactions Warehouse, Analytics

Operational: MongoDB

First-‐level Analy/cs

Product/Asset Catalogs

Security & Fraud

Internet of Things

Mobile Apps Customer Data Mgmt

Single View Social

Churn Analysis Recommender

Warehouse & ETL Risk Modeling

Trade Surveillance

Predic/ve Analy/cs

Ad Targe/ng Sen/ment Analysis

Analytical: Hadoop



Security & Fraud

Internet of Things


Single View Social



Trade Surveillance

Predic/ve Analy/cs


Operational vs. Analytical: Lifecycle



Security & Fraud

Internet of Things


Single View Social



Trade Surveillance

Predic/ve Analy/cs


MongoDB & Hadoop Use Cases

Commerce

Applications powered by

Analysis powered by

•  Products & Inventory •  Recommended products •  Customer profile •  Session management

•  Elastic pricing •  Recommendation models •  Predictive analytics •  Clickstream history

MongoDB Connector for

Hadoop

Insurance

Applications powered by

Analysis powered by

•  Customer profiles •  Insurance policies •  Session data •  Call center data

•  Customer action analysis •  Churn analysis •  Churn prediction •  Policy rates

MongoDB Connector for

Hadoop

Fraud Detection

Payments

Fraud modeling

Nightly Analysis

MongoDB Connector for Hadoop

Results Cache

Online payments processing

3rd Party Data

Sources

Fraud Detection

query only

query only

MongoDB Connector for Hadoop

Data

Read/Write MongoDB

Read/Write BSON

Tools

MapReduce

Pig

Hive

Spark

PlaNorms

Apache Hadoop

Cloudera CDH

Hortonworks HDP

Amazon EMR

Connector Overview

Connector Features and Functionality

•  Computes splits to read data –  Single Node, Replica Sets, Sharded Clusters

•  Mappings for Pig and Hive –  MongoDB as a standard data source/destination

•  Support for –  Filtering data with MongoDB queries –  Authentication –  Reading from Replica Set tags –  Appending to existing collections

MapReduce Configuration

•  MongoDB input –  mongo.job.input.format = com.mongodb.hadoop.MongoInputFormat –  mongo.input.uri = mongodb://mydb:27017/db1.collection1

•  MongoDB output –  mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormat –  mongo.output.uri = mongodb://mydb:27017/db1.collection2

•  BSON input/output –  mongo.job.input.format = com.hadoop.BSONFileInputFormat –  mapred.input.dir = hdfs:///tmp/database.bson –  mongo.job.output.format = com.hadoop.BSONFileOutputFormat –  mapred.output.dir = hdfs:///tmp/output.bson

Pig Mappings

•  Input: BSONLoader and MongoLoader data = LOAD ‘mongodb://mydb:27017/db.collection’

using com.mongodb.hadoop.pig.MongoLoader

•  Output: BSONStorage and MongoInsertStorage STORE records INTO ‘hdfs:///output.bson’

using com.mongodb.hadoop.pig.BSONStorage

Hive Support

CREATE TABLE mongo_users (id int, name string, age int)

STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"

WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”)

TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)

•  Access collections as Hive tables

•  Use with MongoStorageHandler or BSONStorageHandler

Spark Usage

•  Use with MapReduce input/output formats

•  Create Configuration objects with input/output formats and data URI

•  Load/save data using SparkContext Hadoop file API

Data Movement

Dynamic queries with most recent data

Puts load on operational database

Snapshots move load to Hadoop

Snapshots add predictable load to MongoDB

Dynamic queries to MongoDB vs. BSON snapshots in HDFS

Demo

MovieWeb

MovieWeb Components

•  MovieLens dataset –  10M ratings, 10K movies, 70K users

•  Python web app to browse movies, recommendations –  Flask, PyMongo

•  Spark app computes recommendations –  MLLib collaborative filter

•  Predicted ratings are exposed in web app –  New predictions collection

MovieWeb Web Application

•  Browse –  Top movies by ratings count –  Top genres by movie count

•  Log in to –  See My Ratings –  Rate movies

•  What’s missing? –  Movies You May Like –  Recommendations

Spark Recommender

•  Apache Hadoop 2.3.0 –  HDFS and YARN

•  Spark 1.0 –  Execute within YARN –  Assign executor

resources

•  Data –  From HDFS, MongoDB –  To MongoDB

Snapshot database as BSON

Store BSON in HDFS

Read BSON into Spark app

Train model from existing ratings

Create user-movie pairings

Predict ratings for all pairings

Write predictions to MongoDB

collection

Web application exposes

recommendations

Repeat the process weekly

MovieWeb Workflow

$ export SPARK_JAR=spark-assembly-1.0.0-hadoop2.4.0.jar

$ export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

$ bin/spark-submit

--master yarn-cluster

--class com.mongodb.hadoop.demo.Recommender demo-1.0.jar

--jars mongo-java-2.12.3.jar,mongo-hadoop-1.3.0.jar

--driver-memory 1G

--executor-memory 2G

--num-executors 4

Execution

Questions?

•  MongoDB Connector for Hadoop –  http://github.com/mongodb/mongo-hadoop

•  Getting Started with MongoDB and Hadoop –  http://docs.mongodb.org/ecosystem/tutorial/getting-

started-with-hadoop/

•  MongoDB-Spark Demo –  http://github.com/crcsmnky/mongodb-spark-demo

Technology

MongoDB and Hadoop: Driving Business Insights