29
MongoDB and Hadoop Driving Business Insights Software Engineer, MongoDB Justin Lee #MongoDB DC

MongoDB and Hadoop: Driving Business Insights

  • Upload
    mongodb

  • View
    439

  • Download
    3

Embed Size (px)

DESCRIPTION

MongoDB and Hadoop can work together to solve big data problems facing today's enterprises. We will take an in-depth look at how the two technologies complement and enrich each other with complex analyses and greater intelligence. We will take a deep dive into the MongoDB Connector for Hadoop and how it can be applied to enable new business insights with MapReduce, Pig, and Hive, and demo a Spark application to drive product recommendations.

Citation preview

Page 1: MongoDB and Hadoop: Driving Business Insights

MongoDB and Hadoop Driving Business Insights

Software Engineer, MongoDB

Justin Lee

#MongoDB DC

Page 2: MongoDB and Hadoop: Driving Business Insights

Agenda

•  Evolving Data Landscape

•  MongoDB & Hadoop Use Cases

•  MongoDB Connector Features

•  Demo

Page 3: MongoDB and Hadoop: Driving Business Insights

Evolving Data Landscape

Page 4: MongoDB and Hadoop: Driving Business Insights

Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

•  Terabyte and Petabtye datasets

•  Data warehousing

•  Advanced analytics

Page 5: MongoDB and Hadoop: Driving Business Insights

Enterprise IT Stack

EDW

Man

agem

ent &

Mon

itorin

g Security &

Auditing

RDBMS

CRM, ERP, Collaboration, Mobile, BI

OS & Virtualization, Compute, Storage, Network

RDBMS

Applications

Infrastructure

Data Management

Operational Analytical

Page 6: MongoDB and Hadoop: Driving Business Insights

Operational vs. Analytical: Enrichment

Applications, Interactions Warehouse, Analytics

Page 7: MongoDB and Hadoop: Driving Business Insights

Operational: MongoDB

First-­‐level  Analy/cs  

Product/Asset  Catalogs  

Security  &  Fraud  

Internet  of  Things  

Mobile  Apps   Customer  Data  Mgmt  

Single  View   Social  

Churn  Analysis   Recommender  

Warehouse  &  ETL   Risk  Modeling  

Trade  Surveillance  

Predic/ve  Analy/cs  

Ad  Targe/ng   Sen/ment  Analysis  

Page 8: MongoDB and Hadoop: Driving Business Insights

Analytical: Hadoop

First-­‐level  Analy/cs  

Product/Asset  Catalogs  

Security  &  Fraud  

Internet  of  Things  

Mobile  Apps   Customer  Data  Mgmt  

Single  View   Social  

Churn  Analysis   Recommender  

Warehouse  &  ETL   Risk  Modeling  

Trade  Surveillance  

Predic/ve  Analy/cs  

Ad  Targe/ng   Sen/ment  Analysis  

Page 9: MongoDB and Hadoop: Driving Business Insights

Operational vs. Analytical: Lifecycle

First-­‐level  Analy/cs  

Product/Asset  Catalogs  

Security  &  Fraud  

Internet  of  Things  

Mobile  Apps   Customer  Data  Mgmt  

Single  View   Social  

Churn  Analysis   Recommender  

Warehouse  &  ETL   Risk  Modeling  

Trade  Surveillance  

Predic/ve  Analy/cs  

Ad  Targe/ng   Sen/ment  Analysis  

Page 10: MongoDB and Hadoop: Driving Business Insights

MongoDB & Hadoop Use Cases

Page 11: MongoDB and Hadoop: Driving Business Insights

Commerce

Applications powered by

Analysis powered by

•  Products & Inventory •  Recommended products •  Customer profile •  Session management

•  Elastic pricing •  Recommendation models •  Predictive analytics •  Clickstream history

MongoDB Connector for

Hadoop

Page 12: MongoDB and Hadoop: Driving Business Insights

Insurance

Applications powered by

Analysis powered by

•  Customer profiles •  Insurance policies •  Session data •  Call center data

•  Customer action analysis •  Churn analysis •  Churn prediction •  Policy rates

MongoDB Connector for

Hadoop

Page 13: MongoDB and Hadoop: Driving Business Insights

Fraud Detection

Payments

Fraud modeling

Nightly Analysis

MongoDB Connector for Hadoop

Results Cache

Online payments processing

3rd Party Data

Sources

Fraud Detection

query only

query only

Page 14: MongoDB and Hadoop: Driving Business Insights

MongoDB Connector for Hadoop

Page 15: MongoDB and Hadoop: Driving Business Insights

Data  

Read/Write  MongoDB  

Read/Write  BSON  

Tools  

MapReduce  

Pig  

Hive  

Spark  

PlaNorms  

Apache  Hadoop  

Cloudera  CDH  

Hortonworks  HDP  

Amazon  EMR  

Connector Overview

Page 16: MongoDB and Hadoop: Driving Business Insights

Connector Features and Functionality

•  Computes splits to read data –  Single Node, Replica Sets, Sharded Clusters

•  Mappings for Pig and Hive –  MongoDB as a standard data source/destination

•  Support for –  Filtering data with MongoDB queries –  Authentication –  Reading from Replica Set tags –  Appending to existing collections

Page 17: MongoDB and Hadoop: Driving Business Insights

MapReduce Configuration

•  MongoDB input –  mongo.job.input.format  =  com.mongodb.hadoop.MongoInputFormat  –  mongo.input.uri  =  mongodb://mydb:27017/db1.collection1  

•  MongoDB output –  mongo.job.output.format  =  com.mongodb.hadoop.MongoOutputFormat  –  mongo.output.uri  =  mongodb://mydb:27017/db1.collection2  

•  BSON input/output –  mongo.job.input.format  =  com.hadoop.BSONFileInputFormat  –  mapred.input.dir  =  hdfs:///tmp/database.bson  –  mongo.job.output.format  =  com.hadoop.BSONFileOutputFormat  –  mapred.output.dir  =  hdfs:///tmp/output.bson  

Page 18: MongoDB and Hadoop: Driving Business Insights

Pig Mappings

•  Input: BSONLoader and MongoLoader          data  =  LOAD  ‘mongodb://mydb:27017/db.collection’    

               using  com.mongodb.hadoop.pig.MongoLoader  

 

•  Output: BSONStorage and MongoInsertStorage          STORE  records  INTO  ‘hdfs:///output.bson’  

               using  com.mongodb.hadoop.pig.BSONStorage  

Page 19: MongoDB and Hadoop: Driving Business Insights

Hive Support

CREATE  TABLE  mongo_users  (id  int,  name  string,  age  int)  

STORED  BY  "com.mongodb.hadoop.hive.MongoStorageHandler"  

WITH  SERDEPROPERTIES("mongo.columns.mapping”  =  "_id,name,age”)  

TBLPROPERTIES("mongo.uri"  =  "mongodb://host:27017/test.users”)  

•  Access collections as Hive tables

•  Use with MongoStorageHandler or BSONStorageHandler  

Page 20: MongoDB and Hadoop: Driving Business Insights

Spark Usage

•  Use with MapReduce input/output formats

•  Create Configuration objects with input/output formats and data URI

•  Load/save data using SparkContext Hadoop file API

Page 21: MongoDB and Hadoop: Driving Business Insights

Data Movement

Dynamic queries with most recent data

Puts load on operational database

Snapshots move load to Hadoop

Snapshots add predictable load to MongoDB

Dynamic queries to MongoDB vs. BSON snapshots in HDFS

Page 22: MongoDB and Hadoop: Driving Business Insights

Demo

Page 23: MongoDB and Hadoop: Driving Business Insights

MovieWeb

Page 24: MongoDB and Hadoop: Driving Business Insights

MovieWeb Components

•  MovieLens dataset –  10M ratings, 10K movies, 70K users

•  Python web app to browse movies, recommendations –  Flask, PyMongo

•  Spark app computes recommendations –  MLLib collaborative filter

•  Predicted ratings are exposed in web app –  New predictions collection

Page 25: MongoDB and Hadoop: Driving Business Insights

MovieWeb Web Application

•  Browse –  Top movies by ratings count –  Top genres by movie count

•  Log in to –  See My Ratings –  Rate movies

•  What’s missing? –  Movies You May Like –  Recommendations

Page 26: MongoDB and Hadoop: Driving Business Insights

Spark Recommender

•  Apache Hadoop 2.3.0 –  HDFS and YARN

•  Spark 1.0 –  Execute within YARN –  Assign executor

resources

•  Data –  From HDFS, MongoDB –  To MongoDB

Page 27: MongoDB and Hadoop: Driving Business Insights

Snapshot database as BSON

Store BSON in HDFS

Read BSON into Spark app

Train model from existing ratings

Create user-movie pairings

Predict ratings for all pairings

Write predictions to MongoDB

collection

Web application exposes

recommendations

Repeat the process weekly

MovieWeb Workflow

Page 28: MongoDB and Hadoop: Driving Business Insights

$ export SPARK_JAR=spark-assembly-1.0.0-hadoop2.4.0.jar

$ export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

$ bin/spark-submit

--master yarn-cluster

--class com.mongodb.hadoop.demo.Recommender demo-1.0.jar

--jars mongo-java-2.12.3.jar,mongo-hadoop-1.3.0.jar

--driver-memory 1G

--executor-memory 2G

--num-executors 4

Execution

Page 29: MongoDB and Hadoop: Driving Business Insights

Questions?

•  MongoDB Connector for Hadoop –  http://github.com/mongodb/mongo-hadoop

•  Getting Started with MongoDB and Hadoop –  http://docs.mongodb.org/ecosystem/tutorial/getting-

started-with-hadoop/

•  MongoDB-Spark Demo –  http://github.com/crcsmnky/mongodb-spark-demo