Upload
mongodb
View
439
Download
3
Embed Size (px)
DESCRIPTION
MongoDB and Hadoop can work together to solve big data problems facing today's enterprises. We will take an in-depth look at how the two technologies complement and enrich each other with complex analyses and greater intelligence. We will take a deep dive into the MongoDB Connector for Hadoop and how it can be applied to enable new business insights with MapReduce, Pig, and Hive, and demo a Spark application to drive product recommendations.
Citation preview
MongoDB and Hadoop Driving Business Insights
Software Engineer, MongoDB
Justin Lee
#MongoDB DC
Agenda
• Evolving Data Landscape
• MongoDB & Hadoop Use Cases
• MongoDB Connector Features
• Demo
Evolving Data Landscape
Hadoop
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
• Terabyte and Petabtye datasets
• Data warehousing
• Advanced analytics
Enterprise IT Stack
EDW
Man
agem
ent &
Mon
itorin
g Security &
Auditing
RDBMS
CRM, ERP, Collaboration, Mobile, BI
OS & Virtualization, Compute, Storage, Network
RDBMS
Applications
Infrastructure
Data Management
Operational Analytical
Operational vs. Analytical: Enrichment
Applications, Interactions Warehouse, Analytics
Operational: MongoDB
First-‐level Analy/cs
Product/Asset Catalogs
Security & Fraud
Internet of Things
Mobile Apps Customer Data Mgmt
Single View Social
Churn Analysis Recommender
Warehouse & ETL Risk Modeling
Trade Surveillance
Predic/ve Analy/cs
Ad Targe/ng Sen/ment Analysis
Analytical: Hadoop
First-‐level Analy/cs
Product/Asset Catalogs
Security & Fraud
Internet of Things
Mobile Apps Customer Data Mgmt
Single View Social
Churn Analysis Recommender
Warehouse & ETL Risk Modeling
Trade Surveillance
Predic/ve Analy/cs
Ad Targe/ng Sen/ment Analysis
Operational vs. Analytical: Lifecycle
First-‐level Analy/cs
Product/Asset Catalogs
Security & Fraud
Internet of Things
Mobile Apps Customer Data Mgmt
Single View Social
Churn Analysis Recommender
Warehouse & ETL Risk Modeling
Trade Surveillance
Predic/ve Analy/cs
Ad Targe/ng Sen/ment Analysis
MongoDB & Hadoop Use Cases
Commerce
Applications powered by
Analysis powered by
• Products & Inventory • Recommended products • Customer profile • Session management
• Elastic pricing • Recommendation models • Predictive analytics • Clickstream history
MongoDB Connector for
Hadoop
Insurance
Applications powered by
Analysis powered by
• Customer profiles • Insurance policies • Session data • Call center data
• Customer action analysis • Churn analysis • Churn prediction • Policy rates
MongoDB Connector for
Hadoop
Fraud Detection
Payments
Fraud modeling
Nightly Analysis
MongoDB Connector for Hadoop
Results Cache
Online payments processing
3rd Party Data
Sources
Fraud Detection
query only
query only
MongoDB Connector for Hadoop
Data
Read/Write MongoDB
Read/Write BSON
Tools
MapReduce
Pig
Hive
Spark
PlaNorms
Apache Hadoop
Cloudera CDH
Hortonworks HDP
Amazon EMR
Connector Overview
Connector Features and Functionality
• Computes splits to read data – Single Node, Replica Sets, Sharded Clusters
• Mappings for Pig and Hive – MongoDB as a standard data source/destination
• Support for – Filtering data with MongoDB queries – Authentication – Reading from Replica Set tags – Appending to existing collections
MapReduce Configuration
• MongoDB input – mongo.job.input.format = com.mongodb.hadoop.MongoInputFormat – mongo.input.uri = mongodb://mydb:27017/db1.collection1
• MongoDB output – mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormat – mongo.output.uri = mongodb://mydb:27017/db1.collection2
• BSON input/output – mongo.job.input.format = com.hadoop.BSONFileInputFormat – mapred.input.dir = hdfs:///tmp/database.bson – mongo.job.output.format = com.hadoop.BSONFileOutputFormat – mapred.output.dir = hdfs:///tmp/output.bson
Pig Mappings
• Input: BSONLoader and MongoLoader data = LOAD ‘mongodb://mydb:27017/db.collection’
using com.mongodb.hadoop.pig.MongoLoader
• Output: BSONStorage and MongoInsertStorage STORE records INTO ‘hdfs:///output.bson’
using com.mongodb.hadoop.pig.BSONStorage
Hive Support
CREATE TABLE mongo_users (id int, name string, age int)
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"
WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”)
TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)
• Access collections as Hive tables
• Use with MongoStorageHandler or BSONStorageHandler
Spark Usage
• Use with MapReduce input/output formats
• Create Configuration objects with input/output formats and data URI
• Load/save data using SparkContext Hadoop file API
Data Movement
Dynamic queries with most recent data
Puts load on operational database
Snapshots move load to Hadoop
Snapshots add predictable load to MongoDB
Dynamic queries to MongoDB vs. BSON snapshots in HDFS
Demo
MovieWeb
MovieWeb Components
• MovieLens dataset – 10M ratings, 10K movies, 70K users
• Python web app to browse movies, recommendations – Flask, PyMongo
• Spark app computes recommendations – MLLib collaborative filter
• Predicted ratings are exposed in web app – New predictions collection
MovieWeb Web Application
• Browse – Top movies by ratings count – Top genres by movie count
• Log in to – See My Ratings – Rate movies
• What’s missing? – Movies You May Like – Recommendations
Spark Recommender
• Apache Hadoop 2.3.0 – HDFS and YARN
• Spark 1.0 – Execute within YARN – Assign executor
resources
• Data – From HDFS, MongoDB – To MongoDB
Snapshot database as BSON
Store BSON in HDFS
Read BSON into Spark app
Train model from existing ratings
Create user-movie pairings
Predict ratings for all pairings
Write predictions to MongoDB
collection
Web application exposes
recommendations
Repeat the process weekly
MovieWeb Workflow
$ export SPARK_JAR=spark-assembly-1.0.0-hadoop2.4.0.jar
$ export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
$ bin/spark-submit
--master yarn-cluster
--class com.mongodb.hadoop.demo.Recommender demo-1.0.jar
--jars mongo-java-2.12.3.jar,mongo-hadoop-1.3.0.jar
--driver-memory 1G
--executor-memory 2G
--num-executors 4
Execution
Questions?
• MongoDB Connector for Hadoop – http://github.com/mongodb/mongo-hadoop
• Getting Started with MongoDB and Hadoop – http://docs.mongodb.org/ecosystem/tutorial/getting-
started-with-hadoop/
• MongoDB-Spark Demo – http://github.com/crcsmnky/mongodb-spark-demo