28

Mongo db and hadoop driving business insights - final

  • Upload
    mongodb

  • View
    551

  • Download
    4

Embed Size (px)

DESCRIPTION

MongoDB and Hadoop can work together to solve big data problems facing today's enterprises. We will take an in-depth look at how the two technologies complement and enrich each other with complex analyses and greater intelligence. We will take a deep dive into the MongoDB Connector for Hadoop and how it can be applied to enable new business insights with MapReduce, Pig, and Hive, and demo a Spark application to drive product recommendations.

Citation preview

Page 1: Mongo db and hadoop   driving business insights - final
Page 2: Mongo db and hadoop   driving business insights - final

MongoDB and Hadoop

Software Engineer, MongoDB

Luke Lovett

Page 3: Mongo db and hadoop   driving business insights - final

Agenda

• Complementary Approaches to Data

• MongoDB & Hadoop Use Cases

• MongoDB Connector Overview and Features

• Demo

Page 4: Mongo db and hadoop   driving business insights - final

Complementary Approaches to Data

Page 5: Mongo db and hadoop   driving business insights - final

Operational: MongoDB

Real-Time Analytics

Product/Asset Catalogs

Security & Fraud

Internet of Things

Mobile AppsCustomer

Data Mgmt

Single View Social

Churn Analysis Recommender

Warehouse & ETL

Risk Modeling

Trade Surveillance

Predictive Analytics

Ad TargetingSentiment

Analysis

Page 6: Mongo db and hadoop   driving business insights - final

MongoDB

• Store and read data frequently

• Easy administration

• Built-in analytical tools

– aggregation framework

– JavaScript MapReduce

– Geo/text indexes

Page 7: Mongo db and hadoop   driving business insights - final

Analytical: Hadoop

Real-Time Analytics

Product/Asset Catalogs

Security & Fraud

Internet of Things

Mobile AppsCustomer

Data Mgmt

Single View Social

Churn Analysis Recommender

Warehouse & ETL

Risk Modeling

Trade Surveillance

Predictive Analytics

Ad TargetingSentiment

Analysis

Page 8: Mongo db and hadoop   driving business insights - final

Hadoop

The Apache Hadoop software library is a framework that allows for the

distributed processing of large data sets across clusters of computers

using simple programming models.

• Terabyte and Petabyte datasets

• Data warehousing

• Advanced analytics

Page 9: Mongo db and hadoop   driving business insights - final

Operational vs. Analytical: Lifecycle

Real-Time Analytics

Product/Asset Catalogs

Security & Fraud

Internet of Things

Mobile AppsCustomer

Data Mgmt

Single View Social

Churn Analysis Recommender

Warehouse & ETL

Risk Modeling

Trade Surveillance

Predictive Analytics

Ad TargetingSentiment

Analysis

Page 10: Mongo db and hadoop   driving business insights - final

MongoDB & Hadoop Use Cases

Page 11: Mongo db and hadoop   driving business insights - final

Batch Aggregation

Applicatio

ns

powered

by

Analysis

powered

by

● Need more than MongoDB aggregation

● Need offline processing

● Results sent back to MongoDB

● Can be left as BSON on HDFS for further analysis

MongoDB Connector

for Hadoop

Page 12: Mongo db and hadoop   driving business insights - final

Commerce

Applicatio

ns

powered

by

Analysis

powered

by

• Products & Inventory

• Recommended

products

• Customer profile

• Session management

• Elastic pricing

• Recommendation

models

• Predictive analytics

• Clickstream history

MongoDB Connector

for Hadoop

Page 13: Mongo db and hadoop   driving business insights - final

Fraud Detection

Payments

Fraud modeling

Nightly

Analysis

MongoDB Connector

for Hadoop

Results

Cache

Online payments

processing

3rd Party Data

Sources

Fraud

Detection

query

only

query

only

Page 14: Mongo db and hadoop   driving business insights - final

MongoDB Connector for Hadoop

Page 15: Mongo db and hadoop   driving business insights - final

Connector Overview

HadoopMap Reduce, Hive, Pig, Spark

HDFS / S3Hadoop Connector

Text Files

Hadoop

Connector

Apache Hadoop / Cloudera CDH / Hortonworks HDP / Amazon

EMR

BSON FilesMongoDB

Single Node, Replica Set,

Cluster

Page 16: Mongo db and hadoop   driving business insights - final

Data Movement

Dynamic queries with most recent data

Puts load on operational database

Snapshots move load to Hadoop

Snapshots add predictable load to MongoDB

Dynamic queries to MongoDB vs. BSON snapshots in

HDFS

Page 17: Mongo db and hadoop   driving business insights - final

Connector Operation

1. Split according to given InputFormat

- many options available for reading from live cluster

- configure key pattern, split strategy

1. Write splits file

2. Output to BSON file or live MongoDB

- BSON file splits written automatically for future tasks

- Mongo insertion round-robin across collections

Page 18: Mongo db and hadoop   driving business insights - final

Getting Splits

• Split on a sharded cluster

– Split by chunk

– Split by shard

• Splits on replica

set/standalone

– splitVector command

• BSON files

– specify max docs

– split per input file

Config

Servers

Chunk

Chunk

Chunk

Shard

Mongos

Chunk

Chunk

Chunk

Shard

Chunk

Chunk

Chunk

Shard

MongoDB Connector for Hadoop

Page 19: Mongo db and hadoop   driving business insights - final

Config

Servers

Getting Splits

• Split on a sharded cluster

– Split by chunk

– Split by shard

• Splits on replica

set/standalone

– splitVector command

• BSON files

– specify max docs

– split per input file

Chunk

Chunk

Chunk

Shard

Mongos

Chunk

Chunk

Chunk

Shard

Chunk

Chunk

Chunk

Shard

MongoDB Connector for Hadoop

Page 20: Mongo db and hadoop   driving business insights - final

MapReduce Configuration

• MongoDB input

– mongo.job.input.format = com.hadoop.MongoInputFormat

– mongo.input.uri = mongodb://mydb:27017/db1.collection1

• MongoDB output

– mongo.job.output.format = com.hadoop.MongoOutputFormat

– mongo.output.uri = mongodb://mydb:27017/db1.collection2

Page 21: Mongo db and hadoop   driving business insights - final

MapReduce Configuration

• BSON input/output

– mongo.job.input.format = com.hadoop.BSONFileInputFormat

– mapred.input.dir = hdfs:///tmp/database.bson

– mongo.job.output.format = com.hadoop.BSONFileOutputFormat

– mapred.output.dir = hdfs:///tmp/output.bson

Page 22: Mongo db and hadoop   driving business insights - final

Spark Usage

• Use with MapReduce

input/output formats

• Create Configuration objects with

input/output formats and data

URI

• Load/save data using

SparkContext Hadoop file API

Page 23: Mongo db and hadoop   driving business insights - final

Hive Support

CREATE TABLE mongo_users (id int, name string, age int)

STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"

WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”)

TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)

• Access collections as Hive tables

• Use with MongoStorageHandler or BSONSerDe

Page 24: Mongo db and hadoop   driving business insights - final

Hive Support

MongoDB Hive

Primitive type (int, String, etc.) Primitive type (int, float, etc.)

Document Row

Sub-document Struct, Map, or exploded field

Array Array or exploded field

● Types given by schema

● May use structs to project fields out of documents and ease access

● Can explode nested fields to make them top-level:{“customer”: {“name”: “Bart”}}

can be accessed with “customer.name”.

Page 25: Mongo db and hadoop   driving business insights - final

Pig Mappings

• Input: BSONLoader and MongoLoader

data = LOAD ‘mongodb://mydb:27017/db.collection’

using com.mongodb.hadoop.pig.MongoLoader

• Output: BSONStorage and MongoInsertStorage

STORE records INTO ‘hdfs:///output.bson’

using com.mongodb.hadoop.pig.BSONStorage

Page 26: Mongo db and hadoop   driving business insights - final

Pig Mappings

MongoDB Pig

Primitive type (int, String, etc.) Primitive type (int, chararray, etc.)

Document Tuple (schema given)

Document Tuple containing a Map (no schema)

Sub-document Map

Array Bag

● Organize and prune documents by specifying a schema

● Access full document in a Map without needing a schema

Page 27: Mongo db and hadoop   driving business insights - final

Demo!

Page 28: Mongo db and hadoop   driving business insights - final

Questions?