Upload
mongodb
View
3.757
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
1
Operational Intelligence with MongoDB
Edouard Servan-Schreiber, Ph.D.Director for Solution Architecture
October 11th 2012
2
The goal
Real Time Analytics Engine
Data SourceData
SourceData Source
3
Sample Customers
4
Solution goals
• Lots of data sources• Lots of data from each source
High write volume
• Users can drill down into dataDynamic queries
• Lots of clients• High request rate
Fast queries
• How long before an event appears in a report?
Minimize delay between collection &
query
5
Systems Architecture
Data Sources
Asynchronous writes
Upserts avoid unnecessary reads
Writes buffered in RAM and flushed to
disk in bulk
Data SourcesData
SourcesData Sources
Spread writes over multiple shards
6
Sample data
Original Event Data
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 “http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)”
As BSON doc = { _id: ObjectId('4f442120eb03305789000000'), host: "127.0.0.1", time: ISODate("2000-10-10T20:55:36Z"), path: "/apache_pb.gif", referer: “http://www.example.com/start.html", user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)”}
Insert to MongoDB
db.logs.insert( doc )
7
Dynamic Queries
Find all logs for a URL
db.logs.find( { ‘path’ : ‘/index.html’ } )
Find all logs for a time range
db.logs.find( { ‘time’ : { ‘$gte’ : new Date(2012,0), ‘$lt’ : new Date(2012,1) } } );
Find all logs for a host over a range of dates
db.logs.find( { ‘host’ : ‘127.0.0.1’, ‘time’ : { ‘$gte’ : new Date(2012,0), ‘$lt’ : new Date(2012, 1) } } );
8
Three Approaches
• Aggregation Framework for on-demand rollups
• Map/Reduce Framework for background rollups
• Pre-Aggregation for real-time reporting
9
Aggregation Framework(New in version 2.2!)
Requests per day by URL
db.logs.aggregate( [ { '$match': { 'time': { '$gte': new Date(2012,0), '$lt': new Date(2012,1) } } }, { '$project': { 'path': 1, 'date': { 'y': { '$year': '$time' }, 'm': { '$month': '$time' }, 'd': { '$dayOfMonth': '$time' } } } }, { '$group': { '_id': { 'p':'$path’, 'y': '$date.y', 'm': '$date.m', 'd': '$date.d' }, 'hits': { '$sum': 1 } } },])
$project $match $limit $skip
$unwind $group $sort
10
Aggregation Framework
{ ‘ok’: 1, ‘result’: [ { '_id': {'p':’/index.html’,'y': 2012,'m': 1,'d': 1 },'hits’: 124 } }, { '_id': {'p':’/index.html’,'y': 2012,'m': 1,'d': 2 },'hits’: 245} }, { '_id': {'p':’/index.html’,'y': 2012,'m': 1,'d': 3 },'hits’: 322} }, { '_id': {'p':’/index.html’,'y': 2012,'m': 1,'d': 4 },'hits’: 175} }, { '_id': {'p':’/index.html’,'y': 2012,'m': 1,'d': 5 },'hits’: 94} } ]}
11
Map Reduce – Map Phase
Generate hourly rollups from log data
var map = function() { var key = { p: this.path, d: new Date( this.ts.getFullYear(), this.ts.getMonth(), this.ts.getDate(), this.ts.getHours(), 0, 0, 0) }; emit( key, { hits: 1 } );}
12
Map Reduce – Reduce Phase
Generate hourly rollups from log data
var reduce = function(key, values) { var r = { hits: 0 }; values.forEach(function(v) { r.hits += v.hits; }); return r; })
13
Map Reduce
Generate hourly rollups from log data
cutoff = new Date(2012,0,1)
query = { 'ts': { '$gt': last_run, '$lt': cutoff } }
db.logs.mapReduce( map, reduce, { ‘query’: query, ‘out’: { ‘reduce’ : ‘stats.hourly’ } } )
last_run = cutoff
14
Map Reduce Output
> db.stats.hourly.find() { '_id': {'p':’/index.html’,’d’:ISODate(“2012-0-1 00:00:00”) }, ’value': { ’hits’: 124 } }, { '_id': {'p':’/index.html’,’d’:ISODate(“2012-0-1 01:00:00”) }, ’value': { ’hits’: 245} }, { '_id': {'p':’/index.html’,’d’:ISODate(“2012-0-1 02:00:00”) }, ’value': { ’hits’: 322} }, { '_id': {'p':’/index.html’,’d’:ISODate(“2012-0-1 03:00:00”) }, ’value': { ’hits’: 175} }, ... More ...
15
Chained Map Reduce
Collection 1 : Raw Logs
Map Reduce
Collection 2: Hourly Stats
Collection 3: Daily Stats
Map Reduce
Runs every hour
Runs every day
16
Pre-Aggregation
Data for URL / Date
{ _id: "20101010/site-1/apache_pb.gif", metadata: { date: ISODate("2000-10-10T00:00:00Z"), site: "site-1", page: "/apache_pb.gif" }, daily: 5468426, hourly: { "0": 227850, "1": 210231, ... "23": 20457 }, minute: { "0": 3612, "1": 3241, ... "1439": 2819 } }
WARNING: arrays are not random accessed in MongoDB….
17
Pre-Aggregation
Data for URL / Date
{ _id: "20101010/site-1/apache_pb.gif", metadata: { date: ISODate("2000-10-10T00:00:00Z"), site: "site-1", page: "/apache_pb.gif" }, daily: 5468426, hourly: { "0": { “0” : 3612, “1” : 3241 … “59” : 2130 } "1": { … } …. “23”: { ….} }
18
Pre-Aggregation
Data for URL / Date
id_daily = dt_utc.strftime('%Y%m%d/') + site + pagehour = dt_utc.hourminute = dt_utc.minute
# Get a datetime that only includes date infod = datetime.combine(dt_utc.date(), time.min)query = { '_id': id_daily, 'metadata': { 'date': d, 'site': site, 'page': page } }update = { '$inc': { 'hourly.%d' % (hour,): 1, 'minute.%d.%d' % (hour,minute): 1 } }
db.stats.daily.update(query, update, upsert=True)
19
Reporting
Javascript Charting
20
Apache Hadoop
Log Aggregation with MongoDB as
sink
More complex aggregations or integration with
tools like Mahout
21
Q&A