MongoDB Berlin Aggregation

Preview:

DESCRIPTION

Aggregation with MongoDB and introducing the new aggregation framework... think Unix pipes for JSON data!

Citation preview

AggregationNew framework in MongoDB

Alvin Richards

Technical Director, EMEAalvin@10gen.com

@jonnyeight

1

What problem are we solving?

• Map/Reduce can be used for aggregation…• Currently being used for totaling, averaging, etc

• Map/Reduce is a big hammer• Simpler tasks should be easier

• Shouldn’t need to write JavaScript• Avoid the overhead of JavaScript engine

• We’re seeing requests for help in handling complex documents• Select only matching subdocuments or arrays

2

How will we solve the problem?

• New aggregation framework• Declarative framework (no JavaScript)• Describe a chain of operations to apply• Expression evaluation

• Return computed values• Framework: new operations added easily• C++ implementation

3

Aggregation - Pipelines

• Aggregation requests specify a pipeline• A pipeline is a series of operations• Members of a collection are passed

through a pipeline to produce a result• e.g. ps -ef | grep -i mongod

4

Example - twitter{

   "_id"  :  ObjectId("4f47b268fb1c80e141e9888c"),

   "user"  :  {

       "friends_count"  :  73,

       "location"  :  "Brazil",

       "screen_name"  :  "Bia_cunha1",

       "name"  :  "Beatriz  Helena  Cunha",

       "followers_count"  :  102,

   }

}

• Find the # of followers and # friends by location

5

Example - twitterdb.tweets.aggregate(    {$match:        {"user.friends_count":      {  $gt:  0  },            "user.followers_count":  {  $gt:  0  }        }    },    {$project:        {  location:    "$user.location",            friends:      "$user.friends_count",              followers:  "$user.followers_count"        }    },    {$group:        {_id:              "$location",          friends:      {$sum:  "$friends"},          followers:  {$sum:  "$followers"}        }    });

6

Example - twitterdb.tweets.aggregate(    {$match:        {"user.friends_count":      {  $gt:  0  },            "user.followers_count":  {  $gt:  0  }        }    },    {$project:        {  location:    "$user.location",            friends:      "$user.friends_count",              followers:  "$user.followers_count"        }    },    {$group:        {_id:              "$location",          friends:      {$sum:  "$friends"},          followers:  {$sum:  "$followers"}        }    });

Predicate

7

Example - twitterdb.tweets.aggregate(    {$match:        {"user.friends_count":      {  $gt:  0  },            "user.followers_count":  {  $gt:  0  }        }    },    {$project:        {  location:    "$user.location",            friends:      "$user.friends_count",              followers:  "$user.followers_count"        }    },    {$group:        {_id:              "$location",          friends:      {$sum:  "$friends"},          followers:  {$sum:  "$followers"}        }    });

Predicate

Parts of the document you want to project

8

Example - twitterdb.tweets.aggregate(    {$match:        {"user.friends_count":      {  $gt:  0  },            "user.followers_count":  {  $gt:  0  }        }    },    {$project:        {  location:    "$user.location",            friends:      "$user.friends_count",              followers:  "$user.followers_count"        }    },    {$group:        {_id:              "$location",          friends:      {$sum:  "$friends"},          followers:  {$sum:  "$followers"}        }    });

Predicate

Parts of the document you want to project

Function to apply to the

result set

9

Example - twitter{   "result"  :  [     {       "_id"  :  "Far  Far  Away",       "friends"  :  344,       "followers"  :  789     },...   ],   "ok"  :  1}

10

Pipeline Operations• $match

• Uses a query predicate (like .find({…})) as a filter• $project

• Uses a sample document to determine the shape of the result (similar to .find()’s optional argument)• This can include computed values

• $unwind• Hands out array elements one at a time

• $group• Aggregates items into buckets defined by a key

11

Pipeline Operations (continued)

• $sort• Sort documents

• $limit• Only allow the specified number of

documents to pass• $skip

• Skip over the specified number of documents

12

Computed Expressions

• Available in $project operations• Prefix expression language

• $add:[“$field1”, “$field2”]• $ifNull:[“$field1”, “$field2”]• Nesting:

$add:[“$field1”, $ifNull:[“$field2”, “$field3”]]• Other functions….

• $divide, $mod, $multiply

13

Computed Expressions

• String functions• $toUpper, $toLower, $substr

• Date field extraction• $year, $month, $day, $hour...

• Date arithmetic• $ifNull• Ternary conditional

• Return one of two values based on a predicate

14

Projections

• $project can reshape results• Include or exclude fields• Computed fields

• Arithmetic expressions• Pull fields from nested documents to the top• Push fields from the top down into new virtual

documents

15

Unwinding

• $unwind can “stream” arrays• Array values are doled out one at time in the

context of their surrounding documents• Makes it possible to filter out elements before

returning

16

Grouping

• $group aggregation expressions• Define a grouping key as the _id of the result• Total grouped column values: $sum• Average grouped column values: $avg• Collect grouped column values in an array or

set: $push, $addToSet• Other functions

• $min, $max, $first, $last

17

Sorting

• $sort can sort documents• Sort specifications are the same as today,

e.g., $sort:{ key1: 1, key2: -1, …}

18

DemoDemo  files  are  at  https://gist.github.com/2036709

19

Usage Tips

• Use $match in a pipeline as early as possible• The query optimizer can then be used to

choose an index and avoid scanning the entire collection

• Use $sort in a pipeline as early as possible• The query optimizer can sometimes be used

to choose an index to scan instead of sorting the result

20

Driver Support

• Initial version is a command• For any language, build a JSON database

object, and execute the command• { aggregate : <collection>, pipeline : {…} }

• Beware of result size limit of 16MB

21

When is this being released?

• Now!• 2.1.0 - unstable• 2.2.0 - stable (soon)

22

Sharding support

• Initial release supports sharding• Mongos analyzes pipeline

• forwards operations up to $group or $sort to shards

• combines shard server results and returns them

23

Pipeline Operations – Future

• $out• Saves the document stream to a collection• Similar to M/R $out, but with sharded output• Functions like a tee, so that intermediate

results can be saved

24

Documentation, Bug Reports• http://www.mongodb.org/display/DOCS/

Aggregation+Framework

• https://jira.mongodb.org/browse/SERVER/component/10840

25

@mongodb

conferences,  appearances,  and  meetupshttp://www.10gen.com/events

http://bit.ly/mongoE  Facebook                    |                  Twitter                  |                  LinkedIn

http://linkd.in/joinmongo

download at mongodb.org

alvin@10gen.com

26

Recommended