Building a custom time series db - Colin Hemmings at #DOXLON

Preview:

DESCRIPTION

Colin talks about how he architected and built a high performance time series database from the ground up at Dataloop.io. Handling hundreds of thousands of metrics per second. One of the objectives was to provide real time graphing and alerting. If you're 'rolling your own' metrics, are interested in Node.JS, highly scalable architectures and like listening to plenty of war stories you should enjoy this talk. Video: http://youtu.be/vx6Ms5TNtqo DevOps Exchange Meetup Group: http://bit.ly/doxlonmeetup

Citation preview

www.dataloop.io | @dataloopio | info@dataloop.io

Colin Hemmings | Architect

Time-series Datastore on Riak

www.dataloop.io | @dataloopio | info@dataloop.io

•Collection •Storage •Analytics

Architecture

www.dataloop.io | @dataloopio | info@dataloop.io

Just stick it in a database, right?

The Storage Problem

www.dataloop.io | @dataloopio | info@dataloop.io

Past Solutions

TempoDB - the phantom menace

www.dataloop.io | @dataloopio | info@dataloop.io

Past Solutions

MongoDB - return of the Jedi

www.dataloop.io | @dataloopio | info@dataloop.io

Riak - Our New Hope

• Scales

• Ops Friendly

• Actually works

• No random JVM crashes here

www.dataloop.io | @dataloopio | info@dataloop.io

Objectives

• Handle the load

• Semi-arbitrary queries

• Data retention windows

• Low latency

www.dataloop.io | @dataloopio | info@dataloop.io

Data structure

• Resolution/rollup based queries

• Minimum 24 hours at 1 second resolution

• Second, minute and hour resolution

www.dataloop.io | @dataloopio | info@dataloop.io

Data structure

• 86,400 data points per resolution

• 1 second -> 24 hour retention

• 1 minute -> 60 day retention

• 1 hour -> 10 year retention

www.dataloop.io | @dataloopio | info@dataloop.io

Data structure

• per metric -> 250k data points

• 1000 metric per host -> 2.5M data points

• 300 hosts per user -> 750M data points

• 1000 customers -> 750B data points!!!!!

www.dataloop.io | @dataloopio | info@dataloop.io

Simple Riak Storage

• Timestamp keyed object per metric value

• 2i and MapReduce are too slow

• Especially across millions of keys

• Writes would soon cripple our Riak cluster

www.dataloop.io | @dataloopio | info@dataloop.io

Intelligent Riak Storage

• Units of storage: time based data blocks

• Compute keys

• Mutable data windows

www.dataloop.io | @dataloopio | info@dataloop.io

Query

Get cpu metrics for host A for period t1-t4 at 1 second resolution

• Pull the correct blocks from riak, based on block boundaries

• GET /buckets/host_a/keys/cpu_second_t1b

• GET /buckets/host_a/keys/cpu_second_t2b

• GET /buckets/host_a/keys/cpu_second_t3b

• GET /buckets/host_a/keys/cpu_second_t4b

www.dataloop.io | @dataloopio | info@dataloop.io

Query

• Filter points outside of our query range

• Aggregate all the data points

• Perform other operation if more complex query

www.dataloop.io | @dataloopio | info@dataloop.io

Expiring

• Cleanup worker

• Removes keys out of retention window

• Host keyed, easier to clear all hosts or account data

www.dataloop.io | @dataloopio | info@dataloop.io

Our cluster

• Riak 2.0

• 5 nodes on LevelDB

• Each 2 x 500GB striped SSDs

• Average 1ms GET and PUT latencies

www.dataloop.io | @dataloopio | info@dataloop.io

www.dataloop.io | @dataloopio | info@dataloop.io

Comments

• Awesome, especially for ops

• A bit more work in application tier

• Always compute keys avoid 2i and MapReduce

• Looking forward to using the new data types

Recommended