How to create Treasure Data #dotsbigdata

Masahiro NakagawaAugust 1, 2015

BigData All Stars 2015

How to createTreasure Data#dotsbigdata

Who are you?

> Masahiro Nakagawa> github/twitter: @repeatedly

> Treasure Data, Inc.> Senior Software Engineer> Fluentd / td-agent developer

> I love OSS :)> D language - Phobos committer> Fluentd - Main maintainer> MessagePack / RPC - D and Python (only RPC)> The organizer of Presto Source Code Reading / meetup> etc…

Company overview

http://www.treasuredata.com/opensource

21 65

http://www.treasuredata.com/opensource

Treasure Data Solution

Ingest Analyze Distribute

74 and

Treasure Data Service

> A simplified cloud analytics infrastructure > Customers focus on their business

> SQL interfaces for Schema-less data sources > Fit for Data Hub / Lake > Batch / Low latency / Machine Learning

> Lots of ingestion and integrated solutions > Fluentd / Embulk / Data Connector / SDKs > Result Output / Prestogres Gateway / BI tools

> Awesome support for time to value

21 65

Plazma - TD’s distributed analytical database

Plazma by the numbers> Streaming import

> 45 billion records / day > Bulk Import

> 10 billion records / day > Hive Query

> 3+ trillion records / day > Machine Learning queries, Hivemall, increased

> Presto Query > 3+ trillion records / day

TD’s resource management

> Guarantee and boost compute resources > Guarantee for stabilizing query performance > Boost for sharing free resources

> Get multi-tenant merit > Global resource schedular

> manage job, resource and priority across users > Separate storage from compute resource

> Easy to scale workers > We can use S3 / GCS / Azure Storage for reliable backend

Data Importing

Import Queue

td-agent / fluentd

Import Worker

✓ Buffering for5 minute

✓ Retrying(at-least once)

✓ On-disk buffering on failure

✓ Unique ID for each chunk

API Server

It’s like JSON.

but fast and small.

unique_id=375828ce5510cadb {“time”:1426047906,”uid”:1,…} {“time”:1426047912,”uid”:9,…} {“time”:1426047939,”uid”:3,…} {“time”:1426047951,”uid”:2,…} …

MySQL (PerfectQueue)

Import Queue

td-agent / fluentd

Import Worker





API Server

It’s like JSON.

but fast and small.


unique_id time

375828ce5510cadb 2015-12-01 10:47

2024cffb9510cadc 2015-12-01 11:09

1b8d6a600510cadd 2015-12-01 11:21

1f06c0aa510caddb 2015-12-01 11:38

Import Queue

td-agent / fluentd

Import Worker





API Server

It’s like JSON.

but fast and small.


unique_id time

375828ce5510cadb 2015-12-01 10:47

2024cffb9510cadc 2015-12-01 11:09

1b8d6a600510cadd 2015-12-01 11:21

1f06c0aa510caddb 2015-12-01 11:38UNIQUE (at-most once)

Import Queue

Import Worker

Import Worker

Import Worker

✓ HA ✓ Load balancing

Realtime Storage

PostgreSQL

Amazon S3 / Basho Riak CS

Metadata

Import Queue

Import Worker

Import Worker

Import Worker

Archive Storage

Realtime Storage

PostgreSQL


Metadata

Import Queue

Import Worker

Import Worker

Import Worker

uploaded time file index range records

2015-03-08 10:47 [2015-12-01 10:47:11, 2015-12-01 10:48:13] 3

2015-03-08 11:09 [2015-12-01 11:09:32, 2015-12-01 11:10:35] 25

2015-03-08 11:38 [2015-12-01 11:38:43, 2015-12-01 11:40:49] 14

… … … …

Archive Storage

Metadata of the records in a file (stored on PostgreSQL)


Metadata

Merge Worker(MapReduce)


2015-03-08 10:47 [2015-12-01 10:47:11, 2015-12-01 10:48:13] 3

2015-03-08 11:09 [2015-12-01 11:09:32, 2015-12-01 11:10:35] 25

2015-03-08 11:38 [2015-12-01 11:38:43, 2015-12-01 11:40:49] 14

… … … …

file index range records

[2015-12-01 10:00:00, 2015-12-01 11:00:00] 3,312

[2015-12-01 11:00:00, 2015-12-01 12:00:00] 2,143

… … …

Realtime Storage

Archive Storage

PostgreSQL

Merge every 1 hourRetrying + Unique (at-least-once + at-most-once)


Metadata


2015-03-08 10:47 [2015-12-01 10:47:11, 2015-12-01 10:48:13] 3

2015-03-08 11:09 [2015-12-01 11:09:32, 2015-12-01 11:10:35] 25

2015-03-08 11:38 [2015-12-01 11:38:43, 2015-12-01 11:40:49] 14

… … … …

file index range records

[2015-12-01 10:00:00, 2015-12-01 11:00:00] 3,312

[2015-12-01 11:00:00, 2015-12-01 12:00:00] 2,143

… … …

Realtime Storage

Archive Storage

PostgreSQL

GiST (R-tree) Index on“time” column on the files

Read from Archive Storage if merged. Otherwise, from Realtime Storage

Data Importing

> Scalable & Reliable importing > Fluentd buffers data on a disk > Import queue deduplicates uploaded chunks > Workers take the chunks and put to Realtime Storage

> Instant visibility > Imported data is immediately visible by query engines. > Background workers merges the files every 1 hour.

> Metadata > Index is built on PostgreSQL using RANGE type and

GiST index

Data processing

time code method

2015-12-01 10:02:36 200 GET

2015-12-01 10:22:09 404 GET

2015-12-01 10:36:45 200 GET

2015-12-01 10:49:21 200 POST

… … …

time code method

2015-12-01 11:10:09 200 GET

2015-12-01 11:21:45 200 GET

2015-12-01 11:38:59 200 GET

2015-12-01 11:43:37 200 GET

2015-12-01 11:54:52 “200” GET

… … …

Archive Storage

Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL

path index range records

[2015-12-01 10:00:00, 2015-12-01 11:00:00] 3,312

[2015-12-01 11:00:00, 2015-12-01 12:00:00] 2,143

… … …

MessagePack ColumnarFile Format

time code method

2015-12-01 10:02:36 200 GET

2015-12-01 10:22:09 404 GET

2015-12-01 10:36:45 200 GET

2015-12-01 10:49:21 200 POST

… … …

time code method

2015-12-01 11:10:09 200 GET

2015-12-01 11:21:45 200 GET

2015-12-01 11:38:59 200 GET

2015-12-01 11:43:37 200 GET

2015-12-01 11:54:52 “200” GET

… … …

Archive Storage


[2015-12-01 10:00:00, 2015-12-01 11:00:00] 3,312

[2015-12-01 11:00:00, 2015-12-01 12:00:00] 2,143

… … …

column-based partitioning

time-based partitioning


time code method

2015-12-01 10:02:36 200 GET

2015-12-01 10:22:09 404 GET

2015-12-01 10:36:45 200 GET

2015-12-01 10:49:21 200 POST

… … …

time code method

2015-12-01 11:10:09 200 GET

2015-12-01 11:21:45 200 GET

2015-12-01 11:38:59 200 GET

2015-12-01 11:43:37 200 GET

2015-12-01 11:54:52 “200” GET

… … …

Archive Storage


[2015-12-01 10:00:00, 2015-12-01 11:00:00] 3,312

[2015-12-01 11:00:00, 2015-12-01 12:00:00] 2,143

… … …

column-based partitioning

time-based partitioning


SELECT code, COUNT(1) FROM logs WHERE time >= 2015-12-01 11:00:00 GROUP BY code

Handling Eventual Consistency

1. Writing data / metadata first> At this time, data is not visible

2. Check data is available or not> GET, GET, GET…

3. Data become visible> Query includes imported data!

Ex. Netflix case> https://github.com/Netflix/s3mper

https://github.com/Netflix/s3mper

Hide network cost

> Open a lot of connections to Object Storage > Using range feature with columnar offset > Improve scan performance for partitioned data

> Detect recoverable error > We have error lists for fault tolerance

> Stall checker > Watch the progress of reading data > If processing time reached threshold, re-connect to OS

and re-read data

buffer

Optimizing Scan Performance

•  Fully utilize the network bandwidth from S3 •  TD Presto becomes CPU bottleneck

8

TableScanOperator�

•  s3 file list •  table schema header

request

S3 / RiakCS�

•  release(Buffer) Buffer size limit Reuse allocated buffers

Request Queue�

•  priority queue •  max connections limit

Header�Column Block 0 (column names)�

Column Block 1�

Column Block i�

Column Block m�

MPC1 file

HeaderReader�

•  callback to HeaderParser

ColumnBlockReader�

header HeaderParser�

•  parse MPC file header • column block offsets • column names

column block request Column block requests

column block

prepare

MessageUnpacker�

buffer

MessageUnpacker�MessageUnpacker�

S3 read�

S3 read�

pull records

Retry GET request on - 500 (internal error) - 503 (slow down) - 404 (not found) - eventual consistency

S3 read�•  decompression •  msgpack-java v07

S3 read�

S3 read�

S3 read�

Optimize scan performance

Recoverable errors> Error types

> User error > Syntax error, Semantic error

> Insufficient resource > Exceeded task memory size

> Internal failure > I/O error of S3 / Riak CS > worker failure > etc

We can retry these patterns

Recoverable errors> Error types

> User error > Syntax error, Semantic error

> Insufficient resource > Exceeded task memory size

> Internal failure > I/O error of S3 / Riak CS > worker failure > etc

We can retry these patterns

Presto retry on Internal Errors

> Query succeed eventually log scale

time code method

2015-12-01 10:02:36 200 GET

2015-12-01 10:22:09 404 GET

2015-12-01 10:36:45 200 GET

2015-12-01 10:49:21 200 POST

… … …

user time code method

391 2015-12-01 11:10:09 200 GET

482 2015-12-01 11:21:45 200 GET

573 2015-12-01 11:38:59 200 GET

664 2015-12-01 11:43:37 200 GET

755 2015-12-01 11:54:52 “200” GET

… … …

time code method

2015-12-01 10:02:36 200 GET

2015-12-01 10:22:09 404 GET

2015-12-01 10:36:45 200 GET

2015-12-01 10:49:21 200 POST

… … …

user time code method

391 2015-12-01 11:10:09 200 GET

482 2015-12-01 11:21:45 200 GET

573 2015-12-01 11:38:59 200 GET

664 2015-12-01 11:43:37 200 GET

755 2015-12-01 11:54:52 “200” GET

… … …

MessagePack ColumnarFile Format is schema-less✓ Instant schema change

SQL is schema-full✓ SQL doesn’t work

without schema

Schema-on-Read

Realtime Storage

Query EngineHive, Pig, Presto

Archive Storage

{“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”}

Schema-on-ReadSchema-full

Schema-less

Realtime Storage


Archive Storage

Schema-full

Schema-less

Schema


CREATE TABLE events ( user INT, name STRING, value INT, host INT );

| user | 54

| value | 120

| host | NULL

| |

Schema-on-Read

| name | “plazma”

Realtime Storage


Archive Storage


CREATE TABLE events ( user INT, name STRING, value INT, host INT );

| user | 54

| name | “plazma”

| value | 120

| host | NULL

| |

Schema-on-ReadSchema-full

Schema-less

Schema

Streaming logging layer

Reliable forwarding

Pluggable architecture

http://fluentd.org/

http://fluentd.org/

Bulk loading

Parallel processing

Pluggable architecture

http://embulk.org/

http://embulk.org/

Hadoop

> Distributed computing framework> Consist of many components…

http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/

http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/

Presto

>

> Open sourced by Facebook> https://github.com/facebook/presto

A distributed SQL query engine for interactive data analisys against GBs to PBs of data.

https://github.com/facebook/presto

Conclusion> Build scalable data analytics platform on Cloud

> Separate resource and storage > loosely-coupled components

> We have lots of useful OSS and services :) > There are many trade-off

> Use existing component or create new component? > Stick to the basics!

> If you tired, please use Treasure Data ;)

https://jobs.lever.co/treasure-data

Cloud service for the entire data pipeline.

https://jobs.lever.co/treasure-data

Engineering

How to create Treasure Data #dotsbigdata