40
Masahiro Nakagawa August 1, 2015 BigData All Stars 2015 How to create Treasure Data #dotsbigdata

How to create Treasure Data #dotsbigdata

Embed Size (px)

Citation preview

Page 1: How to create Treasure Data #dotsbigdata

Masahiro NakagawaAugust 1, 2015

BigData All Stars 2015

How to createTreasure Data#dotsbigdata

Page 2: How to create Treasure Data #dotsbigdata

Who are you?

> Masahiro Nakagawa> github/twitter: @repeatedly

> Treasure Data, Inc.> Senior Software Engineer> Fluentd / td-agent developer

> I love OSS :)> D language - Phobos committer> Fluentd - Main maintainer> MessagePack / RPC - D and Python (only RPC)> The organizer of Presto Source Code Reading / meetup> etc…

Page 3: How to create Treasure Data #dotsbigdata

Company overview

http://www.treasuredata.com/opensource

21 65

Page 4: How to create Treasure Data #dotsbigdata

Treasure Data Solution

Ingest Analyze Distribute

74 and

Page 5: How to create Treasure Data #dotsbigdata

Treasure Data Service

> A simplified cloud analytics infrastructure > Customers focus on their business

> SQL interfaces for Schema-less data sources > Fit for Data Hub / Lake > Batch / Low latency / Machine Learning

> Lots of ingestion and integrated solutions > Fluentd / Embulk / Data Connector / SDKs > Result Output / Prestogres Gateway / BI tools

> Awesome support for time to value

Page 6: How to create Treasure Data #dotsbigdata

21 65

Page 7: How to create Treasure Data #dotsbigdata

Plazma - TD’s distributed analytical database

Page 8: How to create Treasure Data #dotsbigdata

Plazma by the numbers> Streaming import

> 45 billion records / day > Bulk Import

> 10 billion records / day > Hive Query

> 3+ trillion records / day > Machine Learning queries, Hivemall, increased

> Presto Query > 3+ trillion records / day

Page 9: How to create Treasure Data #dotsbigdata

TD’s resource management

> Guarantee and boost compute resources > Guarantee for stabilizing query performance > Boost for sharing free resources

> Get multi-tenant merit > Global resource schedular

> manage job, resource and priority across users > Separate storage from compute resource

> Easy to scale workers > We can use S3 / GCS / Azure Storage for reliable backend

Page 10: How to create Treasure Data #dotsbigdata

Data Importing

Page 11: How to create Treasure Data #dotsbigdata

Import Queue

td-agent / fluentd

Import Worker

✓ Buffering for5 minute

✓ Retrying(at-least once)

✓ On-disk buffering on failure

✓ Unique ID for each chunk

API Server

It’s like JSON.

but fast and small.

unique_id=375828ce5510cadb {“time”:1426047906,”uid”:1,…} {“time”:1426047912,”uid”:9,…} {“time”:1426047939,”uid”:3,…} {“time”:1426047951,”uid”:2,…} …

MySQL (PerfectQueue)

Page 12: How to create Treasure Data #dotsbigdata

Import Queue

td-agent / fluentd

Import Worker

✓ Buffering for1 minute

✓ Retrying(at-least once)

✓ On-disk buffering on failure

✓ Unique ID for each chunk

API Server

It’s like JSON.

but fast and small.

MySQL (PerfectQueue)

unique_id time

375828ce5510cadb 2015-12-01 10:47

2024cffb9510cadc 2015-12-01 11:09

1b8d6a600510cadd 2015-12-01 11:21

1f06c0aa510caddb 2015-12-01 11:38

Page 13: How to create Treasure Data #dotsbigdata

Import Queue

td-agent / fluentd

Import Worker

✓ Buffering for5 minute

✓ Retrying(at-least once)

✓ On-disk buffering on failure

✓ Unique ID for each chunk

API Server

It’s like JSON.

but fast and small.

MySQL (PerfectQueue)

unique_id time

375828ce5510cadb 2015-12-01 10:47

2024cffb9510cadc 2015-12-01 11:09

1b8d6a600510cadd 2015-12-01 11:21

1f06c0aa510caddb 2015-12-01 11:38UNIQUE (at-most once)

Page 14: How to create Treasure Data #dotsbigdata

Import Queue

Import Worker

Import Worker

Import Worker

✓ HA ✓ Load balancing

Page 15: How to create Treasure Data #dotsbigdata

Realtime Storage

PostgreSQL

Amazon S3 / Basho Riak CS

Metadata

Import Queue

Import Worker

Import Worker

Import Worker

Archive Storage

Page 16: How to create Treasure Data #dotsbigdata

Realtime Storage

PostgreSQL

Amazon S3 / Basho Riak CS

Metadata

Import Queue

Import Worker

Import Worker

Import Worker

uploaded time file index range records

2015-03-08 10:47 [2015-12-01 10:47:11, 2015-12-01 10:48:13] 3

2015-03-08 11:09 [2015-12-01 11:09:32, 2015-12-01 11:10:35] 25

2015-03-08 11:38 [2015-12-01 11:38:43, 2015-12-01 11:40:49] 14

… … … …

Archive Storage

Metadata of the records in a file (stored on PostgreSQL)

Page 17: How to create Treasure Data #dotsbigdata

Amazon S3 / Basho Riak CS

Metadata

Merge Worker(MapReduce)

uploaded time file index range records

2015-03-08 10:47 [2015-12-01 10:47:11, 2015-12-01 10:48:13] 3

2015-03-08 11:09 [2015-12-01 11:09:32, 2015-12-01 11:10:35] 25

2015-03-08 11:38 [2015-12-01 11:38:43, 2015-12-01 11:40:49] 14

… … … …

file index range records

[2015-12-01 10:00:00, 2015-12-01 11:00:00] 3,312

[2015-12-01 11:00:00, 2015-12-01 12:00:00] 2,143

… … …

Realtime Storage

Archive Storage

PostgreSQL

Merge every 1 hourRetrying + Unique (at-least-once + at-most-once)

Page 18: How to create Treasure Data #dotsbigdata

Amazon S3 / Basho Riak CS

Metadata

uploaded time file index range records

2015-03-08 10:47 [2015-12-01 10:47:11, 2015-12-01 10:48:13] 3

2015-03-08 11:09 [2015-12-01 11:09:32, 2015-12-01 11:10:35] 25

2015-03-08 11:38 [2015-12-01 11:38:43, 2015-12-01 11:40:49] 14

… … … …

file index range records

[2015-12-01 10:00:00, 2015-12-01 11:00:00] 3,312

[2015-12-01 11:00:00, 2015-12-01 12:00:00] 2,143

… … …

Realtime Storage

Archive Storage

PostgreSQL

GiST (R-tree) Index on“time” column on the files

Read from Archive Storage if merged. Otherwise, from Realtime Storage

Page 19: How to create Treasure Data #dotsbigdata

Data Importing

> Scalable & Reliable importing > Fluentd buffers data on a disk > Import queue deduplicates uploaded chunks > Workers take the chunks and put to Realtime Storage

> Instant visibility > Imported data is immediately visible by query engines. > Background workers merges the files every 1 hour.

> Metadata > Index is built on PostgreSQL using RANGE type and

GiST index

Page 20: How to create Treasure Data #dotsbigdata

Data processing

Page 21: How to create Treasure Data #dotsbigdata

time code method

2015-12-01 10:02:36 200 GET

2015-12-01 10:22:09 404 GET

2015-12-01 10:36:45 200 GET

2015-12-01 10:49:21 200 POST

… … …

time code method

2015-12-01 11:10:09 200 GET

2015-12-01 11:21:45 200 GET

2015-12-01 11:38:59 200 GET

2015-12-01 11:43:37 200 GET

2015-12-01 11:54:52 “200” GET

… … …

Archive Storage

Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL

path index range records

[2015-12-01 10:00:00, 2015-12-01 11:00:00] 3,312

[2015-12-01 11:00:00, 2015-12-01 12:00:00] 2,143

… … …

MessagePack ColumnarFile Format

Page 22: How to create Treasure Data #dotsbigdata

time code method

2015-12-01 10:02:36 200 GET

2015-12-01 10:22:09 404 GET

2015-12-01 10:36:45 200 GET

2015-12-01 10:49:21 200 POST

… … …

time code method

2015-12-01 11:10:09 200 GET

2015-12-01 11:21:45 200 GET

2015-12-01 11:38:59 200 GET

2015-12-01 11:43:37 200 GET

2015-12-01 11:54:52 “200” GET

… … …

Archive Storage

path index range records

[2015-12-01 10:00:00, 2015-12-01 11:00:00] 3,312

[2015-12-01 11:00:00, 2015-12-01 12:00:00] 2,143

… … …

column-based partitioning

time-based partitioning

Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL

Page 23: How to create Treasure Data #dotsbigdata

time code method

2015-12-01 10:02:36 200 GET

2015-12-01 10:22:09 404 GET

2015-12-01 10:36:45 200 GET

2015-12-01 10:49:21 200 POST

… … …

time code method

2015-12-01 11:10:09 200 GET

2015-12-01 11:21:45 200 GET

2015-12-01 11:38:59 200 GET

2015-12-01 11:43:37 200 GET

2015-12-01 11:54:52 “200” GET

… … …

Archive Storage

path index range records

[2015-12-01 10:00:00, 2015-12-01 11:00:00] 3,312

[2015-12-01 11:00:00, 2015-12-01 12:00:00] 2,143

… … …

column-based partitioning

time-based partitioning

Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL

SELECT code, COUNT(1) FROM logs WHERE time >= 2015-12-01 11:00:00 GROUP BY code

Page 24: How to create Treasure Data #dotsbigdata

Handling Eventual Consistency

1. Writing data / metadata first> At this time, data is not visible

2. Check data is available or not> GET, GET, GET…

3. Data become visible> Query includes imported data!

Ex. Netflix case> https://github.com/Netflix/s3mper

Page 25: How to create Treasure Data #dotsbigdata

Hide network cost

> Open a lot of connections to Object Storage > Using range feature with columnar offset > Improve scan performance for partitioned data

> Detect recoverable error > We have error lists for fault tolerance

> Stall checker > Watch the progress of reading data > If processing time reached threshold, re-connect to OS

and re-read data

Page 26: How to create Treasure Data #dotsbigdata

buffer

Optimizing Scan Performance

•  Fully utilize the network bandwidth from S3 •  TD Presto becomes CPU bottleneck

8

TableScanOperator�

•  s3 file list •  table schema header

request

S3 / RiakCS�

•  release(Buffer) Buffer size limit Reuse allocated buffers

Request Queue�

•  priority queue •  max connections limit

Header�Column Block 0 (column names)�

Column Block 1�

Column Block i�

Column Block m�

MPC1 file

HeaderReader�

•  callback to HeaderParser

ColumnBlockReader�

header HeaderParser�

•  parse MPC file header • column block offsets • column names

column block request Column block requests

column block

prepare

MessageUnpacker�

buffer

MessageUnpacker�MessageUnpacker�

S3 read�

S3 read�

pull records

Retry GET request on - 500 (internal error) - 503 (slow down) - 404 (not found) - eventual consistency

S3 read�•  decompression •  msgpack-java v07

S3 read�

S3 read�

S3 read�

Optimize scan performance

Page 27: How to create Treasure Data #dotsbigdata

Recoverable errors> Error types

> User error > Syntax error, Semantic error

> Insufficient resource > Exceeded task memory size

> Internal failure > I/O error of S3 / Riak CS > worker failure > etc

We can retry these patterns

Page 28: How to create Treasure Data #dotsbigdata

Recoverable errors> Error types

> User error > Syntax error, Semantic error

> Insufficient resource > Exceeded task memory size

> Internal failure > I/O error of S3 / Riak CS > worker failure > etc

We can retry these patterns

Page 29: How to create Treasure Data #dotsbigdata

Presto retry on Internal Errors

> Query succeed eventually log scale

Page 30: How to create Treasure Data #dotsbigdata

time code method

2015-12-01 10:02:36 200 GET

2015-12-01 10:22:09 404 GET

2015-12-01 10:36:45 200 GET

2015-12-01 10:49:21 200 POST

… … …

user time code method

391 2015-12-01 11:10:09 200 GET

482 2015-12-01 11:21:45 200 GET

573 2015-12-01 11:38:59 200 GET

664 2015-12-01 11:43:37 200 GET

755 2015-12-01 11:54:52 “200” GET

… … …

Page 31: How to create Treasure Data #dotsbigdata

time code method

2015-12-01 10:02:36 200 GET

2015-12-01 10:22:09 404 GET

2015-12-01 10:36:45 200 GET

2015-12-01 10:49:21 200 POST

… … …

user time code method

391 2015-12-01 11:10:09 200 GET

482 2015-12-01 11:21:45 200 GET

573 2015-12-01 11:38:59 200 GET

664 2015-12-01 11:43:37 200 GET

755 2015-12-01 11:54:52 “200” GET

… … …

MessagePack ColumnarFile Format is schema-less✓ Instant schema change

SQL is schema-full✓ SQL doesn’t work

without schema

Schema-on-Read

Page 32: How to create Treasure Data #dotsbigdata

Realtime Storage

Query EngineHive, Pig, Presto

Archive Storage

{“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”}

Schema-on-ReadSchema-full

Schema-less

Page 33: How to create Treasure Data #dotsbigdata

Realtime Storage

Query EngineHive, Pig, Presto

Archive Storage

Schema-full

Schema-less

Schema

{“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”}

CREATE TABLE events ( user INT, name STRING, value INT, host INT );

| user | 54

| value | 120

| host | NULL

| |

Schema-on-Read

| name | “plazma”

Page 34: How to create Treasure Data #dotsbigdata

Realtime Storage

Query EngineHive, Pig, Presto

Archive Storage

{“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”}

CREATE TABLE events ( user INT, name STRING, value INT, host INT );

| user | 54

| name | “plazma”

| value | 120

| host | NULL

| |

Schema-on-ReadSchema-full

Schema-less

Schema

Page 35: How to create Treasure Data #dotsbigdata

Streaming logging layer

Reliable forwarding

Pluggable architecture

http://fluentd.org/

Page 36: How to create Treasure Data #dotsbigdata

Bulk loading

Parallel processing

Pluggable architecture

http://embulk.org/

Page 37: How to create Treasure Data #dotsbigdata

Hadoop

> Distributed computing framework> Consist of many components…

http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/

Page 38: How to create Treasure Data #dotsbigdata

Presto

>

> Open sourced by Facebook> https://github.com/facebook/presto

A distributed SQL query engine for interactive data analisys against GBs to PBs of data.

Page 39: How to create Treasure Data #dotsbigdata

Conclusion> Build scalable data analytics platform on Cloud

> Separate resource and storage > loosely-coupled components

> We have lots of useful OSS and services :) > There are many trade-off

> Use existing component or create new component? > Stick to the basics!

> If you tired, please use Treasure Data ;)

Page 40: How to create Treasure Data #dotsbigdata

https://jobs.lever.co/treasure-data

Cloud service for the entire data pipeline.