Upload
n-masahiro
View
3.411
Download
5
Embed Size (px)
Citation preview
Masahiro NakagawaAugust 1, 2015
BigData All Stars 2015
How to createTreasure Data#dotsbigdata
Who are you?
> Masahiro Nakagawa> github/twitter: @repeatedly
> Treasure Data, Inc.> Senior Software Engineer> Fluentd / td-agent developer
> I love OSS :)> D language - Phobos committer> Fluentd - Main maintainer> MessagePack / RPC - D and Python (only RPC)> The organizer of Presto Source Code Reading / meetup> etc…
Company overview
http://www.treasuredata.com/opensource
21 65
Treasure Data Solution
Ingest Analyze Distribute
74 and
Treasure Data Service
> A simplified cloud analytics infrastructure > Customers focus on their business
> SQL interfaces for Schema-less data sources > Fit for Data Hub / Lake > Batch / Low latency / Machine Learning
> Lots of ingestion and integrated solutions > Fluentd / Embulk / Data Connector / SDKs > Result Output / Prestogres Gateway / BI tools
> Awesome support for time to value
21 65
Plazma - TD’s distributed analytical database
Plazma by the numbers> Streaming import
> 45 billion records / day > Bulk Import
> 10 billion records / day > Hive Query
> 3+ trillion records / day > Machine Learning queries, Hivemall, increased
> Presto Query > 3+ trillion records / day
TD’s resource management
> Guarantee and boost compute resources > Guarantee for stabilizing query performance > Boost for sharing free resources
> Get multi-tenant merit > Global resource schedular
> manage job, resource and priority across users > Separate storage from compute resource
> Easy to scale workers > We can use S3 / GCS / Azure Storage for reliable backend
Data Importing
Import Queue
td-agent / fluentd
Import Worker
✓ Buffering for5 minute
✓ Retrying(at-least once)
✓ On-disk buffering on failure
✓ Unique ID for each chunk
API Server
It’s like JSON.
but fast and small.
unique_id=375828ce5510cadb {“time”:1426047906,”uid”:1,…} {“time”:1426047912,”uid”:9,…} {“time”:1426047939,”uid”:3,…} {“time”:1426047951,”uid”:2,…} …
MySQL (PerfectQueue)
Import Queue
td-agent / fluentd
Import Worker
✓ Buffering for1 minute
✓ Retrying(at-least once)
✓ On-disk buffering on failure
✓ Unique ID for each chunk
API Server
It’s like JSON.
but fast and small.
MySQL (PerfectQueue)
unique_id time
375828ce5510cadb 2015-12-01 10:47
2024cffb9510cadc 2015-12-01 11:09
1b8d6a600510cadd 2015-12-01 11:21
1f06c0aa510caddb 2015-12-01 11:38
Import Queue
td-agent / fluentd
Import Worker
✓ Buffering for5 minute
✓ Retrying(at-least once)
✓ On-disk buffering on failure
✓ Unique ID for each chunk
API Server
It’s like JSON.
but fast and small.
MySQL (PerfectQueue)
unique_id time
375828ce5510cadb 2015-12-01 10:47
2024cffb9510cadc 2015-12-01 11:09
1b8d6a600510cadd 2015-12-01 11:21
1f06c0aa510caddb 2015-12-01 11:38UNIQUE (at-most once)
Import Queue
Import Worker
Import Worker
Import Worker
✓ HA ✓ Load balancing
Realtime Storage
PostgreSQL
Amazon S3 / Basho Riak CS
Metadata
Import Queue
Import Worker
Import Worker
Import Worker
Archive Storage
Realtime Storage
PostgreSQL
Amazon S3 / Basho Riak CS
Metadata
Import Queue
Import Worker
Import Worker
Import Worker
uploaded time file index range records
2015-03-08 10:47 [2015-12-01 10:47:11, 2015-12-01 10:48:13] 3
2015-03-08 11:09 [2015-12-01 11:09:32, 2015-12-01 11:10:35] 25
2015-03-08 11:38 [2015-12-01 11:38:43, 2015-12-01 11:40:49] 14
… … … …
Archive Storage
Metadata of the records in a file (stored on PostgreSQL)
Amazon S3 / Basho Riak CS
Metadata
Merge Worker(MapReduce)
uploaded time file index range records
2015-03-08 10:47 [2015-12-01 10:47:11, 2015-12-01 10:48:13] 3
2015-03-08 11:09 [2015-12-01 11:09:32, 2015-12-01 11:10:35] 25
2015-03-08 11:38 [2015-12-01 11:38:43, 2015-12-01 11:40:49] 14
… … … …
file index range records
[2015-12-01 10:00:00, 2015-12-01 11:00:00] 3,312
[2015-12-01 11:00:00, 2015-12-01 12:00:00] 2,143
… … …
Realtime Storage
Archive Storage
PostgreSQL
Merge every 1 hourRetrying + Unique (at-least-once + at-most-once)
Amazon S3 / Basho Riak CS
Metadata
uploaded time file index range records
2015-03-08 10:47 [2015-12-01 10:47:11, 2015-12-01 10:48:13] 3
2015-03-08 11:09 [2015-12-01 11:09:32, 2015-12-01 11:10:35] 25
2015-03-08 11:38 [2015-12-01 11:38:43, 2015-12-01 11:40:49] 14
… … … …
file index range records
[2015-12-01 10:00:00, 2015-12-01 11:00:00] 3,312
[2015-12-01 11:00:00, 2015-12-01 12:00:00] 2,143
… … …
Realtime Storage
Archive Storage
PostgreSQL
GiST (R-tree) Index on“time” column on the files
Read from Archive Storage if merged. Otherwise, from Realtime Storage
Data Importing
> Scalable & Reliable importing > Fluentd buffers data on a disk > Import queue deduplicates uploaded chunks > Workers take the chunks and put to Realtime Storage
> Instant visibility > Imported data is immediately visible by query engines. > Background workers merges the files every 1 hour.
> Metadata > Index is built on PostgreSQL using RANGE type and
GiST index
Data processing
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
time code method
2015-12-01 11:10:09 200 GET
2015-12-01 11:21:45 200 GET
2015-12-01 11:38:59 200 GET
2015-12-01 11:43:37 200 GET
2015-12-01 11:54:52 “200” GET
… … …
Archive Storage
Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL
path index range records
[2015-12-01 10:00:00, 2015-12-01 11:00:00] 3,312
[2015-12-01 11:00:00, 2015-12-01 12:00:00] 2,143
… … …
MessagePack ColumnarFile Format
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
time code method
2015-12-01 11:10:09 200 GET
2015-12-01 11:21:45 200 GET
2015-12-01 11:38:59 200 GET
2015-12-01 11:43:37 200 GET
2015-12-01 11:54:52 “200” GET
… … …
Archive Storage
path index range records
[2015-12-01 10:00:00, 2015-12-01 11:00:00] 3,312
[2015-12-01 11:00:00, 2015-12-01 12:00:00] 2,143
… … …
column-based partitioning
time-based partitioning
Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
time code method
2015-12-01 11:10:09 200 GET
2015-12-01 11:21:45 200 GET
2015-12-01 11:38:59 200 GET
2015-12-01 11:43:37 200 GET
2015-12-01 11:54:52 “200” GET
… … …
Archive Storage
path index range records
[2015-12-01 10:00:00, 2015-12-01 11:00:00] 3,312
[2015-12-01 11:00:00, 2015-12-01 12:00:00] 2,143
… … …
column-based partitioning
time-based partitioning
Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL
SELECT code, COUNT(1) FROM logs WHERE time >= 2015-12-01 11:00:00 GROUP BY code
Handling Eventual Consistency
1. Writing data / metadata first> At this time, data is not visible
2. Check data is available or not> GET, GET, GET…
3. Data become visible> Query includes imported data!
Ex. Netflix case> https://github.com/Netflix/s3mper
Hide network cost
> Open a lot of connections to Object Storage > Using range feature with columnar offset > Improve scan performance for partitioned data
> Detect recoverable error > We have error lists for fault tolerance
> Stall checker > Watch the progress of reading data > If processing time reached threshold, re-connect to OS
and re-read data
buffer
Optimizing Scan Performance
• Fully utilize the network bandwidth from S3 • TD Presto becomes CPU bottleneck
8
TableScanOperator�
• s3 file list • table schema header
request
S3 / RiakCS�
• release(Buffer) Buffer size limit Reuse allocated buffers
Request Queue�
• priority queue • max connections limit
Header�Column Block 0 (column names)�
Column Block 1�
Column Block i�
Column Block m�
MPC1 file
HeaderReader�
• callback to HeaderParser
ColumnBlockReader�
header HeaderParser�
• parse MPC file header • column block offsets • column names
column block request Column block requests
column block
prepare
MessageUnpacker�
buffer
MessageUnpacker�MessageUnpacker�
S3 read�
S3 read�
pull records
Retry GET request on - 500 (internal error) - 503 (slow down) - 404 (not found) - eventual consistency
S3 read�• decompression • msgpack-java v07
S3 read�
S3 read�
S3 read�
Optimize scan performance
Recoverable errors> Error types
> User error > Syntax error, Semantic error
> Insufficient resource > Exceeded task memory size
> Internal failure > I/O error of S3 / Riak CS > worker failure > etc
We can retry these patterns
Recoverable errors> Error types
> User error > Syntax error, Semantic error
> Insufficient resource > Exceeded task memory size
> Internal failure > I/O error of S3 / Riak CS > worker failure > etc
We can retry these patterns
Presto retry on Internal Errors
> Query succeed eventually log scale
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
user time code method
391 2015-12-01 11:10:09 200 GET
482 2015-12-01 11:21:45 200 GET
573 2015-12-01 11:38:59 200 GET
664 2015-12-01 11:43:37 200 GET
755 2015-12-01 11:54:52 “200” GET
… … …
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
user time code method
391 2015-12-01 11:10:09 200 GET
482 2015-12-01 11:21:45 200 GET
573 2015-12-01 11:38:59 200 GET
664 2015-12-01 11:43:37 200 GET
755 2015-12-01 11:54:52 “200” GET
… … …
MessagePack ColumnarFile Format is schema-less✓ Instant schema change
SQL is schema-full✓ SQL doesn’t work
without schema
Schema-on-Read
Realtime Storage
Query EngineHive, Pig, Presto
Archive Storage
{“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”}
Schema-on-ReadSchema-full
Schema-less
Realtime Storage
Query EngineHive, Pig, Presto
Archive Storage
Schema-full
Schema-less
Schema
{“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”}
CREATE TABLE events ( user INT, name STRING, value INT, host INT );
| user | 54
| value | 120
| host | NULL
| |
Schema-on-Read
| name | “plazma”
Realtime Storage
Query EngineHive, Pig, Presto
Archive Storage
{“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”}
CREATE TABLE events ( user INT, name STRING, value INT, host INT );
| user | 54
| name | “plazma”
| value | 120
| host | NULL
| |
Schema-on-ReadSchema-full
Schema-less
Schema
Streaming logging layer
Reliable forwarding
Pluggable architecture
http://fluentd.org/
Hadoop
> Distributed computing framework> Consist of many components…
http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/
Presto
>
> Open sourced by Facebook> https://github.com/facebook/presto
A distributed SQL query engine for interactive data analisys against GBs to PBs of data.
Conclusion> Build scalable data analytics platform on Cloud
> Separate resource and storage > loosely-coupled components
> We have lots of useful OSS and services :) > There are many trade-off
> Use existing component or create new component? > Stick to the basics!
> If you tired, please use Treasure Data ;)
https://jobs.lever.co/treasure-data
Cloud service for the entire data pipeline.