17
Embulk at Treasure Data Satoshi Akama Dec. 15, 2015 Embulk meetup #2 ×

Embulk at Treasure Data

Embed Size (px)

Citation preview

Embulk at Treasure Data

Satoshi AkamaDec. 15, 2015

Embulk meetup #2

×

About me…

Satoshi Akama

Embulk plugins  ・embulk-output-bigquery  ・embulk-input-gcs  ・embulk-input-azure_blob_storage  ・embulk-output-azure_blob_storage

Treasure Data Inc.

Software Engineer (Java/Scala/Ruby)

github.com/sakama/

@oreradio

We are providing Hosted Embulk

Data Connector (Import)

Result Output (export)

“Data Loading” should not be customer’s work unless they’re developing ETL tools.

Streaming Import

MySQL PostgreSQL Redshift AWS S3 Google Cloud Storage SalesForce Marketo …etc

MySQL PostgreSQL Redshift BigQuery …etc

Treasure Data as a Datahub

Schema Less (Treasure Data)

Something Data Store (Schema full)

You can create Data Pipeline easily

Various formatted data ・log ・Sensor data(IoT)

・Visualize ・Digital Marketing

Data Connector(Import) - CUIguess/preview/import

$ td connector:guess seed.yml -o load.yml

$ td connector:preview load.yml

$ td connector:issue load.yml —database td_sample_db \ —table td_sample_table

Scheduled execution$ td connector:create \ daily_import \ “10 5 * * * “ \ td_sample_db \ td_sample_table \ load.yml \ —time-column created_at

GUI will come in the near future

Result Output(Output) - GUI/CUI

Unchanged OSS Embulk/Embulk plugins

Send pull-request to OSS Embulk

We are using…

We will use at our service after

「いわゆるオープンソースソフトウェアの中で基本機能は無償で公開してコミュニティに任せる、でも機能を追加したソフトを有償で提供するというモデルは実際にはそんなに上手く行ってないのではないかと感じています。」-「「Fluentdをきっかけにビジネスが回る仕掛けがとっても気持ちイイです。」 | Think IT(シンクイット)」 https://thinkit.co.jp/story/2015/07/17/6232

「オープンソースソフトウェアといってもいろいろな開発スタイルがあると思うんですが、fluentdの場合、僕が所属するトレジャーデータが全面的にバックアップしています。現在は、この開発スタイル「企業がバックについているけど、開発はオープンに行う」という手法が一番合っていると思います。」 - OSや言語ではなくデータベースを極めたい:グリー技術者が聞いた、fluentdの新機能とTreasure Data古橋氏の野心 (2/3) - @IT http://www.atmarkit.co.jp/ait/articles/1310/07/news010_2.html

Process to use Embulk plugins at TD

Fix for MapReduce Executor

Write Unit test

Write Integration test

Add Features Fix for Local Executor

Send Pull-Request to OSS Embulk or Embulk Plugins

Sorry, this is sorry closed source code

Release as “Data Connector” or ”Result Output”

Process to use Embulk plugins at TD (1)

Fix for MapReduce Executor

Write Unit test

Write Integration test

Add Features Fix for Local Executor ・Add some features

e.g. add various authentication method.

・Add some fixes  e.g. add retry logic fix error handling

Process to use Embulk plugins at TD (2)

Fix for MapReduce Executor

Write Unit test

Write Integration test

Add Features Fix for Local Executor

Handling of file path MR executor could not read local file path(like private key)

Fix authorization logic if need transaction() and open() method will run at different instances

Process to use Embulk plugins at TD (3)

Fix for MapReduce Executor

Write Unit test

Write Integration test

Add Features Fix for Local Executor

Need 80% coverageBy internal rules, we can’t deploy without 80% coverered unit test.

Write Unit test

Write unit test for Embulk plugin is difficult. e.g. connect to cloud service…

Process to use Embulk plugins at TD (4)

Fix for MapReduce Executor

Write Unit test

Write Integration test

Add Features Fix for Local Executor

Write Integration Test for Treasure Data Service

(1) Import data into TD (2) Send query into Presto, Hive (3) Check result with local file.

e.g.

Process to use Embulk plugins at TD (5)

Fix for MapReduce Executor

Write Unit test

Write Integration test

Add Features Fix for Local Executor

Release as “Data Connector” or ”Result Output”

We hope Win-Win relationship

Embulk Community

Use at TD

Core development Plugin development

Use at your own environment

Contribute

Embulk Execution Platform at Treasure Data

Load Balancer

TD API(API Servers)Web Console

td commands

td connector:issue td guess config.yml…

Response

Response

Request

Request

Bulkload API (API Servers)

Perfect Queue

TD worker (worker process)

enqueue

dequeue

Submit Job (Retry if need)

Execute with MR / Local Executor

guess/preview

TD API / Bulkload API

TD API(API Servers)

Bulkload API(API Servers)

guess/preview is processed at different API Servers.

ResponseRequest

guess/preview

data importPerfect Queue

Load Balancer

QueuingHttp Request/Responseguess/preview needs quick response

enqueue

Problems

Stability of Integration Tests

Execution time of Integration Tests

・Many plugins × Many test cases × Frequent execution  sometimes causes failure.

・Many plugins × Many test cases causes long execution time:)