Embulk - 進化するバルクデータローダ

Sadayuki Furuhashi Founder & Software Architect

Embulk Meetup Tokyo #2

A little about me…

Sadayuki Furuhashigithub: @frsyuki

Fluentd - Unifid log collection infrastracture

Embulk - Plugin-based parallel ETL Founder & Software Architect

What’s Embulk?

> An open-source parallel bulk data loader > loads records from “A” to “B”

> using plugins > for various kinds of “A” and “B”

> to make data integration easy. > which was very painful…

Storage, RDBMS, NoSQL, Cloud Service,

broken records,transactions (idempotency),

performance, …

The pains of bulk data loading

Example: load a 10GB CSV file to PostgreSQL > 1. First attempt → fails > 2. Write a script to make the records cleaned

• Convert ”2015-01-27T19:05:00Z” → “2015-01-27 19:05:00 UTC”

• Convert “\N" → “”

• many cleanings…

> 3. Second attempt → another error • Convert “Inf” → “Infinity”

> 4. Fix the script, retry, retry, retry… > 5. Oh, some data got loaded twice!?

Example: load a 10GB CSV file to PostgreSQL > 6. Ok, the script worked. > 7. Register it to cron to sync data every day. > 8. One day… it fails with another error

• Convert invalid UTF-8 byte sequence to U+FFFD

Example: load 10GB CSV × 720 files > Most of scripts are slow.

• People have little time to optimize bulk load scripts

> One file takes 1 hour → 720 files takes 1 month (!?)

A lot of integration efforts for each storages: > XML, JSON, Apache log format (+some custom), … > SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile… > MongoDB, Elasticsearch, Redshift, Salesforce, …

The problems:

> Data cleaning (normalization) > How to normalize broken records?

> Error handling > How to remove broken records?

> Idempotent retrying > How to retry without duplicated loading?

> Performance optimization > How to optimize the code or parallelize?

Amazon S3

Embulk

CSV Files

SequenceFile

Salesforce.com

Elasticsearch

Cassandra

✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behavior ✓ Resuming

Plugins Plugins

bulk load

Input Output

Embulk’s Plugin Architecture

Embulk Core

Executor Plugin

Filter Filter

Output

Embulk Core

Executor Plugin

Filter Filter

GuessFileInput

Parser

Decoder

Embulk Core

FileInput

Executor Plugin

Parser

Decoder

FileOutput

Formatter

Encoder

Filter Filter

Execution overview

Transaction Task

taskCount

{ taskIndex: 0, task: {…} }

{ taskIndex: 2, task: {…} }

runs on a single thread runs on multiple threads(or machines)

Parallel execution

Threads

Task queue

run tasks in parallel

(embulk-executor-local-thread)

Distributed execution

Map tasks

Task queue

run tasks on Hadoop

(embulk-executor-mapreduce)

Distributed execution (w/ partitioning)

Map - Shuffle - Reduce

Task queue

run tasks on Hadoop

(embulk-executor-mapreduce)

Transaction control

fileInput.transaction { parser.transaction { filters.transaction { formatter.transaction { fileOutput.transaction { executor.transaction { … } } } } } }

file input plugin

parser plugin

filter plugins

formatter plugin

file output plugin

executor plugin

Task Task

Task configurationfileInput.transaction { fileInputTask, taskCount → parser.transaction { parserTask, schema → filters.transaction { filterTasks, schema → formatter.transaction { formatterTask → fileOutput.transaction { fileOutputTask → executor.transaction { → task = { fileInputTask, parserTask, filterTasks, formatterTask, fileOutputTask, } taskCount.times.inParallel { taskIndex → run(taskIndex, task)

taskCount is decided by input

schema is decided by input, and may be

modified by filters

Task execution

parser.run(fileInput, pageOutput)

fileInput.open() formatter.open(fileOutput)

fileOutput.open()

parser plugin

file input plugin filter plugins

file output plugin

formatter plugin …Task Task …

Type conversionEmbulk type systemInput type system Output type system

boolean

double

string

timestamp

boolean integer bigint double precision text varchar date timestamp timestamp with zone …

(e.g. PostgreSQL)

boolean integer long float double string array geo point geo shape … (e.g. Elasticsearch)

Input plugin(parser plugin if input is file-based)

Output plugin(formatter plugin if output is file-based)

What’s added since the first release?

• v0.3 • Resuming • Filter plugin type

• v0.4 • Plugin template generator • Incremental execution (ConfigDiff) • Isolated ClassLoaders for Java plugins • Polyglot command launcher

What’s added since the first release?

• v0.6 • Executor plugin type • Liquid template engine

• v0.7 • EmbulkEmbed & Embulk::Runner • Plugin bundle (embulk-mkbundle) • JRuby 9000 • Gradle v2.6

Resuming

• Retries a failed transaction without retrying everything.

• Skips successful tasks by using information stored in a file by the previous transaction.

• embulk run config.yml -r resume-state.yml

Filter plugin type

• Filtering rows out, filtering columns out, or enrich the data. 18 plugins released.

Plugin template generator

• Generates template of a plugin. • Generated code is already ready to compile.

> You modify & compile it to do your work.

• embulk new <category> <new>

Incremental execution

• Store last file name or row in a file, and next execution starts from there.

• Usecase: sync new files on S3 to Elasticsearch every day.

• embulk run config.yml -o next-config.yml

Isolated ClassLoaders for Java plugins

• Embulk can load multiple versions of java plugins.

Plugin Version Conflicts

Embulk Core

Java Runtime

aws-sdk.jar v1.9

embulk-input-s3.jar

Version conflicts!

aws-sdk.jar v1.10

embulk-output-redshift.jar

Multiple Classloaders in JVM

Embulk Core

Java Runtime

aws-sdk.jar v1.9

embulk-input-s3.jar

Isolated environments

aws-sdk.jar v1.10

embulk-output-redshift.jar

Class Loader 1

Class Loader 2

Polyglot launcher script

• embulk .jar is a jar file. • embulk.jar is a shell script. • embulk.jar is a bat script. • It sets JVM options to improve performance.

• ./embulk run abc

Executor plugin type

• embulk-executor-mapreduce executes tasks on distributed environment.

Liquid template engine

• A config file can include variables.

EmbulkEmbed & Embulk::Runner

• Embed embulk in an application.

Plugin bundle

• Uses fixed version of plugins.

• embulk mkbundle my-project • embulk run -b my-project config.yml

Gradle v2.6

• Continous compiling. • “embulk migrate .” upgrades gradle versio of your

plugin project. • ./gradlew -t build

Future plan

• v0.8 • JSON type (issue #306) • Error plugin type (#27, #124) • More (or less) concurrency for output (#231)

• v0.9 • More Guess (#242, #235) • Multiple jobs using a single config file (#167)

Embulk - 進化するバルクデータローダ

Engineering

Java scriptの進化

á r N 7 V E r m ? o { B · j$ ) %Û 進化する進進化する進化する … · 2 WAZA Vol.13 2007/06 thinkiD DesignXpression機能比較 think document management ドキュメント管理用パッケージ

進化分子工学：分子設計ツールとしての進化 · 2019. 2. 28. · 進化分子工学：分子設計ツールとしての進化科学 0143 多くの研究者が腕試しのごとく合理的な

遺伝的Particle コミュニティのモデル · 2.2 進化論的計算手法 9 2.2 進化論的計算手法進化論的計算は,生物の進化のメカニズムをまねてデータ構造を変形,合成,選択す

第3章緑化推進重点地区における緑化の推進保全 …...68 第 3章緑化推進重点地区における緑化の推進第 1章緑地の保全

Ⅴ ．ソフトウエアの進化

S3 章進化計算 - ieice-hbkb.org

20150219 初めての「embulk」

Title 霊長類進化の科学( p. 465 )第13章生体分子の機能と進化 465 第13 章生体分子の機能と進化 1 酵素の機能多様性と進化霊長類の食性と消化酵素

醜簡報的進化 (三)：視覺美化

演化 / 進化 (Evolution)

進化の深化。 · 2019. 7. 23. · 進化の深化。 The Ongoing Evolution - M7CL Version3 - M7CL StageMix for iPadTM Waves SoundGridインテグレーション M7CL Version

ビッグデータ処理技術の進化と、エッジヘビー ...ビッグデータ処理技術の進化と、エッジヘビーコンピューティング西川徹 (株) Preferred

進化したZen coding "Emmet"

Embulk makes Japan visible

Using Embulk at Treasure Data

可視化周辺の進化がヤヴァイ〜2016〜

Fluentd and Embulk Game Server 4

5 情報化（デジタル化）の進展...14 5 情報化（デジタル化）の進展現在は、ワイヤレス・ブロードバンド注2やクラウド注3の普及、ソーシャルネット

校内の情報化を推進するために ~情報化推進リーダーの役割~