35
Embulk - 進化するバルク データローダ Sadayuki Furuhashi Founder & Software Architect Embulk Meetup Tokyo #2

Embulk - 進化するバルクデータローダ

Embed Size (px)

Citation preview

Page 1: Embulk - 進化するバルクデータローダ

Embulk - 進化するバルクデータローダ

Sadayuki Furuhashi Founder & Software Architect

Embulk Meetup Tokyo #2

Page 2: Embulk - 進化するバルクデータローダ

A little about me…

Sadayuki Furuhashigithub: @frsyuki

Fluentd - Unifid log collection infrastracture

Embulk - Plugin-based parallel ETL Founder & Software Architect

Page 3: Embulk - 進化するバルクデータローダ

What’s Embulk?

> An open-source parallel bulk data loader > loads records from “A” to “B”

> using plugins > for various kinds of “A” and “B”

> to make data integration easy. > which was very painful…

Storage, RDBMS, NoSQL, Cloud Service,

etc.

broken records,transactions (idempotency),

performance, …

Page 4: Embulk - 進化するバルクデータローダ

The pains of bulk data loading

Example: load a 10GB CSV file to PostgreSQL > 1. First attempt → fails > 2. Write a script to make the records cleaned

• Convert ”2015-01-27T19:05:00Z” → “2015-01-27 19:05:00 UTC”

• Convert “\N" → “”

• many cleanings…

> 3. Second attempt → another error • Convert “Inf” → “Infinity”

> 4. Fix the script, retry, retry, retry… > 5. Oh, some data got loaded twice!?

Page 5: Embulk - 進化するバルクデータローダ

The pains of bulk data loading

Example: load a 10GB CSV file to PostgreSQL > 6. Ok, the script worked. > 7. Register it to cron to sync data every day. > 8. One day… it fails with another error

• Convert invalid UTF-8 byte sequence to U+FFFD

Page 6: Embulk - 進化するバルクデータローダ

The pains of bulk data loading

Example: load 10GB CSV × 720 files > Most of scripts are slow.

• People have little time to optimize bulk load scripts

> One file takes 1 hour → 720 files takes 1 month (!?)

A lot of integration efforts for each storages: > XML, JSON, Apache log format (+some custom), … > SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile… > MongoDB, Elasticsearch, Redshift, Salesforce, …

Page 7: Embulk - 進化するバルクデータローダ

The problems:

> Data cleaning (normalization) > How to normalize broken records?

> Error handling > How to remove broken records?

> Idempotent retrying > How to retry without duplicated loading?

> Performance optimization > How to optimize the code or parallelize?

Page 8: Embulk - 進化するバルクデータローダ

HDFS

MySQL

Amazon S3

Embulk

CSV Files

SequenceFile

Salesforce.com

Elasticsearch

Cassandra

Hive

Redis

✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behavior ✓ Resuming

Plugins Plugins

bulk load

Page 9: Embulk - 進化するバルクデータローダ

Input Output

Embulk’s Plugin Architecture

Embulk Core

Executor Plugin

Filter Filter

Guess

Page 10: Embulk - 進化するバルクデータローダ

Output

Embulk’s Plugin Architecture

Embulk Core

Executor Plugin

Filter Filter

GuessFileInput

Parser

Decoder

Page 11: Embulk - 進化するバルクデータローダ

Guess

Embulk’s Plugin Architecture

Embulk Core

FileInput

Executor Plugin

Parser

Decoder

FileOutput

Formatter

Encoder

Filter Filter

Page 12: Embulk - 進化するバルクデータローダ

Execution overview

Task

Transaction Task

Task

taskCount

{ taskIndex: 0, task: {…} }

{ taskIndex: 2, task: {…} }

runs on a single thread runs on multiple threads(or machines)

Page 13: Embulk - 進化するバルクデータローダ

Parallel execution

Task

Task

Task

Task

Threads

Task queue

run tasks in parallel

(embulk-executor-local-thread)

Page 14: Embulk - 進化するバルクデータローダ

Distributed execution

Task

Task

Task

Task

Map tasks

Task queue

run tasks on Hadoop

(embulk-executor-mapreduce)

Page 15: Embulk - 進化するバルクデータローダ

Distributed execution (w/ partitioning)

Task

Task

Task

Task

Map - Shuffle - Reduce

Task queue

run tasks on Hadoop

(embulk-executor-mapreduce)

Page 16: Embulk - 進化するバルクデータローダ

Transaction control

fileInput.transaction { parser.transaction { filters.transaction { formatter.transaction { fileOutput.transaction { executor.transaction { … } } } } } }

file input plugin

parser plugin

filter plugins

formatter plugin

file output plugin

executor plugin

Task Task

Page 17: Embulk - 進化するバルクデータローダ

Task configurationfileInput.transaction { fileInputTask, taskCount → parser.transaction { parserTask, schema → filters.transaction { filterTasks, schema → formatter.transaction { formatterTask → fileOutput.transaction { fileOutputTask → executor.transaction { → task = { fileInputTask, parserTask, filterTasks, formatterTask, fileOutputTask, } taskCount.times.inParallel { taskIndex → run(taskIndex, task)

taskCount is decided by input

schema is decided by input, and may be

modified by filters

Page 18: Embulk - 進化するバルクデータローダ

Task execution

parser.run(fileInput, pageOutput)

fileInput.open() formatter.open(fileOutput)

fileOutput.open()

parser plugin

file input plugin filter plugins

file output plugin

formatter plugin …Task Task …

Page 19: Embulk - 進化するバルクデータローダ

Type conversionEmbulk type systemInput type system Output type system

boolean

long

double

string

timestamp

boolean integer bigint double precision text varchar date timestamp timestamp with zone …

(e.g. PostgreSQL)

boolean integer long float double string array geo point geo shape … (e.g. Elasticsearch)

Input plugin(parser plugin if input is file-based)

Output plugin(formatter plugin if output is file-based)

Page 20: Embulk - 進化するバルクデータローダ

What’s added since the first release?

• v0.3 • Resuming • Filter plugin type

• v0.4 • Plugin template generator • Incremental execution (ConfigDiff) • Isolated ClassLoaders for Java plugins • Polyglot command launcher

Page 21: Embulk - 進化するバルクデータローダ

What’s added since the first release?

• v0.6 • Executor plugin type • Liquid template engine

• v0.7 • EmbulkEmbed & Embulk::Runner • Plugin bundle (embulk-mkbundle) • JRuby 9000 • Gradle v2.6

Page 22: Embulk - 進化するバルクデータローダ

Resuming

• Retries a failed transaction without retrying everything.

• Skips successful tasks by using information stored in a file by the previous transaction.

• embulk run config.yml -r resume-state.yml

Page 23: Embulk - 進化するバルクデータローダ

Filter plugin type

• Filtering rows out, filtering columns out, or enrich the data. 18 plugins released.

Page 24: Embulk - 進化するバルクデータローダ

Plugin template generator

• Generates template of a plugin. • Generated code is already ready to compile.

> You modify & compile it to do your work.

• embulk new <category> <new>

Page 25: Embulk - 進化するバルクデータローダ

Incremental execution

• Store last file name or row in a file, and next execution starts from there.

• Usecase: sync new files on S3 to Elasticsearch every day.

• embulk run config.yml -o next-config.yml

Page 26: Embulk - 進化するバルクデータローダ

Isolated ClassLoaders for Java plugins

• Embulk can load multiple versions of java plugins.

Page 27: Embulk - 進化するバルクデータローダ

Plugin Version Conflicts

Embulk Core

Java Runtime

aws-sdk.jar v1.9

embulk-input-s3.jar

Version conflicts!

aws-sdk.jar v1.10

embulk-output-redshift.jar

Page 28: Embulk - 進化するバルクデータローダ

Multiple Classloaders in JVM

Embulk Core

Java Runtime

aws-sdk.jar v1.9

embulk-input-s3.jar

Isolated environments

aws-sdk.jar v1.10

embulk-output-redshift.jar

Class Loader 1

Class Loader 2

Page 29: Embulk - 進化するバルクデータローダ

Polyglot launcher script

• embulk .jar is a jar file. • embulk.jar is a shell script. • embulk.jar is a bat script. • It sets JVM options to improve performance.

• ./embulk run abc

Page 30: Embulk - 進化するバルクデータローダ

Executor plugin type

• embulk-executor-mapreduce executes tasks on distributed environment.

Page 31: Embulk - 進化するバルクデータローダ

Liquid template engine

• A config file can include variables.

Page 32: Embulk - 進化するバルクデータローダ

EmbulkEmbed & Embulk::Runner

• Embed embulk in an application.

Page 33: Embulk - 進化するバルクデータローダ

Plugin bundle

• Uses fixed version of plugins.

• embulk mkbundle my-project • embulk run -b my-project config.yml

Page 34: Embulk - 進化するバルクデータローダ

Gradle v2.6

• Continous compiling. • “embulk migrate .” upgrades gradle versio of your

plugin project. • ./gradlew -t build

Page 35: Embulk - 進化するバルクデータローダ

Future plan

• v0.8 • JSON type (issue #306) • Error plugin type (#27, #124) • More (or less) concurrency for output (#231)

• v0.9 • More Guess (#242, #235) • Multiple jobs using a single config file (#167)