Embulk - 進化するバルクデータローダ

Preview:

Citation preview

Embulk - 進化するバルクデータローダ

Sadayuki Furuhashi Founder & Software Architect

Embulk Meetup Tokyo #2

A little about me…

Sadayuki Furuhashigithub: @frsyuki

Fluentd - Unifid log collection infrastracture

Embulk - Plugin-based parallel ETL Founder & Software Architect

What’s Embulk?

> An open-source parallel bulk data loader > loads records from “A” to “B”

> using plugins > for various kinds of “A” and “B”

> to make data integration easy. > which was very painful…

Storage, RDBMS, NoSQL, Cloud Service,

etc.

broken records,transactions (idempotency),

performance, …

The pains of bulk data loading

Example: load a 10GB CSV file to PostgreSQL > 1. First attempt → fails > 2. Write a script to make the records cleaned

• Convert ”2015-01-27T19:05:00Z” → “2015-01-27 19:05:00 UTC”

• Convert “\N" → “”

• many cleanings…

> 3. Second attempt → another error • Convert “Inf” → “Infinity”

> 4. Fix the script, retry, retry, retry… > 5. Oh, some data got loaded twice!?

The pains of bulk data loading

Example: load a 10GB CSV file to PostgreSQL > 6. Ok, the script worked. > 7. Register it to cron to sync data every day. > 8. One day… it fails with another error

• Convert invalid UTF-8 byte sequence to U+FFFD

The pains of bulk data loading

Example: load 10GB CSV × 720 files > Most of scripts are slow.

• People have little time to optimize bulk load scripts

> One file takes 1 hour → 720 files takes 1 month (!?)

A lot of integration efforts for each storages: > XML, JSON, Apache log format (+some custom), … > SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile… > MongoDB, Elasticsearch, Redshift, Salesforce, …

The problems:

> Data cleaning (normalization) > How to normalize broken records?

> Error handling > How to remove broken records?

> Idempotent retrying > How to retry without duplicated loading?

> Performance optimization > How to optimize the code or parallelize?

HDFS

MySQL

Amazon S3

Embulk

CSV Files

SequenceFile

Salesforce.com

Elasticsearch

Cassandra

Hive

Redis

✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behavior ✓ Resuming

Plugins Plugins

bulk load

Input Output

Embulk’s Plugin Architecture

Embulk Core

Executor Plugin

Filter Filter

Guess

Output

Embulk’s Plugin Architecture

Embulk Core

Executor Plugin

Filter Filter

GuessFileInput

Parser

Decoder

Guess

Embulk’s Plugin Architecture

Embulk Core

FileInput

Executor Plugin

Parser

Decoder

FileOutput

Formatter

Encoder

Filter Filter

Execution overview

Task

Transaction Task

Task

taskCount

{ taskIndex: 0, task: {…} }

{ taskIndex: 2, task: {…} }

runs on a single thread runs on multiple threads(or machines)

Parallel execution

Task

Task

Task

Task

Threads

Task queue

run tasks in parallel

(embulk-executor-local-thread)

Distributed execution

Task

Task

Task

Task

Map tasks

Task queue

run tasks on Hadoop

(embulk-executor-mapreduce)

Distributed execution (w/ partitioning)

Task

Task

Task

Task

Map - Shuffle - Reduce

Task queue

run tasks on Hadoop

(embulk-executor-mapreduce)

Transaction control

fileInput.transaction { parser.transaction { filters.transaction { formatter.transaction { fileOutput.transaction { executor.transaction { … } } } } } }

file input plugin

parser plugin

filter plugins

formatter plugin

file output plugin

executor plugin

Task Task

Task configurationfileInput.transaction { fileInputTask, taskCount → parser.transaction { parserTask, schema → filters.transaction { filterTasks, schema → formatter.transaction { formatterTask → fileOutput.transaction { fileOutputTask → executor.transaction { → task = { fileInputTask, parserTask, filterTasks, formatterTask, fileOutputTask, } taskCount.times.inParallel { taskIndex → run(taskIndex, task)

taskCount is decided by input

schema is decided by input, and may be

modified by filters

Task execution

parser.run(fileInput, pageOutput)

fileInput.open() formatter.open(fileOutput)

fileOutput.open()

parser plugin

file input plugin filter plugins

file output plugin

formatter plugin …Task Task …

Type conversionEmbulk type systemInput type system Output type system

boolean

long

double

string

timestamp

boolean integer bigint double precision text varchar date timestamp timestamp with zone …

(e.g. PostgreSQL)

boolean integer long float double string array geo point geo shape … (e.g. Elasticsearch)

Input plugin(parser plugin if input is file-based)

Output plugin(formatter plugin if output is file-based)

What’s added since the first release?

• v0.3 • Resuming • Filter plugin type

• v0.4 • Plugin template generator • Incremental execution (ConfigDiff) • Isolated ClassLoaders for Java plugins • Polyglot command launcher

What’s added since the first release?

• v0.6 • Executor plugin type • Liquid template engine

• v0.7 • EmbulkEmbed & Embulk::Runner • Plugin bundle (embulk-mkbundle) • JRuby 9000 • Gradle v2.6

Resuming

• Retries a failed transaction without retrying everything.

• Skips successful tasks by using information stored in a file by the previous transaction.

• embulk run config.yml -r resume-state.yml

Filter plugin type

• Filtering rows out, filtering columns out, or enrich the data. 18 plugins released.

Plugin template generator

• Generates template of a plugin. • Generated code is already ready to compile.

> You modify & compile it to do your work.

• embulk new <category> <new>

Incremental execution

• Store last file name or row in a file, and next execution starts from there.

• Usecase: sync new files on S3 to Elasticsearch every day.

• embulk run config.yml -o next-config.yml

Isolated ClassLoaders for Java plugins

• Embulk can load multiple versions of java plugins.

Plugin Version Conflicts

Embulk Core

Java Runtime

aws-sdk.jar v1.9

embulk-input-s3.jar

Version conflicts!

aws-sdk.jar v1.10

embulk-output-redshift.jar

Multiple Classloaders in JVM

Embulk Core

Java Runtime

aws-sdk.jar v1.9

embulk-input-s3.jar

Isolated environments

aws-sdk.jar v1.10

embulk-output-redshift.jar

Class Loader 1

Class Loader 2

Polyglot launcher script

• embulk .jar is a jar file. • embulk.jar is a shell script. • embulk.jar is a bat script. • It sets JVM options to improve performance.

• ./embulk run abc

Executor plugin type

• embulk-executor-mapreduce executes tasks on distributed environment.

Liquid template engine

• A config file can include variables.

EmbulkEmbed & Embulk::Runner

• Embed embulk in an application.

Plugin bundle

• Uses fixed version of plugins.

• embulk mkbundle my-project • embulk run -b my-project config.yml

Gradle v2.6

• Continous compiling. • “embulk migrate .” upgrades gradle versio of your

plugin project. • ./gradlew -t build

Future plan

• v0.8 • JSON type (issue #306) • Error plugin type (#27, #124) • More (or less) concurrency for output (#231)

• v0.9 • More Guess (#242, #235) • Multiple jobs using a single config file (#167)

Recommended