9
Treasure Data, Inc. Founder & Software Architect Sadayuki Furuhashi Embulk Internals

Embuk internals

Embed Size (px)

Citation preview

Page 1: Embuk internals

Treasure Data, Inc. Founder & Software Architect

Sadayuki Furuhashi

Embulk Internals

Page 2: Embuk internals

Execution overview

Task

Transaction Task

Task

taskCount

{ taskIndex: 0, task: {…} }

{ taskIndex: 2, task: {…} }

runs on a single thread runs on multiple threads(or machines)

Page 3: Embuk internals

Parallel execution

Task

Task

Task

Task

Threads

Task queue

run tasks in parallel

(embulk-executor-local-thread)

Page 4: Embuk internals

Distributed execution

Task

Task

Task

Task

Map tasks

Task queue

run tasks on Hadoop

(embulk-executor-mapreduce)

Page 5: Embuk internals

Distributed execution (w/ partitioning)

Task

Task

Task

Task

Map - Shuffle - Reduce

Task queue

run tasks on Hadoop

(embulk-executor-mapreduce)

Page 6: Embuk internals

Transaction control

fileInput.transaction { parser.transaction { filters.transaction { formatter.transaction { fileOutput.transaction { executor.transaction { … } } } } } }

file input plugin

parser plugin

filter plugins

formatter plugin

file output plugin

executor plugin

Task Task

Page 7: Embuk internals

Task configurationfileInput.transaction { fileInputTask, taskCount → parser.transaction { parserTask, schema → filters.transaction { filterTasks, schema → formatter.transaction { formatterTask → fileOutput.transaction { fileOutputTask → executor.transaction { → task = { fileInputTask, parserTask, filterTasks, formatterTask, fileOutputTask, } taskCount.times.inParallel { taskIndex → run(taskIndex, task)

taskCount is decided by input

schema is decided by input, and may be

modified by filters

Page 8: Embuk internals

Task execution

parser.run(fileInput, pageOutput)

fileInput.open() formatter.open(fileOutput)

fileOutput.open()

parser plugin

file input plugin filter plugins

file output plugin

formatter plugin …Task Task …

Page 9: Embuk internals

Type conversionEmbulk type systemInput type system Output type system

boolean

long

double

string

timestamp

boolean integer bigint double precision text varchar date timestamp timestamp with zone …

(e.g. PostgreSQL)

boolean integer long float double string array geo point geo shape … (e.g. Elasticsearch)

Input plugin(parser plugin if input is file-based)

Output plugin(formatter plugin if output is file-based)