Upload
sadayuki-furuhashi
View
1.916
Download
0
Embed Size (px)
Citation preview
Treasure Data, Inc. Founder & Software Architect
Sadayuki Furuhashi
Embulk Internals
Execution overview
Task
Transaction Task
Task
taskCount
{ taskIndex: 0, task: {…} }
{ taskIndex: 2, task: {…} }
runs on a single thread runs on multiple threads(or machines)
Parallel execution
Task
Task
Task
Task
Threads
Task queue
run tasks in parallel
(embulk-executor-local-thread)
Distributed execution
Task
Task
Task
Task
Map tasks
Task queue
run tasks on Hadoop
(embulk-executor-mapreduce)
Distributed execution (w/ partitioning)
Task
Task
Task
Task
Map - Shuffle - Reduce
Task queue
run tasks on Hadoop
(embulk-executor-mapreduce)
Transaction control
fileInput.transaction { parser.transaction { filters.transaction { formatter.transaction { fileOutput.transaction { executor.transaction { … } } } } } }
file input plugin
parser plugin
filter plugins
formatter plugin
file output plugin
executor plugin
Task Task
Task configurationfileInput.transaction { fileInputTask, taskCount → parser.transaction { parserTask, schema → filters.transaction { filterTasks, schema → formatter.transaction { formatterTask → fileOutput.transaction { fileOutputTask → executor.transaction { → task = { fileInputTask, parserTask, filterTasks, formatterTask, fileOutputTask, } taskCount.times.inParallel { taskIndex → run(taskIndex, task)
taskCount is decided by input
schema is decided by input, and may be
modified by filters
Task execution
parser.run(fileInput, pageOutput)
fileInput.open() formatter.open(fileOutput)
fileOutput.open()
parser plugin
file input plugin filter plugins
file output plugin
formatter plugin …Task Task …
Type conversionEmbulk type systemInput type system Output type system
boolean
long
double
string
timestamp
boolean integer bigint double precision text varchar date timestamp timestamp with zone …
(e.g. PostgreSQL)
boolean integer long float double string array geo point geo shape … (e.g. Elasticsearch)
Input plugin(parser plugin if input is file-based)
Output plugin(formatter plugin if output is file-based)