Apache Drill Design proposal from
OpenDremel team
Camuel Gilyadov & Constantine Peresypkin,
Email: [email protected]
OpenDremel Story: 2010
• Camuel Gilyadov started Dremel implementation on
summer 2010 named OpenDremel.
• David Gruzman joined the effort a few months later
followed by Constantine Peresypkin.
• There wasn’t a comprehensive design or architecture.
The goal was to get hierarchal-columnar transformation
working smoothly and in strict accordance to the
Dremel paper. Several working implementations are
published by us under Apache License.
• Hong San was hired as first full-timer to speedup the
development. Metaxa milestone was set.
OpenDremel Story: 2011
• OpenDremel early design was found too naive, mainly due to
Java underperformance in inner number-crunching loops.
• After fierce brainstorming, project was restarted from scratch
under new name Dazo. With Dazo, query plan is an arbitrary
piece of executable native code with Java frontend.
• From now on we got inspiration from BigQuery as opposed to
from Dremel paper.
• We decided to use Google NaCl as sandboxing technology to
isolate queries as well as meter resource consumption. The new
sandbox was named ZeroVM.
• As for storage we decided to use OpenStack Swift.
OpenDremel Story: 2012
• Four people full-time, several others part time, we still
don’t have fully integrated version but we are satisfied
with what we have achieved and convinced that the
decisions behind Dazo were correct.
• We believe ZeroVM could be a disruptive technology in
itself revolutionizing BigData@Cloud space.
• We are excited by Apache Drill initiative and hope to be
useful for it.
Design Tenet #1
• Apache Drill must support multi-tenant semantics
internally and not to be run in guest VMs altogether.
• It should be inspired by BigQuery and not only by
Dremel/PowerDrill/Tenzing papers.
• It is not practical to setup a dedicated cloud (billed
hourly) just to be able to run a query for a few seconds.
• The codebase must be clearly divided into trusted part
and untrusted part. Trusted part must be kept to
absolute minimum and must be peer-reviewed, secured,
audited and metered.
Design Tenet #2
• Apache Drill must be extremely flexible and
customizable.
• Schema-on-read concept must be supported.
Imperative high-performance parser code must be
possible to be embedded into the query.
• SQL is no longer enough. New query languages must
be easily added as plug-ins or as user-defined-functions
(UDF).
• Additionally various data-formats must be supported
like column-stores, row-stores, PAX, RCFiles and etc.
Design Tenet #2 (cont.)
• We suggest that query plan format will be relaxed to
arbitrary distributed executable code and data
format relaxed to arbitrary opaque BLOB.
• This way new query languages and new data formats
could be easily supported without changing backend.
• As added benefit backend becomes generic lightweight
homogeneous compute-storage cloud.
• Such approach exhibits good separation of control.
Cloud operator controls an bills for generic
infrastructure and the query engine is left completely in
the control of the tenant/user.
Design Tenet #3
• Apache Drill requests/queries must be hyper-elastic
meaning capability to exploit compute capacity of
thousands of servers for short duration of just a few
seconds. No resources must be kept spinning per user
between queries or when idle.
• Traditional VMs are too heavyweight for that.
Container approach such as OpenVZ/LXC and etc. are
not secure enough in multi-tenancy context.
• We suggest making sandboxing pluggable and
supporting ZeroVM ( developed for OpenDremel ) and
LXC (is fine for private clouds) to begin with.
Design Tenet #4
• Apache Drill must be efficient.
• Value-per-byte is extremely low with BigData.
• Overhead in the inner loop must be kept to minimum.
• Java was found inefficient for general number
crunching (such as data compression). The main
problem with Java is that GC overhead is unavoidable
for the whole data corpus being scanned. We went so
far as to keep all data in byte arrays and auto-generate
transformation code and it still underperformed and
code complexity went through the roof.
Suggested Architecture
Query
Browser / Client
Single-Tenant
Frontend running inside
traditional guest VM
Multi-Tenant
Backend scale-out object store
and in-situ compute
Query Compiler
JVM
Custom
executable job
OpenDremel/Dazo
Query
Two separate
unfinished jQuery
apps & cmdline app
with no particular
codenames
We call it Metaxa (historic reasons)
BQL Parser, unfinished
compiler based on
Apache Velocity
We call it Zwift
(Swift + ZeroVM)
Alpha Quality
Custom
executable job
Query Compiler
JVM
What is Swift?
“Swift is a highly available, distributed,
eventually consistent object/blob store.
Organizations can use Swift to store
lots of data efficiently, safely, and
cheaply.”
Haven’t got it?
Swift is THE open-source
implementation of
Amazon S3
What is ZeroVM?
Highly-secure, low-overhead, low-latency container-style
virtualization based on Google Native Client project. The
critical security code is transferred verbatim from Chrome
Browser project and therefore is as secure as Chrome
Browser. More info: http://ZeroVM.org and
http://news.ycombinator.com/item?id=3746222
ZeroVM highlights
1. Disposable VM per request
2. HyperElasticity per request
3. Embeddable into everything
4. High-performance (x86/ARM)
5. Erlang inspired clustering
6. Written in pure C, not deps
Haven’t got it?
ZeroVM to Virtualization
is what
SQLite is to Databases
Where is the code?
• OpenDremel (1st generation design): – http://code.google.com/p/dremel/source/browse?repo=dremel
– http://code.google.com/p/dremel/source/browse?repo=metaxa
• Dazo (2nd generation design):
– https://github.com/Dazo-org