Technical Overview of Apache Drill by Jacques Nadeau

1

> Technical Overview Jacques Nadeau, [email protected] 22, 2013

2

Basic Process

Zookeeper

DFS/HBase DFS/HBase DFS/HBase

Drillbit

Distributed Cache

Drillbit

Distributed Cache

Drillbit

Distributed Cache

Query

1. Query comes to any Drillbit2. Drillbit generates execution plan based on affinity

3. Fragments are farmed to individual nodes

4. Data is returned to driving node

3

Core Modules within a Drillbit

SQL Parser Optimizer

Phys

ical

Pla

n

DFS Engine

HBase Engine

RPC Endpoint

Distributed Cache

Stor

age

Engi

ne

Inte

rfac

e

Logi

cal P

lan

Execution

4

Query States

SQL What we want to do (analyst friendly)

Logical Plan: What we want to do (language agnostic, computer friendly)

Physical Plan How we want to do it (the best way we can tell)

Execution Plan (fragments) Where we want to do it

5

SQL

SELECT

t.cf1.name as name,

SUM(t.cf1.sales) as total_sales

FROM m7://cluster1/sales t

GROUP BY name

ORDER BY by total_sales desc

LIMIT 10;

6

Logical Plan: API/Format using JSON

Designed to be as easy as possible for language implementers to utilize– Sugared syntax such as sequence meta-operator

Don’t constrain ourselves to SQL specific paradigm – support complex data type operators such as collapse and expand as well

Allow late typing

sequence: [ { op: scan, storageengine: m7, selection: {table: sales}} { op: project, projections: [

{ref: name, expr: cf1.name}, {ref: sales, expr: cf1.sales}]}

{ op: segment, ref: by_name, exprs: [name]} { op: collapsingaggregate, target: by_name, carryovers: [name],

aggregations: [{ref: total_sales, expr: sum(name)}]} { op: order, ordering: [{order: desc, expr: total_sales}]} { op: store, storageengine: screen}]

7

Physical Plan

Insert points of parallelization where optimizer thinks they are necessary– If we thought that the cardinality of name would be high, we might use an alternative of

sort > range-merge-exchange > streaming aggregate > sort > range-merge-exchange instead of the simpler hash-random-exchange > sorting-hash-aggregate.

Pick the right version of each operator– For example, here we’ve picked the sorting hash aggregate. Since a hash aggregate is

already a blocking operator, doing the sort simultaneously allows us to avoid materializing an intermediate state

Apply projection and other push-down rules into capable operators– Note that the projection is gone, applied directly to the m7scan operator.

{ @id: 1, pop: m7scan, cluster: def, table: sales, cols: [cf1.name, cf2.name]} { @id: 2, op: hash-random-exchange, input: 1, expr: 1} { @id: 3, op: sorting-hash-aggregate, input: 2,

grouping: 1, aggr:[sum(2)], carry: [1], sort: ~agrr[0] } { @id: 4, op: screen, input: 4}

8

Execution Plan

Break plan into major fragments Determine quantity of parallelization for each task based on

estimated costs as well as maximum parallelization for each fragment (file size for now)

Collect up endpoint affinity for each particular HasAffinity operator Assign particular nodes based on affinity, load and topology Generate minor versions of each fragment for individual execution

FragmentId: Major = portion of dataflow Minor = a particular version of that execution (1 or more)

9

Execution Plan, cont’d

Each execution plan has: One root fragment (runs on driving node) Leaf fragments (first tasks to run) Intermediate fragments (won’t start until

they receive data from their children) In the case where the query output is

routed to storage, the root operator will often receive metadata to present rather than data

Root

Intermediate

Leaf

Intermediate

Leaf

10

Example FragmentsLeaf Fragment 1{ pop : "hash-partition-sender", @id : 1, child : { pop : "mock-scan", @id : 2, url : "http://apache.org", entries : [ { id : 1, records : 4000}] }, destinations : [ "Cglsb2NhbGhvc3QY0gk=" ]

Leaf Fragment 2{ pop : "hash-partition-sender", @id : 1, child : { pop : "mock-scan", @id : 2, url : "http://apache.org", entries : [ { id : 1, records : 4000 }, { id : 2, records : 4000 } ] }, destinations : [ "Cglsb2NhbGhvc3QY0gk=" ]}

Root Fragment{ pop : "screen", @id : 1, child : { pop : "random-receiver", @id : 2, providingEndpoints : [ "Cglsb2NhbGhvc3QY0gk=" ] }}

Intermediate Fragment{ pop : "single-sender", @id : 1, child : { pop : "mock-store", @id : 2, child : { pop : "filter", @id : 3, child : { pop : "random-receiver", @id : 4, providingEndpoints : [ "Cglsb2NhbGhvc3QYqRI=", "Cglsb2NhbGhvc3QY0gk=" ] }, expr : " ('b') > (5) " } }, destinations : [ "Cglsb2NhbGhvc3QYqRI=" ]}

11

Execution Flow

Drill Client

UserServer Query Foreman BitCom

Parser Optimizer Execution Planner

12

SQL Parser

Leverage Optiq Add support for “any” type Add support for nested and repeated[] references Add transformation rules to convert from SQL AST to Logical plan

syntax

13

Optimizer

Convert Logical to Physical Very much TBD Likely leverage Optiq Hardest problem in system, especially given lack of statistics Probably not parallel

14

Execution Planner

Each scan operator provides a maximum width of parallelization based on the number of read entries (similar to splits)

Decision of parallelization width is based on simple disk costs size Affinity orders the location of fragment assignment Storage, Scan and Exchange operators are informed of the actual

endpoint assignments to then re-decide their entries (splits)

15

Grittier

16

Execution Engine

Single JVM per Drillbit Small heap space for object management Small set of network event threads to manage socket operations Callbacks for each message sent Messages contain header and collection of native byte buffers Designed to minimize copies and ser/de costs Query setup and fragment runners are managed via processing

queues & thread pools

17

Data

Records are broken into batches Batches contain a schema and a collection of fields Each field has a particular type (e.g. smallint) Fields (a.k.a. columns) are stored in ValueVectors ValueVectors are façades to byte buffers. The in-memory structure of each ValueVector is well defined and

language agnostic ValueVectors defined based on the width and nature of the

underlying data– RepeatMap Fixed1 Fixed2 Fixed4 Fixed8 Fixed12 Fixed16 Bit FixedLen

VarLen1 VarLen2 VarLen4

There are three sub value vector types– Optional (nullable), required or repeated

18

Execution Paradigm We will have a large amount of operators Each operator works on a batch of records at a time A loose goal is batches are roughly a single core’s L2 cache in size Each batch of records carries a schema An operator is responsible for reconfiguring itself if a new schema arrives (or rejecting

the record batch if the schema is disallowed) Most operators are the combination of a set of static operations along with the

evaluation of query specific expressions Runtime compiled operators are the combination of a pre-compiled template and a

runtime compiled set of expressions Exchange operators are converted into Senders and Receiver when execution plan is

materialized Each operator must support consumption of a SelectionVector, a partial

materialization of a filter

19

Storage Engine

Input and output is done through storage engines– (and the screen specialized storage operator)

A storage engine is responsible for providing metadata and statistics about the data

A storage engine exposes a set of optimizer (plan rewrite) rules to support things such as predicate pushdown

A storage engine provides one or more storage engine specific scan operators that can support affinity exposure and task splitting– These are generated based on a StorageEngine specific configuration

The primary interfaces are RecordReader and RecordWriter. RecordReaders are responsible for– Converting stored data into Drill canonical ValueVector format a batch at a time– Providing schema for each record batch

Our initial storage engines will be for DFS and HBase

20

Messages Foreman drives query Foreman saves intermediate fragments to distributed cache Foreman sends leaf fragments directly to execution nodes Executing fragments push record batches to their fragment’s destination

nodes When destination node receives first fragment for a new query, it retrieves

its appropriate fragment from distributed cache, setups up required framework, then waits until the start requirement is needed:– A fragment is evaluated for the number of different sending streams that are

required before the query can actually be scheduled based on each exchanges “supportsOutOfOrder” capability.

– When the IncomingBatchHandler recognizes that its start criteria has been reached, it begins

– In the meantime, destination mode will buffer (potentially to disk) Fragment status messages are pushed back to foreman directly from

individual nodes A single failure status causes the foreman to cancel all other parts of query

21

Scheduling

Plan is to leverage the concepts inside Sparrow Reality is that receiver-side buffering and pre-assigned execution

locations means that this is very much up in the air right now

22

Operation/Configuration

Drillbit is a single JVM Extension is done by building to an api and generating a jar file that

includes a drill-module.conf file with information about where that module needs to be inserted

All configuration is done via a JSON like configuration metaphort that supports complex types

Node discovery/service registry is done through Zookeeper Metrics are collected utilizing the Yammer metrics module

23

User Interfaces

Drill provides DrillClient– Encapsulates endpoint discovery– Supports logical and physical plan submission, query cancellation, query

status– Supports streaming return results

Drill will provide a JDBC driver which converts JDBC into DrillClient communication. – Currently SQL parsing is done client side• Artifact of the current state of Optiq• Need to slim up the JDBC driver and push stuff remotely

In time, will add REST proxy for DrillClient

24

Technologies Jackson for JSON SerDe for metadata Typesafe HOCON for configuration and module management Netty4 as core RPC engine, protobuf for communication Vanilla Java, Larray and Netty ByteBuf for off-heap large data structure help Hazelcast for distributed cache Curator on top of Zookeeper for service registry Optiq for SQL parsing and cost optimization Parquet (probably) as ‘native’ format Janino for expression compilation ASM for ByteCode manipulation Yammer Metrics for metrics Guava extensively Carrot HPC for primitive collections

Technology

Technical Overview of Apache Drill by Jacques Nadeau