Swiss Big Data User Group - Introduction to Apache Drill

1

Introduc)on to Apache Drill

Michael Hausenblas, Chief Data Engineer EMEA, MapR 6th Swiss Big Data User Group MeeAng, Zurich, 2013-‐03-‐25

2 2

Kudos to hJp://cmx.io/

3

Workloads •  Batch processing (MapReduce)

•  Light-‐weight OLTP (HBase, Cassandra, etc.)

•  Stream processing (Storm, S4)

•  Search (Solr, ElasAcsearch)

•  Interac)ve, ad-‐hoc query and analysis (?)

4

Impala

InteracAve Query at Scale

low-‐latency

5

Use Case I

•  Jane, a markeAng analyst •  Determine target segments •  Data from different sources

6

Use Case II

•  LogisAcs – supplier status •  Queries – How many shipments from supplier X? – How many shipments in region Y?

SUPPLIER_ID NAME REGION

ACM ACME Corp US

GAL GotALot Inc US

BAP Bits and Pieces Ltd Europe

ZUP Zu Pli Asia

{ "shipment": 100123, "supplier": "ACM", “timestamp": "2013-02-01", "description": ”first delivery today” }, { "shipment": 100124, "supplier": "BAP", "timestamp": "2013-02-02", "description": "hope you enjoy it” } …

7

Today’s SoluAons •  RDBMS-‐focused –  ETL data from MongoDB and Hadoop –  Query data using SQL

•  MapReduce-‐focused –  ETL from RDBMS and MongoDB –  Use Hive, etc.

8

Requirements

•  Support for different data sources •  Support for different query interfaces •  Low-‐latency/real-‐Ame •  Ad-‐hoc queries •  Scalable, reliable

9

Google’s Dremel

hJp://research.google.com/pubs/pub36632.html

10

Apache Drill Overview

•  Inspired by Google’s Dremel •  Standard SQL 2003 support •  Other QL possible •  Plug-‐able data sources •  Support for nested data •  Schema is opAonal •  Community driven, open, 100’s involved

11

Apache Drill Overview

12

High-‐level Architecture

13

High-‐level Architecture •  Each node: Drillbit -‐ maximize data locality •  Co-‐ordinaAon, query planning, execuAon, etc, are distributed •  By default Drillbits hold all roles •  Any node can act as endpoint for a query

Storage Process

Drillbit

node

Storage Process

Drillbit

node

Storage Process

Drillbit

node

Storage Process

Drillbit

node

14

High-‐level Architecture •  Zookeeper for ephemeral cluster membership info •  Distributed cache (Hazelcast) for metadata, locality

informaAon, etc.

Zookeeper

Distributed Cache

Storage Process

Drillbit

node

Storage Process

Drillbit

node

Storage Process

Drillbit

node

Storage Process

Drillbit

node

Distributed Cache Distributed Cache Distributed Cache

15

High-‐level Architecture •  Origina)ng Drillbit acts as foreman, manages query execuAon,

scheduling, locality informaAon, etc. •  Streaming data communica)on avoiding SerDe

Zookeeper

Distributed Cache

Storage Process

Drillbit

node

Storage Process

Drillbit

node

Storage Process

Drillbit

node

Storage Process

Drillbit

node

Distributed Cache Distributed Cache Distributed Cache

16

Principled Query ExecuAon

Source Query Parser

Logical Plan OpAmizer

Physical Plan ExecuAon

SQL 2003 DrQL MongoQL DSL

scanner API topology query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” },

parser API

17

Drillbit Modules

DFS Engine

HBase Engine

RPC Endpoint

SQL

HiveQL

Pig

Parser

Distributed Cache

Logical Plan

Physical Plan

OpAmizer

Storage En

gine

Interface Scheduler

Foreman

Operators Mongo

18

Key Features

•  Full SQL 2003 •  Nested data •  OpAonal schema •  Extensibility points

19

Full SQL – ANSI SQL 2003 •  SQL-‐like is oken not enough •  IntegraAon with exisAng tools –  Datameer, Tableau, Excel, SAP Crystal Reports –  Use standard ODBC/JDBC driver

20

Nested Data

•  Nested data becoming prevalent –  JSON/BSON, XML, ProtoBuf, Avro –  Some data sources support it naAvely (MongoDB, etc.)

•  FlaJening nested data is error-‐prone •  Extension to ANSI SQL 2003

21

OpAonal Schema •  Many data sources don’t have rigid schemas –  Schema changes rapidly –  Different schema per record (e.g. HBase)

•  Supports queries against unknown schema •  User can define schema or via discovery

22

Extensibility Points

•  Source query – parser API •  Custom operators, UDF – logical plan •  OpAmizer •  Data sources and formats – scanner API

Source Query Parser

Logical Plan OpAmizer

Physical Plan ExecuAon

23

… and Hadoop? •  HDFS can be a data source

•  Complementary use cases …

•  … use Apache Drill –  Find record with specified condiAon –  AggregaAon under dynamic condiAons

•  … use MapReduce –  Data mining with mulAple iteraAons –  ETL

23 hJps://cloud.google.com/files/BigQueryTechnicalWP.pdf

24

Example

hJps://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo

{ "id": "0001", "type": "donut", ”ppu": 0.55, "batters": { "batter”: [

{ "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" },

…

data source: donuts.json

query:[ { op:"sequence", do:[

{ op: "scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "filter", expr: "donuts.ppu < 2.00" },

…

logical plan: simple_plan.json

result: out.json

{ "sales" : 700.0, "typeCount" : 1, "quantity" : 700, "ppu" : 1.0 } { "sales" : 109.71, "typeCount" : 2, "quantity" : 159, "ppu" : 0.69 } { "sales" : 184.25, "typeCount" : 2, "quantity" : 335, "ppu" : 0.55 }

25

Status

•  Heavy development by mulAple organizaAons

•  Available – Logical plan (ADSP) – Reference interpreter – Basic SQL parser – Basic demo – Basic HBase back-‐end

26

Status

March/April •  Larger SQL syntax •  Physical plan •  In-‐memory compressed data interfaces •  Distributed execuAon focused on large cluster high performance sort, aggregaAon and join

27

ContribuAng •  Dremel-‐inspired columnar format: TwiJer’s Parquet and

Hive’s ORC file

•  IntegraAon with Hive metastore (?)

•  DRILL-‐13 Storage Engine: Define Java Interface

•  DRILL-‐15 Build HBase storage engine implementaAon

28

ContribuAng •  DRILL-‐48 RPC interface for query submission and physical plan

execuAon

•  DRILL-‐53 Setup cluster configuraAon and membership mgmt system –  ZK for coordinaAon –  Helix for parAAon and resource assignment (?)

•  Further schedule –  Alpha Q2 –  Beta Q3

29

Kudos to …

•  Julian Hyde, Pentaho •  Timothy Chen, Microsok •  Chris Merrick, RJMetrics •  David Alves, UT AusAn •  Sree Vaadi, SSS/NGData •  Jacques Nadeau, MapR •  Ted Dunning, MapR

30

Engage! •  Follow @ApacheDrill on TwiJer

•  Sign up at mailing lists (user | dev) hJp://incubator.apache.org/drill/mailing-‐lists.html

•  Learn where and how to contribute hJps://cwiki.apache.org/confluence/display/DRILL/ContribuAng

•  Keep an eye on hJp://drill-‐user.org/

Technology

Swiss Big Data User Group - Introduction to Apache Drill