30
Introduc)on to Apache Drill Michael Hausenblas, Chief Data Engineer EMEA, MapR 6 th Swiss Big Data User Group MeeAng, Zurich, 20130325

Swiss Big Data User Group - Introduction to Apache Drill

Embed Size (px)

DESCRIPTION

An introduction to Apache Drill given by MapR Chief Data Engineer, Michael Hausenblas at the 6th Swiss Big Data User Group Meeting. Zurich, 2013-03-25

Citation preview

Page 1: Swiss Big Data User Group - Introduction to Apache Drill

1  

Introduc)on  to  Apache  Drill  

Michael  Hausenblas,  Chief  Data  Engineer  EMEA,  MapR  6th  Swiss  Big  Data  User  Group  MeeAng,  Zurich,  2013-­‐03-­‐25  

Page 2: Swiss Big Data User Group - Introduction to Apache Drill

2  2  

Kudos  to  hJp://cmx.io/    

Page 3: Swiss Big Data User Group - Introduction to Apache Drill

3  

Workloads  •  Batch  processing  (MapReduce)  

•  Light-­‐weight  OLTP  (HBase,  Cassandra,  etc.)  

•  Stream  processing  (Storm,  S4)  

•  Search  (Solr,  ElasAcsearch)  

•  Interac)ve,  ad-­‐hoc  query  and  analysis  (?)  

Page 4: Swiss Big Data User Group - Introduction to Apache Drill

4  

Impala

InteracAve  Query  at  Scale  

low-­‐latency  

Page 5: Swiss Big Data User Group - Introduction to Apache Drill

5  

Use  Case  I  

•  Jane,  a  markeAng  analyst  •  Determine  target  segments  •  Data  from  different  sources    

Page 6: Swiss Big Data User Group - Introduction to Apache Drill

6  

Use  Case  II  

•  LogisAcs  –  supplier  status  •  Queries  – How  many  shipments  from  supplier  X?  – How  many  shipments  in  region  Y?  

SUPPLIER_ID   NAME   REGION  

ACM   ACME  Corp   US  

GAL   GotALot  Inc   US  

BAP   Bits  and  Pieces  Ltd   Europe  

ZUP   Zu  Pli   Asia  

{ "shipment": 100123, "supplier": "ACM", “timestamp": "2013-02-01", "description": ”first delivery today” }, { "shipment": 100124, "supplier": "BAP", "timestamp": "2013-02-02", "description": "hope you enjoy it” } …

Page 7: Swiss Big Data User Group - Introduction to Apache Drill

7  

Today’s  SoluAons  •  RDBMS-­‐focused  –  ETL  data  from  MongoDB  and  Hadoop  –  Query  data  using  SQL  

•  MapReduce-­‐focused  –  ETL  from  RDBMS  and  MongoDB  –  Use  Hive,  etc.  

Page 8: Swiss Big Data User Group - Introduction to Apache Drill

8  

Requirements  

•  Support  for  different  data  sources  •  Support  for  different  query  interfaces  •  Low-­‐latency/real-­‐Ame  •  Ad-­‐hoc  queries  •  Scalable,  reliable  

Page 9: Swiss Big Data User Group - Introduction to Apache Drill

9  

Google’s  Dremel  

hJp://research.google.com/pubs/pub36632.html    

Page 10: Swiss Big Data User Group - Introduction to Apache Drill

10  

Apache  Drill  Overview  

•  Inspired  by  Google’s  Dremel  •  Standard    SQL  2003  support  •  Other  QL  possible  •  Plug-­‐able  data  sources  •  Support  for  nested  data  •  Schema  is  opAonal  •  Community  driven,  open,  100’s  involved  

Page 11: Swiss Big Data User Group - Introduction to Apache Drill

11  

Apache  Drill  Overview  

Page 12: Swiss Big Data User Group - Introduction to Apache Drill

12  

High-­‐level  Architecture  

Page 13: Swiss Big Data User Group - Introduction to Apache Drill

13  

High-­‐level  Architecture  •  Each  node:  Drillbit  -­‐  maximize  data  locality  •  Co-­‐ordinaAon,  query  planning,  execuAon,  etc,  are  distributed  •  By  default  Drillbits  hold  all  roles  •  Any  node  can  act  as  endpoint  for  a  query  

Storage  Process  

Drillbit  

node  

Storage  Process  

Drillbit  

node  

Storage  Process  

Drillbit  

node  

Storage  Process  

Drillbit  

node  

Page 14: Swiss Big Data User Group - Introduction to Apache Drill

14  

High-­‐level  Architecture  •  Zookeeper  for  ephemeral  cluster  membership  info  •  Distributed  cache  (Hazelcast)  for  metadata,  locality  

informaAon,  etc.  

Zookeeper  

Distributed  Cache  

Storage  Process  

Drillbit  

node  

Storage  Process  

Drillbit  

node  

Storage  Process  

Drillbit  

node  

Storage  Process  

Drillbit  

node  

Distributed  Cache   Distributed  Cache   Distributed  Cache  

Page 15: Swiss Big Data User Group - Introduction to Apache Drill

15  

High-­‐level  Architecture  •  Origina)ng  Drillbit  acts  as  foreman,  manages  query  execuAon,  

scheduling,  locality  informaAon,  etc.  •  Streaming  data  communica)on  avoiding  SerDe  

Zookeeper  

Distributed  Cache  

Storage  Process  

Drillbit  

node  

Storage  Process  

Drillbit  

node  

Storage  Process  

Drillbit  

node  

Storage  Process  

Drillbit  

node  

Distributed  Cache   Distributed  Cache   Distributed  Cache  

Page 16: Swiss Big Data User Group - Introduction to Apache Drill

16  

Principled  Query  ExecuAon  

Source  Query   Parser  

Logical  Plan   OpAmizer  

Physical  Plan   ExecuAon  

SQL  2003    DrQL  MongoQL  DSL  

scanner  API  topology  query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” },

parser  API  

Page 17: Swiss Big Data User Group - Introduction to Apache Drill

17  

Drillbit  Modules  

DFS  Engine  

HBase  Engine  

RPC  Endpoint  

SQL  

HiveQL  

Pig  

Parser  

Distributed  Cache  

Logical  Plan  

Physical  Plan  

OpAmizer  

Storage  En

gine

 Interface  Scheduler  

Foreman  

Operators  Mongo  

Page 18: Swiss Big Data User Group - Introduction to Apache Drill

18  

Key  Features  

•  Full  SQL  2003  •  Nested  data  •  OpAonal  schema  •  Extensibility  points  

Page 19: Swiss Big Data User Group - Introduction to Apache Drill

19  

Full  SQL  –  ANSI  SQL  2003  •  SQL-­‐like  is  oken  not  enough  •  IntegraAon  with  exisAng  tools  –  Datameer,  Tableau,  Excel,  SAP  Crystal  Reports  –  Use  standard  ODBC/JDBC  driver  

Page 20: Swiss Big Data User Group - Introduction to Apache Drill

20  

Nested  Data  

•  Nested  data  becoming  prevalent  –  JSON/BSON,  XML,  ProtoBuf,  Avro  –  Some  data  sources  support  it  naAvely  (MongoDB,  etc.)  

•  FlaJening  nested  data  is  error-­‐prone  •  Extension  to  ANSI  SQL  2003  

Page 21: Swiss Big Data User Group - Introduction to Apache Drill

21  

OpAonal  Schema  •  Many  data  sources  don’t  have  rigid  schemas  –  Schema  changes  rapidly  –  Different  schema  per  record  (e.g.  HBase)  

•  Supports  queries  against  unknown  schema  •  User  can  define  schema  or  via  discovery  

Page 22: Swiss Big Data User Group - Introduction to Apache Drill

22  

Extensibility  Points  

•  Source  query  –  parser  API  •  Custom  operators,  UDF  –  logical  plan  •  OpAmizer  •  Data  sources  and  formats  –  scanner  API  

Source  Query   Parser  

Logical  Plan   OpAmizer  

Physical  Plan   ExecuAon  

Page 23: Swiss Big Data User Group - Introduction to Apache Drill

23  

…  and  Hadoop?  •  HDFS  can  be  a  data  source  

•  Complementary  use  cases  …  

•  …  use  Apache  Drill  –  Find  record  with  specified  condiAon  –  AggregaAon  under  dynamic  condiAons  

•  …  use  MapReduce  –  Data  mining  with  mulAple  iteraAons  –  ETL  

23  hJps://cloud.google.com/files/BigQueryTechnicalWP.pdf    

Page 24: Swiss Big Data User Group - Introduction to Apache Drill

24  

Example  

hJps://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo    

{ "id": "0001", "type": "donut", ”ppu": 0.55, "batters": { "batter”: [

{ "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" },

data  source:  donuts.json  

query:[ { op:"sequence", do:[

{ op: "scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "filter", expr: "donuts.ppu < 2.00" },

logical  plan:  simple_plan.json  

result:  out.json  

{ "sales" : 700.0, "typeCount" : 1, "quantity" : 700, "ppu" : 1.0 } { "sales" : 109.71, "typeCount" : 2, "quantity" : 159, "ppu" : 0.69 } { "sales" : 184.25, "typeCount" : 2, "quantity" : 335, "ppu" : 0.55 }

Page 25: Swiss Big Data User Group - Introduction to Apache Drill

25  

Status  

•  Heavy  development  by  mulAple  organizaAons  

•  Available  – Logical  plan  (ADSP)  – Reference  interpreter  – Basic  SQL  parser    – Basic  demo  – Basic  HBase  back-­‐end  

Page 26: Swiss Big Data User Group - Introduction to Apache Drill

26  

Status  

March/April    •  Larger  SQL  syntax  •  Physical  plan  •  In-­‐memory  compressed  data  interfaces  •  Distributed  execuAon  focused  on  large  cluster  high  performance  sort,  aggregaAon  and  join  

Page 27: Swiss Big Data User Group - Introduction to Apache Drill

27  

ContribuAng  •  Dremel-­‐inspired  columnar  format:  TwiJer’s  Parquet    and  

Hive’s  ORC  file  

•  IntegraAon  with  Hive  metastore  (?)  

•  DRILL-­‐13  Storage  Engine:  Define  Java  Interface  

•  DRILL-­‐15  Build  HBase  storage  engine  implementaAon  

Page 28: Swiss Big Data User Group - Introduction to Apache Drill

28  

ContribuAng  •  DRILL-­‐48  RPC  interface  for  query  submission  and  physical  plan  

execuAon  

•  DRILL-­‐53  Setup  cluster  configuraAon  and  membership  mgmt  system  –  ZK  for  coordinaAon  –  Helix  for  parAAon  and  resource  assignment  (?)  

•  Further  schedule  –  Alpha  Q2  –  Beta  Q3  

Page 29: Swiss Big Data User Group - Introduction to Apache Drill

29  

Kudos  to  …  

•  Julian  Hyde,  Pentaho    •  Timothy  Chen,  Microsok  •  Chris  Merrick,  RJMetrics    •  David  Alves,  UT  AusAn  •  Sree  Vaadi,  SSS/NGData  •  Jacques  Nadeau,  MapR  •  Ted  Dunning,  MapR  

Page 30: Swiss Big Data User Group - Introduction to Apache Drill

30  

Engage!  •  Follow  @ApacheDrill  on  TwiJer  

•  Sign  up  at  mailing  lists  (user  |  dev)    hJp://incubator.apache.org/drill/mailing-­‐lists.html    

•  Learn  where  and  how  to  contribute  hJps://cwiki.apache.org/confluence/display/DRILL/ContribuAng    

•  Keep  an  eye  on  hJp://drill-­‐user.org/