Upload
mapr-technologies
View
305
Download
0
Tags:
Embed Size (px)
DESCRIPTION
An introduction to Apache Drill given by MapR Chief Data Engineer, Michael Hausenblas at the 6th Swiss Big Data User Group Meeting. Zurich, 2013-03-25
Citation preview
1
Introduc)on to Apache Drill
Michael Hausenblas, Chief Data Engineer EMEA, MapR 6th Swiss Big Data User Group MeeAng, Zurich, 2013-‐03-‐25
2 2
Kudos to hJp://cmx.io/
3
Workloads • Batch processing (MapReduce)
• Light-‐weight OLTP (HBase, Cassandra, etc.)
• Stream processing (Storm, S4)
• Search (Solr, ElasAcsearch)
• Interac)ve, ad-‐hoc query and analysis (?)
4
Impala
InteracAve Query at Scale
low-‐latency
5
Use Case I
• Jane, a markeAng analyst • Determine target segments • Data from different sources
6
Use Case II
• LogisAcs – supplier status • Queries – How many shipments from supplier X? – How many shipments in region Y?
SUPPLIER_ID NAME REGION
ACM ACME Corp US
GAL GotALot Inc US
BAP Bits and Pieces Ltd Europe
ZUP Zu Pli Asia
{ "shipment": 100123, "supplier": "ACM", “timestamp": "2013-02-01", "description": ”first delivery today” }, { "shipment": 100124, "supplier": "BAP", "timestamp": "2013-02-02", "description": "hope you enjoy it” } …
7
Today’s SoluAons • RDBMS-‐focused – ETL data from MongoDB and Hadoop – Query data using SQL
• MapReduce-‐focused – ETL from RDBMS and MongoDB – Use Hive, etc.
8
Requirements
• Support for different data sources • Support for different query interfaces • Low-‐latency/real-‐Ame • Ad-‐hoc queries • Scalable, reliable
9
Google’s Dremel
hJp://research.google.com/pubs/pub36632.html
10
Apache Drill Overview
• Inspired by Google’s Dremel • Standard SQL 2003 support • Other QL possible • Plug-‐able data sources • Support for nested data • Schema is opAonal • Community driven, open, 100’s involved
11
Apache Drill Overview
12
High-‐level Architecture
13
High-‐level Architecture • Each node: Drillbit -‐ maximize data locality • Co-‐ordinaAon, query planning, execuAon, etc, are distributed • By default Drillbits hold all roles • Any node can act as endpoint for a query
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Storage Process
Drillbit
node
14
High-‐level Architecture • Zookeeper for ephemeral cluster membership info • Distributed cache (Hazelcast) for metadata, locality
informaAon, etc.
Zookeeper
Distributed Cache
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Distributed Cache Distributed Cache Distributed Cache
15
High-‐level Architecture • Origina)ng Drillbit acts as foreman, manages query execuAon,
scheduling, locality informaAon, etc. • Streaming data communica)on avoiding SerDe
Zookeeper
Distributed Cache
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Distributed Cache Distributed Cache Distributed Cache
16
Principled Query ExecuAon
Source Query Parser
Logical Plan OpAmizer
Physical Plan ExecuAon
SQL 2003 DrQL MongoQL DSL
scanner API topology query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” },
parser API
17
Drillbit Modules
DFS Engine
HBase Engine
RPC Endpoint
SQL
HiveQL
Pig
Parser
Distributed Cache
Logical Plan
Physical Plan
OpAmizer
Storage En
gine
Interface Scheduler
Foreman
Operators Mongo
18
Key Features
• Full SQL 2003 • Nested data • OpAonal schema • Extensibility points
19
Full SQL – ANSI SQL 2003 • SQL-‐like is oken not enough • IntegraAon with exisAng tools – Datameer, Tableau, Excel, SAP Crystal Reports – Use standard ODBC/JDBC driver
20
Nested Data
• Nested data becoming prevalent – JSON/BSON, XML, ProtoBuf, Avro – Some data sources support it naAvely (MongoDB, etc.)
• FlaJening nested data is error-‐prone • Extension to ANSI SQL 2003
21
OpAonal Schema • Many data sources don’t have rigid schemas – Schema changes rapidly – Different schema per record (e.g. HBase)
• Supports queries against unknown schema • User can define schema or via discovery
22
Extensibility Points
• Source query – parser API • Custom operators, UDF – logical plan • OpAmizer • Data sources and formats – scanner API
Source Query Parser
Logical Plan OpAmizer
Physical Plan ExecuAon
23
… and Hadoop? • HDFS can be a data source
• Complementary use cases …
• … use Apache Drill – Find record with specified condiAon – AggregaAon under dynamic condiAons
• … use MapReduce – Data mining with mulAple iteraAons – ETL
23 hJps://cloud.google.com/files/BigQueryTechnicalWP.pdf
24
Example
hJps://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo
{ "id": "0001", "type": "donut", ”ppu": 0.55, "batters": { "batter”: [
{ "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" },
…
data source: donuts.json
query:[ { op:"sequence", do:[
{ op: "scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "filter", expr: "donuts.ppu < 2.00" },
…
logical plan: simple_plan.json
result: out.json
{ "sales" : 700.0, "typeCount" : 1, "quantity" : 700, "ppu" : 1.0 } { "sales" : 109.71, "typeCount" : 2, "quantity" : 159, "ppu" : 0.69 } { "sales" : 184.25, "typeCount" : 2, "quantity" : 335, "ppu" : 0.55 }
25
Status
• Heavy development by mulAple organizaAons
• Available – Logical plan (ADSP) – Reference interpreter – Basic SQL parser – Basic demo – Basic HBase back-‐end
26
Status
March/April • Larger SQL syntax • Physical plan • In-‐memory compressed data interfaces • Distributed execuAon focused on large cluster high performance sort, aggregaAon and join
27
ContribuAng • Dremel-‐inspired columnar format: TwiJer’s Parquet and
Hive’s ORC file
• IntegraAon with Hive metastore (?)
• DRILL-‐13 Storage Engine: Define Java Interface
• DRILL-‐15 Build HBase storage engine implementaAon
28
ContribuAng • DRILL-‐48 RPC interface for query submission and physical plan
execuAon
• DRILL-‐53 Setup cluster configuraAon and membership mgmt system – ZK for coordinaAon – Helix for parAAon and resource assignment (?)
• Further schedule – Alpha Q2 – Beta Q3
29
Kudos to …
• Julian Hyde, Pentaho • Timothy Chen, Microsok • Chris Merrick, RJMetrics • David Alves, UT AusAn • Sree Vaadi, SSS/NGData • Jacques Nadeau, MapR • Ted Dunning, MapR
30
Engage! • Follow @ApacheDrill on TwiJer
• Sign up at mailing lists (user | dev) hJp://incubator.apache.org/drill/mailing-‐lists.html
• Learn where and how to contribute hJps://cwiki.apache.org/confluence/display/DRILL/ContribuAng
• Keep an eye on hJp://drill-‐user.org/