View
872
Download
0
Category
Tags:
Preview:
DESCRIPTION
An Apache Drill status update given by Michael Hausenblas, MapR's Chief Data Engineer EMEA (2013-04-19)
Citation preview
Apache Drill status
Michael Hausenblas, Chief Data Engineer EMEA, MapR HUG Munich, 2013-‐04-‐19
Kudos to hEp://cmx.io/
Workloads • Batch processing (MapReduce)
• Light-‐weight OLTP (HBase, Cassandra, etc.)
• Stream processing (Storm, S4)
• Search (Solr, ElasVcsearch)
• Interac1ve, ad-‐hoc query and analysis (?)
Impala
InteracVve Query at Scale
low-‐latency
Use Case I
• Jane, a markeVng analyst • Determine target segments • Data from different sources
Use Case II
• LogisVcs – supplier status • Queries – How many shipments from supplier X? – How many shipments in region Y?
SUPPLIER_ID NAME REGION
ACM ACME Corp US
GAL GotALot Inc US
BAP Bits and Pieces Ltd Europe
ZUP Zu Pli Asia
{ "shipment": 100123, "supplier": "ACM", “timestamp": "2013-02-01", "description": ”first delivery today” }, { "shipment": 100124, "supplier": "BAP", "timestamp": "2013-02-02", "description": "hope you enjoy it” } …
Today’s SoluVons • RDBMS-‐focused – ETL data from MongoDB and Hadoop – Query data using SQL
• MapReduce-‐focused – ETL from RDBMS and MongoDB – Use Hive, etc.
Requirements
• Support for different data sources • Support for different query interfaces • Low-‐latency/real-‐Vme • Ad-‐hoc queries • Scalable, reliable
Google’s Dremel*
*) hEp://research.google.com/pubs/pub36632.html
Apache Drill Overview
• Inspired by Google’s Dremel • Standard SQL 2003 support • Other QL possible • Plug-‐able data sources • Support for nested data • Schema is opVonal • Community driven, open, 100’s involved
High-‐level Architecture
High-‐level Architecture • Each node: Drillbit -‐ maximize data locality • Co-‐ordinaVon, query planning, execuVon, etc, are distributed • By default Drillbits hold all roles • Any node can act as endpoint for a query
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Storage Process
Drillbit
node
High-‐level Architecture • Zookeeper for ephemeral cluster membership info • Distributed cache (Hazelcast) for metadata, locality
informaVon, etc.
Curator/Zk
Distributed Cache
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Distributed Cache Distributed Cache Distributed Cache
High-‐level Architecture • Origina1ng Drillbit acts as foreman, manages query execuVon,
scheduling, locality informaVon, etc. • Streaming data communica1on avoiding SerDe
Curator/Zk
Distributed Cache
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Distributed Cache Distributed Cache Distributed Cache
Principled Query ExecuVon
Source Query Parser
Logical Plan OpVmizer
Physical Plan ExecuVon
SQL 2003 DrQL MongoQL DSL
scanner API topology query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” },
parser API
Drillbit Modules
DFS Engine
HBase Engine
RPC Endpoint
SQL
HiveQL
Pig
Parser
Distributed Cache
Logical Plan
Physical Plan
OpVmizer
Storage En
gine
Interface Scheduler
Foreman
Operators Mongo
Key Features
• Full SQL 2003 • Nested data • OpVonal schema • Extensibility points
Full SQL – ANSI SQL 2003 • SQL-‐like is oien not enough • IntegraVon with exisVng tools – Datameer, Tableau, Excel, SAP Crystal Reports – Use standard ODBC/JDBC driver
Nested Data
• Nested data becoming prevalent – JSON/BSON, XML, ProtoBuf, Avro – Some data sources support it naVvely (MongoDB, etc.)
• FlaEening nested data is error-‐prone • Extension to ANSI SQL 2003
OpVonal Schema • Many data sources don’t have rigid schemas – Schema changes rapidly – Different schema per record (e.g. HBase)
• Supports queries against unknown schema • User can define schema or via discovery
Extensibility Points
• Source query à parser API • Custom operators, UDF à logical plan • Serving tree, CF, topology à physical plan/opVmizer • Data sources &formats à scanner API
Source Query Parser
Logical Plan OpVmizer
Physical Plan ExecuVon
… and Hadoop? • HDFS can be a data source
• Complementary use cases*
• … use Apache Drill – Find record with specified condiVon – AggregaVon under dynamic condiVons
• … use MapReduce – Data mining with mulVple iteraVons – ETL
22 *) hEps://cloud.google.com/files/BigQueryTechnicalWP.pdf
Example
hEps://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo
{ "id": "0001", "type": "donut", ”ppu": 0.55, "batters": { "batter”: [
{ "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" },
…
data source: donuts.json
query:[ { op:"sequence", do:[
{ op: "scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "filter", expr: "donuts.ppu < 2.00" },
…
logical plan: simple_plan.json
result: out.json
{ "sales" : 700.0, "typeCount" : 1, "quantity" : 700, "ppu" : 1.0 } { "sales" : 109.71, "typeCount" : 2, "quantity" : 159, "ppu" : 0.69 } { "sales" : 184.25, "typeCount" : 2, "quantity" : 335, "ppu" : 0.55 }
Status
• Heavy development by mulVple organizaVons
• Available – Logical plan (ADSP) – Reference interpreter – Basic SQL parser – Basic demo – Basic HBase back-‐end
Status
April 2013 • Extend SQL syntax • Physical plan • In-‐memory compressed data interfaces • Distributed execuVon
ContribuVng
• Learn where and how to contribute hEps://cwiki.apache.org/confluence/display/DRILL/ContribuVng
• Jira, Git, Apache build and test tools
• Preparing for dependencies – Hazelcast – Neolix Curator
ContribuVng General contribuVons appreciated:
• Supersonic (?) • Test data & test queries • Use case scenarios (textual desc./SQL queries) • DocumentaVon
ContribuVng • Dremel-‐inspired columnar format
– TwiEer’s Parquet – Hive’s ORC file
• IntegraVon with Hive metastore (?)
• DRILL-‐13 Storage Engine: Define Java Interface
• DRILL-‐15 Build HBase storage engine implementaVon
ContribuVng • DRILL-‐48 RPC interface for query submission and physical plan
execuVon
• DRILL-‐53 Setup cluster configuraVon and membership mgmt system
• Further schedule – Alpha Q2 – Beta Q3
Kudos to …
• Julian Hyde, Pentaho • Lisen Mu • Tim Chen, Microsoi • Chris Merrick, RJMetrics • David Alves, UT AusVn • Sree Vaadi, SSS/NGData • Jacques Nadeau, MapR • Ted Dunning, MapR
Engage! • Follow @ApacheDrill on TwiEer
• Sign up at mailing lists (user | dev) hEp://incubator.apache.org/drill/mailing-‐lists.html
• Standing G+ hangouts every Tuesday at 18:00 CET
• Keep an eye on hEp://drill-‐user.org/
Recommended