25
Gera Shegalov @PJUG, Jan 15, 2013

Apache Drill @ PJUG, Jan 15, 2013

Embed Size (px)

DESCRIPTION

Apache Drill is new Apache incubator project. It's goal is to provide a distributed system for interactive analysis of large-scale datasets. Inspired by Google's Dremel technology, it aims to process trillions of records in seconds. We will cover the goals of Apache Drill, its use cases and how it relates to Hadoop, MongoDB and other large-scale distributed systems. We'll also talk about details of the architecture, points of extensibility, data flow and our first query languages (DrQL and SQL).

Citation preview

Page 1: Apache Drill @ PJUG, Jan 15, 2013

Gera Shegalov @PJUG, Jan 15, 2013

Page 2: Apache Drill @ PJUG, Jan 15, 2013

/home/gera: whoami

■ Saarland University■ 1st intern in Immortal DB @ Microsoft Research■ JMS, RDBMS HA @ Oracle

■ Hadoop MapReduce / Hadoop Core■ Founding member of Apache Drill

Page 3: Apache Drill @ PJUG, Jan 15, 2013

■ Open enterprise-grade distribution for Hadoop● Easy, dependable and fast● Open source with standards-based extensions

■ MapR is deployed at 1000’s of companies● From small Internet startups to Fortune 100

■ MapR customers analyze massive amounts of data:● Hundreds of billions of events daily● 90% of the world’s Internet population monthly● $1 trillion in retail purchases annually

■ MapR in the Cloud:● partnered with Google: Hadoop on Google Compute Engine● partnered with Amazon: M3/M5 options for Elastic Map Reduce

Page 4: Apache Drill @ PJUG, Jan 15, 2013

Agenda

■ What?● What exactly does Drill do?

■ Why?● Why do we need Apache Drill?

■ Who?● Who is doing this?

■ How?● How does Drill work inside?

■ Conclusion● How can you help?● Where can you find out more?

Page 5: Apache Drill @ PJUG, Jan 15, 2013

Apache Drill Overview

■ Drill overview● Low latency interactive queries ● Standard ANSI SQL support● Domain Specific Languages / Your own QL

■ Open-Source● Apache Incubator● 100’s involved across US and Europe ● Community consensus on API, functionality

Page 6: Apache Drill @ PJUG, Jan 15, 2013

Big Data ProcessingBatch processing

Interactive analysis

Stream processing

Query runtime Minutes to hours Milliseconds to minutes Never-ending

Data volume TBs to PBs GBs to PBs Continuous stream

Programming model MapReduce Queries DAG

Users Developers Analysts and developers Developers

Google project MapReduce Dremel

Open source project

Hadoop MapReduce Apache Drill Storm and S4

Page 7: Apache Drill @ PJUG, Jan 15, 2013

Latency Matters

■ Ad-hoc analysis with interactive tools

■ Real-time dashboards

■ Event/trend detection and analysis● Network intrusions● Fraud● Failures

Page 8: Apache Drill @ PJUG, Jan 15, 2013

Nested Query Languages

■ DrQL● SQL-like query language for nested data

● Compatible with Google BigQuery/Dremel● BigQuery applications should work with Drill

● Designed to support efficient column-based processing● No record assembly during query processing

■ Mongo Query Language● {$query: {x: 3, y: "abc"}, $orderby: {x: 1}}

■ Other languages/programming models can plug in

Page 9: Apache Drill @ PJUG, Jan 15, 2013

Nested Data Model■ The data model in Dremel is Protocol Buffers

● Nested● Schema

■ Apache Drill is designed to support multiple data models● Schema: Protocol Buffers, Apache Avro, …● Schema-less: JSON, BSON, …

■ Flat records are supported as a special case of nested data● CSV, TSV, …

{ "name": "Srivas", "gender": "Male", "followers": 100}{ "name": "Raina", "gender": "Female", "followers": 200, "zip": "94305"}

enum Gender { MALE, FEMALE}

record User { string name; Gender gender; long followers;}

Avro IDL JSON

Page 10: Apache Drill @ PJUG, Jan 15, 2013

Extensibility■ Nested query languages

● Pluggable model● DrQL● Mongo Query Language● Cascading

■ Distributed execution engine● Extensible model (eg, Dryad)● Low-latency● Fault tolerant

■ Nested data formats● Pluggable model● Column-based (ColumnIO/Dremel, Trevni, RCFile) and row-based (RecordIO,

Avro, JSON, CSV)● Schema (Protocol Buffers, Avro, CSV) and schema-less (JSON, BSON)

■ Scalable data sources● Pluggable model● Hadoop● HBase

Page 11: Apache Drill @ PJUG, Jan 15, 2013

Design Principles

Flexible● Pluggable query languages● Extensible execution engine● Pluggable data formats

● Column-based and row-based● Schema and schema-less

● Pluggable data sources● N(ot)O(nly) Hadoop

Easy● Unzip and run● Zero configuration● Reverse DNS not needed● IP addresses can change● Clear and concise log messages

Dependable● No SPOF● Instant recovery from crashes

Fast● Minimum Java core● C/C++ core with Java support

● Google C++ style guide● Min latency and max throughput

(limited only by hardware)

Page 12: Apache Drill @ PJUG, Jan 15, 2013

Architecture

Page 13: Apache Drill @ PJUG, Jan 15, 2013

Execution EngineOperator layer is serialization-aware

Processes individual records

Execution layer is not serialization-awareProcesses batches of records (blobs/JSON trees)Responsible for communication, dependencies and fault tolerance

Page 14: Apache Drill @ PJUG, Jan 15, 2013

DrQL Examplelocal-logs = donuts.json:

{ "id": "0003", "type": "donut", "name": "Old Fashioned", "ppu": 0.55, "sales": 300, "batters": { "batter": [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" } ] }, "topping": [ { "id": "5001", "type": "None" }, { "id": "5002", "type": "Glazed" }, { "id": "5003", "type": "Chocolate" }, { "id": "5004", "type": "Maple" } ] }

SELECT ppu,

typeCount = COUNT(*) OVER PARTITION BY ppu,

quantity = SUM(sales) OVER PARTITION BY ppu, sales =

SUM(ppu*sales) OVER PARTITION BY ppu

FROM local-logs donutsWHERE donuts.ppu < 1.00ORDER BY dountuts.ppu DESC;

Page 15: Apache Drill @ PJUG, Jan 15, 2013

Query Components

■ User Query (DrQL) components:● SELECT● FROM● WHERE● GROUP BY● HAVING● (JOIN)

■ Logical operators:● Scan● Filter● Aggregate● (Join)

Page 16: Apache Drill @ PJUG, Jan 15, 2013

Logical Plan

Page 17: Apache Drill @ PJUG, Jan 15, 2013

query:[ { op:"sequence", do:[ { op: "scan", memo: "initial_scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "transform", transforms: [ { ref: "donuts.quanity", expr: "donuts.sales"} ] }, { op: "filter", expr: "donuts.ppu < 1.00" }, ---

Logical Plan Syntax:Operators & Expressions

Page 18: Apache Drill @ PJUG, Jan 15, 2013

Logical Streaming Example

{ @id: <refnum>, op: “window-frame”, input: <input>, keys: [ <name>,... ], ref: <name>, before: 2, after: here}

0 1 2 3 4

0 0 10 1 2 1 2 32 3 4

Page 19: Apache Drill @ PJUG, Jan 15, 2013

Representing a DAG

{ @id: 19, op: "aggregate", input: 18, type: <simple|running|repeat>, keys: [<name>,...], aggregations: [ {ref: <name>, expr: <aggexpr> },... ]}

Page 20: Apache Drill @ PJUG, Jan 15, 2013

Multiple Inputs

{ @id: 25, op: "cogroup", groupings: [

{ref: 23, expr: “id”}, {ref: 24, expr: “id”} ]}

Page 21: Apache Drill @ PJUG, Jan 15, 2013

Physical Scan Operators

Scan with schema Scan without schema

Operator output

Protocol Buffers JSON-like (MessagePack)

Supported data formats

ColumnIO (column-based protobuf/Dremel)RecordIO (row-based protobuf)CSV

JSONHBase

SELECT … FROM …

ColumnIO(proto URI, data URI)RecordIO(proto URI, data URI)

Json(data URI)HBase(table name)

Page 22: Apache Drill @ PJUG, Jan 15, 2013

Hadoop Integration

■ Hadoop data sources● Hadoop FileSystem API (HDFS/MapR-FS)● HBase

■ Hadoop data formats● Apache Avro● RCFile

■ MapReduce-based tools to create column-based formats

■ Table registry in HCatalog

■ Run long-running services in YARN

Page 23: Apache Drill @ PJUG, Jan 15, 2013

Where is Drill now?

■ API Definition

■ Reference Implementation for Logical Plan Interpreter● 1:1 mapping logical/physical op● Single JVM

■ Demo

Page 24: Apache Drill @ PJUG, Jan 15, 2013

Contribute!

■ Participate in Design discussions: JIRA, ML, Wiki, Google Doc!

■ Write a parser for your favorite QL / Domain-Specific Language

■ Write Storage Engine API implementations● HDFS, Hbase, relational, XML DB.

■ Write Physical Operators● scan-hbase, scan-cassandra, scan-mongo● scan-jdbc, scan-odbc, scan-jms (browse topic/queue), scan-*● combined functionality operators: group-aggregate, ...● sort-merge-join, hash-join, index-lookup-join

■ Etc...

Page 25: Apache Drill @ PJUG, Jan 15, 2013

Thanks, Q&A

■ Download these slides● http://www.mapr.com/company/events/pjug-1-15-2013

■ Join the project● [email protected] ● #apachedrill

■ Contact me:● [email protected]

■ Join MapR● [email protected]