Apache Drill is a new open source Apache Incubator project for interactive analysis of large-scale datasets, inspired by Google's Dremel. It enables users to query terabytes of data in seconds. Apache Drill supports a broad range of data formats, including Protocol Buffers, Avro and JSON, and leverages Hadoop and HBase as data sources. Drill's primary query language, DrQL, is compatible with Google BigQuery. In this talk we provide an overview of the Drill project, including its design goals and architecture. Presenter: Jason Frantz, Software Architect, MapR Technologies
Citation preview
1. Apache DrillInteractive Analysis of Large-Scale Datasets
Jason Frantz Architect, MapR
2. My Background Caltech Clustrix MapR Founding member of
Apache Drill
3. MapR Technologies The open enterprise-grade distribution for
Hadoop Easy, dependable and fast Open source with standards-based
extensions MapR is deployed at 1000s of companies From small
Internet startups to the worlds largest enterprises MapR customers
analyze massive amounts of data: Hundreds of billions of events
daily 90% of the worlds Internet population monthly $1 trillion in
retail purchases annually MapR has partnered with Google to provide
Hadoop on Google Compute Engine
4. Latency Matters Ad-hoc analysis with interactive tools
Real-time dashboards Event/trend detection and analysis Network
intrusions Fraud Failures
5. Big Data Processing Batch processing Interactive analysis
Stream processingQuery runtime Minutes to hours Milliseconds to
Never-ending minutesData volume TBs to PBs GBs to PBs Continuous
streamProgramming MapReduce Queries DAGmodelUsers Developers
Analysts and Developers developersGoogle project MapReduce
DremelOpen source Hadoop Storm and S4project MapReduce Introducing
Apache Drill
6. GOOGLE DREMEL
7. Google Dremel Interactive analysis of large-scale datasets
Trillion records at interactive speeds Complementary to MapReduce
Used by thousands of Google employees Paper published at VLDB 2010
Authors: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey
Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis Model Nested
data model with schema Most data at Google is stored/transferred in
Protocol Buffers Normalization (to relational) is prohibitive
SQL-like query language with nested data support Implementation
Column-based storage and processing In-situ data access (GFS and
Bigtable) Tree architecture as in Web search (and databases)
8. Google BigQuery Hosted Dremel (Dremel as a Service) CLI (bq)
and Web UI Import data from Google Cloud Storage or local files
Files must be in CSV format Nested data not supported [yet] except
built-in datasets Schema definition required
9. APACHE DRILL
10. Architecture Only the execution engine knows the physical
attributes of the cluster # nodes, hardware, file locations, Public
interfaces enable extensibility Developers can build parsers for
new query languages Developers can provide an execution plan
directly Each level of the plan has a human readable representation
Facilitates debugging and unit testing
11. Architecture (2)
12. Execution Engine Layers Drill execution engine has two
layers Operator layer is serialization-aware Processes individual
records Execution layer is not serialization-aware Processes
batches of records (blobs) Responsible for communication,
dependencies and fault tolerance
13. Data Flow
14. Nested Query Languages DrQL SQL-like query language for
nested data Compatible with Google BigQuery/Dremel BigQuery
applications should work with Drill Designed to support efficient
column-based processing No record assembly during query processing
Mongo Query Language {$query: {x: 3, y: "abc"}, $orderby: {x: 1}}
Other languages/programming models can plug in
15. Nested Data Model The data model in Dremel is Protocol
Buffers Nested Schema Apache Drill is designed to support multiple
data models Schema: Protocol Buffers, Apache Avro, Schema-less:
JSON, BSON, Flat records are supported as a special case of nested
data CSV, TSV, Avro IDL JSON enum Gender { { MALE, FEMALE "name":
"Srivas", } "gender": "Male", "followers": 100 record User { }
string name; { Gender gender; "name": "Raina", long followers;
"gender": "Female", } "followers": 200, "zip": "94305" }
16. DrQL ExampleSELECT DocId AS Id, COUNT(Name.Language.Code)
WITHIN Name ASCnt, Name.Url + , + Name.Language.Code ASStrFROM
tWHERE REGEXP(Name.Url, ^http) AND DocId < 20; * Example from
the Dremel paper
17. Query Components Query components: SELECT FROM WHERE GROUP
BY HAVING (JOIN) Key logical operators: Scan Filter Aggregate
(Join)
18. Extensibility Nested query languages Pluggable model DrQL
Mongo Query Language Cascading Distributed execution engine
Extensible model (eg, Dryad) Low-latency Fault tolerant Nested data
formats Pluggable model Column-based (ColumnIO/Dremel, Trevni,
RCFile) and row-based (RecordIO, Avro, JSON, CSV) Schema (Protocol
Buffers, Avro, CSV) and schema-less (JSON, BSON) Scalable data
sources Pluggable model Hadoop HBase
19. Scan Operators Drill supports multiple data formats by
having per-format scan operators Queries involving multiple data
formats/sources are supported Fields and predicates can be pushed
down into the scan operator Scan operators may have adaptive
side-effects (database cracking) Produce ColumnIO from RecordIO
Google PowerDrill stores materialized expressions with the data
Scan with schema Scan without schemaOperator Protocol Buffers
JSON-like (MessagePack)outputSupported ColumnIO (column-based
protobuf/Dremel) JSONdata formats RecordIO (row-based protobuf)
HBase CSVSELECT ColumnIO(proto URI, data URI) Json(data URI)FROM
RecordIO(proto URI, data URI) HBase(table name)
20. Design PrinciplesFlexible Easy Pluggable query languages
Unzip and run Extensible execution engine Zero configuration
Pluggable data formats Reverse DNS not needed Column-based and
row-based IP addresses can change Schema and schema-less Clear and
concise log messages Pluggable data sourcesDependable Fast No SPOF
C/C++ core with Java support Instant recovery from crashes Google
C++ style guide Min latency and max throughput (limited only by
hardware)
21. Hadoop Integration Hadoop data sources Hadoop FileSystem
API (HDFS/MapR-FS) HBase Hadoop data formats Apache Avro RCFile
MapReduce-based tools to create column-based formats Table registry
in HCatalog Run long-running services in YARN
22. Get Involved! Download these slides
http://www.mapr.com/company/events/bay-area-hug/9-19-2012 Join the
mailing list [email protected] Join MapR
[email protected]