Sep 2012 HUG: Apache Drill for Interactive Analysis

Embed Size (px)

DESCRIPTION

Apache Drill is a new open source Apache Incubator project for interactive analysis of large-scale datasets, inspired by Google's Dremel. It enables users to query terabytes of data in seconds. Apache Drill supports a broad range of data formats, including Protocol Buffers, Avro and JSON, and leverages Hadoop and HBase as data sources. Drill's primary query language, DrQL, is compatible with Google BigQuery. In this talk we provide an overview of the Drill project, including its design goals and architecture. Presenter: Jason Frantz, Software Architect, MapR Technologies

Citation preview

  • 1. Apache DrillInteractive Analysis of Large-Scale Datasets Jason Frantz Architect, MapR
  • 2. My Background Caltech Clustrix MapR Founding member of Apache Drill
  • 3. MapR Technologies The open enterprise-grade distribution for Hadoop Easy, dependable and fast Open source with standards-based extensions MapR is deployed at 1000s of companies From small Internet startups to the worlds largest enterprises MapR customers analyze massive amounts of data: Hundreds of billions of events daily 90% of the worlds Internet population monthly $1 trillion in retail purchases annually MapR has partnered with Google to provide Hadoop on Google Compute Engine
  • 4. Latency Matters Ad-hoc analysis with interactive tools Real-time dashboards Event/trend detection and analysis Network intrusions Fraud Failures
  • 5. Big Data Processing Batch processing Interactive analysis Stream processingQuery runtime Minutes to hours Milliseconds to Never-ending minutesData volume TBs to PBs GBs to PBs Continuous streamProgramming MapReduce Queries DAGmodelUsers Developers Analysts and Developers developersGoogle project MapReduce DremelOpen source Hadoop Storm and S4project MapReduce Introducing Apache Drill
  • 6. GOOGLE DREMEL
  • 7. Google Dremel Interactive analysis of large-scale datasets Trillion records at interactive speeds Complementary to MapReduce Used by thousands of Google employees Paper published at VLDB 2010 Authors: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis Model Nested data model with schema Most data at Google is stored/transferred in Protocol Buffers Normalization (to relational) is prohibitive SQL-like query language with nested data support Implementation Column-based storage and processing In-situ data access (GFS and Bigtable) Tree architecture as in Web search (and databases)
  • 8. Google BigQuery Hosted Dremel (Dremel as a Service) CLI (bq) and Web UI Import data from Google Cloud Storage or local files Files must be in CSV format Nested data not supported [yet] except built-in datasets Schema definition required
  • 9. APACHE DRILL
  • 10. Architecture Only the execution engine knows the physical attributes of the cluster # nodes, hardware, file locations, Public interfaces enable extensibility Developers can build parsers for new query languages Developers can provide an execution plan directly Each level of the plan has a human readable representation Facilitates debugging and unit testing
  • 11. Architecture (2)
  • 12. Execution Engine Layers Drill execution engine has two layers Operator layer is serialization-aware Processes individual records Execution layer is not serialization-aware Processes batches of records (blobs) Responsible for communication, dependencies and fault tolerance
  • 13. Data Flow
  • 14. Nested Query Languages DrQL SQL-like query language for nested data Compatible with Google BigQuery/Dremel BigQuery applications should work with Drill Designed to support efficient column-based processing No record assembly during query processing Mongo Query Language {$query: {x: 3, y: "abc"}, $orderby: {x: 1}} Other languages/programming models can plug in
  • 15. Nested Data Model The data model in Dremel is Protocol Buffers Nested Schema Apache Drill is designed to support multiple data models Schema: Protocol Buffers, Apache Avro, Schema-less: JSON, BSON, Flat records are supported as a special case of nested data CSV, TSV, Avro IDL JSON enum Gender { { MALE, FEMALE "name": "Srivas", } "gender": "Male", "followers": 100 record User { } string name; { Gender gender; "name": "Raina", long followers; "gender": "Female", } "followers": 200, "zip": "94305" }
  • 16. DrQL ExampleSELECT DocId AS Id, COUNT(Name.Language.Code) WITHIN Name ASCnt, Name.Url + , + Name.Language.Code ASStrFROM tWHERE REGEXP(Name.Url, ^http) AND DocId < 20; * Example from the Dremel paper
  • 17. Query Components Query components: SELECT FROM WHERE GROUP BY HAVING (JOIN) Key logical operators: Scan Filter Aggregate (Join)
  • 18. Extensibility Nested query languages Pluggable model DrQL Mongo Query Language Cascading Distributed execution engine Extensible model (eg, Dryad) Low-latency Fault tolerant Nested data formats Pluggable model Column-based (ColumnIO/Dremel, Trevni, RCFile) and row-based (RecordIO, Avro, JSON, CSV) Schema (Protocol Buffers, Avro, CSV) and schema-less (JSON, BSON) Scalable data sources Pluggable model Hadoop HBase
  • 19. Scan Operators Drill supports multiple data formats by having per-format scan operators Queries involving multiple data formats/sources are supported Fields and predicates can be pushed down into the scan operator Scan operators may have adaptive side-effects (database cracking) Produce ColumnIO from RecordIO Google PowerDrill stores materialized expressions with the data Scan with schema Scan without schemaOperator Protocol Buffers JSON-like (MessagePack)outputSupported ColumnIO (column-based protobuf/Dremel) JSONdata formats RecordIO (row-based protobuf) HBase CSVSELECT ColumnIO(proto URI, data URI) Json(data URI)FROM RecordIO(proto URI, data URI) HBase(table name)
  • 20. Design PrinciplesFlexible Easy Pluggable query languages Unzip and run Extensible execution engine Zero configuration Pluggable data formats Reverse DNS not needed Column-based and row-based IP addresses can change Schema and schema-less Clear and concise log messages Pluggable data sourcesDependable Fast No SPOF C/C++ core with Java support Instant recovery from crashes Google C++ style guide Min latency and max throughput (limited only by hardware)
  • 21. Hadoop Integration Hadoop data sources Hadoop FileSystem API (HDFS/MapR-FS) HBase Hadoop data formats Apache Avro RCFile MapReduce-based tools to create column-based formats Table registry in HCatalog Run long-running services in YARN
  • 22. Get Involved! Download these slides http://www.mapr.com/company/events/bay-area-hug/9-19-2012 Join the mailing list [email protected] Join MapR [email protected]