Apache Drill

Embed Size (px)


Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.

Text of Apache Drill

  • 1.Apaches Answer to Low Latency Interactive Query for Big Data February 13, 2013

2. Agenda Apache Drill overview Key features Status and progress Discuss potential use cases and cooperation 3. Big Data Workloads ETL Data mining Blob store Lightweight OLTP on large datasets Index and model generation Web crawling Stream processing Clustering, anomaly detection and classification Interactive analysis 4. Interactive Queries and HadoopCommon Solutions Compile SQL toExport MapReduceExternal tables in anMapReduce results to RDBMS and MPP databasequery the RDBMSEmerging Technologies Impala Real-timeReal-timeSQL basedinteractive queriesinteractive queriesanalytics 5. Example Problem Jane works as anTransaction analyst at an e-information commerce company How does she figure User out good targeting profiles segments for the next marketing campaign? She has some ideasAccess and lots of data logs 6. Solving the Problem with Traditional Systems Use an RDBMS ETL the data from MongoDB and Hadoop into the RDBMS MongoDB data must be flattened, schematized, filtered and aggregated Hadoop data must be filtered and aggregated Query the data using any SQL-based tool Use MapReduce ETL the data from Oracle and MongoDB into Hadoop Work with the MapReduce team to generate the desired analyses Use Hive ETL the data from Oracle and MongoDB into Hadoop MongoDB data must be flattened and schematized But HiveQL is limited, queries take too long and BI tool support is limited 7. WWGDDistributedInteractive BatchNoSQLFile Systemanalysisprocessing GFSBigTableDremel MapReduceHadoopHDFS HBase??? MapReduce Build Apache Drill to provide a true open sourcesolution to interactive analysis of Big Data 8. Apache Drill Overview Interactive analysis of Big Data using standard SQL Fast Low latency queries Columnar execution Inspired by Google Dremel/BigQueryInteractive queries Complement native interfaces andApache Drill Data analystMapReduce/Hive/Pig Reporting Open 100 ms-20 min Community driven open source project Under Apache Software Foundation Modern Standard ANSI SQL:2003 (select/into) Data mining Nested/hierarchical data supportMapReduceModeling Hive Schema is optionalPig Large ETL Supports RDBMS, Hadoop and NoSQL 20 min-20 hr 9. How Does It Work? Drillbits run on each node, designed tomaximize data locality Processing is done outside MapReduceSELECT * FROMparadigm (but possibly within YARN) oracle.transactions, Queries can be fed to any Drillbitmongo.users, hdfs.events Coordination, query LIMIT 1planning, optimization, scheduling, andexecution are distributed 10. Key Features Full SQL (ANSI SQL:2003) Nested data Schema is optional Flexible and extensible architecture 11. Full SQL (ANSI SQL:2003) Drill supports standard ANSI SQL:2003 Correlated subqueries, analytic functions, SQL-like is not enough Use any SQL-based tool with Apache Drill Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, Standard ODBC and JDBC driversClientTableau DrillbitMicroStrategyDrill% ODBC%SQL%Query% Query% DriverDrillbitsDriverParserPlanner Drill% WorkerExcel Drill%Worker SAP%Crystal%Reports 12. Nested Data JSON Nested data is becoming prevalent{ "name": "Homer", JSON, BSON, XML, Protocol Buffers, Avro, etc. "gender": "Male", "followers": 100 The data source may or may not be aware children: [ MongoDB supports nested data natively {name: "Bart"}, {name: "Lisa} A single HBase value could be a JSON document ] (compound nested type)} Google Dremels innovation was efficient columnar storage and querying of nested data Flattening nested data is error-prone and oftenAvro enum Gender {impossible MALE, FEMALE Think about repeated and optional fields at every } levelrecord User { Apache Drill supports nested datastring name; Gender gender; Extensions to ANSI SQL:2003 long followers; } 13. Schema is Optional Many data sources do not have rigid schemas Schemas change rapidly Each record may have a different schema Sparse and wide rows in HBase and Cassandra, MongoDB Apache Drill supports querying against unknown schemas Query any HBase, Cassandra or MongoDB table User can define the schema or let the system discover it automatically System of record may already have schema information Why manage it in a separate system? No need to manage schema evolutionRow KeyCF contentsCF anchor"com.cnn.www"contents:html = ""anchor:my.look.ca = "CNN.com"anchor:cnnsi.com = "CNN""com.foxnews.www"contents:html = ""anchor:en.wikipedia.org = "Fox News" 14. Flexible and Extensible Architecture Apache Drill is designed for extensibility Well-documented APIs and interfaces Data sources and file formats Implement a custom scanner to support a new data source or file format Query languages SQL:2003 is the primary language Implement a custom Parser to support a Domain Specific Language UDFs and UDTFs Optimizers Drill will have a cost-based optimizer Clear surrounding APIs support easy optimizer exploration Operators Custom operators can be implemented Special operators for Mahout (k-means) being designed Operator push-down to data source (RDBMS) 15. How Does Impala Fit In?Impala Strengths Questions Beta currently available Open Source Lite Easy install and setup on top of Doesnt support RDBMS or otherCloudera NoSQLs (beyond Hadoop/HBase) Faster than Hive on some queries Early row materialization increases SQL-like query languagefootprint and reduces performance Limited file format support Query results must fit in memory! Rigid schema is required No support for nested data Compound APIs restrict optimizer progression SQL-like (not SQL)Many important features are coming soon. Architectural foundation is constrained. Nocommunity development. 16. Status: In Progress Heavy active development by multiple organizations Available Logical plan syntax and interpreter Reference interpreter In progress SQL interpreter Storage engine implementations for Accumulo, Cassandra, HBase and various file formats Significant community momentum Over 200 people on the Drill mailing list Over 200 members of the Bay Area Drill User Group Drill meetups across the US and Europe OpenDremel team joined Apache Drill Anticipated schedule: Prototype: Q1 Alpha: Q2 Beta: Q3 17. Why Apache Drill Will Be SuccessfulResourcesCommunityArchitecture Contributors have strong Development done in the Full SQL backgrounds fromopen New data support companies like Oracle, Active contributors from Extensible APIs IBM Netezza, Informatica, multiple companies Full Columnar Execution Clustrix and Pentaho Rapidly growing Beyond Hadoop 18. Questions? What problems can Drill solve for you? Where does it fit in the organization? Which data sources and BI tools are importantto you? 19. Lets Talk! tdunning@maprtech.com tdunning@apache.org @ted_dunning @ApacheDrill Slides at http://bit.ly/YxZq8X See also http://www.mapr.com/support/community-resources/drill 20. APPENDIX 21. Why Not Leverage MapReduce? Scheduling Model Coarse resource model reduces hardware utilization Acquisition of resources typically takes 100s of millis to seconds Barriers Map completion required before shuffle/reduce commencement All maps must complete before reduce can start In chained jobs, one job must finish entirely before the next one can start Persistence and Recoverability Data is persisted to disk between each barrier Serialization and deserialization are required between execution phase