23
© 2014 MapR Technologies 1 ® © 2014 MapR Technologies Tomer Shiran, VP Product Management

Self-Service Data Exploration with Apache Drill

Embed Size (px)

Citation preview

Page 1: Self-Service Data Exploration with Apache Drill

®© 2014 MapR Technologies 1

®

© 2014 MapR Technologies

Tomer Shiran, VP Product Management

Page 2: Self-Service Data Exploration with Apache Drill

®© 2014 MapR Technologies 2

Empowering “as it happens” businesses by speeding up the

data-to-action cycle

®

Page 3: Self-Service Data Exploration with Apache Drill

®© 2014 MapR Technologies 3

Top-Ranked NoSQL

Top-Ranked Hadoop Distribution

Top-Ranked SQL-on-Hadoop Solution

®

Page 4: Self-Service Data Exploration with Apache Drill

®© 2014 MapR Technologies 4

Data is doubling in size every two years

Page 5: Self-Service Data Exploration with Apache Drill

®© 2014 MapR Technologies 5

SEMI-STRUCTURED DATA

STRUCTURED DATA

1980 2000 2010 1990 2020

Unstructured data will account for more than 80% of the data

collected by organizations

Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data

Total Data S

tored

Page 6: Self-Service Data Exploration with Apache Drill

®© 2014 MapR Technologies 6 1980 2000 2010 1990 2020

Fixed schema

DBA controls structure

Dynamic schema (schema-free)

Application controls structure

NON-RELATIONAL DATASTORES RELATIONAL DATABASES

MBs-TBs TBs-PBs Volume

Database

Data Increasingly Stored in Non-Relational Datastores

Structure

Development

Structured Structured, semi-structured and unstructured

Planned (release cycle = months-years) Iterative (release cycle = days-weeks)

Page 7: Self-Service Data Exploration with Apache Drill

®© 2014 MapR Technologies 7

SQL in a Non-Relational World

•  SQL •  BI (Tableau, MicroStrategy, etc.) •  Low latency •  Scalability

•  Create and maintain schemas on: –  HDFS (Parquet, JSON, etc.) –  HBase –  …

•  Transform or copy data

2 DON’T WANT WANT

Page 8: Self-Service Data Exploration with Apache Drill

®© 2014 MapR Technologies 8

• Data exploration for Hadoop and NoSQL • Low latency at scale • Point-and-query vs. schema-first • Extreme ease of use • Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs

APACHE DRILL

Agility

•  Explore big data in its native format without IT intervention

Flexibility

•  Analyze semi-structured/nested data coming from NoSQL applications and JSON sources

Plug-and-Play

•  Leverage existing SQL skillsets, BI tools and Apache Hive deployments

Page 9: Self-Service Data Exploration with Apache Drill

®© 2014 MapR Technologies 9

Drill is the Top-Ranked SQL-on-Hadoop

Page 10: Self-Service Data Exploration with Apache Drill

®© 2014 MapR Technologies 10

Evolution Towards Self-Service Data Exploration

Schema Management and

Transformation

Data Visualization

IT-dependent

IT-dependent

IT-dependent

Self-service

IT-dependent

Self-service

Self-service

Self-service

Traditional BI w/ RDBMS

Self-Service BI w/ RDBMS

“Gen 1” SQL-on-Hadoop

Self-Service Data Exploration

Zero-day analytics

Page 11: Self-Service Data Exploration with Apache Drill

®© 2014 MapR Technologies 11

Drill’s Role in the Enterprise Data Architecture

Raw data

•  JSON, CSV, ...

“Optimized” data

•  Parquet, …

Centrally-structured data

•  Schemas in Hive Metastore

Relational data

•  Highly-structured data

Hive, Impala, Spark SQL

Oracle, Teradata

Exploration (known and unknown questions)

Page 12: Self-Service Data Exploration with Apache Drill

®© 2014 MapR Technologies 12

Leverage Existing SQL Tools and Skills

Leverage SQL-compatible tools (BI, query builders, etc.) via Drill’s standard ODBC, JDBC and ANSI SQL support

Enable business analysts, technical analysts and data scientists to explore and analyze large volumes of real-time data

Page 13: Self-Service Data Exploration with Apache Drill

®© 2014 MapR Technologies 13

A storage plugin instance -  DFS -  Hbase/MapRDB -  Hive Metastore/HCatalog

A workspace -  Sub-directory -  Hive database -  HBase namespace

A table -  pathnames -  HBase table -  Hive table

Combine Data from Multiple Sources on the Fly

SELECT  *  FROM  dfs.demo.`yelp/business.json`  !

Page 14: Self-Service Data Exploration with Apache Drill

®© 2014 MapR Technologies 14 © 2014 MapR Technologies ®

How Drill Achieves Data Agility

Page 15: Self-Service Data Exploration with Apache Drill

®© 2014 MapR Technologies 15

Drill’s Data Model is Flexible

JSON BSON

HBase

Parquet/Avro

CSV TSV

Schema-less Fixed schema

Complex

Flat

Flexibility

Flexibility

Name ! Gender ! Age !Michael ! M ! 6 !Jennifer ! F ! 3 !

{ ! name: { ! first: Michael, ! last: Smith ! }, ! hobbies: [ski, soccer], ! district: Los Altos !} !{ ! name: { ! first: Jennifer, ! last: Gates ! }, ! hobbies: [sing], ! preschool: CCLC !} !

RDBMS/SQL-on-Hadoop table

Apache Drill table

Page 16: Self-Service Data Exploration with Apache Drill

®© 2014 MapR Technologies 16

Drill Supports Schema Discovery On-The-Fly

•  Fixed schema •  Leverage schema in centralized

repository (Hive Metastore)

•  Fixed schema, evolving schema or schema-less

•  Leverage schema in centralized repository or self-describing data

2 Schema Discovered On-The-Fly Schema Declared In Advance

SCHEMA ON WRITE

SCHEMA BEFORE READ

SCHEMA ON THE FLY

Page 17: Self-Service Data Exploration with Apache Drill

®© 2014 MapR Technologies 17

$50M $50M in Free Training

Page 18: Self-Service Data Exploration with Apache Drill

®© 2014 MapR Technologies 18

Q & A

@mapr maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies

Page 19: Self-Service Data Exploration with Apache Drill

®© 2014 MapR Technologies 19 © 2014 MapR Technologies ®

Under the Hood

Page 20: Self-Service Data Exploration with Apache Drill

®© 2014 MapR Technologies 20

High Level Architecture

Cluster of commodity servers –  Daemon (drillbit) on each node

ZooKeeper maintains ephemeral cluster membership information –  Drillbit uses ZooKeeper to find other drillbits in the cluster –  Client uses ZooKeeper to find drillbits

Built-in, optimistic query execution engine. Doesn’t require a particular storage or execution system (MapReduce, Spark, Tez)

–  Better performance and manageability

Data processing unit is columnar record batches  –  Enables schema flexibility with negligible performance impact

Page 21: Self-Service Data Exploration with Apache Drill

®© 2014 MapR Technologies 21

Drill Maximizes Data Locality

Data Source Best Practice HDFS or MapR-FS drillbit on each DataNode HBase or MapR-DB drillbit on each RegionServer MongoDB drillbit on each mongod node (when using replicas, run it on the replica node)

drillbit

DataNode/ RegionServer/

mongod

drillbit

DataNode/ RegionServer/

mongod

drillbit

DataNode/ RegionServer/

mongod

ZooKeeper ZooKeeper

ZooKeeper …

Page 22: Self-Service Data Exploration with Apache Drill

®© 2014 MapR Technologies 22

Core Modules within drillbit  

SQL Parser Hive

HBase

Sto

rage

Plu

gins

MongoDB

DFS

Phy

sica

l Pla

n

Execution Lo

gica

l Pla

n Optimizer

Page 23: Self-Service Data Exploration with Apache Drill

®© 2014 MapR Technologies 23

SELECT * Query Execution

drillbit  ZooKeeper

Client (JDBC, ODBC,

REST)

1.  Find drillbits (once per session)

3.  Create logical and physical execution plans 4.  Farm out execution of fragments to cluster

(completely distributed execution)

ZooKeeper ZooKeeper

drillbit  drillbit  

2.  Submit query to drillbit

5.  Return results to client

* CTAS (CREATE TABLE AS SELECT) queries include steps 1-4