Download pdf - Self-Service BI for Big Data Applications Using Apache Drill (Big Data & Analytics Insight at Ba4all, 2015-06-04)

© 2015 MapR Technologies 1© 2015 MapR Technologies

Self-service BI for big data applications using Apache Drill

© 2015 MapR Technologies 2

Ma

na

ge

me

nt -

MC

S

MapR Data Platform

APACHE HADOOP AND OSS ECOSYSTEM

Security

YARN

Pig

Cascading

Spark

Batch

Spark Streaming

Storm

Streaming

HBase

Solr

NoSQL & Search

Juju

Provisioning &

coordination

Savannah

Mahout

MLLib

ML, Graph

MapR Data Platform for Hadoop and NoSQL

GraphX

MapReduce v1 & v2

EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS

Workflow & Data

Governance

Tez*

Hive

Impala

Spark SQL

SQL

Sentry Oozie ZooKeeperSqoop

Knox WhirrFalconFlume

Data Integration& Access

HttpFS

Hue

Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability

Drill

MapR-FS MapR-DB


SEMI-STRUCTURED DATA

STRUCTURED DATA

1980 2000 20101990 2020

Data Is Doubling Every Two Years

Unstructured data will account

for more than 80% of the data

collected by organizations

Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data

To

tal D

ata

Sto

red

IT

Resources


1980 2000 20101990 2020

Fixed schema

DBA controls structure

Dynamic / Flexible schema

Application controls structure

NON-RELATIONAL DATASTORESRELATIONAL DATABASES

GBs-TBs TBs-PBsVolume

Database

Data Increasingly Stored in Non-Relational Datastores

Structure

Development

Structured Structured, semi-structured and unstructured

Planned (release cycle = months-years) Iterative (release cycle = days-weeks)


How To Bring SQL Into An Unstructured Future?

Familiarity of SQL Agility & Flexibility of NoSQL

• SQL

• BI (Tableau, MicroStrategy,

etc.)

• Low latency

• Scalability

• No schema management

– HDFS (Parquet, JSON, etc.)

– HBase

– …

• No transform or silos of data


Industry's First

Schema-free SQL engine

for Big Data


Apache Drill Brings Flexibility & PerformanceAccess to any data type, any data source

• Relational

• Nested data

• Schema-less

Rapid time to insights

• Query data in-situ

• No Schemas required

• Easy to get started

Integration with existing tools

• ANSI SQL

• BI tool integration

Scale in all dimensions

• TB-PB of scale

• 1000’s of users

• 1000’s of nodes

Granular Security

• Authentication

• Row/column level controls

• De-centralized


Extending Self Service to Schema-free dataA

gil

ity &

Bu

sin

ess V

alu

e

Use cases for BI

IT-Driven BI

Self-Service BI

Schema-Free

Data Exploration

IT-Driven BI IT-Driven BI

Self-Service BI

Analyst-driven with

no IT dependency

Analyst-driven with

IT support for ETL

IT-created

reports, spreadsheets

1980s -1990s 2000s Now


Enabling “As-It-Happens” Business with Instant Analytics

Hadoop data Data modeling Transformation

Data movement

(optional)

Users

Hadoop data Users

Governed

approach

Exploratory

approach

New Business questionsSource data evolution

Total time to insight: weeks to months

Total time to insight: minutes


Drill’s Role in the Enterprise Data Architecture

Raw data

• JSON, CSV, ...

“Optimized” data

• Parquet, …

Centrally-structured data

• Schemas in Hive Metastore

Relational data

• Highly-structured data

Hive, Impala, Spark SQL

Oracle, Teradata

Exploration

(known and unknown questions)


Access control that scales

PAM Authentication +

User Impersonation

Fine-grained row and

column level access control

with Drill Views – no

centralized security

repository required

Files HBase Hive

Drill

View 1

Drill

View 2

UUU

User

User


Granular security permissions through Drill views

Name City State Credit Card #

Dave San Jose CA 1374-7914-3865-4817

John Boulder CO 1374-9735-1794-9711

Raw File (/raw/cards.csv)Owner

Admins

Permission

Admins

Business Analyst Data Scientist

Name City State Credit Card #

Dave San Jose CA 1374-1111-1111-1111

John Boulder CO 1374-1111-1111-1111

Data Scientist View (/views/maskedcards.csv)

Not a physical data copy

Name City State

Dave San Jose CA

John Boulder CO

Business Analyst View

Owner

Admins

Permission

Business

Analysts

Owner

Admins

Permission

Data

Scientists


Business Benefits

Rapid time-to-value for business analysts:

SQL specialists and BI analysts can query any dataset—including complex

nested data—instantly, versus waiting several weeks for data preparation by IT.

Efficiency with easy governance for IT:

IT can avoid unnecessary ETL cycles and schema maintenance activities, but

still ensure governance through easy-to-deploy granular access controls.

Accelerated big data adoption for businesses:

Organizations can use the existing and large SQL talent base and tools to

rapidly discover new business insights from big data.


Quick TourSelf-Service Data Exploration with Apache Drill


Data is growing fast and scattered in various silo’s:

Website click logs

• JSON files

Product database

• MapR-DB NoSQL

Customers

• CSV files


Apache Drill: SQL in a Non-Relational World

• ANSI SQL

• BI (Tableau, MicroStrategy, etc.)

• Low latency

• Scalability

• Agility

• Create and maintain schemas in

advance:

– HDFS (Parquet, JSON, etc.)

– HBase

– …

• Transform, copy, or move data

2 DON’T WANT WANT


Closing The Gap Between Different Datasources using Drill

Product database

• Prod_id

• Productname

• Category

• Price

NoSQL

Hbase / MapR-DB

Website click logs

• Trans_id

• Sess_date

• Cust_id

• Device

• Prod_id

• Purch_flag

JSON

Customers

• Cust_id

• Customername

• State

• Gender

• Agg_rev

• Age

• Membership

CSV


Demo


In lieu of the live demonstration please find links below:

• Apache Drill with Tableau (4:28):

https://www.youtube.com/watch?v=EH0_vRTAkyk

• Twitter analytics with Apache Drill and Microstrategy (5:02):

https://www.youtube.com/watch?v=-gqwgahtc2Y

• Analyzing JSON and Packet Data with SAP Lumira and Apache

Drill: https://www.youtube.com/watch?v=s-fEATDI2wA

https://www.youtube.com/watch?v=EH0_vRTAkyk

https://www.youtube.com/watch?v=-gqwgahtc2Y

https://www.youtube.com/watch?v=s-fEATDI2wA


Case Studies


Raw Data Exploration JSON Analytics DWH Offload …

Hive HBaseFiles Directories

…

{JSON}, Parquet

Text Files …

Self-Service Data ExplorationDirect access to any data store from familiar tools- ANSI SQL compatible


Data Warehouse Offload with Drill & MapRUltimately replace existing expensive SQL analytics platform with Hadoop

• Apache Drill allows interactive analysis on large datasets with MapR as the

underlying platform that meets scale, reliability and data protection needs

• SQL users did not have to learn Pig, HiveQL or any other language and

continue to use Tableau and Squirrel on top of Drill

OBJECTIVES

CHALLENGES

SOLUTION

• Hadoop and Drill dramatically reduce the price point to less than $1,000 / TB

• MapR platform with Drill delivers reliability and performance for the end users

• Leverage existing BI and SQL skill-sets on Hadoop without retraining

Business

Impact

Potential

• Mine credit card data and compares consumer shopping habits

• Require internal SQL specialists to gain instant access to data at all times

• Want to preserve instant access to data but a lower price point

• Need a system that is reliable, does not lose data and is fast

• Must be able to leverage the SQL skill sets in the company


Telecom OEM application with Drill & MapRLeverage Drill’s JSON capabilities to create revenue-generating IOT services

• Apache Drill is being used to build the engine for the interactive experience

• Drill allows SQL queries on incoming JSON structures natively without

requiring any centralized schema definitions

• Drill connects to all BI tools using standard ODBC connectors

OBJECTIVES

CHALLENGES

SOLUTION

• Provide new revenue-generating services to mobile operators

• Enable deeper, instant intelligence about the networks and users

• Reduce maintenance costs - no IT intervention required for schema changes

Business

Impact

Potential

• Offer service to mobile operators to proactively monitor and improve their

subscriber experience

• Instant availability of data from diverse and disparate sources

• Data is very diverse and dynamic using JSON as the key format

• Require interactive, ad-hoc analysis capabilities via standard BI tools such

as Tableau and Spotfire


Recap: Apache Drill enables Self Service SQL for Big data

AGILITYINSTANT INSIGHTS TO BIG DATA

FLEXIBILITYONE INTERFACE

FOR HADOOP & NOSQL

FAMILIARITYEXISTING SKILLS &

TECHNOLOGIES

• Direct queries on self

describing data

• No schemas or ETL

required

• Query HBase and

other NoSQL stores

• Use SQL to natively

operate on complex

data types (such as

JSON)

• Leverage ANSI SQL

skills and BI tools

• Plug-n-play with Hive

schema, file formats,

UDF’s


Learn more and get started with Apache Drill

New to MapR and/or Drill?

– Get started with Free MapR On Demand training

– Test Drive Drill on cloud with Amazon EMR

– Learn how to use Drill with Hadoop using MapR sandbox

Ready to play with your data?

– Try out Apache Drill in 10 mins guide on your desktop

– Download Drill for your MapR cluster and start exploration

• Use both with relational and JSON datasets

– Comprehensive tutorials and documentation available

Ask questions – [email protected]

https://www.mapr.com/services/mapr-academy/big-data-hadoop-online-training-all

https://mapr.orbitera.com/c2m/customer/testDrives/view/471

https://www.mapr.com/products/mapr-sandbox-hadoop/download-sandbox-drill

http://drill.apache.org/docs/drill-in-10-minutes/

https://www.mapr.com/products/apache-drill

http://drill.apache.org/docs/tutorials-introduction/

http://drill.apache.org/docs/

mailto:[email protected]


Thank You

@mapr maprtech

[email protected]@mapr.com

MapRTechnologies

maprtech

mapr-technologies


Backup Slides


MapR with Drill is Top-Ranked SQL-on-Hadoop

Source: Gigaom Research, 2015

Key:

• Number indicates companies relative strength across all vectors

• Size of ball indicates company’s relative strength along individual vector

Like other vendors’

offerings, Drill

handles BI and

interactive queries with

great aplomb, but it is

designed to serve these

workloads with data

complexity that goes

well beyond the flat

structured data that

other SQL-on-

Hadoop systems deal

with.


Drill Hive Impala Spark SQL

Key Use Cases Self-service Data Exploration

Interactive BI / Ad-hoc queries

Batch/ ETL/ Long-running jobs Interactive BI / Ad-hoc queries SQL as part of Spark pipelines

/ Advanced analytic workflows

Data

Sources

Files support Parquet, JSON, Text, all

Hive file formats

Yes (all Hive file formats) Yes (Parquet, Sequence,

RC, Text, AVRO …)

Parquet, JSON, Text, all

Hive file formats

HBase/MapR-DB Yes Yes, performance issues Yes, performance issues Same as Hive

Beyond Hadoop Yes No No Yes

Data

Types

Relational Yes Yes Yes Yes

Complex/Nested Yes Limited No Limited

Metadata Schema-less

/Dynamic

schema

Yes No No Limited

Hive Meta store Yes Yes Yes Yes

SQL /

BI tools

SQL support ANSI SQL HiveQL HiveQL ANSI SQL (limited) &

HiveQL

Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC

Beyond Memory Yes Yes Yes Yes

Optimizer Limited Limited Limited Limited

Platform Latency Low Medium Low Low (in-memory) / Medium

Concurrency High Medium High Medium

SQL technologies available on MapR


MapR: Best Solution for Customer Success

Premier

InvestorsHigh Growth

2X Growth In Direct Customers

90%Subscription Licenses

Software Margins

140% Dollar-based Net Expansion

700+ Customers

2X Growth In Annual

Subscriptions ( ACV)

Best Product

Apache Open Source


Key Reasons for Selecting MapRRespondents who had prior experience with another Hadoop distribution*

* Apache Hadoop, Cloudera or Hortonworks


Analytics with 1st

generation SQL-on-

Hadoop requires

ETL and schema

creation.

Operational apps on

HBase/Accumulo must be

run in a separate cluster

from the analytics cluster.

HBase/Accumulo suffer

from service disruptions

due to compactions,

garbage collection, and

region splits. All data

movement into HDFS

force batch processing.

1

2

3

MapR Provides the Only Real-Time Distribution

Apache Drill provides

immediate self-service

data exploration with

no waiting on IT.

MapR-DB runs in the same cluster

as the analytics cluster (Hadoop),

to avoid batch data copies across

clusters.

MapR-DB architecture

ensures consistently

high responsiveness

(low latency). MapR

ingests data in real-time

via MapR-DB, HDFS

API, and NFS.

2 1

3


MapR: The Only Platform Architected For Big, Fast, Reliable

APACHE HADOOP AND OSS ECOSYSTEM

Security

YARN

Spark Streaming

Storm

StreamingNoSQL & Search

Juju

Provisioning &

coordination

Savannah

ML, Graph

Mahout

MLLib

GraphX

EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS

Workflow & Data

Governance

Pig

Cascading

Spark

Batch

MapReduce v1 & v2

Tez

HBase

Solr

Hive

Impala

Spark SQL

Drill

SQL

Sentry Oozie ZooKeeperSqoop

Flume

Data Integration& Access

HttpFS

Hue

MapR Data Platform(Random Read/Write)

MapR-FS(HDFS and NFS APIs)

MapR-DB(High-Performance NoSQL)

More efficient use of infrastructure

(30-50% lower TCO)

First new

database

designed

for

operational

real-time

Your choice

of SQL

Industry’s only mirroring,

point-in-time consistent snapshots

Trillion files

2-11x faster

Open source

Projects ‘inherit’

MapR’s platform

attributes