© 2015 MapR Technologies 1© 2015 MapR Technologies
Self-service BI for big data applications using Apache Drill
© 2015 MapR Technologies 2
Ma
na
ge
me
nt -
MC
S
MapR Data Platform
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Pig
Cascading
Spark
Batch
Spark Streaming
Storm
Streaming
HBase
Solr
NoSQL & Search
Juju
Provisioning &
coordination
Savannah
Mahout
MLLib
ML, Graph
MapR Data Platform for Hadoop and NoSQL
GraphX
MapReduce v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow & Data
Governance
Tez*
Hive
Impala
Spark SQL
SQL
Sentry Oozie ZooKeeperSqoop
Knox WhirrFalconFlume
Data Integration& Access
HttpFS
Hue
Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability
Drill
MapR-FS MapR-DB
© 2015 MapR Technologies 3
SEMI-STRUCTURED DATA
STRUCTURED DATA
1980 2000 20101990 2020
Data Is Doubling Every Two Years
Unstructured data will account
for more than 80% of the data
collected by organizations
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data
To
tal D
ata
Sto
red
IT
Resources
© 2015 MapR Technologies 4
1980 2000 20101990 2020
Fixed schema
DBA controls structure
Dynamic / Flexible schema
Application controls structure
NON-RELATIONAL DATASTORESRELATIONAL DATABASES
GBs-TBs TBs-PBsVolume
Database
Data Increasingly Stored in Non-Relational Datastores
Structure
Development
Structured Structured, semi-structured and unstructured
Planned (release cycle = months-years) Iterative (release cycle = days-weeks)
© 2015 MapR Technologies 5
How To Bring SQL Into An Unstructured Future?
Familiarity of SQL Agility & Flexibility of NoSQL
• SQL
• BI (Tableau, MicroStrategy,
etc.)
• Low latency
• Scalability
• No schema management
– HDFS (Parquet, JSON, etc.)
– HBase
– …
• No transform or silos of data
© 2015 MapR Technologies 6
Industry's First
Schema-free SQL engine
for Big Data
© 2015 MapR Technologies 7
Apache Drill Brings Flexibility & PerformanceAccess to any data type, any data source
• Relational
• Nested data
• Schema-less
Rapid time to insights
• Query data in-situ
• No Schemas required
• Easy to get started
Integration with existing tools
• ANSI SQL
• BI tool integration
Scale in all dimensions
• TB-PB of scale
• 1000’s of users
• 1000’s of nodes
Granular Security
• Authentication
• Row/column level controls
• De-centralized
© 2015 MapR Technologies 8
Extending Self Service to Schema-free dataA
gil
ity &
Bu
sin
ess V
alu
e
Use cases for BI
IT-Driven BI
Self-Service BI
Schema-Free
Data Exploration
IT-Driven BI IT-Driven BI
Self-Service BI
Analyst-driven with
no IT dependency
Analyst-driven with
IT support for ETL
IT-created
reports, spreadsheets
1980s -1990s 2000s Now
© 2015 MapR Technologies 9
Enabling “As-It-Happens” Business with Instant Analytics
Hadoop data Data modeling Transformation
Data movement
(optional)
Users
Hadoop data Users
Governed
approach
Exploratory
approach
New Business questionsSource data evolution
Total time to insight: weeks to months
Total time to insight: minutes
© 2015 MapR Technologies 10
Drill’s Role in the Enterprise Data Architecture
Raw data
• JSON, CSV, ...
“Optimized” data
• Parquet, …
Centrally-structured data
• Schemas in Hive Metastore
Relational data
• Highly-structured data
Hive, Impala, Spark SQL
Oracle, Teradata
Exploration
(known and unknown questions)
© 2015 MapR Technologies 11
Access control that scales
PAM Authentication +
User Impersonation
Fine-grained row and
column level access control
with Drill Views – no
centralized security
repository required
Files HBase Hive
Drill
View 1
Drill
View 2
UUU
User
User
© 2015 MapR Technologies 12
Granular security permissions through Drill views
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)Owner
Admins
Permission
Admins
Business Analyst Data Scientist
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist View (/views/maskedcards.csv)
Not a physical data copy
Name City State
Dave San Jose CA
John Boulder CO
Business Analyst View
Owner
Admins
Permission
Business
Analysts
Owner
Admins
Permission
Data
Scientists
© 2015 MapR Technologies 13
Business Benefits
Rapid time-to-value for business analysts:
SQL specialists and BI analysts can query any dataset—including complex
nested data—instantly, versus waiting several weeks for data preparation by IT.
Efficiency with easy governance for IT:
IT can avoid unnecessary ETL cycles and schema maintenance activities, but
still ensure governance through easy-to-deploy granular access controls.
Accelerated big data adoption for businesses:
Organizations can use the existing and large SQL talent base and tools to
rapidly discover new business insights from big data.
© 2015 MapR Technologies 14© 2015 MapR Technologies
Quick TourSelf-Service Data Exploration with Apache Drill
© 2015 MapR Technologies 15
Data is growing fast and scattered in various silo’s:
Website click logs
• JSON files
Product database
• MapR-DB NoSQL
Customers
• CSV files
© 2015 MapR Technologies 16
Apache Drill: SQL in a Non-Relational World
• ANSI SQL
• BI (Tableau, MicroStrategy, etc.)
• Low latency
• Scalability
• Agility
• Create and maintain schemas in
advance:
– HDFS (Parquet, JSON, etc.)
– HBase
– …
• Transform, copy, or move data
2 DON’T WANT WANT
© 2015 MapR Technologies 17
Closing The Gap Between Different Datasources using Drill
Product database
• Prod_id
• Productname
• Category
• Price
NoSQL
Hbase / MapR-DB
Website click logs
• Trans_id
• Sess_date
• Cust_id
• Device
• Prod_id
• Purch_flag
JSON
Customers
• Cust_id
• Customername
• State
• Gender
• Agg_rev
• Age
• Membership
CSV
© 2015 MapR Technologies 18© 2015 MapR Technologies
Demo
© 2015 MapR Technologies 19
In lieu of the live demonstration please find links below:
• Apache Drill with Tableau (4:28):
https://www.youtube.com/watch?v=EH0_vRTAkyk
• Twitter analytics with Apache Drill and Microstrategy (5:02):
https://www.youtube.com/watch?v=-gqwgahtc2Y
• Analyzing JSON and Packet Data with SAP Lumira and Apache
Drill: https://www.youtube.com/watch?v=s-fEATDI2wA
© 2015 MapR Technologies 20© 2015 MapR Technologies
Case Studies
© 2015 MapR Technologies 21
Raw Data Exploration JSON Analytics DWH Offload …
Hive HBaseFiles Directories
…
{JSON}, Parquet
Text Files …
Self-Service Data ExplorationDirect access to any data store from familiar tools- ANSI SQL compatible
© 2015 MapR Technologies 22
Data Warehouse Offload with Drill & MapRUltimately replace existing expensive SQL analytics platform with Hadoop
• Apache Drill allows interactive analysis on large datasets with MapR as the
underlying platform that meets scale, reliability and data protection needs
• SQL users did not have to learn Pig, HiveQL or any other language and
continue to use Tableau and Squirrel on top of Drill
OBJECTIVES
CHALLENGES
SOLUTION
• Hadoop and Drill dramatically reduce the price point to less than $1,000 / TB
• MapR platform with Drill delivers reliability and performance for the end users
• Leverage existing BI and SQL skill-sets on Hadoop without retraining
Business
Impact
Potential
• Mine credit card data and compares consumer shopping habits
• Require internal SQL specialists to gain instant access to data at all times
• Want to preserve instant access to data but a lower price point
• Need a system that is reliable, does not lose data and is fast
• Must be able to leverage the SQL skill sets in the company
© 2015 MapR Technologies 23
Telecom OEM application with Drill & MapRLeverage Drill’s JSON capabilities to create revenue-generating IOT services
• Apache Drill is being used to build the engine for the interactive experience
• Drill allows SQL queries on incoming JSON structures natively without
requiring any centralized schema definitions
• Drill connects to all BI tools using standard ODBC connectors
OBJECTIVES
CHALLENGES
SOLUTION
• Provide new revenue-generating services to mobile operators
• Enable deeper, instant intelligence about the networks and users
• Reduce maintenance costs - no IT intervention required for schema changes
Business
Impact
Potential
• Offer service to mobile operators to proactively monitor and improve their
subscriber experience
• Instant availability of data from diverse and disparate sources
• Data is very diverse and dynamic using JSON as the key format
• Require interactive, ad-hoc analysis capabilities via standard BI tools such
as Tableau and Spotfire
© 2015 MapR Technologies 24
Recap: Apache Drill enables Self Service SQL for Big data
AGILITYINSTANT INSIGHTS TO BIG DATA
FLEXIBILITYONE INTERFACE
FOR HADOOP & NOSQL
FAMILIARITYEXISTING SKILLS &
TECHNOLOGIES
• Direct queries on self
describing data
• No schemas or ETL
required
• Query HBase and
other NoSQL stores
• Use SQL to natively
operate on complex
data types (such as
JSON)
• Leverage ANSI SQL
skills and BI tools
• Plug-n-play with Hive
schema, file formats,
UDF’s
© 2014 MapR Technologies 25
Learn more and get started with Apache Drill
New to MapR and/or Drill?
– Get started with Free MapR On Demand training
– Test Drive Drill on cloud with Amazon EMR
– Learn how to use Drill with Hadoop using MapR sandbox
Ready to play with your data?
– Try out Apache Drill in 10 mins guide on your desktop
– Download Drill for your MapR cluster and start exploration
• Use both with relational and JSON datasets
– Comprehensive tutorials and documentation available
Ask questions – [email protected]
© 2014 MapR Technologies 26
Thank You
@mapr maprtech
[email protected]@mapr.com
MapRTechnologies
maprtech
mapr-technologies
© 2014 MapR Technologies 27© 2014 MapR Technologies
Backup Slides
© 2014 MapR Technologies 28
MapR with Drill is Top-Ranked SQL-on-Hadoop
Source: Gigaom Research, 2015
Key:
• Number indicates companies relative strength across all vectors
• Size of ball indicates company’s relative strength along individual vector
Like other vendors’
offerings, Drill
handles BI and
interactive queries with
great aplomb, but it is
designed to serve these
workloads with data
complexity that goes
well beyond the flat
structured data that
other SQL-on-
Hadoop systems deal
with.
© 2014 MapR Technologies 29
Drill Hive Impala Spark SQL
Key Use Cases Self-service Data Exploration
Interactive BI / Ad-hoc queries
Batch/ ETL/ Long-running jobs Interactive BI / Ad-hoc queries SQL as part of Spark pipelines
/ Advanced analytic workflows
Data
Sources
Files support Parquet, JSON, Text, all
Hive file formats
Yes (all Hive file formats) Yes (Parquet, Sequence,
RC, Text, AVRO …)
Parquet, JSON, Text, all
Hive file formats
HBase/MapR-DB Yes Yes, performance issues Yes, performance issues Same as Hive
Beyond Hadoop Yes No No Yes
Data
Types
Relational Yes Yes Yes Yes
Complex/Nested Yes Limited No Limited
Metadata Schema-less
/Dynamic
schema
Yes No No Limited
Hive Meta store Yes Yes Yes Yes
SQL /
BI tools
SQL support ANSI SQL HiveQL HiveQL ANSI SQL (limited) &
HiveQL
Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC
Beyond Memory Yes Yes Yes Yes
Optimizer Limited Limited Limited Limited
Platform Latency Low Medium Low Low (in-memory) / Medium
Concurrency High Medium High Medium
SQL technologies available on MapR
© 2014 MapR Technologies 30
MapR: Best Solution for Customer Success
Premier
InvestorsHigh Growth
2X Growth In Direct Customers
90%Subscription Licenses
Software Margins
140% Dollar-based Net Expansion
700+ Customers
2X Growth In Annual
Subscriptions ( ACV)
Best Product
Apache Open Source
© 2014 MapR Technologies 31
Key Reasons for Selecting MapRRespondents who had prior experience with another Hadoop distribution*
* Apache Hadoop, Cloudera or Hortonworks
© 2014 MapR Technologies 32
Analytics with 1st
generation SQL-on-
Hadoop requires
ETL and schema
creation.
Operational apps on
HBase/Accumulo must be
run in a separate cluster
from the analytics cluster.
HBase/Accumulo suffer
from service disruptions
due to compactions,
garbage collection, and
region splits. All data
movement into HDFS
force batch processing.
1
2
3
MapR Provides the Only Real-Time Distribution
Apache Drill provides
immediate self-service
data exploration with
no waiting on IT.
MapR-DB runs in the same cluster
as the analytics cluster (Hadoop),
to avoid batch data copies across
clusters.
MapR-DB architecture
ensures consistently
high responsiveness
(low latency). MapR
ingests data in real-time
via MapR-DB, HDFS
API, and NFS.
2 1
3
© 2014 MapR Technologies 33
MapR: The Only Platform Architected For Big, Fast, Reliable
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Spark Streaming
Storm
StreamingNoSQL & Search
Juju
Provisioning &
coordination
Savannah
ML, Graph
Mahout
MLLib
GraphX
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow & Data
Governance
Pig
Cascading
Spark
Batch
MapReduce v1 & v2
Tez
HBase
Solr
Hive
Impala
Spark SQL
Drill
SQL
Sentry Oozie ZooKeeperSqoop
Flume
Data Integration& Access
HttpFS
Hue
MapR Data Platform(Random Read/Write)
MapR-FS(HDFS and NFS APIs)
MapR-DB(High-Performance NoSQL)
More efficient use of infrastructure
(30-50% lower TCO)
First new
database
designed
for
operational
real-time
Your choice
of SQL
Industry’s only mirroring,
point-in-time consistent snapshots
Trillion files
2-11x faster
Open source
Projects ‘inherit’
MapR’s platform
attributes