Upload
mapr-technologies
View
222
Download
2
Embed Size (px)
Citation preview
®© 2014 MapR Technologies 1
®
© 2014 MapR Technologies
Tomer Shiran, VP Product Management
®© 2014 MapR Technologies 2
Empowering “as it happens” businesses by speeding up the
data-to-action cycle
®
®© 2014 MapR Technologies 3
Top-Ranked NoSQL
Top-Ranked Hadoop Distribution
Top-Ranked SQL-on-Hadoop Solution
®
®© 2014 MapR Technologies 4
Data is doubling in size every two years
®© 2014 MapR Technologies 5
SEMI-STRUCTURED DATA
STRUCTURED DATA
1980 2000 2010 1990 2020
Unstructured data will account for more than 80% of the data
collected by organizations
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data
Total Data S
tored
®© 2014 MapR Technologies 6 1980 2000 2010 1990 2020
Fixed schema
DBA controls structure
Dynamic schema (schema-free)
Application controls structure
NON-RELATIONAL DATASTORES RELATIONAL DATABASES
MBs-TBs TBs-PBs Volume
Database
Data Increasingly Stored in Non-Relational Datastores
Structure
Development
Structured Structured, semi-structured and unstructured
Planned (release cycle = months-years) Iterative (release cycle = days-weeks)
®© 2014 MapR Technologies 7
SQL in a Non-Relational World
• SQL • BI (Tableau, MicroStrategy, etc.) • Low latency • Scalability
• Create and maintain schemas on: – HDFS (Parquet, JSON, etc.) – HBase – …
• Transform or copy data
2 DON’T WANT WANT
®© 2014 MapR Technologies 8
• Data exploration for Hadoop and NoSQL • Low latency at scale • Point-and-query vs. schema-first • Extreme ease of use • Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs
APACHE DRILL
Agility
• Explore big data in its native format without IT intervention
Flexibility
• Analyze semi-structured/nested data coming from NoSQL applications and JSON sources
Plug-and-Play
• Leverage existing SQL skillsets, BI tools and Apache Hive deployments
®© 2014 MapR Technologies 9
Drill is the Top-Ranked SQL-on-Hadoop
®© 2014 MapR Technologies 10
Evolution Towards Self-Service Data Exploration
Schema Management and
Transformation
Data Visualization
IT-dependent
IT-dependent
IT-dependent
Self-service
IT-dependent
Self-service
Self-service
Self-service
Traditional BI w/ RDBMS
Self-Service BI w/ RDBMS
“Gen 1” SQL-on-Hadoop
Self-Service Data Exploration
Zero-day analytics
®© 2014 MapR Technologies 11
Drill’s Role in the Enterprise Data Architecture
Raw data
• JSON, CSV, ...
“Optimized” data
• Parquet, …
Centrally-structured data
• Schemas in Hive Metastore
Relational data
• Highly-structured data
Hive, Impala, Spark SQL
Oracle, Teradata
Exploration (known and unknown questions)
®© 2014 MapR Technologies 12
Leverage Existing SQL Tools and Skills
Leverage SQL-compatible tools (BI, query builders, etc.) via Drill’s standard ODBC, JDBC and ANSI SQL support
Enable business analysts, technical analysts and data scientists to explore and analyze large volumes of real-time data
®© 2014 MapR Technologies 13
A storage plugin instance - DFS - Hbase/MapRDB - Hive Metastore/HCatalog
A workspace - Sub-directory - Hive database - HBase namespace
A table - pathnames - HBase table - Hive table
Combine Data from Multiple Sources on the Fly
SELECT * FROM dfs.demo.`yelp/business.json` !
®© 2014 MapR Technologies 14 © 2014 MapR Technologies ®
How Drill Achieves Data Agility
®© 2014 MapR Technologies 15
Drill’s Data Model is Flexible
JSON BSON
HBase
Parquet/Avro
CSV TSV
Schema-less Fixed schema
Complex
Flat
Flexibility
Flexibility
Name ! Gender ! Age !Michael ! M ! 6 !Jennifer ! F ! 3 !
{ ! name: { ! first: Michael, ! last: Smith ! }, ! hobbies: [ski, soccer], ! district: Los Altos !} !{ ! name: { ! first: Jennifer, ! last: Gates ! }, ! hobbies: [sing], ! preschool: CCLC !} !
RDBMS/SQL-on-Hadoop table
Apache Drill table
®© 2014 MapR Technologies 16
Drill Supports Schema Discovery On-The-Fly
• Fixed schema • Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or schema-less
• Leverage schema in centralized repository or self-describing data
2 Schema Discovered On-The-Fly Schema Declared In Advance
SCHEMA ON WRITE
SCHEMA BEFORE READ
SCHEMA ON THE FLY
®© 2014 MapR Technologies 17
$50M $50M in Free Training
®© 2014 MapR Technologies 18
Q & A
@mapr maprtech
Engage with us!
MapR
maprtech
mapr-technologies
®© 2014 MapR Technologies 19 © 2014 MapR Technologies ®
Under the Hood
®© 2014 MapR Technologies 20
High Level Architecture
Cluster of commodity servers – Daemon (drillbit) on each node
ZooKeeper maintains ephemeral cluster membership information – Drillbit uses ZooKeeper to find other drillbits in the cluster – Client uses ZooKeeper to find drillbits
Built-in, optimistic query execution engine. Doesn’t require a particular storage or execution system (MapReduce, Spark, Tez)
– Better performance and manageability
Data processing unit is columnar record batches – Enables schema flexibility with negligible performance impact
®© 2014 MapR Technologies 21
Drill Maximizes Data Locality
Data Source Best Practice HDFS or MapR-FS drillbit on each DataNode HBase or MapR-DB drillbit on each RegionServer MongoDB drillbit on each mongod node (when using replicas, run it on the replica node)
drillbit
DataNode/ RegionServer/
mongod
drillbit
DataNode/ RegionServer/
mongod
drillbit
DataNode/ RegionServer/
mongod
ZooKeeper ZooKeeper
ZooKeeper …
®© 2014 MapR Technologies 22
Core Modules within drillbit
SQL Parser Hive
HBase
Sto
rage
Plu
gins
MongoDB
DFS
Phy
sica
l Pla
n
Execution Lo
gica
l Pla
n Optimizer
®© 2014 MapR Technologies 23
SELECT * Query Execution
drillbit ZooKeeper
Client (JDBC, ODBC,
REST)
1. Find drillbits (once per session)
3. Create logical and physical execution plans 4. Farm out execution of fragments to cluster
(completely distributed execution)
ZooKeeper ZooKeeper
drillbit drillbit
2. Submit query to drillbit
5. Return results to client
* CTAS (CREATE TABLE AS SELECT) queries include steps 1-4