47
© 2014 MapR Technologies 1 ® © 2014 MapR Technologies Analyzing Real-World Data with Apache Drill Tomer Shiran VP Product Management, MapR Technologies Co-Founder, PMC Member and Committer, Apache Drill November 20, 2014

Analyzing Real-World Data with Apache Drill

Embed Size (px)

Citation preview

®© 2014 MapR Technologies 1

®

© 2014 MapR Technologies

Analyzing Real-World Data with Apache Drill Tomer Shiran VP Product Management, MapR Technologies Co-Founder, PMC Member and Committer, Apache Drill November 20, 2014

®© 2014 MapR Technologies 2

Data is doubling in size every two years

®© 2014 MapR Technologies 3

44 ZETTABYTES

4.4 ZETTABYTES

2011 2013

1.8 ZETTABYTES

IDC estimates that in 2020, there will be 44 zettabytes

of data in the world

2020

Source: IDC Digital Universe

®© 2014 MapR Technologies 4

UNSTRUCTURED DATA

STRUCTURED DATA

1980 2000 2010 1990 2020

Unstructured data will account for more than 80% of the data

collected by organizations

Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data

Total Data S

tored

®© 2014 MapR Technologies 5 1980 2000 2010 1990 2020

Fixed schema

DBA controls structure

Dynamic schema (schema-free)

Application controls structure

“NOSCHEMA” DATASTORES RELATIONAL DATABASES

MBs-GBs TBs-PBs Volume

Database

NoSchema Datastores are Capturing this Data

Structure

Development

Structured Structured, semi-structured and unstructured

Planned (release cycle = months-years) Iterative (release cycle = days-weeks)

®© 2014 MapR Technologies 6

SQL in the Big Data World

•  SQL •  BI (Tableau, MicroStrategy, etc.) •  Low latency •  Scalability

•  Create and maintain schemas on: –  HDFS (Parquet, JSON, etc.) –  HBase –  MongoDB

•  Transform or copy data

2 DON’T WANT WANT

We want SQL and BI support without compromising the flexibility and agility of NoSchema datastores

®© 2014 MapR Technologies 7

• Schema-free scale-out query engine for Hadoop and NoSQL • Point-and-query vs. schema-first • Low latency • Extreme ease of use • Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs

APACHE DRILL

40+ contributors 150+ years of experience building databases and distributed systems

®© 2014 MapR Technologies 8

Evolution Towards Self-Service Data Exploration

Data Modeling and Transformation

Data Visualization

IT-driven

IT-driven

IT-driven

Self-service

IT-driven

Self-service

Not needed

Self-service

Traditional BI w/ RDBMS

Self-Service BI w/ RDBMS SQL-on-Hadoop

Self-Service Data Exploration

Zero-day analytics

®© 2014 MapR Technologies 9

®© 2014 MapR Technologies 10

Drill’s Data Model is Flexible

HBase

JSON BSON

CSV TSV

Parquet Avro

Schema-less Fixed schema

Flat

Complex

Flexibility

Flexibility

Name ! Gender ! Age !Michael ! M ! 6 !Jennifer ! F ! 3 !

{ ! name: { ! first: Michael, ! last: Smith ! }, ! hobbies: [ski, soccer], ! district: Los Altos !} !{ ! name: { ! first: Jennifer, ! last: Gates ! }, ! hobbies: [sing], ! preschool: CCLC !} !

RDBMS/SQL-on-Hadoop table

Apache Drill table

®© 2014 MapR Technologies 11

Drill Supports Schema Discovery On-The-Fly

•  Fixed schema •  Leverage schema in centralized

repository (Hive Metastore)

•  Fixed schema, evolving schema or schema-less

•  Leverage schema in centralized repository or self-describing data

2 Schema Discovered On-The-Fly Schema Declared In Advance

SCHEMA ON WRITE

SCHEMA BEFORE READ

SCHEMA ON THE FLY

®© 2014 MapR Technologies 12

Native JSON

SELECT  json_value(po_document,      '$.AllowPartialShipment’  RETURNING  NUMBER)  FROM      j_purchaseorder;  

SELECT  po_document.AllowPartialShipment    FROM      j_purchaseorder;  

JSON query with Oracle:

JSON query with Drill:

Relational databases cannot provide true schema-free JSON support.

®© 2014 MapR Technologies 13 © 2014 MapR Technologies ®

Architecture

®© 2014 MapR Technologies 14

High Level Architecture •  Cluster of commodity servers

–  Daemon (drillbit) on each node

•  No dependency on other execution engines (MapReduce, Spark, Tez) –  Better performance and manageability

•  ZooKeeper maintains ephemeral cluster membership information –  drillbit uses ZooKeeper to find other drillbits in the cluster –  Client uses ZooKeeper to find drillbits

•  Data processing unit is columnar record batches  –  Enables schema flexibility with negligible performance impact

®© 2014 MapR Technologies 15

Drill Maximizes Data Locality

Data Source Best Practice HDFS or MapR-FS drillbit on each DataNode HBase or MapR-DB drillbit on each RegionServer MongoDB drillbit on each mongod node (when using replicas, run it on the replica node)

drillbit  

DataNode/RegionServer/

mongod  

drillbit  

DataNode/RegionServer/

mongod  

drillbit  

DataNode/RegionServer/

mongod  

ZooKeeper ZooKeeper

ZooKeeper …

®© 2014 MapR Technologies 16

SELECT* Query Execution

drillbit  ZooKeeper

Client (JDBC, ODBC,

REST)

1.  Find drillbits (once per session)

3.  Create logical and physical execution plans 4.  Farm out execution of fragments to cluster

(completely distributed execution)

ZooKeeper ZooKeeper

drillbit  drillbit  

2.  Submit query to drillbit

5.  Return results to client

* CTAS (CREATE TABLE AS SELECT) queries include steps 1-4

®© 2014 MapR Technologies 17

Core Modules within drillbit  

SQL Parser Hive

HBase

Distributed Cache

Sto

rage

Plu

gins

MongoDB

DFS

Phy

sica

l Pla

n

Execution Lo

gica

l Pla

n Optimizer

RPC Endpoint

®© 2014 MapR Technologies 18 © 2014 MapR Technologies ®

Example: Analyzing Real-World Data

®© 2014 MapR Technologies 19

Demo Plan 1.  Run Drill 2.  Configure DFS and MongoDB storage plugins 3.  Explore the data

–  Basics –  Complex data –  Views

®© 2014 MapR Technologies 20 © 2014 MapR Technologies ®

Run Drill

®© 2014 MapR Technologies 21

Run Drill in Embedded Mode (sqlline) $  tar  xf  apache-­‐drill-­‐0.7.0.tar.gz  $  cd  apache-­‐drill-­‐0.7.0  $  bin/sqlline  -­‐u  jdbc:drill:zk=local  >  SELECT  *      FROM  dfs.root.`/Users/tshiran/Development/demo/data/yelp/user.json`      LIMIT  1;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  yelping_since  |      votes        |  review_count  |        name        |    user_id      |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  2012-­‐02              |  {"funny":1,"useful":5,"cool":0}  |  6                        |  Lee                |  qtrmBGNqCvupHMHL_bKFgQ  |  

•  drillbit (Drill daemon) starts automatically in embedded mode •  No ZooKeeper in embedded mode (hence zk=local) •  Can’t use BI clients (JDBC/ODBC) in embedded mode

You can now access the Web UI: http://localhost:8047

®© 2014 MapR Technologies 22

Or Run Drill in Distributed Mode…

$  zkServer  start  •  Make sure ZooKeeper (zkServer) is running:

•  Access the Web UI: http://localhost:8047 •  Connect a client to the cluster (eg, sqlline):

•  Clients (like sqlline) connect to ZooKeeper to discover the cluster nodes •  If you have multiple Drill clusters registered in one ZooKeeper ensemble, specify the desired

cluster in the JDBC connection string: jdbc:drill:zk=localhost:2181/drill/<clustername>

•  Not sure if ZooKeeper is running? Run telnet  localhost  2181 and make sure it connects

•  Define the Drill cluster name and ZooKeeper nodes in conf/drill-­‐override.conf •  Start drillbit:  $  bin/drillbit.sh  start  

$  bin/sqlline  -­‐u  jdbc:drill:zk=localhost:2181  

®© 2014 MapR Technologies 23 © 2014 MapR Technologies ®

Configure Storage Plugins

®© 2014 MapR Technologies 24

Enable MongoDB Storage Plugin

®© 2014 MapR Technologies 25

Define Workspaces in the DFS Storage Plugin •  d

®© 2014 MapR Technologies 26 © 2014 MapR Technologies ®

Explore the Data: Basics

®© 2014 MapR Technologies 27

Inventory: DFS Files

{      "votes":  {"funny":  0,  "useful":  2,  "cool":  1},      "user_id":  "Xqd0DzHaiyRqVH3WRG7hzg",      "review_id":  "15SdjuK7DmYqUAj6rjGowg",      "stars":  5,      "date":  "2007-­‐05-­‐17",      "text":  "dr.  goldberg  offers  everything  ...",      "type":  "review",      "business_id":  "vcNAWiLM4dR7D2nwwJ7nCA"  }  

®© 2014 MapR Technologies 28

Inventory: MongoDB Collections $  mongo  MongoDB  shell  version:  2.6.5  >  show  databases;  admin    (empty)  local    0.078GB  yelp      0.453GB  >  use  yelp  >  db.users.findOne()  {  

 "_id"  :  ObjectId("54566cdf3237149de181a92a"),    "yelping_since"  :  "2012-­‐02",    "votes"  :  {      "funny"  :  1,      "useful"  :  5,      "cool"  :  0    },    "review_count"  :  6,    "name"  :  "Lee",    "user_id"  :  "qtrmBGNqCvupHMHL_bKFgQ",    "friends"  :  [  ]  

}  

®© 2014 MapR Technologies 29

Let’s Go! >  SELECT  *      FROM  dfs.root.`/Users/tshiran/Development/demo/data/yelp/review.json`      WHERE  stars  =  1      LIMIT  1;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |      votes        |    user_id      |  review_id    |      stars        |        date        |        text        |        type        |  business_id  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  {"funny":0,"useful":0,"cool":0}  |  Qrs3EICADUKNFoUq2iHStA  |  _ePLBPrkrf4bhyiKWEn4Qg  |  1                    |  2013-­‐04-­‐19  |  I  don't  know  what  Dr.  Goldberg  was  like  before    moving  to  Arizona,  but  let  me  tell  you,  STAY  AWAY  from  this  doctor  and  this  office.  |  review          |  vcNAWiLM4dR7D2nwwJ7nCA  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

®© 2014 MapR Technologies 30

Using Storage Plugins and Workspaces

>  SELECT  *  FROM  dfs.root.`/Users/tshiran/Development/demo/data/yelp/review.json`  LIMIT  1;  >  SELECT  *  FROM  dfs.demo.`yelp/review.json`  LIMIT  1;  >  SELECT  *  FROM  mongo.yelp.users  LIMIT  1;  >  USE  mongo.yelp;  >  SELECT  *  FROM  users  LIMIT  1;  

Storage plugin Workspace

Path relative to workspace

Storage Plugin Workspace Table dfs Path Path relative to workspace mongo Database Collection hive Database Table hbase Namespace Table

®© 2014 MapR Technologies 31

Most Common User Names (MongoDB) >  SELECT  name,  count(*)  AS  users      FROM  mongo.yelp.users      GROUP  BY  name      ORDER  BY  users  DESC  LIMIT  10;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |        name        |      users        |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  David            |  2453              |  |  John              |  2378              |  |  Michael        |  2322              |  |  Chris            |  2202              |  |  Mike              |  2037              |  |  Jennifer      |  1867              |  |  Jessica        |  1463              |  |  Jason            |  1457              |  |  Michelle      |  1439              |  |  Brian            |  1436              |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

®© 2014 MapR Technologies 32

Cities with the Most Businesses >  SELECT  state,  city,  count(*)  AS  businesses      FROM  dfs.demo.`/yelp/business.json`      GROUP  BY  state,  city      ORDER  BY  businesses  DESC  LIMIT  10;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |      state        |        city        |    businesses  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  NV                  |  Las  Vegas    |  12021              |  |  AZ                  |  Phoenix        |  7499                |  |  AZ                  |  Scottsdale  |  3605                |  |  EDH                |  Edinburgh    |  2804                |  |  AZ                  |  Mesa              |  2041                |  |  AZ                  |  Tempe            |  2025                |  |  NV                  |  Henderson    |  1914                |  |  AZ                  |  Chandler      |  1637                |  |  WI                  |  Madison        |  1630                |  |  AZ                  |  Glendale      |  1196                |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

®© 2014 MapR Technologies 33 © 2014 MapR Technologies ®

Explore the Data: Complex Data

®© 2014 MapR Technologies 34

business.json (1) {  

 "business_id":  "4bEjOyTaDG24SY5TxsaUNQ",    "full_address":  "3655  Las  Vegas  Blvd  S\nThe  Strip\nLas  Vegas,  NV  89109",    "hours":  {      "Monday":  {"close":  "23:00",  "open":  "07:00"},      "Tuesday":  {"close":  "23:00",  "open":  "07:00"},      "Friday":  {"close":  "00:00",  "open":  "07:00"},      "Wednesday":  {"close":  "23:00",  "open":  "07:00"},      "Thursday":  {"close":  "23:00",  "open":  "07:00"},      "Sunday":  {"close":  "23:00",  "open":  "07:00"},      "Saturday":  {"close":  "00:00",  "open":  "07:00"}    },    "open":  true,    "categories":  ["Breakfast  &  Brunch",  "Steakhouses",  "French",  "Restaurants"],    "city":  "Las  Vegas",    "review_count":  4084,    "name":  "Mon  Ami  Gabi",    "neighborhoods":  ["The  Strip"],    "longitude":  -­‐115.172588519464,  

®© 2014 MapR Technologies 35

business.json (2)  "state":  "NV",    "stars":  4.0,  

   "attributes":  {      "Alcohol":  "full_bar”,  

     "Noise  Level":  "average",      "Has  TV":  false,      "Attire":  "casual",      "Ambience":  {        "romantic":  true,        "intimate":  false,        "touristy":  false,        "hipster":  false,  

       "classy":  true,        "trendy":  false,  

       "casual":  false      },      "Good  For":  {"dessert":  false,  "latenight":  false,  "lunch":  false,  

                                               "dinner":  true,  "breakfast":  false,  "brunch":  false},    }  

}  

®© 2014 MapR Technologies 36

Which Places Are Open Right Now (22:00)? >  SELECT  name,  b.hours      FROM  dfs.demo.`yelp/business.json`  b      WHERE  b.hours.Saturday.`open`  <  '22:00'  AND                  b.hours.Saturday.`close`  >  '22:00'      LIMIT  2;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |        name        |      hours        |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  Chang  Jiang  Chinese  Kitchen  |  {"Tuesday":{"close":"22:00","open":"11:00"},"Friday":{"close":"22:30","open":"11:00"},"Monday":{"close":"22:00","open":"11:00"},"Wednesday":{"close":"22:00","open":"11:00"},"Thursday":{"close":"22:00","open":"11:00"},"Sunday":{"close":"21:00","open":"16:00"},"Saturday":{"close":"22:30","open":"11:00"}}  |  |  Grand  China  Restaurant  |  {"Tuesday":{"close":"22:00","open":"11:00"},"Friday":{"close":"23:00","open":"11:00"},"Monday":{"close":"22:00","open":"11:00"},"Wednesday":{"close":"22:00","open":"11:00"},"Thursday":{"close":"22:00","open":"11:00"},"Sunday":{"close":"22:00","open":"12:00"},"Saturday":{"close":"23:00","open":"11:00"}}  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

®© 2014 MapR Technologies 37

It’s 10pm in Vegas and I Want Good Hummus! >  SELECT  name,  stars,  b.hours.Friday,  categories      FROM  dfs.demo.`yelp/business.json`  b      WHERE  b.hours.Friday.`open`  <  '22:00'  AND                  b.hours.Friday.`close`  >  '22:00'  AND                  REPEATED_CONTAINS(categories,  'Mediterranean')  AND                  city  =  'Las  Vegas'      ORDER  BY  stars  DESC      LIMIT  2;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |        name        |      stars        |      EXPR$2      |  categories  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  Olives          |  4.0                |  {"close":"22:30","open":"11:00"}  |  ["Mediterranean","Restaurants"]  |  |  Marrakech  Moroccan  Restaurant  |  4.0                |  {"close":"23:00","open":"17:30"}  |  ["Mediterranean","Middle  Eastern","Moroccan","Restaurants"]  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

®© 2014 MapR Technologies 38

Flatten Repeated Values >  SELECT  name,  categories      FROM  dfs.demo.`yelp/business.json`  LIMIT  3;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |        name        |  categories  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  Eric  Goldberg,  MD  |  ["Doctors","Health  &  Medical"]  |  |  Pine  Cone  Restaurant  |  ["Restaurants"]  |  |  Deforest  Family  Restaurant  |  ["American  (Traditional)","Restaurants"]  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

>  SELECT  name,  FLATTEN(categories)  AS  categories      FROM  dfs.demo.`yelp/business.json`  LIMIT  5;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |        name        |  categories  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  Eric  Goldberg,  MD  |  Doctors        |  |  Eric  Goldberg,  MD  |  Health  &  Medical  |  |  Pine  Cone  Restaurant  |  Restaurants  |  |  Deforest  Family  Restaurant  |  American  (Traditional)  |  |  Deforest  Family  Restaurant  |  Restaurants  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

®© 2014 MapR Technologies 39

Most and Least Common Business Categories >  SELECT  category,  count(*)  AS  businesses      FROM  (SELECT  name,  FLATTEN(categories)  AS  category                  FROM  dfs.demo.`yelp/business.json`)  c      GROUP  BY  category  ORDER  BY  businesses  DESC;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |    category    |  businesses  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  Restaurants  |  14303            |  …  |  Australian  |  1                    |  |  Boat  Dealers  |  1                    |  |  Firewood      |  1                    |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  715  rows  selected  (3.439  seconds)  

>  SELECT  name,  categories  FROM  dfs.demo.`yelp/business.json`  WHERE  true  and  REPEATED_CONTAINS(categories,  'Australian');  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |        name        |  categories  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  The  Australian  AZ  |  ["Bars","Burgers","Nightlife","Australian","Sports  Bars","Restaurants"]  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

®© 2014 MapR Technologies 40 © 2014 MapR Technologies ®

Explore the Data: Views

®© 2014 MapR Technologies 41

Create a View for Name-Gender Mapping

>  CREATE  VIEW  dfs.tmp.`names`  AS          SELECT  columns[0]  AS  name,  columns[4]  AS  gender          FROM  dfs.demo.`names.csv`;  >  USE  dfs.tmp;  >  CREATE  VIEW  names1  ASSELECT  columns[0]  AS  name,  columns[4]  AS  gender  FROM  dfs.demo.`names.csv`;  >  SELECT  *  FROM  dfs.tmp.names  WHERE  name  =  'John';  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |        name        |      gender      |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  John              |  Male              |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

columns[0]   columns[4]  

names.csv:  

®© 2014 MapR Technologies 42

Most Common Names (and their Genders) on Yelp >  SELECT  u.name,  n.gender,  count(*)  AS  number      FROM  mongo.yelp.users  u,  dfs.tmp.names  n      WHERE  u.name  =  n.name      GROUP  BY  u.name,  n.gender      ORDER  BY  number  DESC  LIMIT  10;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |        name        |      gender      |      number      |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  David            |  Male              |  2453              |  |  John              |  Male              |  2378              |  |  Michael        |  Male              |  2322              |  |  Chris            |  Unknown        |  2202              |  |  Mike              |  Male              |  2037              |  |  Jennifer      |  Female          |  1867              |  |  Jessica        |  Female          |  1463              |  |  Jason            |  Male              |  1457              |  |  Michelle      |  Female          |  1439              |  |  Brian            |  Male              |  1436              |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

®© 2014 MapR Technologies 43

Who Rates Higher – Men or Women? >  SELECT  n.gender,  count(*)  AS  users,  round(avg(average_stars),  2)  stars      FROM  mongo.yelp.users  u,  dfs.tmp.names  n      WHERE  u.name  =  n.name      GROUP  BY  n.gender;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |      gender      |      users        |      stars        |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  Female          |  103684          |  3.77              |  |  Male              |  97430            |  3.696            |  |  Unknown        |  18409            |  3.727            |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

®© 2014 MapR Technologies 44

Who Writes More – Men or Women?

>  SELECT  n.gender,  round(avg(length(r.text)))  AS  review_length      FROM  dfs.demo.`yelp/review.json`  r,                mongo.yelp.users  u,                dfs.tmp.names  n      WHERE  u.name  =  n.name  AND  r.user_id  =  u.user_id      GROUP  BY  n.gender;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |      gender      |  review_length  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  Male              |  665                      |  |  Female          |  730                      |  |  Unknown        |  711                      |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

It takes a 3-way join to find out…

®© 2014 MapR Technologies 45

Drill Tweets (@ApacheDrill)

®© 2014 MapR Technologies 46

Thank You •  Learn: incubator.apache.org/drill/

•  Download: incubator.apache.org/drill/download/

•  Ask questions: [email protected]

•  Contact me: [email protected]

®© 2014 MapR Technologies 47

Thank You

@mapr maprtech

[email protected]

Tomer Shiran, VP Product Management

MapRTechnologies

maprtech

mapr-technologies