40
Data cubes in Apache Hive Amareshwari Sriramadasu Jaideep Dhok

Datacubes in Apache Hive at ApacheCon

  • Upload
    amarsri

  • View
    128

  • Download
    4

Embed Size (px)

DESCRIPTION

Introduces OLAP cubes in Apache Hive and unified analytics platform Grill

Citation preview

Page 1: Datacubes in Apache Hive at ApacheCon

Data cubes in Apache Hive

Amareshwari SriramadasuJaideep Dhok

Page 2: Datacubes in Apache Hive at ApacheCon

Amareshwari

Sriramadasu

•Engineer at Inmobi

•Working in Hadoop and eco systems since 2007

•Apache Hadoop – PMC

•Apache Hive– Committer

Page 3: Datacubes in Apache Hive at ApacheCon

JaideepDhok

•Engineer at Inmobi

•Working in distributed systems and Hadoop eco system since 2007

•Apache Hive and Hadoop Contributor

Page 4: Datacubes in Apache Hive at ApacheCon

Background

Why Apache Hive

Data cubes in Apache Hive

Queries on data cubes with examples

Grill

Agenda

Page 5: Datacubes in Apache Hive at ApacheCon

Background

Why Apache Hive

Data cubes in Hive

Queries on data cubes with examples

Grill

Agenda

Page 6: Datacubes in Apache Hive at ApacheCon

Global Mobile technology company enabling- Developers & Publishers to monetize

- Advertisers to engage and acquire users

@ Scale

About InMobi

Page 7: Datacubes in Apache Hive at ApacheCon

Digital advertising – Intro

Courtesy: http://www.liesdamnedlies.com/

Owns & Sells Real estate on digital inventory

Has reach to users

Wants to target Users

Brings money

Market place

Consumer

Page 8: Datacubes in Apache Hive at ApacheCon

Hadoop @ InMobi – Factual Reporting & Analytics

130 TB Hadoop warehouse

5 TB SQL warehouse

Page 9: Datacubes in Apache Hive at ApacheCon

Users• Advertisers and publishers

• Regional officers

• Account managers

• Executive team

• Business analysts

• Product analysts

• Data scientists

• Engineering systems

• Developers

Use cases

Page 10: Datacubes in Apache Hive at ApacheCon

Sharing reports/data with customer (account level) 

Understand trends through data & exploration (analysis)

Debugging / Postmortem of issues (troubleshooting) 

Sizing & Estimation (Ex: inventory, reach) 

Summary of Product lines, Geographies, Network (Ex: Rev by Geo) 

Sales/Revenue Targets vs Actuals 

Tracking campaign performance 

Tracking any metrics on REAL- TIME basis 

Use cases

Page 11: Datacubes in Apache Hive at ApacheCon

Categorize the use cases• Batch queries

• Adhoc queries

• Interactive queries

• Scheduled reports

• Infer insights through ML algorithms

Use cases

Page 12: Datacubes in Apache Hive at ApacheCon

Adhoc querying system Dashboard system

Customer facing system

Analytics systems at Inmobi

Page 13: Datacubes in Apache Hive at ApacheCon

•Disparate user experience

•Disparate data storage systems causing inability to scale

•Schema management

•Not leveraging community around

Problems

Page 14: Datacubes in Apache Hive at ApacheCon

Background

Why Apache Hive

Data cubes in Hive

Queries on data cubes with examples

Grill

Agenda

Page 15: Datacubes in Apache Hive at ApacheCon

Associates structure to dataProvides Metastore and catalog service – HcatalogProvides pluggable storage interfaceAccepts SQL like batch queriesHQL is widely adopted language by systems like Shark, ImpalaHas strong apache community

Data warehouse features like facts, dimensionsLogical table associated with multiple physical storagesInteractive queriesScheduling queriesUIW

hat

does

Hiv

e

pro

vid

e

Wh

at is m

issing in

H

ive

Apache Hive to the rescue

Page 16: Datacubes in Apache Hive at ApacheCon

Background

Why Apache Hive

Data cubes in Apache Hive

Queries on data cubes with examples

Grill

Agenda

Page 17: Datacubes in Apache Hive at ApacheCon

Data Layout – Fact data

Aggrk : measures (mak <= ma(k-1)), dimensions (dak <

da(k-1))

…..

Aggr2 : measures (ma2 <= ma1), dimensions (da2 < da1)

Aggr1 : measures (ma1 <= mr), dimensions (da1 < dr)

Raw data : measures (mr), dimensions(dr)

Other side of pyramid is aggregated at timed dimension

Page 18: Datacubes in Apache Hive at ApacheCon

Data Layout – snowflake

Aggrk : measures (mak <= ma(k-1)), dimensions (dak <

da(k-1))

…..

Aggr2 : measures (ma2 <= ma1), dimensions (da2 < da1)

Aggr1 : measures (ma1 <= mr), dimensions (da1 < dr)

Raw data : measures (mr), dimensions(dr)

Dim2_1

Dim3

Dim2

Dim4_1

Dim4

Dim1

Page 19: Datacubes in Apache Hive at ApacheCon

Data Model

Cube Storage Fact Table

Physical Fact tables

Dimension Table

Physical Dimension

tables

Page 20: Datacubes in Apache Hive at ApacheCon

Data Model - Cube

Dimension• Simple Dimension• Referenced Dimension• Hierarchical Dimension• Expression Dimension• Timed dimension

Measure• Column Measure • Expression Measure

Cube

Measures Dimensions

Note : Some of the concepts are borrowed from http://community.pentaho.com/projects/mondrian/

Page 21: Datacubes in Apache Hive at ApacheCon

Data Model – Storage

Storage

Name

End point

PropertiesEx : ProdCluster, StagingCluster, Postgres1, HBase1, HBase2

Page 22: Datacubes in Apache Hive at ApacheCon

Data Model – Fact Table

Fact table

Cube

Fact table

Storage

Fact Table

Columns

Cube that it belongsStorages on which it is present and the associated update periods

Page 23: Datacubes in Apache Hive at ApacheCon

Data Model – Dimension table

Dimension Table

Columns

Dimension referencesStorages on which it is present and associated snapshot dump period, if any.

Cube

Dimension table

Dimension table

Dimension table

Storage

Page 24: Datacubes in Apache Hive at ApacheCon

Data Model – Storage tables and partitions

Storage table

Belongs to fact/dimensionAssociated storage descriptor

Partitioned by columnsNaming convention – storage name followed by fact/dimension namePartition can override its storage descriptor

• Fact storage table

Fact table

• Dimension storage table

Dimension table

Page 25: Datacubes in Apache Hive at ApacheCon

Background

Why Apache Hive

Data cubes in Hive

Queries on data cubes with examples

Grill

Agenda

Page 26: Datacubes in Apache Hive at ApacheCon

CUBE SELECT [DISTINCT] select_expr, select_expr, ...FROM cube_table_referenceWHERE [where_condition AND] TIME_RANGE_IN(colName , from, to)[GROUP BY col_list][HAVING having_expr][ORDER BY colList][LIMIT number]

cube_table_reference:cube_table_factor

| join_tablejoin_table:

cube_table_reference JOIN cube_table_factor [join_condition]

| cube_table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN cube_table_reference [join_condition]

cube_table_factor:cube_name [alias]

| ( cube_table_reference )join_condition:

ON equality_expression ( AND equality_expression )*equality_expression:

expression = expressioncolOrder: ( ASC | DESC )

colList : colName colOrder? (',' colName colOrder?)*

Queries on Data cubes

Page 27: Datacubes in Apache Hive at ApacheCon

Resolve candidate tables and storages

Automatically resolve joins

Resolve aggregates and groupby expressions

Allows SQL over Cube QL

Queries can span multiple storages

Accepts multi time range queries

All Hive QL features

Query features

Page 28: Datacubes in Apache Hive at ApacheCon

• SELECT ( citytable . name ), ( citytable . stateid ) FROM c2_citytable citytable LIMIT 100

• SELECT ( citytable . name ), ( citytable . stateid ) FROM c1_citytable citytable WHERE (citytable.dt = 'latest') LIMIT 100

cube select name, stateid from citytable limit 100

Example query

Page 29: Datacubes in Apache Hive at ApacheCon

Example query

• SELECT (citytable.name), sum((testcube.msr2)) FROM c2_testfact testcube INNER JOIN c1_citytable citytable ON ((testcube.cityid)= (citytable.id)) WHERE (( testcube.dt='2014-03-10-03') OR (testcube.dt='2014-03-10-04') OR (testcube.dt='2014-03-10-05') OR (testcube.dt='2014-03-10-06') OR (testcube.dt='2014-03-10-07') OR (testcube.dt='2014-03-10-08') OR (testcube.dt='2014-03-10-09') OR (testcube.dt='2014-03-10-10') OR (testcube.dt='2014-03-10-11') OR (testcube.dt='2014-03-10-12') OR (testcube.dt='2014-03-10-13') OR (testcube.dt='2014-03-10-14') OR (testcube.dt='2014-03-10-15') OR (testcube.dt='2014-03-10-16') OR (testcube.dt='2014-03-10-17') OR (testcube.dt='2014-03-10-18') OR (testcube.dt='2014-03-10-19') OR (testcube.dt='2014-03-10-20') OR (testcube.dt='2014-03-10-21') OR (testcube.dt='2014-03-10-22') OR (testcube.dt='2014-03-10-23') OR (testcube.dt='2014-03-11') OR (testcube.dt='2014-03-12-00') OR (testcube.dt='2014-03-12 -01') OR (testcube.dt='2014-03-12-02') )AND (citytable.dt = 'latest')

• GROUP BY(citytable.name)

cube select citytable.name, msr2 from testcube where timerange_in(dt, '2014-03-10-03’, '2014-03-12-03’)

Page 30: Datacubes in Apache Hive at ApacheCon

Available in Hive• Data warehouse

features like facts, dimensions

• Logical table associated with multiple physical storages

Available in Grill • Pluggable execution engine

for HQL• Query history, caching• Scheduling queries

Where is it available

Page 31: Datacubes in Apache Hive at ApacheCon

Background

Why Apache Hive

Data cubes in Hive

Queries on data cubes with examples

Grill

Agenda

Page 32: Datacubes in Apache Hive at ApacheCon

Unified analytics platform at Inmobi

• Supports multiple execution engines• Supports multiple storages

Provides analytics on the system

Provides query history

Grill

Page 33: Datacubes in Apache Hive at ApacheCon

Grill Architecture

Page 34: Datacubes in Apache Hive at ApacheCon

Implements an interface• execute• explain• executeAsynchronously• fetchResults• Specify all storages it can support

Pluggable execution engine

Page 35: Datacubes in Apache Hive at ApacheCon

Cube QL query

Rewrite query for available execution engine’s supported storages

Get cost of the rewritten query from each execution engine

Pick up execution engine with least cost and fire the query

Cube query with multiple execution engines

Page 36: Datacubes in Apache Hive at ApacheCon

Grill Server

Page 37: Datacubes in Apache Hive at ApacheCon

Grill server – code status

Page 38: Datacubes in Apache Hive at ApacheCon

• Number of queries - 700 to 900 per day

• Number of dimension tables - 125

• Number of fact tables – 24

• Number cubes – 15

• Size of the data

• Total size – 136 TB

• Dimension data – 400 MB compressed per hour

• Raw data - 1.2 TB per day

• Aggregated facts- 53GB per day

Data ware house statistics

Page 39: Datacubes in Apache Hive at ApacheCon

Jira reference

• https://issues.apache.org/jira/browse/HIVE-4115

Github source for Hive

• https://github.com/InMobi/hive

Github source for grill

• https://github.com/InMobi/grill

Documentation

• http://inmobi.github.io/grill

References

Page 40: Datacubes in Apache Hive at ApacheCon

Thank you!• [email protected]

rg• [email protected]

m