16
DWH OVER HADOOP DWH OVER HADOOP

Veracity think bugdata #2 6.7.2015

Embed Size (px)

Citation preview

Page 1: Veracity think bugdata #2   6.7.2015

DWH OVER HADOOPDWH OVER HADOOP

Page 2: Veracity think bugdata #2   6.7.2015

THETHE

BASICSBASICS

Page 3: Veracity think bugdata #2   6.7.2015

COLUMNAR FORMATS (ORC/PARQUET)COLUMNAR FORMATS (ORC/PARQUET)Projection Push DownPredicate Push DownExcellent Compression RatiosColumn IndicesMax/Avg/Min valuesRows must be batched to benefit from these optimizations

Page 4: Veracity think bugdata #2   6.7.2015

PARQUETPARQUET

Strongly endorsed by ClouderaOne of the few formats Impalasupports (and the most optimalfor it)Also supported by Hive, Spark,Tajo, Drill & Presto.Speaking from myown personal experience a bitmore expensive to generate.

ORCORC

Endorsed by HortonworksMost optimal for PrestoSpark support was recentlyintroduced.

Page 5: Veracity think bugdata #2   6.7.2015

QUERYINGQUERYINGENGINESENGINES

Page 6: Veracity think bugdata #2   6.7.2015

HIVEHIVE

Hive provides a SQL like interface ofaccessing the data (files) called HiveQL.The HQL is translated intoM/R code and executed immediately. Batch Oriented Fault tolerant and thus reliableNot a DB! Does not support updates & delete and hasno transaction (or does it ?)

Page 7: Veracity think bugdata #2   6.7.2015

LOW LOW LATENCYLATENCYSQLSQL

Map-Reduce can be compared toa Tractor:It's very strong and can plow afield better than any other vehicle,but it's also very slow.As prices of memory dropped, ademand emerged to better utilizeit for faster response times.

Page 8: Veracity think bugdata #2   6.7.2015

CLOUDERA IMPALACLOUDERA IMPALAWriten in C++Utilizes Hive's metadataVery fastNot fault toleranteDoesn't support custom dataformatsDoesn't support complex datatypes (maps/arrays/structs)A bit complicated setup for nonCDH distributions

Page 9: Veracity think bugdata #2   6.7.2015

FACEBOOK PRESTOFACEBOOK PRESTOCan connect to:

CassandraHiveJMX SourcesPostgres & Mysql

Allows cross engine joins Used in Facebook to serve onlinedashboardsEasy to setup

Page 10: Veracity think bugdata #2   6.7.2015

SPARK SQLSPARK SQLNot affiliated with any HadoopvendorSupport all of the optimized fileformats (ORC/Parquet/Avro)Can auto discover schemaAims to provide second/sub-second latnecyStill not very mature

Page 11: Veracity think bugdata #2   6.7.2015

THE USUAL DATA FLOWTHE USUAL DATA FLOW

Collect -> Store -> Convert -> Select

The Data Latency conflict - lotsof fragmented small files or bigoptimized files with big latencyProcessing efforts involved inthe conversion process shouldbe minimizedExample..

Page 12: Veracity think bugdata #2   6.7.2015

A BETTER DATA FLOWA BETTER DATA FLOW

Collec-tor-vert -> Select

Convert the data as it is beingcollected where possibleOr convert the data as it isbeing stored (streaming) butwithout losing optimizationsHow can this be achieved?

Page 13: Veracity think bugdata #2   6.7.2015

SQOOPSQOOPImport data from RDBMS intoHadoopCreate java classes and hivetables on importExport data back to RDBMSRuns a "Map Only" job toperform the taskSupports incremental importsNow supports import rightaway as Parquet

Page 14: Veracity think bugdata #2   6.7.2015

HIVE & ACIDHIVE & ACID

Recently a conceptual change has beenintroduced into Hive: CRUD with ACIDTransactions.It is not meant to replace your OLTP butrather supply a better data modificationmechanism to a subset of the data.Explanation on how it worksDemo simple insertStill requires M/R :(

Page 15: Veracity think bugdata #2   6.7.2015

HIVE & STREAMING INGESTHIVE & STREAMING INGEST

With the new ACID capabilities it is nowpossible to continously insert data into hiveData apperas almost immediatelyData is optimized in a columnar formatData is compacted by different triggersCode snippet

Page 16: Veracity think bugdata #2   6.7.2015

FLUMEFLUMEDistributedDurableScalableFault ToleranteServes for ingestion and basicpre-processing of the dataComposed of source -> channel -> Sink(Draw Architecture)Utilized Hive's ACID capabilitiesto instantly stream data into hive- demo