Download pptx - SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL

SQL on Hadoop

Todays agenda

• Introduction• Hive – the first SQL approach• Data ingestion and data formats• Impala – MPP SQL

Why SQL?

• Data warehousing…• Structured data– organization of the data– optimized data access

• Declarative data processing– No need to have developer skills, but…– Portable – universal language– We are lazy

• SQL drivers supported– No need of Hadoop client installation– Easier integration with the current systems

Why not SQL

• It is not RDBMS!– big tables joins should by avoided– no indexes by default– no primary keys and constraints

• Not suited for OLTP– no locks– no transactions

• write once – read many• Additional data structuring during data shipping

(ETL) needed • Not all problems can be solved with SQL

5

SQL on Hadoop

HDFSHadoop Distributed File System

Hba

seN

oSql

col

umna

r sto

re

YARNCluster resource manager

MapReduce

Hiv

eSQ

L

Pig

Scrip

ting

Flum

eLo

g da

ta c

olle

ctor Sq

oop

Dat

a ex

chan

ge w

ith R

DBM

S

Ooz

ieW

orkfl

ow m

anag

er

Mah

out

Mac

hine

lear

ning

Zook

eepe

rCo

ordi

natio

n Impa

laSQ

L

Spar

kLa

rge

scal

e da

ta p

roce

esin

g

SQL on Hadoop

Client

Metadata

SQL master node

JDBC/ODBC server

SQL engine

HDFS

Executor

Cluster Node

HDFS

Executor

Cluster Node

HDFS

Executor

Cluster Node

HDFS

Executor

SQL

Data

Tables definition lookup

YARN

There are others exotic animals…

• Purely on Hadoop– Stinger.next/Hive on Tez (improved MR executions, ACID,

etc)– Presto (graph based processing, multiple data sources)– SparkSQL (Spark based)

• See Greg Rhan slides: https://speakerdeck.com/grahn/an-independent-comparison-of-open-source-sql-on-hadoop

mailto:https://speakerdeck.com/grahn/an-independent-comparison-of-open-source-sql-on-hadoop?subject=Slides








Summary

• SQL on Hadoop is not for OLTP!• …but for data warehousing workloads• … ad-hoc queries

• Enforces semi-structuring of the data

• Does not enforce using certain data format on HDFS