SQL on Hadoop
Todays agenda
• Introduction• Hive – the first SQL approach• Data ingestion and data formats• Impala – MPP SQL
Why SQL?
• Data warehousing…• Structured data– organization of the data– optimized data access
• Declarative data processing– No need to have developer skills, but…– Portable – universal language– We are lazy
• SQL drivers supported– No need of Hadoop client installation– Easier integration with the current systems
Why not SQL
• It is not RDBMS!– big tables joins should by avoided– no indexes by default– no primary keys and constraints
• Not suited for OLTP– no locks– no transactions
• write once – read many• Additional data structuring during data shipping
(ETL) needed • Not all problems can be solved with SQL
5
SQL on Hadoop
HDFSHadoop Distributed File System
Hba
seN
oSql
col
umna
r sto
re
YARNCluster resource manager
MapReduce
Hiv
eSQ
L
Pig
Scrip
ting
Flum
eLo
g da
ta c
olle
ctor Sq
oop
Dat
a ex
chan
ge w
ith R
DBM
S
Ooz
ieW
orkfl
ow m
anag
er
Mah
out
Mac
hine
lear
ning
Zook
eepe
rCo
ordi
natio
n Impa
laSQ
L
Spar
kLa
rge
scal
e da
ta p
roce
esin
g
SQL on Hadoop
Client
Metadata
SQL master node
JDBC/ODBC server
SQL engine
HDFS
Executor
Cluster Node
HDFS
Executor
Cluster Node
HDFS
Executor
Cluster Node
HDFS
Executor
SQL
Data
Tables definition lookup
YARN
There are others exotic animals…
• Purely on Hadoop– Stinger.next/Hive on Tez (improved MR executions, ACID,
etc)– Presto (graph based processing, multiple data sources)– SparkSQL (Spark based)
• See Greg Rhan slides: https://speakerdeck.com/grahn/an-independent-comparison-of-open-source-sql-on-hadoop
Summary
• SQL on Hadoop is not for OLTP!• …but for data warehousing workloads• … ad-hoc queries
• Enforces semi-structuring of the data
• Does not enforce using certain data format on HDFS