28
Using BigBench to compare Using BigBench to compare Hive and Spark Hive and Spark Nicolas Poggi, Alejandro Nicolas Poggi, Alejandro Montero Montero April 2017

Double Your Hadoop Performance with Hortonworks SmartSense

Embed Size (px)

Citation preview

Usi

ng B

igBe

nch

to c

ompa

re

Usi

ng B

igBe

nch

to c

ompa

re

Hiv

e an

d Sp

ark

Hiv

e an

d Sp

ark

Nic

olas

Pog

gi, A

leja

ndro

N

icol

as P

oggi

, Ale

jand

ro M

onte

roM

onte

ro

Apr

il 20

17

Outline

1.In

tro

to B

SC a

nd A

LOJA

2.Bi

gBen

ch3.

Sequ

en;a

l tes

ts1.

Dat

a sc

ales

4.Co

ncur

renc

y te

sts

5.Su

mm

ary

2

Barcelon

aSupe

rcom

pu.n

gCe

nter(B

SC)

•S

pa

nis

h n

a*

on

al su

pe

rco

mp

u*

ng

ce

nte

r 2

2 y

ea

rs h

isto

ry in

:

•C

om

pu

ter

Arc

hit

ectu

re,

ne

two

rkin

g a

nd

dis

trib

ute

d s

yste

ms

rese

arc

h

•B

ase

d a

t B

arc

elo

na

Te

ch

Un

ive

rsit

y (

UP

C)

•La

rge

on

go

ing

lif

e s

cie

nce

co

mp

uta

*o

na

l p

roje

cts

•P

rom

ine

nt

bo

dy o

f re

se

arc

h a

c*

vit

y a

rou

nd

Ha

do

op

Ha

do

op

•2

00

8-2

01

3:

SLA

Ad

ap

*ve

Sch

ed

ule

r, A

cce

lera

tors

, Lo

ca

lity

Aw

are

ne

ss,

Pe

rfo

rma

nce

Ma

na

ge

me

nt.

7 p

ub

lic

a*

on

s7

pu

bli

ca

*o

ns

•2

01

3-P

rese

nt:

CCo

st-

effi

cie

nt

up

co

min

g B

ig D

ata

arc

hit

ectu

res

(ALO

JAA

LO

JA)

8+

8

+ p

ub

lic

a*

on

sp

ub

lic

a*

on

s

ALOJA:tow

ardscost-e

ffec2veBigData

•Re

sear

ch p

roje

ct fo

r aut

oma1

ng c

hara

cter

iza1

on a

ndop

1miz

a1on

of B

ig D

ata

Big

Dat

a de

ploy

men

ts

•O

pen

sour

ce B

ench

mar

king

-to-

Insi

ghts

pla

?or

m a

nd to

ols

•La

rges

t Big

Dat

a pu

blic

repo

sito

ry (7

0,00

0+ jo

bs)

•Co

mm

unity

col

labo

ra1o

n w

ith in

dust

ry a

nd a

cade

mia

hJ

p:/

/alo

ja.b

sc.e

sh

Jp

://a

loja

.bsc

.es

Big

Dat

a B

ench

mar

king

Onl

ine

Rep

osito

ryW

eb /

ML

Ana

lytic

s

Benc

hmar

king

and

Big

Benc

h

The

need

for a

new

ben

chm

ark

stan

dard

•A

benc

hmar

k ca

ptur

es th

e so

lu3o

n to

a p

robl

em a

nd g

uide

dec

isio

nm

akin

g•

Dat

abas

e re

late

d be

nchm

arks

sta

ndar

ds•

Tran

sac3

onal

(OLT

P):

TPC

C an

d E

•D

ecis

ion

Supp

ort (

DSS

/OLA

P): T

PC H

and

DS

•An

d fo

r Big

Dat

a an

aly3

cs p

rope

r3es

?•

3 Vs

, ML,

M/R

•Be

nchm

ark

uses

:•

Syst

em tu

ning

and

deb

uggi

ng•

Spre

ad a

nd b

road

Big

Dat

a ec

osys

tem

•Se

t com

mon

rule

s•

Vend

or c

ompa

rison

•Tr

ansp

aren

cy a

cros

s th

e in

dust

ry6

Wha

t is B

igB

ench

(TPC

x-B

B1)

?•

End-

to-e

nd a

pplic

a/on

leve

l ben

chm

ark

•re

sult

of m

any

year

s of

col

labo

ra/o

n•

indu

stry

and

aca

dem

ia

•Co

vers

mos

t Big

Dat

a An

aly/

cal p

rope

r/es

(3Vs

)•

Base

d on

a re

taile

r com

pany

(ext

ensi

on o

f TPC

-DS)

7[1

]: ht

tp://

ww

w.tp

c.or

g/tp

c_do

cum

ents

_cur

rent

_ver

sion

s/pd

f/tpc

x-bb

_v1.

2.0.

pdf

Big

Ben

ch h

isto

ry

Big

Ben

ch u

se c

ases

and

pro

cess

ove

rvie

w•

3030

busi

ness

use

s ca

ses

busi

ness

use

s ca

ses

cove

ring:

•M

erch

andi

sing

Mer

chan

disi

ng,

•Pr

icin

g O

p9m

iza9

onPr

icin

g O

p9m

iza9

on•

Prod

uct

Retu

rnPr

oduc

t Re

turn

•C

usto

mer

sC

usto

mer

s...

•Im

plem

enta

9on

resu

lted

in:

•14

Dec

lara

9ve

Dec

lara

9ve

quer

ies

(SQ

L)•

7 Q

uerie

s w

ith N

atur

al L

angu

age

Proc

essi

ngN

atur

al L

angu

age

Proc

essi

ng•

4 Q

uerie

s w

ith d

ata

prep

roce

ssin

g w

ithM

apRe

duce

jobs

Map

Redu

ce jo

bs.

•5

Que

ries

with

Mac

hine

Lea

rnin

gM

achi

ne L

earn

ing

post

proc

essi

ng.

8

BigB

ench

v1.

2 –

Ref

eren

ce Im

plem

enta

7on

HD

FS

Hiv

e M

etas

tore

Map

Red

uce

Tez

Spar

k

Y ar n

Hiv

eSp

ark

SQL

Mah

out M

LC

usto

m S

park

MLl

ibM

achi

neLe

arni

ng

SQL

Engi

ne

Tabl

e M

etas

tore

Exec

utio

nEn

gine

File

syst

em

Ben

chm

arke

d sy

stem

s:

•H

ive

+ M

apR

educ

e +

Mah

out

•H

ive

+ M

apR

educ

e +

Spar

k_M

Llib

•H

ive

+ Te

z +

Mah

out

•H

ive

+ Te

z +

Spar

k_M

Llib

•Sp

ark

SQL

+ M

ahou

t•

Spar

k SQ

L +

Spar

k_M

Llib

•Sp

ark

2 SQ

L +

Mah

out

Wor

k in

pro

gres

s:

•H

ive

2•

Spar

k 2

SQL

+ Sp

ark_

MLl

ib

The

clus

ter (

I) –

HD

Insi

ght P

aaS

10

Mod

elD

4v2

# H

ead

node

s2

# W

orki

ng n

odes

4

# Zo

okee

per n

odes

3

C

PUIn

tel(R

) Xeo

n(R

) CPU

E5-

2673

v3

8 x

2,4

GH

z co

res

RA

M28

GB

HD

FSR

emot

e

Softw

are

Hor

tonW

orks

Dat

aPl

atfo

rm 2

.5

Non

-con

ditio

nal m

ap jo

in c

onve

rsio

n w

ith sm

all t

able

s les

ser t

han

319

MB

Softw

are

confi

gura

tion

Map

per/R

educ

cer/T

ezm

emor

y15

36 M

B

Map

per/R

educ

cer/T

ezH

eap

Spac

e10

24 M

B

Map

per/R

educ

cer/T

ezC

ores

1

Hiv

e M

apJo

ins

Yes

Spar

k ex

ecut

ors

4

Spar

k ex

ecut

or m

emor

y46

08 M

B

Spar

k ex

ecut

or C

ores

3

Sequ

en&a

l run

s (p

ower

)Q

uerie

s 1-

30

Aver

age

of th

ree

exec

u&os

of 1

00 G

B Sc

ale

Fact

or11

Big

Ben

ch w

orkl

oad

– po

wer

test

12

Load

to H

ive

Met

asto

reD

ata

Gen

erat

ion

Que

ry 1

HD

FSH

ive

Que

ry 2

….

Que

ry 3

0

Pure

QL

13Av

erag

e of

thre

e ex

ecut

ions

usi

ng 1

00 G

B S

cale

Fac

tor

Que

ry 1

2 C

PU b

ehav

ior

14

Te zSp

ark

1.6.

2Sp

ark

2.0.

2

Aver

age

of th

ree

exec

utio

ns u

sing

100

GB

Sca

le F

acto

r

Cus

tom

Red

ucer

s

15Av

erag

e of

thre

e ex

ecut

ions

usi

ng 1

00 G

B S

cale

Fac

tor

Que

ry 2

CPU

beh

avio

r

16

Te zSp

ark

1.6.

2Sp

ark

2.0.

2

Aver

age

of th

ree

exec

utio

ns u

sing

100

GB

Sca

le F

acto

r

Nat

ural

Lan

guag

e Pr

oces

sing

17Av

erag

e of

thre

e ex

ecut

ions

usi

ng 1

00 G

B S

cale

Fac

tor

Que

ry 2

7 C

PU b

ehav

ior

18

Te zSp

ark

1.6.

2Sp

ark

2.0.

2

Aver

age

of th

ree

exec

utio

ns u

sing

100

GB

Sca

le F

acto

r

Mac

hine

Lea

rnin

g

19Av

erag

e of

thre

e ex

ecut

ions

usi

ng 1

00 G

B S

cale

Fac

tor

Que

ry 5

CPU

beh

avio

r

20

Tez

+M

ahou

t

Tez

+Sp

ark_

MLl

ib

Aver

age

of th

ree

exec

utio

ns u

sing

100

GB

Sca

le F

acto

r

21

Aggregated

Results Av

erag

e of

thre

e ex

ecut

ions

usi

ng 1

00 G

B S

cale

Fac

tor

Scalingfrom

1GB

to1TB

Log

scal

es22

Conc

urre

ncy

runs

(thr

ough

put)

2, 4

, 8 p

aral

lel s

trea

ms

23

Big

Ben

ch w

orkl

oad

– Th

roug

hput

test

24

Que

ry 1

5Q

uery

21

….

Que

ry 1

6

Que

ry 1

2Q

uery

18

….

Que

ry 2

2

Que

ry 1

6Q

uery

30

….

Que

ry 2

9

Load

Dat

aD

ata

Gen

erat

ion

HD

FSH

ive

The

clus

ter (

II) –

HD

Insi

ght P

aaS

25

Mod

elH

DIn

sight

D4v

3

# H

ead

node

s2

# W

orki

ng n

odes

7

# Zo

okee

per n

odes

3

C

PUIn

tel(R

) Xeo

n(R

) CPU

E5-

2673

v3

8 x

2,4

GH

z co

res

RA

M28

GB

55 G

B (H

eadn

ode)

HD

FSR

emot

e

Softw

are

Hor

tonW

orks

Dat

aPl

atfo

rm 2

.5

Softw

are

confi

gura

tion

Map

per/R

educ

cer/T

ezm

emor

y15

36 M

B

Map

per/R

educ

cer/T

ezH

eap

Spac

e10

24 M

B

Map

per/R

educ

cer/T

ezC

ores

1

Hiv

e M

apJo

ins

Yes

Spar

k ex

ecut

ors

9

Spar

k ex

ecut

or m

emor

y46

08 M

B

Spar

k ex

ecut

or C

ores

3

Non

-con

ditio

nal m

ap jo

in c

onve

rsio

n w

ith sm

all t

able

s les

ser t

han

319

MB

Spar

k vs

Hiv

e +

Tez

in th

roug

hput

test

s

26

27

28