12
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 1 1 Performance Evaluation of Cloudera impala 1.0 May 1, 2013 CELLANT Corp. R&D Strategy Division Yukinori SUDA @sudabon

Performance Evaluation of Cloudera Impala GA

Embed Size (px)

DESCRIPTION

Performance Evaluation of Cloudera Impala GA

Citation preview

Page 1: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 1 1

Performance  Evaluation  ofCloudera  impala  1.0

May  1,  2013CELLANT  Corp.  R&D  Strategy  Division

Yukinori  SUDA@sudabon

Page 2: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

v  Support  for  a  subset  of  ANSI-‐‑‒92  SQLv  CREATE,  ALTER,  SELECT,  INSERT,  JOIN,  and  subqueries

v  Support  for  partitioned  joins,  fully  distributed  aggregations,  and  fully  distributed  top-‐‑‒n  queries

v  Support  for  a  variety  of  data  formats:v  Hadoop  native  (Apache  Avro,  SequenceFile,  RCFile  with  Snappy,  GZIP,  BZIP,  or  uncompressed)

v  text  (uncompressed  or  LZO-‐‑‒compressed)v  Parquet  (Snappy  or  uncompressed)

v  Support  for  all  CDH4  64-‐‑‒bit  packages:v  RHEL  6.2/5.7,  Ubuntu,  Debian,  SLES

v  Connectivity  via  JDBC,  ODBC,  Hue  GUI,  or  command-‐‑‒line  shellv  Kerberos  authentication  and  MR/Impala  resource  isolationv  etc

Cloudera  Impala  GA  was  released  !!

2

Page 3: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

Our  System  Environment

3

v  Install  using  Cloudera  Manager  Free  Edition  4.5.2

Master Slave

11  Servers

All  servers  are  connected  with  1Gbps  Ethernet  through  an  L2  switch

ActiveNameNode

DataNodeTaskTrackerImpalad

Stand-‐‑‒byNameNode

JobTrackerstatestored

3  Servers

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

Page 4: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

v CPUl Intel  Core  2  Duo  2.13  GHz  with  Hyper  Threading

v Memoryl 4GB

v Diskl 7,200  rpm  SATA  mechanical  Hard  Disk  Drive  *  1

v OSl Cent  OS  6.2

Our  “wimpy”  Server  Specification

4

Page 5: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

v  Use  CDH4.2.1  +  Impala  version  1.0v  Use  hivebench  in  open-‐‑‒sourced  benchmark  tool  “HiBench”

l  https://github.com/hibenchv  Modified  datasets  to  1/10  scale

l  Default  configuration  generates  table  with  1  billion  rowsv  Modified  query  sentence

l  Deleted  “INSERT  INTO  TABLE  …”  to  evaluate  read-‐‑‒only  performancev  Combines  a  few  storage  format  with  a  few  compression  method

l  TextFile,  SequenceFile,  RCFile,  ParquestFilel  No  compression,  Gzip,  Snappy

v  Comparison  with  job  query  latencyv  Average  job  latency  over  5  measurements

Benchmark

5

Page 6: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

•  Uservisits  table–  100  million  rows–  16,895  MB  as  TextFile–  Table  Definitions

•  sourceIP  string•  destURL  string•  visitDate  string•  adRevenue  double•  userAgent  string•  countryCode  string•  languageCode  string•  searchWord  string•  duration  int

•  Rankings  table–  12  million  rows–  744  MB  as  TextFile–  Table  Definitions

•  pageURL string•  pageRank int•  avgDuration int

Modified  Datasets

6

Page 7: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

SELECT  sourceIP,  sum(adRevenue)  as  totalRevenue,  avg(pageRank)  FROM  rankings_̲t  RJOIN  (  SELECT    sourceIP,    destURL,    adRevenue  FROM    uservisits_̲t  UV  WHERE    (datediff(UV.visitDate,  '1999-‐‑‒01-‐‑‒01')>=0    AND    datediff(UV.visitDate,  '2000-‐‑‒01-‐‑‒01')<=0)  )  NUV

ON  (R.pageURL  =  NUV.destURL)group  by  sourceIPorder  by  totalRevenue  DESClimit  1;

Modified  Query

7

Page 8: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

Benchmark  Result  (Hive)cited  from  “Performance  evaluation  of  Cloudera  impala  0.6  beta...”

8

0 50 100 150 200 250

No  Comp.

Gzip

Snappy

Gzip

Snappy

TextFile

SequenceFile

RCFile

235.843

227.883

213.616

234.289

197.894

Avg.  Job  Latency  [sec]

Page 9: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

Benchmark  Result  (Impala)

9

0 50 100 150 200 250

No  Comp.

Gzip

Snappy

Gzip

Snappy

Snappy

Text

File

Sequence

File

RCFile

Parquet

File

36.61

29.736

24.024

26.083

19.586

16.2

Avg.  Job  Latency  [sec]

Page 10: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

v Exchange  the  order  of  JOINed  Tables  like  belowSELECT

sourceIP,  sum(adRevenue)  as  totalRevenue,  avg(pageRank)FROM

(SELECT  sourceIP,  destURL,  adRevenue  FROM  uservisits_̲ps  UV  WHERE  (datediff(UV.visitDate,  '1999-‐‑‒01-‐‑‒01')>=0  AND  datediff(UV.visitDate,  '2000-‐‑‒01-‐‑‒01')<=0))  NUV

JOINrankings_̲ps  R

ON(R.pageURL  =  NUV.destURL)

group  by  sourceIPorder  by  totalRevenue  DESClimit  1;

v Resultl Parquet  compressed  as  Snappy:  34.374  sec

Additional  Experiments

10

Page 11: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

v Parquet  +  Snappy  is  the  fastestv Specifically,

l ParquetFile  compressed  as  Snappy:  16.2  secv Need  to  take  care  the  order  of  JOINed  tables

v Hope  for  future  extensionl Support  UDFl Window  Functionl etc

Conclusion

11

Page 12: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 12

Letʼ’s  try  it  out  on  your  envrionment!!Thanks!