72
MapReduce Extension Topics in Data Science Cheng Ren, Lixing Lian

MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

MapReduce  Extension  Topics  in  Data  Science

Cheng  Ren,  Lixing  Lian  

Page 2: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

•  Scien7fic  Workloads    -­‐  Hadoop’s  Adolescence:  An  Analysis  of  Hadoop  Usage  in  Scien7fic  Workloads  

 •  Itera7ve  extension  

 -­‐  Haloop:  efficient  itera7ve  data  processing  on  large  clusters    •  Adap7ve  Indexes  

 -­‐  Only  Aggressive  Elephants  are  Fast  Elephants(HAIL)

Outline

Page 3: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

Hadoop's  adolescence        An  analysis  of  Hadoop  usage    in  scien7fic  

workload  

Page 4: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)
Page 5: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)
Page 6: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)
Page 7: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)
Page 8: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)
Page 9: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)
Page 10: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)
Page 11: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)
Page 12: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)
Page 13: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)
Page 14: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)
Page 15: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)
Page 16: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)
Page 17: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)
Page 18: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)
Page 19: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)
Page 20: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)
Page 21: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)
Page 22: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

HaLoop:  Efficient  Itera7ve  Data  Processing  On  Large  Clusters

Yingyi  Bu,  Bill  Howe,  Magda  Balazinska,  Michael  D.  Ernst

Page 23: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

•  Mo7va7on  

•  Examples  that  cannot  be  executed  perfectly  

•  Architecture  

•  Caching  ideas    

Outline

Page 24: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

•  MapReduce  can’t  express  recursion/itera7on  •  Lots  of  interes7ng  programs  need  loops     -­‐  graph  algorithms     -­‐  clustering     -­‐  machine  learning     -­‐  recursive  queries  (CTEs,  datalog,  WITH  clause)  •  Dominant  solu7on:  Use  a  driver  program  outside  of  

MapReduce  •  Hypothesis:  making  MapReduce  loop-­‐aware  affords  

op7miza7on  

Mo7va7on

-­‐  lays  a  founda7on  for  scalable  implementa7ons  of  recursive  languages

Page 25: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

Example  1:  PageRank

Page 26: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

PageRank  Implementa7on  on  MapReduce

Page 27: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

       What’s  the  problem?

L  and  Count  are  loop  invariants,  but  1.  They  are  loaded  on  each  itera7on  2.  They  are  shuffled  on  each  itera7on  3.  Also,  fixpoint  evaluated  as  a  separate  MapReduce  job  per  itera7on

Page 28: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

       Example  2:  Transi7ve  Closure

Page 29: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

Transi7ve  Closure  on  MapReduce

Page 30: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

       What’s  the  problem?

Friend  is  loop  invariant,  but  1.  Friend  is  loaded  on  each  itera7on  2.  Friend  is  shuffled  on  each  itera7on

Page 31: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

•  Architecture  

•  Cache  loop-­‐invariant  data    •  Programming  Model

Push  loops  into  MapReduce!

Page 32: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

HaLoop  Architecture

Page 33: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

Inter-­‐itera7on  caching

Page 34: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

RI:  Reducer  Input  Cache •  Provides:            -­‐  Access  to  loop  invariant  data  without  map/shuffle  •  Data:     -­‐  Reducer  func7on  •  Assumes:     1.  Sta7c  par77oning  (implies:  no  new  nodes)     2.  Determinis7c  mapper  implementa7on    •  PageRank     -­‐  Avoid  loading  and  shuffling  the  web  graph  at  every  itera7on  •  Transi7ve  Closure     -­‐  Avoid  loading  and  shuffling  the  friends  graph  at  every  itera7on  

Page 35: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

RO:  Reducer  Output  Cache •  Provides:            -­‐  Distributed  access  to  output  of  previous  itera7ons  •  Used  by:     -­‐  Fixpoint  evalua7on  •  Assumes:     1.  Par77oning  constant  across  itera7ons     2.  Reducer  output  key  func7onally  determines       Reducer  input  key    •  PageRank     -­‐  Allows  distributed  fixpoint  evalua7on     -­‐  Obviates  extra  MapReduce  job  •  Transi7ve  Closure     -­‐  No  help  

Page 36: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

MI:  Mapper  Input  Cache •  Provides:            -­‐  Access  to  non-­‐local  mapper  input  on  later  itera7ons  •  Data  for:  

      -­‐  Map  func7on  •  Assumes:     Mapper  input  does  not  change    -­‐  Avoids  non-­‐local  data  reads  on  itera7ons  >  0

Page 37: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

•  Mapper/reducer  stay  the  same!  •  Touch  points        –  Input/Output:  for  each  <itera7on,  step>        –  Cache  filter:  which  tuple  to  cache?        –  Distance  func7on:  op7onal  •  Nested  job  containing  child  jobs  as  loop  body  •  Minimize  extra  programming  efforts

Programming  Model

Page 38: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

•  Rela7vely  simple  changes  to  MapReduce/Hadoop  can    -­‐  support  itera7ve/recursive  programs    -­‐  TaskTracker  (Cache  management)    -­‐  Scheduler  (Cache  awareness)    -­‐  Programming  model  (mul7-­‐step  loop  bodies,  cache  control)  

 •  Op7miza7ons  

 -­‐  Caching  reducer  input  realizes  the  largest  gain    -­‐  Good  to  eliminate  extra  MapReduce  step  for  termina7on  checks    -­‐  Mapper  input  cache  benefit  inconclusive;  need  a  busier  cluster  

Conclusions

Page 39: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

Only  Aggressive  Elephants  are  fast  Elephants

Jens  Diirich,  Jorge-­‐Arnulfo  Quiané-­‐Ruiz,  Stefan  Richter,  Stefan  Schuh,  Alekh  Jindal,  Jörg  Schad

Page 40: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

•  Mo7va7on  

•  Comparison  between  Hadoop  and  HAIL    •  Upload  pipeline  

•  Query  pipeline    

Outline

Page 41: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

•  Analyze  a  large  web  log  by  filtering  condi7ons.  (source  IP,  web  address)  

 •  He  uses  a  sequence  of  different  filter  condi7ons,  each  one  triggering  a  new  MapReduce  job.  

 •  He  is  not  exactly  sure  what  he  is  looking  for.    •  “Let’s  see  what  I  am  going  to  encounter  on  the  way.”

Bob

Page 42: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

•  This  kind  of  use-­‐case  illustrates  an  exploratory  usage  of  Hadoop  MapReduce.  

 •  It  is  a  major  use-­‐case  of  Hadoop  MapReduce.     -­‐  One  major  problem:  slow  query  run7mes.     -­‐  Time  dominated  by  the  I/O  for  reading  all  input  data.

Bob

Page 43: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

Hadoop  Aggressive  Indexing  Library

Page 44: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

VS.

HAIL  +  MapReduce

HDFS  +  MapReduce

Page 45: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

HDFS  +  MapReduce

Page 46: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

HDFS

horizontal  par77ons

HDFS  blocks  64MB  (default)

Datanodes

Page 47: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

HDFS

Page 48: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

HDFS

Page 49: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

HDFS

Page 50: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

HDFS

Page 51: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

HDFS

Allows  two  Failovers

Page 52: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

MapReduce

map(row)  -­‐>  set  of  (ikey,  value)

Page 53: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

MapReduce

map(row)  -­‐>  set  of  (ikey,  value)

Page 54: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

MapReduce

map(row)  -­‐>  set  of  (ikey,  value)

Page 55: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

MapReduce

map(docID,  document)  -­‐>  set  of  (term,  docID)

Page 56: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

HAIL  +  MapReduce

Page 57: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

HAIL

horizontal  par77ons

HDFS  blocks  64MB  (default)

Page 58: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

HAIL

Page 59: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

HAIL

Page 60: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

HAIL

Page 61: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

HAIL

Page 62: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

HAIL 1.  Convert  the  input  file  into  binary  PAX    2.  Create  a  series  of  different  sort  orders    3.  Create  mul7ple  clustered  indexes.

-­‐  If  indexes  cannot  help,  fall  back  to  standard  Hadoop  scanning.

Page 63: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

HAIL  changes  the  upload  pipeline  of  HDFS  in  order  to  create  different  clustered  indexes  on  each  data  block  replica.

Page 64: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

HAIL  Upload  Pipeline

Page 65: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

Why  Clustered  Indexes?  -­‐  Unclustered  indexes  are  only  compe77ve  for  very  selec7ve  

queries  as  they  may  trigger  considerable  random  I/O  for  non-­‐selec7ve  index  traversals.  

 -­‐  Clustered  index  do  not  have  that  problem.  Whatever  the  

selec7vity,  we  will  read  the  clustered  index  and  scan  the  qualifying  blocks.

HAIL  Upload  Pipeline

Page 66: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

Seman7cs  of  an  ACK  for  a  packet  of  a  block  are  changed      From  “packet  received,  validated,  and  flushed”    

  To  “packet  received  and  validated”.

HAIL  Upload  Pipeline

In  parallel  to  forwarding  and  reassembling  packets,  each  datanode  sorts  the  data,  creates  indexes,and  forms  a  HAIL  block.  

Page 67: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

HAIL  Upload  Pipeline Enrich  the  HDFS  namenode  to  schedule  map  tasks  close  to  replicas  having  a  suitable  index  •  Dir_rep  mapping  (blockID,  datanode)  à  HAILBlockReplicaInfo  •  HAILBlockReplicaInfo  contains  detailed  informa7on  about  the  

types  of  available  indexes

Page 68: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

HAIL  Query  Pipeline

Page 69: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

Upload  Time

Page 70: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

Query  Times Individual  Jobs:  Weblog

Page 71: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

Fast  Indexing  and  Fast  Querying

Page 72: MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7fic)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7fic)Workloads)

Ques7ons?  Ask  Bob.