43
Bridging Big Data and Data Science using Scalable Workflows WorDS.sdsc.edu ILKAY ALTINTAS, Ph.D. [email protected] Director, Workflows for Data Science (WorDS) Center of Excellence San Diego Supercomputer Center, UC San Diego

Bridging Big Data and Data Science Using Scalable Workflows

Embed Size (px)

Citation preview

Page 1: Bridging Big Data and Data Science Using Scalable Workflows

Bridging  Big  Data  and  Data  Science    using  Scalable  Workflows  

WorDS.sdsc.edu          

ILKAY  ALTINTAS,  Ph.D.  [email protected]  

Director,  Workflows  for  Data  Science  (WorDS)  Center  of  Excellence  San  Diego  Supercomputer  Center,  UC  San  Diego  

 

Page 2: Bridging Big Data and Data Science Using Scalable Workflows

SAN  DIEGO  SUPERCOMPUTER  CENTER  at  UC  San  Diego  Providing  Cyberinfrastructure  for  Research  and  EducaBon  

•  Established  as  a  naBonal  supercomputer  resource  center  in  1985  by  NSF  

•  Aworld  leader  in  HPC,  data-­‐intensive  compuBng,  and  scienBfic  data  management  

•  Current  strategic  focus  on  “Big  Data”    

1985  

today  

Page 3: Bridging Big Data and Data Science Using Scalable Workflows

     

Scien&fic  Workflow    Automa&on  Technologies  

Research  

     

Workflows  for  Cloud  Systems  

   Big  Da

ta  App

lica&

ons  

     Re

prod

ucible  Scien

ce  

     

Workforce  Training  and  Educa&on  

     De

velopm

ent  a

nd  Con

sul&ng  

Services  

Workflows  for  Data  Science  Center  

Focus  on  the  ques&on,    not  the  

technology!   10+ years of data science R&D experience as a Center.  

Page 4: Bridging Big Data and Data Science Using Scalable Workflows

Why  Data  Science  Workflows?  “You've  got  to  think  about  

                 big  things      while  you're  doing            small  things,  

so  that  all  the  small  things  go  in  the  right  direcBon.”                                                      –  Alvin  Toffler  

use  cases  =>  purpose  and  value  

Page 5: Bridging Big Data and Data Science Using Scalable Workflows

So,  what  is  a  workflow?  

Source:  hcp://www.fastcodesign.com/1663557/how-­‐a-­‐kitchen-­‐design-­‐could-­‐make-­‐it-­‐easier-­‐to-­‐bond-­‐with-­‐neighbors    

Shop   Prepare   Cook   Store  

Page 6: Bridging Big Data and Data Science Using Scalable Workflows

Let’s  make  pasta  this  evening!  Shop   Prepare   Cook   Store  

30  minutes  

30  minutes  

15  minutes  

3  minutes  

15  minutes  

3  minutes  

Page 7: Bridging Big Data and Data Science Using Scalable Workflows

How  to  Cook  Everything  Fast  

“How  to  Cook  Everything  Fast  is  a  book  of  kitchen  innovaBons.  Time  management—  the  essenBal  principle  of  fast  cooking—  is  woven  into  revoluBonary  recipes  that  do  the  thinking  for  you.  You’ll  learn  how  to  take  advantage  of  down&me  to  prepare  vegetables  while  a  soup  simmers  or  toast  croutons  while  whisking  a  dressing.  Just  cook  as  you  read—and  let  the  recipes  guide  you  quickly  and  easily  toward  a  delicious  result.”  

Image  and  quote  source:  amazon.com    

Page 8: Bridging Big Data and Data Science Using Scalable Workflows

What  if  you  have  more  than  one  cooks?  

Page 9: Bridging Big Data and Data Science Using Scalable Workflows

…  

…  

…  MAP  

•  Input:  veggies  •  User  defined  

func&on(UDF):  chop  •  Output:  Chopped  groups  

of  each  kind  of  veggie  

Page 10: Bridging Big Data and Data Science Using Scalable Workflows

…  

…  

REDUCE  •  Input:  chopped  batches  

for  each  veggie  type  •  User  defined  

func&on(UDF):  combine  based  on  veggie  type  as  key  

•  Output:  a  bowl  of  veggies  per  veggie  kind  

Page 11: Bridging Big Data and Data Science Using Scalable Workflows

Thanksgiving  dinner  preparaBon:    more  planning  and  tasks?  

Menu  Item   Prepara&on  Time  

Cooking  Time  

Cooling  Time  

Turkey   30  minutes   4  hours   15  minutes  

Veggies   30  minutes   45  minutes   None  

Cranberry  Sauce  

5  minutes   30  minutes   2  hours  

Soup   20  minutes   30  minutes   None  

Pie   30  minutes   5  minutes   1  day  

•  When  do  you  start  cooking?    •  What  order  do  you  cook?    •  Can  you  cook  some  menu  items  in  parallel?  •  Who  cooks  what?  •  …  

Page 12: Bridging Big Data and Data Science Using Scalable Workflows

Data  Science  Workflows  -­‐  Programmable  and  Reproducible  Scalability  -­‐  

•  Access  and  query  data  •  Scale  computaBonal  analysis  •  Increase  reuse    •  Save  Bme,  energy  and  money  •  Formalize  and  standardize  

Real-­‐Time  Hazards  Management  wifire.ucsd.edu  

Data-­‐Parallel  BioinformaBcs  bioKepler.org    

Scalable  Automated  Molecular  Dynamics  and  Drug  Discovery  nbcr.ucsd.edu  

kepler-­‐project.org   WorDS.sdsc.edu  

Page 13: Bridging Big Data and Data Science Using Scalable Workflows

Why  scalable  and  reproducible  data  science?  

Page 14: Bridging Big Data and Data Science Using Scalable Workflows

The Big Picture is Supporting the Scientist

Conceptual SWF

Executable SWF

From  “Napkin  Drawings” to  Executable  Workflows  

Fasta  File  

Circonspect  

 Average  Genome  Size  

 Combine  Results   PHACCS  

Page 15: Bridging Big Data and Data Science Using Scalable Workflows

The Big Picture is Supporting the Data Scientist

Conceptual SWF

Executable SWF

From  “Napkin  Drawings” to  Executable  Workflows…  SBNL workflow

Local Learner

Data Quality Evaluation

Local Ensemble Learning

Quality Evaluation & Data Partitioning Big Data

Master Learner

MasterEnsemble Learning

Final BN Structure

Insurance  and  Traffic  Data  Analy&cs  using  Big  Data  Bayesian  Network  

Learning  

Page 16: Bridging Big Data and Data Science Using Scalable Workflows

Ptolemy II: A laboratory for investigating design

KEPLER: A problem-solving environment for Scientific Workflow KEPLER = “Ptolemy II + X” for Scientific Workflows

Kepler is a Scientific Workflow System

•  A cross-project collaboration… initiated August 2003

•  2.4 released 04/2013

www.kepler-project.org

•  Builds upon the open-source Ptolemy II framework

Page 17: Bridging Big Data and Data Science Using Scalable Workflows

A Toolbox with Many Tools

Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize execution!

•   Data  •   Search,  database  access,  IO  operaBons,  streaming  data  in  real-­‐Bme…  

•   Compute  •   Data-­‐parallel  pacerns,  external  execuBon,  …  

•   Network  operaBons  •   Provenance  and  fault  tolerance  

Page 18: Bridging Big Data and Data Science Using Scalable Workflows

So,

how does this relate to data science and big

data?

Page 19: Bridging Big Data and Data Science Using Scalable Workflows

         Toolboxes  with  many  tools  for:    •  data  access,    •  analysis,    •  execuBon,    •  fault  tolerance,    •  provenance  

tracking,    •  reporBng  •  ...  

Business  Analysis  

Opera&ons  Research  

Adapted  from:    B.  Tierney,  2013    

Workflows  integrate  data  science  building  blocks!  

Page 20: Bridging Big Data and Data Science Using Scalable Workflows

Data  ScienBst  Skill  Set  hcp://datasciencedojo.com/what-­‐are-­‐the-­‐key-­‐skills-­‐of-­‐a-­‐data-­‐scienBst/  

Page 21: Bridging Big Data and Data Science Using Scalable Workflows

Unicorn?  

hcp://www.anlytcs.com

/2014/01/data-­‐

science-­‐venn

-­‐diagram

-­‐v20.htm

l    

Page 22: Bridging Big Data and Data Science Using Scalable Workflows

SoluBon:  Scale  Your  Data  ScienBsts  

Standardize  the  data  science  process,  not  the  tools!  

   Standardized  processes  enable  data  

scien&sts  to  communicate  with  business  and  programming  partners.    

 Also,  what  these  definiBons  really  mean  is  “computa&onal  and  data  scien&sts”.  

Page 23: Bridging Big Data and Data Science Using Scalable Workflows

Conceptualizing  a  ComputaBonal  Data  Science  

Workflow  

Page 24: Bridging Big Data and Data Science Using Scalable Workflows

1:  Start  with  the  Workflow  As  a  Blackbox    •  Treat  the  whole  workflow  as  a  blackbox  – What  is  the  usecase/applicaBon?  

•  What  is  the  science  quesBon  this  workflow  is  solving?  

– What  is  the  input  data?  – What  are  the  expected  outcomes?  

Input  data  

My  workflow  

Outputs  

•  Give  the  workflow  a  Btle  based  on  iniBal  assessment!  

f  

Page 25: Bridging Big Data and Data Science Using Scalable Workflows

2:  ConceptualizaBon  of  ScienBfic  Steps  

Fasta  File  

Circonspect  

 Average  Genome  Size  

 Combine  Results   PHACCS  

•  ...  •  Cook    •  Chill  • ….  

Bake  Pie  

• …  •  Prepare  •  Cook  • …  

Bake  Turkey  • …  • Make  Cranberry  Sauce  • Cut  Veggies  • Prepare  Stuffing  • …  

Make  Side  Dishes  

Page 26: Bridging Big Data and Data Science Using Scalable Workflows

3:  Treat  Each  Step  Like  a  Workflow  -­‐  un=l  you  reach  an  atomic  func=onal  step  -­‐  Find  data    Access  data  Acquire  data  Move  data  

Clean  data  Integrate  data  Subset  data  

Pre-­‐process  data  

Analyze  data  Process  data  

Interpret  results  Summarize  results  Visualize  results  

Post-­‐process  results  

PREPARE  SHOP   STORE  COOK  

Some  quesBons  to  ask:  •  Where  and  how  do  I  get  the  data?  •  What  is  the  format  and  frequency  of  the  data,  e.g.,  structured,  textual,  real-­‐Bme,  

image,  …?  •  How  do  I  integrate  or  subset  datasets,  e.g.,  knowledge  representaBon,…  ?  •  How  do  I  analyze  the  data  and  what  is  the  analysis  funcBon?  •  What  are  the  parameters  to  customize  each  step?  •  What  are  the  compuBng  needs  to  schedule  and  run  each  step?  •  How  do  I  make  sure  the  results  are  useful  for  the  next  step  or  as  scienBfic  products,  

e.g.,  standards  compliance,  reporBng,  …?    

configurable  automated  analysis  

Page 27: Bridging Big Data and Data Science Using Scalable Workflows

4:  Start  Building  Each  Step  Including  the  AlternaBves  

•  AlternaBve  tools  •  MulBple  modes  of  scalability  

•  Support  for  each  step  of  the  development  and  producBon  process  

•  Different  reporBng  needs  for  exploraBon  and  producBon  stages  

Build  

Explore    

Scale  

Report  

Page 28: Bridging Big Data and Data Science Using Scalable Workflows

Running on Heterogeneous Computing Resources

- Execution of models on where they run most efficiently -

Gordon   Trestles  

Local:  NBCR  Cluster  Resources  

NSF/DOE:  TeraScale  Resources  (XSEDE)  

(Gordon)   (Trestles)  

(Stampede)  (Lonestar)  

Private  Cluster:  User  Owned  Resources  

Different  models  have  different  compu&ng  architecture  needs!    

e.g.,  memory-­‐intensive,  compute-­‐intensive,  I/O-­‐intensive  

Page 29: Bridging Big Data and Data Science Using Scalable Workflows

5:  Save  and  Share  Reports  and  Final  Products  with  your  Team  

•  Data  scienBst  is  in  the  middle  bridging  the  gap  between  business  and  development  à   So,  Data  ScienBsts  defines  the  business  value  and  the  steps  to  achieve  the  results  as  a  workflow  

•  Developers/computer  scienBsts  use  their  favorite  tools  to  implement  the  methods  in  the  workflow  

•  The  process  is  kept  sharable,  standardized,  scalable  and  accountable  

Page 30: Bridging Big Data and Data Science Using Scalable Workflows

WorDS  –  Simple  and  Scalable  Big  Data  SoluBons  using  Workflows  

Focus  on  the  use  case,    not  the  

technology!  

 

•  Develop   new   big   data   science  technologies  and  infrastructure  

•  Develop   data   science   workflow  applica&ons   through   combinaBon   of  tools,  technologies  and  best  prac&ces  

•  Hands   on   consul&ng   on   workflow  technologies   for   big   data   and   cloud  systems,   e.g.,   MapReduce,   Hadoop,  Yarn,  Cascading  

•  Technology   briefings   and   applied  classes   on   end-­‐to-­‐end   support   for  data  science  

Page 31: Bridging Big Data and Data Science Using Scalable Workflows

Using Big Data Computing in Bioinformatics- Improving Programmability, Scalability and Reproducibility-

biokepler.org  

Page 32: Bridging Big Data and Data Science Using Scalable Workflows

Gateways  and  other  user  environments  

bioKepler  Kepler  and  Provenance  Framework  

BioLinux     Galaxy   Clovr     Hadoop  

CLOUD  and  OTHER  COMPUTING  RESOURCES  e.g.,  SGE,  Amazon,  FutureGrid,  XSEDE  

www.bioKepler.org

A coordinated ecosystem of biological and technological packages for bioinformatics!

Page 33: Bridging Big Data and Data Science Using Scalable Workflows

Status  of  bioActors  500+  bioActors  are  listed  under  current  bioKepler  release,  ~40  of  them  are  

parallelized.  

Page 34: Bridging Big Data and Data Science Using Scalable Workflows

Using Workflows and Cyberinfrastructure for Wildfire Resilience

- A Scalable Data-Driven Monitoring and Dynamic Prediction Approach -

wifire.ucsd.edu  

Page 35: Bridging Big Data and Data Science Using Scalable Workflows

Fire  is  Part  of  the  Natural  Ecology    …  but  requires  Monitoring,  PredicBon  and  Resilience  

•  Wildfires  are  criBcal  for    ecology,  but  volaBle  

•  Fuel  load  is  high  due  to  fire  suppression  over  the  last  century  

•  Changes  in  rainfall,  wind,  seasons,    and  thus  wildfires,  potenBally  induced  by  climate  change  

•  Becer  prevenBon,  predicBon  and    maintenance  of  wildfires  is  needed  

Photo  of  Harris  Fire  (2007)  by  former  Fire  Captain  Bill  Clayton  

Disaster  management  of  (ongoing)  wildfires  heavily  relies  on  understanding    

their  DirecBon  and  Rate  of  Spread  (RoS).  

Page 36: Bridging Big Data and Data Science Using Scalable Workflows

Decision making for wildfire fighting and disaster management based on heterogeneous data:

Photograph by Mark Thiessen

Satellite data

Wildfire perimeter Wind, Vegetation Terrain.

Fire  Data  Today  

Page 37: Bridging Big Data and Data Science Using Scalable Workflows

What  is  lacking  in  disaster  management  today  is…    

 a  system  integraBon  of  real-­‐Bme  sensor  networks,  satellite  imagery,  near-­‐real  Bme  data  management  

tools,  wildfire  simulaBon  tools,  and  connecBvity  to  emergency  command  centers    

 .  ….  before,  during  and  a{er  a  firestorm.  

Page 38: Bridging Big Data and Data Science Using Scalable Workflows

A  Scalable  Data-­‐Driven  Monitoring,  Dynamic  PredicBon  and  Resilience  Cyberinfrastructure  for  Wildfires                                                                                                                    (WIFIRE)  

Development  of:    “cyberinfrastructure”  for  “analysis  of  large  dimensional  heterogeneous  real-­‐Bme  sensed  data”  for  fire  resilience  before,  during  and  aAer  a  wildfire  

Page 39: Bridging Big Data and Data Science Using Scalable Workflows

Data  to  Modeling  in  WIFIRE    Real-­‐&me  remote  data  –>  Modeling,  data  assimilaBon  and  dynamic  

                             wildfire  behavior  predicBon  Sensors:            

Page 40: Bridging Big Data and Data Science Using Scalable Workflows

System  IntegraBon  of  sensor  data,  data  assimilaBon,  dynamic  models  and  fire  direcBon  and  RoS  predicBons  (computaBons)  is  based  on  ScienBfic  and  Engineering  Workflows  (Kepler)              •  Visual  programming  •  Scalable  parallel  execuBon  •  Standardized  data  interfaces  •  Reuse  and  reproducibility  

         

WIFIRE  System  IntegraBon  

Page 41: Bridging Big Data and Data Science Using Scalable Workflows

Training  and  ConsulBng  Services  in  the  WorDS  Center  

•  Ongoing  programs  for  workflow  bootcamps  and  hackathons    

•  Technology  briefings  for  industrial  partners  •  Industry  labs  for  undergraduate  student  researchers  

•  ConsulBng  projects  on  workflow  technologies    

Page 42: Bridging Big Data and Data Science Using Scalable Workflows

To Sum Up•  Workflows and provenance are well-adopted in scientific

data science infrastructures today, with success•  WorDS Center applies these concepts to advanced

dynamic data-driven analytics applications

•  One size does not fit all! •  Many diverse environments and requirements•  Need to orchestrate at a higher level•  Higher level programming components for each domain

•  Lots of future challenges on•  Optimized execution on heterogeneous platforms•  Increasing reuse within and across application domains•  Querying and integration of workflow provenance data

Page 43: Bridging Big Data and Data Science Using Scalable Workflows

Que

sBon

s?  

WorDS

 Dire

ctor:    Ilkay  AlBntas,  Ph.D.  

Email:  alBn

tas@

sdsc.edu