27
©Eagle Genomics Ltd ©Eagle Genomics Ltd. Pistoia Alliance Sequence Squeeze Using a compe--on model to spur development of novel opensource algorithms Richard Holland (Eagle/Pistoia), Nick Lynch (AZ/Pistoia) BOSC July 2012

Holland R - Pistoia Alliance Sequence Squeeze

Embed Size (px)

DESCRIPTION

Presentation at BOSC2012 by Holland R - Pistoia Alliance Sequence Squeeze

Citation preview

Page 1: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    

©Eagle  Genomics  Ltd.      

Pistoia  Alliance  Sequence  Squeeze  Using  a  compe--on  model  to  spur  development  of  novel  open-­‐source  algorithms  

Richard  Holland  (Eagle/Pistoia),  Nick  Lynch  (AZ/Pistoia)  

BOSC   July  2012  

Page 2: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    

Order  of  Service  

•  What/who  is  the  Pistoia  Alliance?  •  What  is/was  Sequence  Squeeze?  •  Who  won,  how,  and  why?  •  Why  did  Pistoia  do  this?  •  Why  is  this  good  for  BOSC  delegates?  •  Will  it  happen  again?  

July  14,  2012   2  Pistoia  Alliance  Sequence  Squeeze  

Page 3: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    ©Eagle  Genomics  Ltd    

What/who  is  the  Pistoia  Alliance?  

July  14,  2012   3  Pistoia  Alliance  Sequence  Squeeze  

Page 4: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    

Who  is  Pistoia?  

•  The  Pistoia  Alliance  is  –  global  –  not-­‐for-­‐profit  –  precompeWWve  alliance    –  life  science  companies,  vendors,  publishers,  and  academic  groups  –  aims  to  lower  barriers  to  innovaWon    –  by  improving  the  interoperability  of  R&D  business  processes.  

•  We  differ  from  standards  groups  because    –  we  bring  together  the  key  consWtuents  to  idenWfy  the  root  causes  that  

lead  to  R&D  inefficiencies    –  develop  best  pracWces  and  technology  pilots  to  overcome  common  

obstacles.  

July  14,  2012   4  Pistoia  Alliance  Sequence  Squeeze  

Page 5: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    ©Eagle  Genomics  Ltd    

What  is/was  Sequence  Squeeze?  

July  14,  2012   5  Pistoia  Alliance  Sequence  Squeeze    

Page 6: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    

The  NGS  problem  

•  Storing  millions  of  NGS  reads  and  their  quality  scores  uncompressed  is  imprac,cal,  yet  current  compression  technologies  are  becoming  inadequate.    

•  There  is  a  need  for  a  new  and  novel  method  of  compressing  sequence  reads  and  their  quality  scores  in  a  way  that  preserves  100%  of  the  informa,on  whilst  achieving  much-­‐improved  linear  (or,  even  be\er,  non-­‐linear)  compression  raWos.  

July  14,  2012   6  Pistoia  Alliance  Sequence  Squeeze  

Page 7: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    

What  was  Sequence  Squeeze?  

•  Contest  to  find  a  be\er  FASTQ  compression  algorithm  –  easiest  format  for  ranking  entries  in  an  automated  se_ng.  

•  Open  source,  non-­‐restricWve  licence  required  for  entries  –  benefit  the  whole  community.  

•  Entries  tested  on  an  extract  of  the  1000  genomes  data  stored  in  AWS.  •  Prize  fund  of  US$15,000  to  the  best  algorithm  submi\ed  before  the  

closing  date  of  15  March  2012.    •  Winner  was  announced  at  the  Pistoia  Alliance  Conference  in  Boston  MA  

on  24  April  2012  –  more  on  that  story  later.  

•  Organised  and  administered  by  Eagle  under  contract  to  Pistoia.  

July  14,  2012   7  Pistoia  Alliance  Sequence  Squeeze  

Page 8: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    

Who  entered?  

•  108  disWnct  entries.  •  But  all  these  from  only  12  entrants!  –  some  entrants  were  groups  or  consorWa  but  most  were  individuals.  

•  Public  leaderboard  encouraged  fiercer  compeWWon.  

•  Entrants  seemingly  driven  to  outdo  their  compeWtors.  

July  14,  2012   8  Pistoia  Alliance  Sequence  Squeeze  

Page 9: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    

Who  judged?  

•  Yingrui  Li  –  Duty  OperaWon  Officer  of  Science  &  Technology  Department  of  the  BGI-­‐Shenzhen.  

•  Nick  Lynch  –  President  of  the  Pistoia  Alliance  (2009-­‐11).  

•  Guy  Coates  –  leader  of  the  InformaWcs  Systems  Group  at  the  Wellcome  Trust  Sanger  InsWtute.  

•  Tim  Fennell  –  Assistant  Director  for  Sequencing  Pipeline  InformaWcs  at  the  Broad  InsWtute.  

July  14,  2012   9  Pistoia  Alliance  Sequence  Squeeze  

Page 10: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    ©Eagle  Genomics  Ltd    

Who  won,  how,  and  why?  

July  14,  2012   10  Pistoia  Alliance  Sequence  Squeeze    

Page 11: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    

What  were  the  results?  

•  Entrants  were  judged  by  –  compression  raWo  –  compression  Wme  and  memory  –  decompression  Wme  and  memory  –  accuracy  (lossiness  –  100%  target)  –  manual  review  for  code  quality,  scalability,  and  other  factors.  

•  The  same  three  people  showed  up  at  the  top  of  every  category  –  in  a  different  order  –  with  different  versions  of  their  entries.  

July  14,  2012   11  Pistoia  Alliance  Sequence  Squeeze  

Page 12: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    

Who  won,  and  why?  

•  James  Bonfield  won  overall  –  majority  of  top  places  in  each  category  –  using  various  versions  of  his  entry  –  forming  a  suite  of  suitable  tools.  

•  11.41%  compression  raWo  (test  data  ~6GB)  –  or  109.90  seconds  compression  Wme  –  or  100.91  seconds  decompression  Wme  –  or  35.76MB  compression  memory  usage  –  or  16.01MB  decompression  memory  usage  –  but  not  all  at  once!  

July  14,  2012   12  Pistoia  Alliance  Sequence  Squeeze  

Page 13: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    

ImplicaWons  of  winning    entry  

•  The  approach  is  very  simple  –  essenWally:  –  convert  the  FASTQ  to  BAM  alignments  against  a  reference  genome,  preserving  quality  scores.  

–  compress  the  BAM  files.    

•  Many  other  entries  followed  the  same  pa\ern:    –  convert  to  some  other  format  then  compress  using  standard  techniques.  

July  14,  2012   13  Pistoia  Alliance  Sequence  Squeeze  

Page 14: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    

Other  interesWng    results  

•  Ma\  Mahoney  (Dell)  submi\ed  a  specialised  version  of  the  standard  tool  paq  which  performed  extremely  well.  

•  Even  vanilla  paq  wasn’t  too  bad.  •  Discarding  the  quality  scores  enWrely  gets  a  compression  raWo  of  

2.87%  vs.  the  original  FASTQ  (not  FASTA).  •  If  this  contest  truly  represented  the  latest  and  greatest  ideas  in  the  

field,  then  NGS  storage  must  therefore  either  be    –  highly  compressed,  very  slow  access,    –  or  less  compressed,  relaWvely  fast  access.  

•  Its  quite  hard  to  beat  bzip2.  

July  14,  2012   14  Pistoia  Alliance  Sequence  Squeeze  

Page 15: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    

And  unexpected  benefits  James  Bonfield  donated  his  enWre  prize  fund  –  US$15,000  –  to  charity.  

50%  to  the  Wellcome  Trust  Sanger  InsWtute.  50%  to  the  BriWsh  Heart  FoundaWon.  

 

July  14,  2012   15  

David  Flanders  (Eagle  CEO)  and  John  Wise  (Pistoia  chairman)  present  James  Bonfield  with  his  prize.  

Pistoia  Alliance  Sequence  Squeeze  

Page 16: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    

PublicaWon  

•  Formal  paper  being  wri\en  at  the  moment  by  James  Bonfield  –  in  collaboraWon  with  close-­‐second  Ma\  Mahoney  –  and  judge  Nick  Lynch  –  and  the  authors  of  other  significant  entries.  

•  Source  code  of  ALL  entries  is  available  at  www.sequencesqueeze.org    –  all  under  BSD  licence  –  all  hosted  at  SourceForge  or  similar  –  click  entry  names  to  be  taken  to  download  page.  

•  Interviews  with  entrants  at  the  Pistoia  blog  www.pistoiaalliance.org/blog  –  search  for  arWcles  with  the  tag  ‘compression  algorithms’.  

July  14,  2012   16  Pistoia  Alliance  Sequence  Squeeze  

Page 17: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    ©Eagle  Genomics  Ltd    

Why  did  Pistoia  do  this?  

July  14,  2012   17  Pistoia  Alliance  Sequence  Squeeze    

Page 18: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    

Why  did  Pistoia  do  this?  

•  Encouraging  innovaWon  through  prize-­‐backed  contests.    

•  Open  innovaWon  model  allows  industry  to  state  its  requirements  –  then  let  the  free  market  decide  how  to  deliver  something  that  saWsfies  these.  

July  14,  2012   18  Pistoia  Alliance  Sequence  Squeeze  

Page 19: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    

Why  did  Pistoia  do  this?  

•  Typical  bioinformaWcs  open-­‐source  hackers  do  things  because  they  enjoy  them  –  but  someWmes  also  because  of  the  challenge,  the  kudos,  the  

saWsfacWon  of  solving  a  real-­‐world  problem.  •  James’  charity  donaWon  is  a  great  example  of  this  

–  he  wasn’t  in  it  for  the  money  –  but  the  prize  fund  created  a  tangible  goal  to  aim  at.  

•  Amazon  kindly  sponsored  vouchers  for  all  parWcipants  that  should  have  covered  the  cost  of  developing  and  submi_ng  an  entry  –  contest  was  AWS-­‐based  –  entries  had  to  be  submi\ed  as  S3  buckets.  

July  14,  2012   19  Pistoia  Alliance  Sequence  Squeeze  

Page 20: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    

Why  did  Pistoia  do  this?  

•  Leaderboard  encouraged  compeWWon  – one-­‐upmanship  –  innovaWon.  

•  Does  not  discourage  collaboraWon  –  James  and  Ma\  both  discussed  their  entries  with  the  data  compression  community  at  encode.ru    

July  14,  2012   20  Pistoia  Alliance  Sequence  Squeeze  

Page 21: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    

Why  did  Pistoia  do  this?  

•  BSD-­‐licence  requirement  ensured  that  the  winning  entry  was  not  going  to  be  available  only  to  those  willing  to  pay  a  fee.  

•  EnWre  community  benefits,  not  just  Pistoia  members  or  those  with  deep  pockets  to  pay  for  sosware  licence  agreements.  

July  14,  2012   21  Pistoia  Alliance  Sequence  Squeeze  

Page 22: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    ©Eagle  Genomics  Ltd    

Why  is  this  good  for  BOSC  delegates?  

July  14,  2012   22  Pistoia  Alliance  Sequence  Squeeze    

Page 23: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    

Why  is  this  good  for    BOSC  delegates?  

•  If  the  entries  had  been  closed/commercial  then  only  organisaWons  willing  to  pay  to  licence/buy  the  resulWng  products  would  benefit.  

•  But  this  way  the  enWre  community  benefits  from  results,  for  free,  without  restricWon.    

•  Beneficiaries  include  big  pharma  and  other  large  corporaWons  that  commissioned  the  contest    –  but  also  all  universiWes    –  all  non-­‐profits  –  all  small  businesses  in  biotech  –  and  everyone  else  involved  in  NGS  work.  

•  Pistoia  is  about  pre-­‐compeWWve  alliance    –  there  is  no  reason  to  make  the  Alliance’s  output  exclusive  –  they  are  there  to  develop  and  share  ideas,  not  to  build  an  empire.  

July  14,  2012   23  Pistoia  Alliance  Sequence  Squeeze  

Page 24: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    ©Eagle  Genomics  Ltd    

Will  it  happen  again?  

July  14,  2012   24  Pistoia  Alliance  Sequence  Squeeze    

Page 25: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    

Will  it  happen  again?  

•  Pleased  with  outcome  and  level  of  interest.  •  So,  yes.  •  Goal  is  to  run  two  such  contests  a  year.  •  But,  your  community  needs  you!  

–  we  need  a  topic/subject/idea  that  can  be  raWonally/objecWvely  judged/ranked  

–  and  that  is  relevant  to  the  research  acWviWes  of  life  science  companies  and  other  Pistoia  members.  

•  Ideas  can  be  sent  to  Pistoia  Ops  team  c/o  [email protected]    

July  14,  2012   25  Pistoia  Alliance  Sequence  Squeeze  

Page 26: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    

Credits  

•  Pistoia  Alliance  for  the  idea  and  funding.  •  Eagle  for  organising  and  administering.  •  All  contestants  for  entering.  •  1000  Genomes  for  the  test  data.  •  AWS  for  sponsoring  parWcipants.  •  BOSC/OBF  for  accepWng  this  talk.  

July  14,  2012   26  Pistoia  Alliance  Sequence  Squeeze  

Page 27: Holland R - Pistoia Alliance Sequence Squeeze

©Eagle  Genomics  Ltd    

©Eagle  Genomics  Ltd.      

[email protected]  (ideas  to:  [email protected]  )  

+44  (0)1223  654481  x3  

www.pistoiaalliance.org  www.sequencesqueeze.org  www.eaglegenomics.com  

               facebook.com/eaglegenomics   blog.eaglegenomics.com  

www.pistoiaalliance.org/blog  

@eaglegen                                  @sequencesqueeze  

                   @pistoiaalliance  

Eagle®  is  a  registered  trademark  no.  010418135  of  Eagle  Genomics  Ltd.      Postal  address:  Eagle  Genomics  Ltd.,  Babraham  Research  Campus,  Cambridge  CB22  3AT,  United  Kingdom.