66
Compara’ve Genomics and Visualisa’on – Part 2 Leighton Pritchard

Comparative Genomics and Visualisation - Part 2

Embed Size (px)

DESCRIPTION

Slides from a Comparative Genomics and Visualisation course (part 2) presented at the University of Dundee, 11th March 2014. Other materials are available at GitHub (https://github.com/widdowquinn/Teaching)

Citation preview

Page 1: Comparative Genomics and Visualisation - Part 2

Compara've  Genomics  and  Visualisa'on  –  Part  2  

Leighton  Pritchard  

Page 2: Comparative Genomics and Visualisation - Part 2

Part  2  

l Part  1  l  Experimental  Compara5ve  Genomics  

l  Bulk  and  Whole  Genome  Comparisons  

l Genome  Features  

l Who  let  the  –logues  out?  

l Finishing  The  Hat  

Page 3: Comparative Genomics and Visualisation - Part 2

Genome  Features  l Genes:  

l  transla5on  start  

l  introns  

l  exons  

l  transla5on  stop  

l  transla5on  terminator  

l ncRNA:  l  tRNA  –  transfer  RNA  

l  rRNA  –  ribosomal  RNA  

l  CRISPRs  –  bacterial  and  archaeal  defence    (genome  edi5ng)  

l  many  other  classes  (including  enhancers)  

Page 4: Comparative Genomics and Visualisation - Part 2

Genome  Features  l Regulatory  sites  

l  Transcrip5on  start  site  (TSS)  

l  RNA  polymerase  binding  sites  

l  Transcrip5on  Factor  Binding  Sites    (TFBS)  

l  Core,  proximal  and  distal  promoter  regions  

l Repe''ve  Regions  and  Mobile  Elements  

l  Tandem  repeats  

l  (retro-­‐)transposable  elements  

� Alu  has  ≈50,000  ac5ve  copies  in  human  genome  

l  Phage  inclusion  (bacteria/archaea)  

Pennacchio  &  Rubin  (2001)  Nat.  Rev.  Genet.  doi:10.1038/35052548  

human  v  mouse  comparison  

Page 5: Comparative Genomics and Visualisation - Part 2

Genome  Feature  Iden'fica'on  l Gene  Finding:  

1.  Empirical  (evidence-­‐based)  methods:  

�  Inference  from  known  protein/cDNA/mRNA/EST  sequence  

�  Inference  from  mapped  RNA  reads  

2.  Ab  ini*o  methods:  

�  Iden5fica5on  of  sequences  associated  with  gene  features:  

ª  TSS,  CpG  islands,  Shine-­‐Dalgarno  sequence,  stop  codons,  etc.  

3.  Inference  from  genome  comparisons/conserva5on  

Liang  et  al.  (2009)  Genome  Res.  doi:10.1101/gr.088997.108  Brent  (2007)  Nat.  Biotech.  doi:10.1038/nbt0807-­‐883  Korf  (2004)  BMC  Bioinf.  doi:10.1186/1471-­‐2105-­‐5-­‐59  

Page 6: Comparative Genomics and Visualisation - Part 2

Genome  Feature  Iden'fica'on  l Finding  Regulatory  Elements  (short,  degenerate):  

1.  Empirical  (evidence-­‐based)  methods:  

�  Inference  from  protein-­‐DNA  binding  experiments  

�  Inference  from  coexpression  

2.  Ab  ini*o  methods:  

�  Iden5fica5on  of  regulatory  mo5fs  (profile/other  methods):  

ª  TATA,  sigma-­‐factor  binding  sites,  etc.  

�  sta5s5cal  overrepresenta5on  

�  Iden5fica5on  from  sequence  proper5es  

3.  Inference  from  sequence  conserva5on/genome  comparisons  

Zhang  et  al.  (2011)  BMC  Bioinf.  doi:10.1186/1471-­‐2105-­‐12-­‐238  Kilic  et  al.  (2013)  Nucl.  Acids  Res.  doi:10.1093/nar/gkt1123  Vavouri  &  Elgar  (2005)  Curr.  Op.  Genet.  Devel.  doi:10.1016/j.gde.2005.05.002  

Page 7: Comparative Genomics and Visualisation - Part 2

Genome  Feature  Iden'fica'on  l  All  predic5on  methods  result  in  errors  

l  All  experiments  have  error  

l  Genome  comparisons  can  help  correct  errors  

l  [OPTIONAL  ACTIVITY]  –  useful  for  exercise  l  predict_CDS.md  Markdown  

l  Other  op5ons  for  prokaryo5c  genecalling:  l  Glimmer  (hZp://ccb.jhu.edu/so\ware/glimmer/index.shtml)  

l  GeneMarkS  (hZp://opal.biology.gatech.edu/)  

l  RAST  (hZp://rast.nmpdr.org/)  

l  BASys  (hZps://www.basys.ca/),  etc.  

l  Op5ons  for  eukaryo5c  genecalling:  l  GlimmerHMM  (hZp://ccb.jhu.edu/so\ware/glimmerhmm/)  

l  GeneMarkES  (hZp://opal.biology.gatech.edu/gmseuk.html)    

l  Augustus  (hZp://augustus.gobics.de/),  etc.  

Page 8: Comparative Genomics and Visualisation - Part 2

Who  Let  The  -­‐logues  Out?  

Evolu'onary  rela'onships  of  genome  features  can  be  complex.  We  require  precise  terms  to  describe  rela'onships  between  genome  features.  

Page 9: Comparative Genomics and Visualisation - Part 2

Comparing  Gene  Features  l Given  gene  annota5ons  for  more  than  one  genome,  how  can  we  organise  and  understand  rela5onships?  

l  Func5onal  similarity  (analogy)  

l  Evolu5onary  common  origin  (homology,  orthology,  etc.)  

l  Evolu5onary/func5onal/family  rela5onships  (paralogy)  

Terms  first  suggested  by  Fitch  (1970)  Syst.  Zool.  doi:10.2307/2412448  

Page 10: Comparative Genomics and Visualisation - Part 2

Agack  of  the  –logues  l Technical  terms  describing  evolu5onary  rela5onships    

l Homologues:  elements  that  are  similar  because  they  share  a  common  ancestor  (NOTE:  There  are  NOT  degrees  of  homology!)  

l Analogues:  elements  that  are  (func5onally?)  similar,  possibly  through  convergent  evolu5on  and  not  by  sharing  common  ancestry  

l Orthologues:  homologues  that  diverged  through  specia5on  

l Paralogues:  homologues  that  diverged  through  duplica5on  within  the  same  genome  

l  (also  co-­‐orthologues,  xenologues,  etc.)  

Page 11: Comparative Genomics and Visualisation - Part 2

Agack  of  the  –logues  

'me  

ancestral  genome  feature  genome  

Page 12: Comparative Genomics and Visualisation - Part 2

Agack  of  the  –logues  

'me  

specia'on  

ancestor:  iA  

species1:iA   species2:iA  

orthologues  

•  Orthologues:  homologues  that  diverged  through  specia5on  

genome  

Page 13: Comparative Genomics and Visualisation - Part 2

Agack  of  the  –logues  

ancestral  copy:\A  

'me  

copy  1:\A   copy  2:\A’  

duplica'on  

paralogues  

Paralogues:  homologues  that  diverged  through  duplica5on  within  the  same  genome  

genome  

Page 14: Comparative Genomics and Visualisation - Part 2

Agack  of  the  –logues  

'me  

specia'on  

ancestor:iA  

species1:iA   species2:iA  

species1:iA’   species1:iA   species2:iA  

duplica'on  

orthologues  

out-­‐paralogues  

in-­‐paralogues  

genome  

Page 15: Comparative Genomics and Visualisation - Part 2

Agack  of  the  –logues  

'me  

specia'on  

ancestor:iA  

species1:iA   Species2:iA  

species1:iA’   species2:iA   species2:iA’  species1:iA  

duplica'on  

in-­‐paralogues   in-­‐paralogues  

out-­‐paralogues  

orthologues  

genome  

Page 16: Comparative Genomics and Visualisation - Part 2

Agack  of  the  –logues  l BUT:  biology  is  not  well-­‐behaved:  rela5onships  can  be    difficult  to  infer  

l  Gene  loss  occurs  

l  Homologues  can  diverge  –  some5mes  very  widely:  hard  to  recognise  

l  Reconstructed  evolu5onary  trees  for  specia5on  events  may  not  be  robust  

Kristensen  et  al.  (2011)  Brief.  Bioinf.  doi:10.1093/bib/bbr030  

Page 17: Comparative Genomics and Visualisation - Part 2

genome  

extensive  divergence  

Agack  of  the  –logues  

'me  specia'on  

ancestor:iA  

species1:iA   Species2:iA  

species1:iA’   species2:iA   species2:iA’  species1:iA  duplica'on  

species1:iA?   species1:iA   species2:iA?  

in-­‐paralogues  (co-­‐)orthologues?  

contemporary  sequence  

historical  events  

out-­‐paralogues/co-­‐orthologues?  

Current  classifica'ons  of  orthology/paralogy  are  inferences  

Page 18: Comparative Genomics and Visualisation - Part 2

Agack  of  the  –logues  l BUT:  biology  is  not  well-­‐behaved:  rela5onships  can  be    difficult  to  infer  

l  Gene  loss  occurs  

l  Homologues  can  diverge  –  some5mes  very  widely:  hard  to  recognise  

l  Reconstructed  evolu5onary  trees  for  specia5on  events  may  not  be  robust  

l Some  resources  and  tools  ‘bend’  defini5ons,  e.g.  Ensembl  Compara  and  OrthoMCL.  

hZp://www.ensembl.org/info/genome/compara/  homology_method.html  Kristensen  et  al.  (2011)  Brief.  Bioinf.  doi:10.1093/bib/bbr030  

Page 19: Comparative Genomics and Visualisation - Part 2

Note  on  “Orthology”  l Frequently  abused/misused  as  a  term  

l “Orthology”  is  an  evolu5onary  rela5onship,  o\en  bent  into  service  as  a  func5onal  descriptor  

l Strictly  defined  only  for  two  species  or  clades!  l  (cf.  OrthoMCL,  etc.)  

l Orthology  is  not  transi5ve  (A  is  orthologue  of  C  and  B  is  orthologue  of  C  does  not  imply  A  is  an  orthologue  of  B)  

l  (cf.  EnsemblCompara  defini5ons)  

Storm  &  Sonnhammer  (2002)  Bioinforma@cs.  doi:10.1093/bioinforma'cs/18.1.92  

Page 20: Comparative Genomics and Visualisation - Part 2

Ensembl  Compara  defini'ons  l  within_species_paralog:    same-­‐species  paralogue  (in-­‐paralogue)  

l  ortholog_one2one: orthologue  

l  ortholog_one2many:    orthologue/paralogue  rela5onship  

l  orthology_many2many:    orthologue/paralogue  rela5onship  

Vilella  et  al.  (2009)  Genome  Res.  doi:10.1101/gr.073585.107  

NOTE:  the  taxonomy  may  not  always  be  correct…  

Page 21: Comparative Genomics and Visualisation - Part 2

“The  Ortholog  Conjecture”  

Without  duplica'on,  a  gene  is  unlikely  to  change  its  basic  func'on,  because  this  would  lead  to  loss  of  the  original  func'on,  and  this  would  be  harmful.  

Page 22: Comparative Genomics and Visualisation - Part 2

Problems  with  the  Ortholog  Conjecture  l Nehrt  et  al.  (2011)  say:  

l  Paralogues  beZer  predictor  of  func5on  than  orthologues  

� ∴  conjecture  is  false!  

l  Cellular  context  beZer  for  protein  func5on  inference  

l  Func5on  defined  from  Gene  Ontology  (GO)  

Nehrt  et  al.  (2011)  PLoS  Comp.  Biol.  doi:10.1371/journal.pcbi.1002073  Chen  et  al.  (2012)  PLoS  Comp.  Biol.  doi:10.1371/journal.pcbi.1002784  

Page 23: Comparative Genomics and Visualisation - Part 2

Problems  with  the  Ortholog  Conjecture  

l But  do  we  understand  func5on  well  enough  to  test  the  conjecture?  

l Chen  et  al.  (2012)  say:  “No”  l  “examina5on  of  func5onal  studies  of  homologs  with  iden5cal  

protein  sequences  reveals  experimental  biases,  annota5on  errors,  and  homology-­‐based  func5onal  inferences  that  are  labeled  in  GO  as  experimental.  These  problems  […]  make  the  current  GO  inappropriate  for  tes5ng  the  ortholog  conjecture”  

l  Expression  level  similarity  is  more  similar  for  orthologues  than  paralogues  (but  is  this  “func'on”…?)  

Nehrt  et  al.  (2011)  PLoS  Comp.  Biol.  doi:10.1371/journal.pcbi.1002073  Chen  et  al.  (2012)  PLoS  Comp.  Biol.  doi:10.1371/journal.pcbi.1002784  

Page 24: Comparative Genomics and Visualisation - Part 2

Finding  “Orthologues”  

The  process  of  finding  evolu'onary  (and/or  func'onal)  equivalents  of  genes  across  two  or  more  organisms’  genomes.  

Page 25: Comparative Genomics and Visualisation - Part 2

Why  are  “orthologues”  so  important?  l Orthology  formalises  the  concept  of  corresponding  genes    across  mul5ple  organisms.  

l  Evolu5onary  

l  Func5onal?  (“The  Ortholog  Conjecture”)  

l Applica5ons  in:  l  Compara5ve  genomics  

l  Func5onal  genomics  

l  Phylogene5cs,  …  

l Many  (>35)  databases  aZempt  to  describe  orthologous  rela5onships  

l  hZp://queskororthologs.org/orthology_databases  

Dessimoz  (2011)  Brief.  Bioinf.  doi:10.1093/bib/bbr057  

Page 26: Comparative Genomics and Visualisation - Part 2

How  to  find  orthologues?  l Many  published  methods  and  databases:  

l  Pairwise  between  two  genomes:    

� RBBH  (aka  BBH,  RBH,  etc.),  RSD,  InParanoid,  RoundUp  

l  Mul5-­‐genome    

� Graph-­‐based:  COG,  eggNOG,  OrthoDB,  OrthoMCL,  OMA,  Mul5Paranoid  

� Tree-­‐based:  TreeFam,  Ensembl  Compara,  PhylomeDB,  LOFT  

l Methods  may  apply  different  -­‐  or  refined  -­‐  defini5ons  of  orthology,  paralogy,  etc.  

Salichos  et  al.  (2011)  PLoS  One.  doi:10.1371/journal.pone.0018755    Trachana    et  al.  (2011)  Bioessays  doi:10.1002/bies.201100062  Kristensen  et  al.  (2011)  Brief.  Bioinf.  doi:10.1093/bib/bbr030  

Page 27: Comparative Genomics and Visualisation - Part 2

Pairwise  approaches  l S1,  S2  are  the  gene  sequence  sets  from  two  organisms  

l Compare  S1  to  S2,  and  iden5fy  the  most  similar  pairs  of  sequences:  these  are  “orthologues”  (or  “puta5ve  orthologues”).  

l Many  similarity  measures  possible  (which  threshold:  E-­‐value,  bit  score,  coverage…?):  

l  Reciprocal  best  BLAST  hit  (RBBH)  –  used  by  e.g.  InParanoid  

l  Reciprocal  smallest  difference  (RSD)  –  used  by  e.g.  RoundUp  

l  and  so  on…  

l Can  be  extended  to  mul5-­‐organism  clusters  by  graph-­‐based  approaches  

Östlund  et  al.  (2009)  Nuc.  Acids  Res.  doi:10.1093/nar/gkp931    DeLuca    et  al.  (2012)  Bioinf.  doi:10.1093/bioinforma'cs/bts006  

Page 28: Comparative Genomics and Visualisation - Part 2

Reciprocal  Best  BLAST  Hits  l S1,  S2  are  the  gene  sequence  sets  from  two  organisms  

l BLASTP:  l  Query=S1,  Subject=S2    

l  Query=S2,  Subject=S1  

l Op5onally  filter  BLAST  hits  (e.g.  on  %iden5ty  and  %coverage)  

l Find  all  pairs  of  sequences  {GS1n,  GS2n}  in  S1,  S2  where  GS1n  is  the  best  BLAST  match  to  GS2n  and  GS2n  is  the  best  BLAST  match  to  GS1n.  

best  hit  

best  hit   best  hit  

best  hit  

2nd  best  hit  

2nd  best  hit  

✔   ✘  

best  hit  

Page 29: Comparative Genomics and Visualisation - Part 2

Reciprocal  Best  BLAST  Hits  l Advantages:    

l  quick  

l  easy  

l  performs  surprisingly  well  (see  later…)  

l Disadvantages:    l  misses  paralogues  

l  not  good  at  iden5fying  gene  families  or  *-­‐to-­‐many  rela5onships  without  more  detailed  analysis.    

l  no  strong  theore5cal/phylogene5c  basis.  

Page 30: Comparative Genomics and Visualisation - Part 2

COG  l COG  (Clusters  of  Orthologous  Groups;  now  POG,  KOG,  eggNOG  etc.)  

l Graph  extension  of  RBBH  to  clusters  of  mutual  RBBH  

l  “Any  group  of  at  least  three  proteins  from  different  genomes,  more  similar  to  each  other  than  any  other  proteins  from  those  genomes,  are  an  orthologous  family.”  

l  Conduct  RBBH  

l  Collapse  paralogues    

l  Detect  “triangles”  

l  Merge  triangles  having  common  side  

l  Manual  cura5on  

l Databases  have  many  outparalogues  

Tatusov  et  al.  (2000)  Nucl.  Acids  Res.  doi:10.1093/nar/28.1.33  

Page 31: Comparative Genomics and Visualisation - Part 2

MCL  l MCL  constructs  a  network  from  all-­‐vs-­‐all  BLAST  results  

l Then  applies  matrix  opera5ons:  expansion  and  infla5on  

l  Itera5ve  expansion  and  infla*on  un5l  network  convergence  

Enright  et  al.  (2002)  Nucl.  Acids  Res.  doi:10.1093/nar/30.7.1575  

Page 32: Comparative Genomics and Visualisation - Part 2

MCL  Expansion   Infla'on  

…  

…  

…   …  

→  

→  

Input  

Clustering  

Page 33: Comparative Genomics and Visualisation - Part 2

OrthoMCL  l hZp://orthomcl.org/orthomcl/  

1.  Defines  poten5al  inparalogue,  orthologue  and  co-­‐orthologue  pairs  (using  RBBH!  –  see  algorithm  descrip5on  in  papers  directory)  

2.  Applies  MCL  to  cluster  inparalogue,  orthologue,  co-­‐orthologue  pairs/  

l Output  clusters  include  both  orthologues  and  paralogues  

Li  et  al.  (2003)  Genome  Res.  doi:10.1101/gr.1224503  

Page 34: Comparative Genomics and Visualisation - Part 2

Notes  of  Cau'on  l  BLAST-­‐based  orthology  methods  (e.g.  RBBH,  InParanoid,  COG)  are  fast!  

l  But  they  have  some  drawbacks:  

l  No  guarantee  that  sequence  matches  are  transi5ve  (A  may  match  B  at  a  domain  differently  than  B  matches  C)  

l  No  evolu5onary  distance  model  

l  Mul5ple  domain  matches  are  not  accounted  for  

l  These  methods  find  similar  sequences,  then  make  assump5ons  based  on  similarity  and  number  of  matches.  They  do  not  detect  orthologues  directly!  

l  Tree-­‐based  methods  incorporate:  

l  Evolu5onary  distance  

l  Direct  orthologue  detec5on  

Page 35: Comparative Genomics and Visualisation - Part 2

Finding  “Orthologues”  l Pairwise  analysis:  RBBH  

l  [ACTIVITY]  l  find_rbbh.ipynb  iPython  notebook  

l Mul5-­‐organism  analysis:  MCL  

l  [ACTIVITY]  l  mcl_orthologues/README.md  Markdown  

l  mcl_orthologues.ipynb iPython  notebook  

Page 36: Comparative Genomics and Visualisation - Part 2

Other  Methods  l  Synteny-­‐based:  

l  Homologene  (NCBI):    �  hZp://www.ncbi.nlm.nih.gov/homologene  

l Manual  cura5on:  

l  Mouse  Genome  Database  (MGD):  

�  hZp://www.informa5cs.jax.org/homology.shtml  

l  Tree-­‐based:  l  EnsemblCompara  (EMBL-­‐EBI):  

�  hZp://www.ensembl.org/info/genome/compara/index.html  

l  TreeFam  (EMBL-­‐EBI):  �  hZp://www.treefam.org/  

l  OrthologID:  �  hZp://nypg.bio.nyu.edu/orthologid/  

Page 37: Comparative Genomics and Visualisation - Part 2

Evalua'ng  Orthologue  Predic'ons  

Which  method  works  best?  (and  what  do  we  mean  by  “best”  anyway?)  

Page 38: Comparative Genomics and Visualisation - Part 2

Evalua'ng  Predic'ons  l Works  the  same  way  for  all  predic5on  tools  

1.  Define  a  “valida5on  set”  (gold  standard),  unseen  by  the  predic5on  tool  

2.  Make  predic5ons  with  the  tool  

3.  Evaluate  confusion  matrix  and  performance  sta5s5cs  

l  Sensi5vity  

l  Specificity  

l  Accuracy    

Standard:   +ve   -­‐ve  

Predict  +ve   TP   FP  

Predict  -­‐ve   FN   TN  

False  posi5ve  rate   FP/(FP+TN)  

False  nega5ve  rate   FN/(TP+FN)  

Sensi5vity   TP/(TP+FN)  

Specificity   TN/(FP+TN)  

False  discovery  rate  (FDR)   FP/(FP+TP)  

Accuracy   (TP+TN)/(TP+TN+FP+FN)  

Page 39: Comparative Genomics and Visualisation - Part 2

Evalua'ng  Orthologue  Predic'ons  l Take  advantage  of  prokaryo5c  operon  structure:  conserved  syntenic  triplets  likely  to  be  orthologous  

l  Idea:  If  the  outer  pair  in  a  syntenic  triplet  are  orthologous,  the  middle  gene  is  likely  to  be,  too.  

l  Middle  genes  are  orthologue  “gold  standard”  

l Do  RBBH  reliably  iden5fy  middle  genes  from  syntenic  triplets?  

Wolf  et  al.  (2012)  Genome  Biol.  Evol.  doi:10.1093/gbe/evs100  

Page 40: Comparative Genomics and Visualisation - Part 2

Evalua'ng  Orthologue  Predic'ons  l  Two  well-­‐characterised  genomes    

compared  against  573  prokaryotes  

l  Iden5fied  RBBH  (with  permissive    BLAST  sewngs)  

l  “Overwhelming  majority”  of  middle    genes  (counterparts)  are  BBH  

l  88-­‐99%  of  BBH  are  in  syntenic  triplets  

l  Therefore,  RBBH  reliably  finds  orthologues  

Wolf  et  al.  (2012)  Genome  Biol.  Evol.  doi:10.1093/gbe/evs100  

Page 41: Comparative Genomics and Visualisation - Part 2

Evalua'ng  Orthologue  Predic'ons  l Four  orthologue  predic5on  algorithms:  

l  RBBH  (and  cRBH)  

l  RSD  (and  cRSD)  

l  Mul5Paranoid  

l  OrthoMCL  

l Tested  against  2,723  curated  orthologues  from  six  Saccharomycetes  

l Rated  by:  l  Sensi5vity:  TP/(TP+FN)  –  what  propor5on  of  orthologues  are  found  

l  Specificity:  TN/(TN+FP)  –  how  well  are  non-­‐orthologues  excluded  

l  Accuracy:  (TP+TN)/(TP+TN+FP+FN)  –  general  measure  of  performance  

l  FDR:  FP/(FP+TP)  –  what  propor5on  of  predic5ons  are  incorrect  

Salichos  et  al.  (2011)  PLoS  One.  doi:10.1371/journal.pone.0018755  

Page 42: Comparative Genomics and Visualisation - Part 2

Evalua'ng  Orthologue  Predic'ons  l Four  orthologue  predic5on  algorithms:  

l  RBBH  (cRBH)  

l  RSD  (cRSD)  

l  Mul5Paranoid  

l  OrthoMCL  

l  cRBH  most  accurate,  and  specific,  with  lowest  FDR  

Salichos  et  al.  (2011)  PLoS  One.  doi:10.1371/journal.pone.0018755  

Page 43: Comparative Genomics and Visualisation - Part 2

Evalua'ng  Orthologue  Predic'ons  l Tests  of  several  methods  on  a  number  of  literature-­‐based  benchmarks  for:  

l  Correct  branching  of  phylogeny  

l  Grouping  by  func5on  

� GO  similarity  

� EC  number  

� Expression  level  

� Gene  Neighbourhood  

Altenhoff  &  Dessimoz  (2009)  PLoS  Comp.  Biol.  doi:10.1371/journal.pcbi.1000262  

Page 44: Comparative Genomics and Visualisation - Part 2

Evalua'ng  Orthologue  Predic'ons  

Altenhoff  &  Dessimoz  (2009)  PLoS  Comp.  Biol.  doi:10.1371/journal.pcbi.1000262  

Page 45: Comparative Genomics and Visualisation - Part 2

Evalua'ng  Orthologue  Predic'ons  l 70  gene  family  test,  mul5ple  evolu5onary  scenarios  

l Tested  databases  with  associated  algorithms:  

Trachana  et  al.  (2011)  Bioessays.  doi:10.1002/bies.201100062  

Page 46: Comparative Genomics and Visualisation - Part 2

Evalua'ng  Orthologue  Predic'ons  l 70  gene  family  test  set,  mul5ple  evolu5onary  scenarios  

 

l All  methods/dbs  have  strong  scope  for  improvement.  

l OrthoMCL  poor  performer,  TreeFam  &  eggNOG  do  best  

Trachana  et  al.  (2011)  Bioessays.  doi:10.1002/bies.201100062  

Page 47: Comparative Genomics and Visualisation - Part 2

Orthologue  Predic'on  Performance  l Performance  varies  by  choice  of  method  and  interpreta'on  of  “orthology”  

l Biggest  influence  is  genome  annota'on  quality  

l Rela've  performance  varies  with  benchmark  choice  

l  (clustering)  RBBH  outperforms  more  complex  algorithms  under  many  circumstances  

Page 48: Comparative Genomics and Visualisation - Part 2

Selec'on  Pressures  

Signs  of  selec'on  pressure  iden'fiable  by  compara've  genomics  

Page 49: Comparative Genomics and Visualisation - Part 2

Selec'on  Pressures  l Defining  core  groups  of  genes  by  “orthology”  allows  analysis  of  those  groups:  

l  Synteny/colloca'on  

l  Gene  neighbourhood  changes  (e.g.  genome  expansion)  

l  The  pangenome:  core  and  accessory  genomes  

l and  sequences  in  those  groups:  l  Mul5ple  alignment  

l  Domain  detec5on  

l  Iden5fica5on  of  func5onal  sites  

l  Inference  of  evolu'onary  pressures  

Page 50: Comparative Genomics and Visualisation - Part 2

Synteny  l Selec5ve  pressures  depend  on  gene  (product)  func5on  

l Genes  involving  physically  or  func5onally-­‐interac5ng  proteins  tend  to  evolve  under  similar  selec5ve  constraints  

l Par5cularly  in  bacteria,  this  leads  to  co-­‐expression  as  regulons  and  colloca5on  in  operons  

l Colloca5on  (and  coregula5on)  may  be  iden5fied  by  compara5ve  genomics  

l  (This  is  also  true  when  considering  regulatory  or  metabolic  networks,  similarly  to  genome  organisa5on)  

Alvarez-­‐Ponce  et  al.  (2011)  Genome  Biol.  Evol.  doi:10.1093/gbe/evq084  

Page 51: Comparative Genomics and Visualisation - Part 2

Synteny  l Many  tools/packages/services  for  synteny  detec5on,  

e.g.  

l  SyMAP  �  hZp://www.agcol.arizona.edu/so\ware/

symap/  l  i-­‐ADHoRe  

�  hZp://bioinforma5cs.psb.ugent.be/so\ware/details/i-­‐-­‐ADHoRe  

l  MCScan,  Cyntenator,  etc  

Soderlund  et  al.  (2011)  Nucl.  Acids.  Res.  doi:10.1093/nar/gkr123  Proost  et  al.  (2011)  Nucl.  Acids  Res.  doi:10.1093/nar/gkr955  

Page 52: Comparative Genomics and Visualisation - Part 2

i-­‐ADHoRe  l Algorithm:  

1.  Combine  tandem  repeats  of  genes/gene  sets  

2.  Make  gene  homology  matrix  (GHM):  iden5fy  collinear  regions  (diagonals)  for  first  genome  pair  

3.  Convert  these  to    profiles  

4.  Use  GG2  algorithm  to  align  profiles  

5.  Search  next  genome    with  profiles,  spliwng    them  where  necessary  

6.  iterate  un5l  complete  

l Gives  genome-­‐scale  mul5ple  alignments  of  blocks  of  genes  

Proost  et  al.  (2011)  Nucl.  Acids  Res.  doi:10.1093/nar/gkr955  

Page 53: Comparative Genomics and Visualisation - Part 2

i-­‐ADHoRe  l  [ACTIVITY]  

l  i-ADHoRe/README.md  Markdown  

l  i-ADHoRe.ipynb  iPython  notebook  

Page 54: Comparative Genomics and Visualisation - Part 2

Genome  Expansion  l Mobile/repeat  elements  reproduce  and  expand  during  evolu5on  

l Generates  sequence  “laboratory”  for  varia5on  and  experiment  

l e.g.  Phytophthora  infestans  effector  protein  expansion  and  arms  race  

Haas  et  al.  (2009)  Nature.  doi:10.1038/nature08358  

Page 55: Comparative Genomics and Visualisation - Part 2

Genome  Expansion  l Mobile  elements  (MEs)  are  large,  

carry  genes  with  them.  

l  Regions  rich  in  MEs  have  larger  gaps  between    consecu5ve  genes  

l  Effector  proteins  are  found  preferen5ally  in  regions  with  large  gaps,  also  show  increased  rates  of  evolu5onary  divergence.  

l  “Two-­‐speed  genome”  associated  with  adaptability  to  new  hosts/escape  from  evolu5onary  “boZleneck”  

Haas  et  al.  (2009)  Nature.  doi:10.1038/nature08358  

Page 56: Comparative Genomics and Visualisation - Part 2

The  Pangenome  l  The  gene  complement  of  a  set  of  organisms  (e.g.  species  group)  is  the    

pangenome,  defined  by  the  union  of  two  gene  sets:  

l  Core  genes:  genes  present  in  all  examples  (define  common  species  characteris5cs)  

l  Accessory  genes:  genes  only  present  in  a  subset  of  examples  (relevant  to  adapta5on  of  individuals)  

l  Defini5on  depends  on  composi5on  of  organism  set  

l  Core  genome  hypothesis:  

l  “The  core  genome  is  the  primary    cohesive  unit  defining  a  bacterial    species.”  

l  Online  tools  available,  e.g.  l  Panseq  (hZp://lfz.corefacility.ca/panseq/)  

Laing  et  al.  (2010)  BMC  Bioinf.  doi:10.1186/1471-­‐2105-­‐11-­‐461  Lefébure  et  al.  (2010)  Genome  Biol.  Evol.  doi:10.1093/gbe/evq048  

Page 57: Comparative Genomics and Visualisation - Part 2

Defining  a  species’  core  genome  l  “Orthologue  groups”  with  a  

representa5ve  in  (nearly)  every  member  of  the  set  

l  But  we  only  have  a  sample  of  the  species,  not  every  member…  

l  …so  use  rarefac5on  curves  to  es5mate  core  genome  size.  

1.  Randomly  order  organisms,  and  count  number  of  ‘core’  and  ‘new’  genes  seen  with  each  new  genome  addi5on.  

2.  Repeat  un5l  you  have  a  reasonable  es5mate  of  error/no  new  genes  found  

Lefébure  et  al.  (2010)  Genome  Biol.  Evol.  doi:10.1093/gbe/evq048  

Page 58: Comparative Genomics and Visualisation - Part 2

Direc'onal  Selec'on  l Several  sta5s5cal  tests  for  direc5onal  selec5on,  e.g.  

l  QTL  sign  

l  Ka/Ks  (dN/dS)  ra'o  test  –  most  commonly  applied  

l  Rela5ve  rate  test  

l Ka/Ks  ra'o:  l  Ka  (or  dN):  number  of  non-­‐synonymous  subs5tu5ons  per  non-­‐

synonymous  site  

l  Ks  (or  dS):  number  of  synonymous  subs5tu5ons  per  synonymous  site  

l  Ka/Ks  >  1  ⇒  posi5ve  selec5on;  Ka/Ks  <  1  ⇒  stabilising  selec5on  

l  Several  methods/tools  for  calcula5on  

� PAML  (hZp://abacus.gene.ucl.ac.uk/so\ware/paml.html)  

� SeqinR  (hZp://cran.r-­‐project.org/web/packages/seqinr/index.html)  

Page 59: Comparative Genomics and Visualisation - Part 2

Genome-­‐Wide  Posi've  Selec'on  

Lefébure  &  Stanhope  (2009)  Genome  Res.  doi:10.1101/gr.089250.108  

Page 60: Comparative Genomics and Visualisation - Part 2

An  Analysis  Output  l Class  comparison:  animal-­‐pathogenic  (APE)  vs  plant-­‐associated  bacteria  (PAB)  

l Presence  of  horizontally-­‐acquired  islands  (HAI)  

l Genes  with  greater  similarity  to  PAB  than  APE  

Toth  et  al.  (2006)  Annu.  Rev.  Phytopath.  doi:10.1146/annurev.phyto.44.070505.143444  

Page 61: Comparative Genomics and Visualisation - Part 2

Things  I  Didn’t  Get  To  l Genome-­‐Wide  Associa'on  Studies  (GWAS):  

l  Try  hZp://genenetwork.org/  to  play  with  some  data  

l Predic'on  of  regulatory  elements,  e.g.  

l  Kellis  et  al.  (2003)  Nature  doi:10.1038/nature01644  

l  King  et  al.  (2007)  Genome  Res.  doi:10.1101/gr.5592107  

l  Chaivorapol  et  al.  (2008)  BMC  Bioinf.  doi:10.1186/1471-­‐2105-­‐9-­‐455  

l  CompMOBY:  hZp://genome.ucsf.edu/compmoby/  

l Detec'on  of  Horizontal/Lateral  Gene  Transfer  (HGT/LGT),  e.g.  l  Tsirigos  &  Rigoutsos  (2005)  Nucl.  Acids.  Res.  doi:10.1093/nar/gki187  

l Phylogenomics,  e.g.  

l  Delsuc  et  al.  (2005)  Nat.  Rev.  Genet.  doi:10.1038/nrg1603  

Page 62: Comparative Genomics and Visualisation - Part 2

Finishing  The  Hat  

Some  of  the  things  I  hope  you  have  taken  away  from  the  lectures/ac'vi'es  

Page 63: Comparative Genomics and Visualisation - Part 2

Take-­‐Home  Messages  l Compara've  genomics  is  a  powerful  set  of  techniques  for:  

l  Understanding  and  iden5fying  evolu5onary  processes  and  mechanisms  

l  Reconstruc5ng  detailed  evolu5onary  history  of  a  set  of  organisms  

l  Iden5fying  and  understanding  common  genomic  features  of  organisms  

l  Providing  hypotheses  about  gene  func5on  for  experimental  inves5ga5on  

l A  huge  amount  of  data  is  available  to  work  with  

l  And  it’s  only  going  to  get  much,  much  larger  

l Results  feed  into  many  areas  of  study:  

l  Medicine  and  health  

l  Agriculture  and  food  security  

l  Basic  biology  in  all  fields  

l  Systems  and  synthe5c  biology  

Page 64: Comparative Genomics and Visualisation - Part 2

Take-­‐Home  Messages  l Compara've  genomics  is  essen'ally  based  around  comparisons  

l  What  is  similar  between  two  genomes?  What  is  different?  

l Compara've  genomics  is  evolu'onary  genomics  

l Large  datasets  benefit  from  visualisa'on  for  effec've  interpreta'on  

l  Much  scope  for  improvement  in  visualisa5on  

l Tools  with  the  same  purpose  give  different  output  

l  BLAST  vs  MUMmer  

l  RBBH  vs  MCL  

l  Choice  of  applica'on  magers  for  correctness  and  interpreta'on!  –  understand  what  the  applica'on  does,  and  its  limits.  

Page 65: Comparative Genomics and Visualisation - Part 2

Take-­‐Home  Messages  

l Compara've  genomics  is  l Fun  l Indoor  work,  in  the  warm  and  dry  l Not  a  job  that  involves  heavy  liiing  

Page 66: Comparative Genomics and Visualisation - Part 2

Credits  l This  slideshow  is  shared  under  a  Crea5ve  Commons  AZribu5on  4.0  License  hZp://crea5vecommons.org/licenses/by/4.0/)  

l Copyright  is  held  by  The  James  HuZon  Ins5tute  hZp://www.huZon.ac.uk  

l You  may  freely  use  this  material  in  research,  papers,  and  talks  so  long  as  acknowledgement  is  made.