77
Cancer Systems Biology: RNASeq and Differen;al Expression Analysis Taking advantage of a Measurement Revolu;on July 25, 2013 Anne DeslaLes Mays Wellstein/Riegel Laboratory Mentor: Anton Wellstein, MD, PhD 7/25/13 Wellstein/Riegel Laboratory 1

2013 july 25 systems biology rna seq v2

Embed Size (px)

Citation preview

Page 1: 2013 july 25 systems biology rna seq v2

Cancer  Systems  Biology:  RNA-­‐Seq  and  Differen;al  Expression  Analysis  

Taking  advantage  of  a  Measurement  Revolu;on  

July  25,  2013  Anne  DeslaLes  Mays  

Wellstein/Riegel  Laboratory  Mentor:  Anton  Wellstein,  MD,  PhD  

7/25/13   Wellstein/Riegel  Laboratory   1  

Page 2: 2013 july 25 systems biology rna seq v2

Talk  Outline  

•  On  the  Shoulders  of  Giants  •  Sequencing  Timeline  •  RNASeq  for  Everyone  •  RNA-­‐Sequencing  Details  •  Differen;al  Expression  Analysis  •  Causality  •  Cancer  Therapeu;cs  Example  •  Ask  Bigger  Ques;ons  –  Sequencing  Everything    

7/25/13   Wellstein/Riegel  Laboratory   2  

Page 3: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   3  

Rosalind  Franklin  “pioneered  use  of  x-­‐rays  to  create  images  of  unorganized  maLer  –  such  as  

large  biological  molecules  –  not  just  single  crystals”  

hLp://www.pbs.org/wgbh/aso/databank/entries/bofran.html  

“Franklin  made  equipment  adjustments  to  produce  an  extremely  fine  beam  of  x-­‐rays.    She  extracted  finer  DNA  fibers  than  ever  before  and  arranged  them  in  parallel  bundles.    Studied  fibers’  reac;ons  to  humid  condi;ons.  …  allowed  her  to  discover  cruical  keys  to  DNA’s  structure….  Wilkins  shared  this  with  Watson  &  Crick  at  Cambridge  without  her  knowledge…”  

Page 4: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   4  

���� ���� ���� ���� ���� ��� ��� ���� ���� ���� ���� ���� ���� ���� ���� ��� ��� ���� ���� ���� ���� ���������������

� �������������� ����

�� ������������� �� ����������� ������������������� �

��������������������������������

����������

��������!��"���������"����������� �#���

$

%����� "���� ��&��" �&�'� ����������(��� "���������"���)��

$*

+������ ����""��"��� �����,-�

��!��"� "������ "�

.

%����"������"&�������������,-�

��!��"������/0� ��

$$

,� ��"�������+���"�1�"�������2����3+1�4

+1�� �"���� ������ ���������������!��"�����������������5�

$5

'������������������������!��"� "�

$�

(��� ����������� �

6�������������������1�-7)

��������� ��� ������!��"������

������#� � "������"� � $

������"�'������"�����������"�� ����!��"��

��� 0�/����"�

+1���"�� ���������� ��� ����8��& "��������� ",�������"���� �"��

$

9���"�+��� �"����%����� "�����:�� 5

� ��"���������%- ���������" "�

$;

�"�������8���<��������������

� ����������������� �

$.

1 �������""��"������"�����������1�"���� ���0

)�"����+����"����� ������ �������������������"���"����

8 �����=�����&���

%� � ��=�>���"����"��6���������������� ������!��"����������"������������55$.

-��� ���% ��������������� ���������������!��"� "������ "���"����&�� 5

6���"�����"��������������������"�� ����������� ����"������"�

*

+����� <�& ����������� ����������

�����#��"�����������,-�

$;

/���� "#%����? ����,-���!��"���

$.

%��&��9���"� ������������@- �����

���" "�

$. ��"" �#���������� ���� ������"�� �����

) ���"�(�������� ���%)-�:������ ��������� �" "����!��"���

;

(������ ���"���� ���������������"�� ��� "&��������

�����������"���"���

7"���"�� �"�����"���� ��������������!��"� "�

���� �������"�

Sequencing  Timeline  

Page 5: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   5  

���� ���� ���� ���� ���� ��� ��� ���� ���� ���� ���� ���� ���� ���� ���� ��� ��� ���� ���� ���� ���� ���������������

� �������������� ����

�� ������������� �� ����������� ������������������� �

��������������������������������

����������

��������!��"���������"����������� �#���

$

%����� "���� ��&��" �&�'� ����������(��� "���������"���)��

$*

+������ ����""��"��� �����,-�

��!��"� "������ "�

.

%����"������"&�������������,-�

��!��"������/0� ��

$$

,� ��"�������+���"�1�"�������2����3+1�4

+1�� �"���� ������ ���������������!��"�����������������5�

$5

'������������������������!��"� "�

$�

(��� ����������� �

6�������������������1�-7)

��������� ��� ������!��"������

������#� � "������"� � $

������"�'������"�����������"�� ����!��"��

��� 0�/����"�

+1���"�� ���������� ��� ����8��& "��������� ",�������"���� �"��

$

9���"�+��� �"����%����� "�����:�� 5

� ��"���������%- ���������" "�

$;

�"�������8���<��������������

� ����������������� �

$.

1 �������""��"������"�����������1�"���� ���0

)�"����+����"����� ������ �������������������"���"����

8 �����=�����&���

%� � ��=�>���"����"��6���������������� ������!��"����������"������������55$.

-��� ���% ��������������� ���������������!��"� "������ "���"����&�� 5

6���"�����"��������������������"�� ����������� ����"������"�

*

+����� <�& ����������� ����������

�����#��"�����������,-�

$;

/���� "#%����? ����,-���!��"���

$.

%��&��9���"� ������������@- �����

���" "�

$. ��"" �#���������� ���� ������"�� �����

) ���"�(�������� ���%)-�:������ ��������� �" "����!��"���

;

(������ ���"���� ���������������"�� ��� "&��������

�����������"���"���

7"���"�� �"�����"���� ��������������!��"� "�

���� �������"�

Human  Sequencing  Timeline  

Key  Technical  Advances:    Celera  Human  Sequence  done  in  one  loca;on  on  the  largest  super  computer  in  private  hands  at  that  ;me  

Page 6: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   6  

Page 7: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   7  

Page 8: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   8  

Page 9: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   9  

Page 10: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   10  

Page 11: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   11  

Page 12: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   12  

Page 13: 2013 july 25 systems biology rna seq v2

Cancer  Systems  Biology  Taking  advantage  of  measurement  revolu3on  

Declining  sequencing  costs,  decreasing  compu3ng  costs  How  do  you  leverage  all  this  data?  

GEO May 25, 2012

GEO June 25, 2013

Page 14: 2013 july 25 systems biology rna seq v2

Here  is  an  example  RNA-­‐Seq  Workflow  

7/25/13   Wellstein/Riegel  Laboratory   14  

Experimental  Design  

Sample  Collec;on  

Quality  Control  Read  Trimming  

Differen;al  Analysis  

Transcript  Iden;fica;on  

Pathway  Analysis  

Feature  Discovery  

Sequencing  

Page 15: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   15  

hLp://rnaseq.uoregon.edu/index.html  

Page 16: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   16  hLp://rnaseq.uoregon.edu/index.html  

Page 17: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   17  hLp://rnaseq.uoregon.edu/index.html  

Page 18: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   18  hLp://rnaseq.uoregon.edu/index.html  

Page 19: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   19  hLp://rnaseq.uoregon.edu/index.html  

Page 20: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   20  

hLp://rnaseq.uoregon.edu/index.html  

Page 21: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   21  

hLp://rnaseq.uoregon.edu/index.html  

Page 22: 2013 july 25 systems biology rna seq v2

Replicates:    Type  I  and  Type  II  errors  

7/25/13   Wellstein/Riegel  Laboratory   22  

Page 23: 2013 july 25 systems biology rna seq v2

Detec;ng  Signal  vs.  Noise  

7/25/13   Wellstein/Riegel  Laboratory   23  

Page 24: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   24  

Page 25: 2013 july 25 systems biology rna seq v2

What  is  the  goal  of  the  sequencing  experiment?  

7/25/13   Wellstein/Riegel  Laboratory   25  

Page 26: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   26  

Page 27: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   27  

Page 28: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   28  

Before  Library  Construc;on  1.  Most  vendors  and  cores  will  assess  

the  quality  of  the  RNA  before  sequencing  

2.  Important  to  determine  before  sequencing  begins  

Garbage  –  in  ==  Garbage  out  

Before  library  construc;on,  RNA  quality  must  be  assessed  

Page 29: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   29  

RNA-­‐seq  

Page 30: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   30  

Three  steps  to  get  to  a  fresh  sequence  with  the  Illumina  Genome  Sequence  Analyzer  

•  Library  genera;on  •  Cluster  genera;on  •  Sequencing  

Page 31: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   31  

Before  Library  Construc;on  1.  Poly-­‐A  Selec;on  (Total  RNA  -­‐>  

mRNA)  2.  mRNA  fragmenta;on  3.  First  strand  synthesis  (here  we  stop  

if  we  want  to  maintain  strand  specificity  

4.  Second  strand  synthesis  

Other  techniques  1.  Ribozero  2.  Ribominus  

Library  Construc;on:    Messenger  RNA  are  Poly-­‐A  selected  from  Total  RNA,  fragmented  and  cDNA  synthesized  

Page 32: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   32  

cDNA  (single  or  double  stranded)  1.  cDNA  is  blunt  end-­‐repaired  and  

phosphorylated  (B.)  2.  A-­‐base  added  to  prepare  for  

indexed  adapter  liga;on  (C.)    

Library  Construc;on:  End  repair  and  adenyla;on  results  in  adapter  liga;on  ready  constructs  

Page 33: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   33  

Index  adapter  liga;on  and  product  ready  for  amplifica;on  on  cBot  or  the  cluster  sta;on  1.  Strand  specific  tags  are  added  to  

the  A  base  –  ligate  index  adapter  (D)  

2.  Denature  and  amplify  for  final  product  (E)  

 

Library  Construc;on:  Adapter  liga;on  results  in  cluster-­‐genera;on-­‐ready  constructs  

Page 34: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   34  

Single  DNA  molecules  hybridize  to  the  lawn  of  oligos  graped  to  the  surface  of  the  flow  cell  1.  Oligo  lawn  2.  Oligos  hybridize  to  the  adapters  

that  had  been  ligated  to  the  library  fragments  which  flow  through  the  cell  

   

Cluster  Genera;on:  In  the  illumina  Cbot  system,  single  molecules  are  isothermally  amplified  in  a  flow  cell  to  prepare  them  for  sequencing  

Page 35: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   35  

Bridge  amplifica;ons  resul;ng  in  100s  of  millions  of  unique  clusters  1.  Each  fragment  is  clonally  

amplified  through  a  series  of  extensions  and  isothermal  bridge  amplifica;ons  

2.  Reverse  strands  cleaved  and  washed  away  

3.  Ends  are  blocked  4.  Sequencing  primer  hybridized  to  

the  DNA  template  5.  Libraries  are  ready  for  

sequencing      

Cluster  genera;on:    Bound  fragments  are  extended  to  make  copies  and  reverse  strands  cleaved  and  washed  away  

Page 36: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   36  

4  fluorescently  labeled  reversibly  terminated  nucleo;des  1.  Each  base  competes  for  addi;on  2.  Natural  compe;;on  ensures  

highest  accuracy  3.  Aper  each  round  of  synthesis,  

clusters  are  excited  by  a  laser  emiqng  a  color  that  iden;fies  the  newly  added  base  

4.  Fluorescent  label  and  blocking  group  are  removed  allowing  for  addi;on  of  next  nucleo;de  

5.  Proprietary  (Illumina)  chemistry  reads  a  base  in  each  cycle  

6.  Allows  for  accurate  sequencing  through  difficult  regions  such  as  homopolymers  and  repe;;ve  sequence  

Sequencing:    100s  of  millions  of  clusters  sequenced  simultaneously  

Page 37: 2013 july 25 systems biology rna seq v2

There  are  other  ways  to  Inquire  about  the  Transcriptome  

•  Array  Based  Technologies  –  Affymetrix  –  Agilent  –  Known  genes  and  hybridiza;on  protocols  

•  Microarray  –  20,000+  array  experiments  on  a  single  platorm  –  Edge  effects  –  False  posi;ves  /  false  nega;ves  

•  Bead-­‐based  arrays  •  Tiling  arrays  •  SAGE  7/25/13   Wellstein/Riegel  Laboratory   37  

Page 38: 2013 july 25 systems biology rna seq v2

What  is  unique  about  RNA-­‐Seq?    

•  Allows  you  to  discover  and  profile  the  en;re  transcriptome  of  any  organism  

•  No  probes  or  primers  to  design  •  Novel  transcripts  •  Novel  isoforms  •  Alterna;ve  splice  sites  •  Rare  transcripts  •  cSNPS  –  all  of  this  in  one  experiment  

7/25/13   Wellstein/Riegel  Laboratory   38  

Page 39: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   39  

Aper  sequencing…  1.  Quality  control  –  trim  your  reads  2.  Count  Reads  

•  Align  to  genome  •  Align  to  transcriptome  

3.  Interpret  Data  •  Sta;s;cal  tests  (differen;al  

expression  analysis)  •  Visualiza;on  (mapped  

reads)  •  Pathway  analysis  

 Not  so  simple  –  big  data,  big  compute  requirements    

Aper  sequencing,  we  must  then  perform    RNA-­‐Seq  Data  Analysis  

Page 40: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   40  

Page 41: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   41  

Page 42: 2013 july 25 systems biology rna seq v2

RNASeq flow chart – reference (steps 1-4): http://trinityrnaseq.sourceforge.net/genome_guided_trinity.html Step 1: align-reads:

FASTQ    PE*  reads  

Reference  Genome    Assembly  WGS  

Exis;ng  Gene  models  

(gt  files  w/  tss  ids)*  

Gene  models    mapped  to    reference  

gsnap  

trimmoma;c  FASTQC  

trimmed    PE*  reads  

Quality  control    consensus    

per  read  length    graphs  

•  Tss ids = transcription start site ids, in a gtf file format •  PE – paired end •  The gene models that are built with the pasa pipeline can be input to tophat

Shadeless    rectangle   An unshaded rectangle represents code to be run – a process

Shaded    rectangle  

A shaded rectangle is a file or a graphic which may be an input and/or an output

Legend  

Gsnap  aligned  Bam  files  

Dark  rectangle  Dark rectangle represents a file that can be displayed as a track in crop-pedia

Align-reads: Gsnap is used to align reads to the genome sequence.

samtools   Gsnap.CoordSorted.bam  

Page 43: 2013 july 25 systems biology rna seq v2

RNA  Alterna;ve  Splicing:  Why  you  need  gapped  aligners  

7/25/13   Wellstein/Riegel  Laboratory   43  

Page 44: 2013 july 25 systems biology rna seq v2

RNASeq flow chart – reference (steps 1-4): http://trinityrnaseq.sourceforge.net/genome_guided_trinity.html Step 2: assemble-reads:

Prep_rnaseq_  alignments_for    

genome_assisted_  assembly.pl  

•  Tss ids = transcription start site ids, in a gtf file format •  PE – paired end •  The gene models that are built with the pasa pipeline can be input to tophat

Shadeless    rectangle   An unshaded rectangle represents code to be run – a process

Shaded    rectangle  

A shaded rectangle is a file or a graphic which may be an input and/or an output

Legend  

Dark  rectangle  Dark rectangle represents a file that can be displayed as a track in crop-pedia

assemble-reads: Trinity is used to assemble the RNA-Seq reads in each partition. This can be done in a massiviely parallel manner, typically requiring little RAM as compared to whole de novo RNA-Seq assemblies, and can be executed using standard hardware. The firs step (pre_rnaseq_alignments_for genome_assisted_assembly.pl – partitions the reads according to covered regions

Gsnap.CoordSorted.bam  

Find  Dir_*  -­‐name    “*reads”  >  read_files.list  

Read_files.list  

GG_write_trinity_  cmds.pl  

ParaFly  

Trinity_GG.cmds  

Find  Dir_*  -­‐name    “*inity.fasta”  –exec  cat  {}  |    

Inchworm_accession_incrementer.pl  >  Trinity_GG.fasta  

Trinity_GG.fasta  

Page 45: 2013 july 25 systems biology rna seq v2

RNASeq flow chart – reference (steps 1-4): http://trinityrnaseq.sourceforge.net/genome_guided_trinity.html Steps 3 and 4: align-transcripts and assemble-transcript alignments

Launch_PASA_pipeline.pl  

•  Tss ids = transcription start site ids, in a gtf file format •  PE – paired end •  The gene models that are built with the pasa pipeline can be input to tophat

Shadeless    rectangle   An unshaded rectangle represents code to be run – a process

Shaded    rectangle  

A shaded rectangle is a file or a graphic which may be an input and/or an output

Legend  

Dark  rectangle  Dark rectangle represents a file that can be displayed as a track in crop-pedia

Trinity_GG.fasta  

Pasa_databasename  .pasa_assemblies.denovo_  transcript_isoforms.gt  

Pasa_databasename  .pasa_assemblies.denovo_  transcript_isoforms.bed  

Pasa_databasename  .pasa_assemblies.denovo_  transcript_isoforms.gff3  

Pasa_databasename  .pasa_assemblies.denovo_  transcript_isoforms.fasta  

Page 46: 2013 july 25 systems biology rna seq v2

RNASeq flow chart – Step 5 – Tuxedo Suite – using the output of the trinity-genome-guided assembly and the pasa and keygene annotation pipelines à call tuxedo suite (in parallel with then calling the abundancy estimator RSEM

•  Tss ids = transcription start site ids, in a gtf file format •  PE – paired end •  The gene models that are built with the pasa pipeline can be input to tophat

Shadeless    rectangle   An unshaded rectangle represents code to be run – a process

Shaded    rectangle  

A shaded rectangle is a file or a graphic which may be an input and/or an output

Legend  

Dark  rectangle  Dark rectangle represents a file that can be displayed as a track in crop-pedia

       

Gff3  (gene  model)    

   

Gff3togt  (convert  to  gt  format  

       

Gt  (gene  model)    

   

tophat   Calls    Bow;e2  

       

Junc;ons.bed    

   

Accepted.hits.  sam  

Page 47: 2013 july 25 systems biology rna seq v2

RNASeq Quantitation and Differential Analysis

•  Tss ids = transcription start site ids, in a gtf file format •  PE – paired end •  The gene models that are built with the pasa pipeline can be input to tophat

Shadeless    rectangle   An unshaded rectangle represents code to be run – a process

Shaded    rectangle  

A shaded rectangle is a file or a graphic which may be an input and/or an output

Legend  

Quantitation (matrix file with counts per isoform) Model building/Differential analysis

Trinity.fasta  

Dark  rectangle  Dark rectangle represents a file that can be displayed as a track in crop-pedia

Tuxedo suite

Trinity genome guided assembly Abundance    es;ma;on  RSEM  

Transcripts  .gt/.gff*  

trimmed    PE*  reads  

RSEM.isoform.  results  

Limma  Model  Design/contrast  

matrix    building  

randomForest    pcAlg  

Genie3.R  DREAM4  

Accepted.hits.  sam  

cuffdiff2  

•  Transcript annotation file produced by cufflinks, cuffcompare or other source

•  Counts and read group tracking files also created

Isoforms.fpkm_tracking  

Genes.fpkm.tracking  

Cds.fpkm.tracking  

Tss_groups.fpkm.tracking  

Isoform_exp.diff  

Gene_exp.diff  

Tss_group_exp.diff  

Cds_exp.diff  

Page 48: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   48  

How  much  RNA-­‐sequencing  data?  1.  20  million  paired  end  reads  ~  2  GB  of  data  2.  100  million  paired  end  reads  ~  10  GB  of  data    How  much  computa;on  power?  1.  More  memory,  more  processors,  less  ;me  it  takes  to  compute  2.  Outsource  the  analysis,  s;ll  will  need  to  store  the  results  somewhere  

Amazon  web  services  S3  storage  EC  elas;c  cloud  on  demand  computa;onal  facility    Georgetown  University  High  Performance  Computer  Core  matrix.georgetown.edu    UPENN  Galaxy  services        

How  much  RNA-­‐sequencing  data,  how  much  computa;on  power  and  where  do  you  go  to  compute?  

Page 49: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   49  

A  growing  number  of  tools  enable  RNA-­‐Seq  analysis  

Page 50: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   50  

What  percentage  of  reads  are  covered?  What  percentage  of  reads  are  mapped?  

3’  Bias  on  transcript  reads  1.  60-­‐80%  of  reads  are  mapped  2.  Highest  percentage  or  3’  end  of  

reads  are  mapped  3.  Reads  need  to  be  quality  trimmed  

Mapping  tools  bias  exons  to  known  genes        

Page 51: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   51  

Galaxy  is  a  web  based  tool  commiLed  to  enable  a  researcher  (more  than  just  for  RNA-­‐Seq)  

Page 52: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   52  

Page 53: 2013 july 25 systems biology rna seq v2

How  to  visualize  mapped  results?  

•  UCSC  Genome  Browser  (Gbrowse)  •  Integrated  Genome  Browser  (IGB)  •  Integrated  Genome  Viewer  (IGV)  

Many  shared  formats,  reading  many  of  the  outputs  generated  by  the  programs,  ability  to  generate  ones  own  tracks  

7/25/13   Wellstein/Riegel  Laboratory   53  

Page 54: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   54  

Scalechr21:

DNase Clusters

Multiz Align

Human mRNAs

K562 CTCF Int 1

K562 Pol2 Int 1

HeLaS3 Pol2 Int 1

GM12878 1H1-hESC 1

K562 1HeLa-S3 1

HepG2 1GM12878H1-hESC

K562HeLa-S3

HepG2HUVEC

GM12878 PkH1-hESC Pk

K562 PkHeLa-S3 Pk

50 kb hg1923,600,000 23,650,000

C7 Random

C7 Targeted

Transcription Factor ChIP-seq from ENCODE

SwitchGear Genomics Transcription Start SitesH3K27Ac Mark (Often Found Near Active Regulatory Elements) on 7 cell lines from ENCODE

RefSeq GenesHuman ESTs That Have Been Spliced

Digital DNaseI Hypersensitivity Clusters in 125 cell types from ENCODEVertebrate Multiz Alignment & Conservation (46 Species)

UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)Simple Nucleotide Polymorphisms (dbSNP 137) Found in >= 1% of Samples

Individual matches for article Przybylski2010Sequences in Articles: PubmedCentral and Elsevier

SNPs in PublicationsHuman mRNAs from GenBank

Regulatory elements from ORegAnnoChromatin Interaction Analysis Paired-End Tags (ChIA-PET) from ENCODE/GIS-Ruan

DNA Methylation by Reduced Representation Bisulfite Seq from ENCODE/HudsonAlpha

CpG Methylation by Methyl 450K Bead Arrays from ENCODE/HAIB

Chromatin Interactions by 5C from ENCODE/Dekker Univ. Mass.

HWI-ST1129:97:D0LRDACXX:6:2208:3356:23592_2:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:2208:3356:23592_1:N:0:CACTCA

HWI-ST1129:97:D0LRDACXX:6:2204:15017:145130_2:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:2204:15017:145130_1:N:0:CACTCA

HWI-ST1129:97:D0LRDACXX:6:2107:8319:79365_2:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:2107:8319:79365_1:N:0:CACTCA

HWI-ST1129:97:D0LRDACXX:6:2107:12368:117403_1:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:2107:12368:117403_2:N:0:CACTCA

HWI-ST1129:97:D0LRDACXX:6:2208:7212:116648_1:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:2208:7212:116648_2:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:2205:11321:72079_1:N:0:CACTCA

HWI-ST1129:97:D0LRDACXX:6:1203:1649:66972_1:N:0:CTCTCAHWI-ST1129:97:D0LRDACXX:6:1203:1649:66972_2:N:0:CTCTCA

HWI-ST1129:97:D0LRDACXX:6:2106:11187:101221_2:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:2106:11187:101221_1:N:0:CACTCA

HWI-ST1129:97:D0LRDACXX:6:2102:8052:88370_2:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:2102:8052:88370_1:N:0:CACTCA

HWI-ST1129:97:D0LRDACXX:6:2108:5000:141429_2:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:2108:5000:141429_1:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:1303:16417:184679_2:N:0:CACTCCHWI-ST1129:97:D0LRDACXX:6:1303:16417:184679_1:N:0:CACTCC

HWI-ST1129:97:D0LRDACXX:6:2106:18235:74385_1:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:2106:18235:74385_2:N:0:CACTCA

HWI-ST1129:97:D0LRDACXX:6:2201:15196:5280_2:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:2201:15196:5280_1:N:0:CACTCA

HWI-ST1129:299:C18KJACXX:6:1305:12160:63303_1:N:0:ATCACGHWI-ST1129:299:C18KJACXX:6:1102:19732:75986_1:N:0:ATCACGHWI-ST1129:299:C18KJACXX:6:1305:12160:63303_2:N:0:ATCACGHWI-ST1129:299:C18KJACXX:6:1102:19732:75986_2:N:0:ATCACGKCEBPB

LMafK_(ab50322)KTAL1_(SC-12984)

KCEBPB KKYY1KTBPKE2F4KTAF1KELF1_(SC-631)KPol2-4H8KHEY1KE2F6_(H-50)KCEBPBKTFIIIC-110

ggNFKBGgPU.1GBATFGIRF4_(M-17)GBCL11A

GgPU.1

gPU.1 KCEBPB

DA743484BF207587

Delgado-Olguin2004

Layered H3K27Ac100 _

0 _

Mammal Cons

K562 CTCF Sig 1

K562 Pol2 Sig 1

HeLaS3 Pol2 Sig 1

Page 55: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   55  

Page 56: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   56  

Page 57: 2013 july 25 systems biology rna seq v2

What  do  RNA-­‐Seq  reads  look  like  for  GAPDH?  

Repeat  masked  allowing  1/2  mismatched  bases  blat’d  reads    viewed  in  IGB  6.7.2  

Page 58: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   58  

RNA-­‐Seq  Differen;al  Expression  analysis  

Page 59: 2013 july 25 systems biology rna seq v2

What  does  GAPDH  look  like  in  terms  of  quan;ta;on?  

TOTAL  BM   HPP  RPKM   3SEQ  Counts   BLAT  Reads   RPKM   3SEQ  Counts   BLAT  Reads  

CD34   0.7   340   230   8   8   14  BST1   19.7   5374       31   31      CD133   0.2   173   176   16   16   33  THY1   0   7       4   4      A12           1           0  A5           0           0  ALK   0   9   24   0   0   3  B9           0           0  C1           0           0  C2           0           0  C7           0           0  E7           0           0  E9           2           0  F6           0           0  G12           0           0  GAPDH   3013.2   727831   356289   120.8   5559   2670  H3           0           0  

Blat  read  raw  counts  ra;o  ==  3Seq  counts  ra;o  ~=  130  to  1  RPKM  ra;o  ~=  24.3  

Page 60: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   60  

RNA-­‐Seq  Quan;fica;on  Challenge:  A  problem  that  exists  with  RNA-­‐Seq  data  that  doesn’t  exist  with  array  data:    Longer  transcripts  produce  more  reads  than  shorter  transcripts  

One  solu;on  to  account  for  this  is  RPKM  (FPKM  used  by  Cufflinks)    RPKM  =  10^9  x  C  /  NL,  which  is  really  just  simply  C/N    C(gene)=  the  number  of  mappable  reads  that  fall  onto  a  gene's  exons  N=  total  number  of  mappable  reads  in  the  experiment  L(gene)=  the  sum  of  the  exons  in  base  pairs.    Wold  (2008)    RPKM  –  reads  per  kilo  base  per  million  CPM  –  counts  per  million    

Page 61: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   61  

RNA-­‐Seq  Quan;fica;on  Challenge:  DESeq  Method  uses  the  geometric  mean  of  counts  in  all  samples  

DESeq  Method:  Construct  a  "reference  sample"  by  taking,  for  each  gene,  the  geometric  mean  of  the  counts  in  all  samples.    To  get  the  sequencing  depth  of  a  sample  rela;ve  to  the  reference,  calculate  for  each  gene  the  quo;ent  of  the  counts  in  your  sample  divided  by  the  counts  of  the  reference  sample.    Now  you  have,  for  each  gene,  an  es;mate  of  the  depth  ra;o.    Simply  take  the  median  of  all  the  quo;ents  to  get  the  rela;ve  depth  of  the  library.      'es;mateSizeFactors'  func;on  of  DESeq  package  does  this  calcula;on.    

Page 62: 2013 july 25 systems biology rna seq v2

DESeq:  an  R  package  that  works  with  Raw  Counts  to  determine  genes  differen;ally  expressed  across  samples  

•  Simon  Anders  

7/25/13   Wellstein/Riegel  Laboratory   62  

Page 63: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   63  

Page 64: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   64  

Page 65: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   65  

Page 66: 2013 july 25 systems biology rna seq v2

Given  a  list  of  differen;ally  expressed  Genes  now  enrichment  analysis  should  be  performed  

•  Enrichment  analysis  allows  the  researcher  to  leverage  documented  experiments  which  provide  evidence  for  genes  roles  in  pathways  and  func;ons  that  enable  the  researcher  to  determine  the  results  and  significance  of  their  experiments  

•  DAVID  –  Gene  ontology  –  Func;onal  ontology  

•  Revigo  –  Output  of  David  may  be  placed  in  REVIGO  for  further  interpreta;on  and  sta;s;cal  explora;on  of  significance  of  discovered  sets  of  genes  

7/25/13   Wellstein/Riegel  Laboratory   66  

Page 67: 2013 july 25 systems biology rna seq v2

Using  differen;ally  expressed  genes,  biological  pathways  should  be  explored  

•  Differen;ally  expressed  genes  are  put  into  programs  such  as  pathway  studio  or  ingenuity  

•  Shortest  path  programs  and  •  Canonical  pathway  analysis  •  Enables  a  researcher  to  reverse  engineer  the  pathways  

expressed  in  the  course  of  a  healthy  response  to  a  diseased  response  

•  Ideally  a  pathway  reveals  the  observed  phenotype  –  connec;ng  the  expressed  gene  expression  program  with  the  phenotype  –  genotype  –  gene  expression  program  to  phenotype  

7/25/13   Wellstein/Riegel  Laboratory   67  

Page 68: 2013 july 25 systems biology rna seq v2

RNA-­‐Sequencing:  What  is  it  good  for?  

•  Transcript  Annota;on  – Muta;on  iden;fica;on  –  Isoform  determina;on  –  Alterna;ve  Splice  Varia;on  

•  Differen;al  Gene  Expression  –  Phenotypically  segrega;ng  experiments  –  Allows  us  to  get  at  the  How  in  looking  at  the  response  of  an  organism  within  a  par;cular  cell  popula;on  to  events  

–  Good  and  careful  design  will  allow  us  to  unfold  the  dynamics  of  this  response  and  iden;fy  targets  for  altering  disease  responses  to  improve  ones  chances  of  surviving  

7/25/13   Wellstein/Riegel  Laboratory   68  

Page 69: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   69  

Page 70: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   70  

hLp://bayes.cs.ucla.edu/home.htm    

Page 71: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   71  

Page 72: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   72  

Page 73: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   73  

Page 74: 2013 july 25 systems biology rna seq v2

7/25/13   Wellstein/Riegel  Laboratory   74  

Acknowledgements  Dr.  Anton  Wellstein  Dr.  Anna  Riegel  

 Dr.  Marcel  Schmidt  Dr.  Elena  Tassi  The  en;re  lab:    Elena,  Virginie,  Ghada,  Ivana,  Eveline,  Khalid,  Eric  the  en;re  Wellstein/Riegel  laboratory      My  CommiLee      Dr.  Yuri  Gusev  Dr.  Anatoly  Dritschilo  Dr.  Michael  Johnson  Dr.  Christopher  Loffredo  Dr.  Habtom  Ressom  Dr.  Terry  Ryan  (external  commiLee  member)    High  Performance  Core  Group,  Steve  Moore,  especially  Woonki  Chung  Amazon  Cloud  Services  Dr.  Ann  Loraine,  UNC,  IGB  Developer  Brian  Haas,  Author  Trinity  Suite      

Page 75: 2013 july 25 systems biology rna seq v2

Some  Resources  

•  hLp://rnaseq.uoregon.edu/index.html  •  hLp://dx.doi.org/10.1038/npre.2010.4282.1    (DESeq)  •  hLp://galaxy.psu.edu/  •  hLp://seqanswers.com/  •  hLp://www.broadins;tute.org/igv/  •  hLp://bioviz.org/igb/index.html  •  hLp://www.illumina.com  •  hLp://www.otogene;cs.com  •  hLp://www.dnanexus.com  •  hLp://bioconductor.org/packages/2.12/bioc/html/limma.html  •  hLp://trinityrnaseq.sourceforge.net/  •  hLp://trinityrnaseq.sourceforge.net/genome_guided_trinity.html  •  hLp://cufflinks.cbcb.umd.edu/  •  hLp://brb.nci.nih.gov/BRB-­‐ArrayTools.html  •  hLp://www.modernatx.com/  

7/25/13   Wellstein/Riegel  Laboratory   75  

Page 76: 2013 july 25 systems biology rna seq v2

Systems  Biology  History  (wikipedia)  

•  Systems  biology  roots  found  in  –  Quan;ta;ve  modeling  of  enzyme  kine;cs  –  Mathema;cal  modeling  of  popula;on  growth  –  Simula;ons  to  study  neurophysiology  –  Control  theory  and  cyberne;cs  

•  Theorists  –  Ludwig  von  Bertalanffy  –  General  Systems  Theory  –  Alan  Lloyd  Hodgkin  and  Andrew  Fielding  Huxley  –  constructed  a  

mathema;cal  model  that  explained  poten;al  propaga;ng  along  the  axon  of  a  neuron  cell  

–  Denis  Nobel  –  first  computer  model  of  the  heart  Pacemaker  

7/25/13   Wellstein/Riegel  Laboratory   76  

Page 77: 2013 july 25 systems biology rna seq v2

Scien;fic  knowledge  is  limited  (and  advanced)  by  the  limits  (and  advancements)  of  measurement  

7/25/13   Wellstein/Riegel  Laboratory   77  

•  Ilya  Shmulevich  Genomic  Signal  Processing  “Validity  of  the  model  involves  observa;on  and  measurement,  scien;fic  knowledge  is  limited  by  the  limits  of  measurement”  

•  Erwin  Shrödinger  Science  Theory  and  Man:  “It  really  is  the  ul;mate  purpose  of  all  schemes  and  models  to  serve  as  scaffolding  for  any  observa;ons  that  are  at  all  means  observable”