60
Laura Riva & Ma+a Pelizzola NGS raw data analysis Introduc3on to NextGenera3on Sequencing (NGS)data Analysis

NGSrawdataanalysis · 2013. 3. 4. · Libraries, lanes, and flowcells Each reaction produces a unique library of DNA fragments for sequencing. Each NGS machine processes a single

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  •              Laura  Riva  &  Ma+a  Pelizzola  

    NGS  raw  data  analysis      Introduc3on  to  Next-‐Genera3on  Sequencing  (NGS)-‐data  Analysis    

  •    1.  Descrip7on  of  sequencing  pla

  • DNA  fragmenta3on    and  adapter  liga3on   A@achment  to  the  flow  cell  

    Cluster  genera3on  by  bridge    amplifica3on  of  DNA  fragments    

    Single  base  extension    with  incorpora3on  of    fluorescently  labeled  nucleo3des        

    Library  prepara3on  

    Sequencing  

    Automated  cluster  genera3on  

    Illumina  Sequencing    

  • Sequence Files Quality Scores

    Illumina  Output    

  • 454  Ti  RocheT   Illumina  HiSeqTM  2000  

    ABI  5500  (SOLiD)  

    Amplifica7on   Emulsion  PCR   Bridge  PCR   Emulsion  PCR  

    Sequencing  reac7on  

    Pyrosequencing   Reversible  terminators  

    Liga7on-‐based  sequencing    

    Paired  ends/sep   Yes/3kb   Yes/200  bp   Yes/3  kb  

    Read  length   400  bp   100  bp   75  bp  

    Advantages   Short  run  7mes.  Longer  reads  improve  mapping  in  repe77ve  regions.  Ability  to  detect  large  structural  varia7ons  

    The  most  popular  pla

  • The Illumina HiSeq!

    Stacey Gabriel

  • Libraries, lanes, and flowcells

    Each reaction produces a unique library of DNA fragments for

    sequencing.

    Each NGS machine processes a single flowcell containing several independent lanes during a single

    sequencing run

    Illumina

    Lanes

    Flowcell

  • Applications of Next-Generation Sequencing

  • Basic  data  analysis  concepts:  •  Raw  data  analysis=  image  processing  and  base  calling  

    •  Understanding  Fastq  •  Understanding  Quality  scores  •  Quality  control  and  read  cleaning  •  Alignment  to  the  reference  genome  •  Understanding  SAM/BAM  formats  •  Samtools  •  Bedtools  

  • Understanding  FASTQ  file  format  FASTQ  format  is  a  text-‐based  format  for  storing  a  biological  sequence  and  its  corresponding    quality  scores.  It  has  become  the  standard  for  storing  the  output  of  high  throughput    sequencing  instruments.    EXAMPLE:  

    1.  begins  with  a  '@'  character  and  is  followed  by  a  sequence  iden7fier  2.  the  raw  sequence  leeers.    3.  begins  with  a  '+'  character  and  is  op#onally  followed  by  the  same  sequence  iden7fier    4.  encodes  the  quality  values  for  the  sequence  in  and  must  contain  the  same  number    of  symbols  as  leeers  in  the  sequence.    

  • Understanding  Quality  scores  Fastq  contains  quality  informa7on.    Phred  quality  scores   �Q�  are  defined  as  a  property  which  is  logarithmically    related  to  the  base-‐calling  error  probabili#es  �P�  Where  P  is  the  probability  that  the  corresponding  base  call  is  incorrect.      

  • Phred  quality  scores  Phred  quality  scores   �Q�  are  represented  with  a  single  bit  in  ASCII  format.    ASCII  stands  for  American  Standard  Code  for  Informa7on  Interchange.    ASCII  code  is  the  numerical  representa7on  of  a  character  such  as  'a'  or  '@'    or  an  ac7on  of  some  sort.  The  first  32  symbols  in  ASCII  are  control  characters,  so  we  start  at  33.                            If  your  ASCII  character  is  ‘B’  and  they  are  in  Sanger  format,  then  66-‐33=33,  so    33=  (-‐10log10p)  -‐3.3=log10p  10-‐3.3=p,  so  p=  0.0005  or  0.05%  chance  of  an  incorrect  base.  

  •  @EAS139:136:FC706VJ:2:5:1000:12850  1:Y:18:ATCACG  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA    +    BBBBCCCC?

  • •  Before  alignment  there  is  some7mes  the  need  to  preprocess/manipulate  the  FASTA/FASTQ  files  to  produce  beeer  mapping  results.  It  is  important  to  do  quality  control  checks  to  understand  whether  your  data  has  any  problems  of  which  you  should  be  aware  before  doing  any  further  analysis    

     •  FastQC:  quality  control  checks  on  raw  sequence  data  coming  from  

    high  throughput  sequencing  pipelines  (hep://www.bioinforma7cs.bbsrc.ac.uk/projects/fastqc/).  

    •  The  FASTX-‐Toolkit  tools  perform  some  of  these  preprocessing  tasks  (hep://hannonlab.cshl.edu/fastx_toolkit/).      Two  of  many  useful  tools  are:    

    –  FASTQ  Quality  Filter  à  Filters  sequences  based  on  quality  –  FASTQ  Quality  Trimmer  à  Trims  (cuts)  sequences  based  on  quality  

     

    Quality  control  and  read  cleaning      

  • FastQC:  quality  controls  of  sequencing  files  I  

    PER  BASE  SEQUENCE  QUALITY  

    ACCTGGGATCAAACATTCAGGACATATAGCACAATAGGAC  

    90th  percen7le  of  the  data  

    median  

    mean  

    10th  percen7le  of  the  data  

    Very  good  quality  

    Reasonable  quality  

    Poor  quality  

    A  very  good  quality  sequence  

  • FastQC:  quality  controls  of  sequencing  files  II  

    DUPLICATION  LEVELS  

    How  many  7mes  is  the  sequence  repeated?    

    Which  por7on  of  the  sequences  has  that  characteris7c?    

    Computed  for  the  first  200’000  reads  gives  an  overall  impression  for  the  duplicate  levels  in  the  whole  file  

    A  very  good  quality  sequence  

  • General  steps  of  data  analysis:  – Quality  control  and  read  cleaning  – Alignment  to  the  genome  

    There  are  many  aligners  and  many  short  reads  aligners:  

  • Alignment  strategy:  •  Mapping:  quickly  iden7fy  candidates  of  hits  on  the  reference  

    genome    •  Alignment  and  report:  score  the  alignment  

    •  Important  features:  –  Some  souware  use  the  base  quality  score  to  evaluate  alignment,  

    others  do  not  –  For  all  the  aligners  there  is  a  trade  off  between  performance  and  

    accuracy  –  Gapped  or  ungapped  alignment  –  Important  parameters:    

    •  Maximum  of  mismatches  •  Repor7ng  unique  hits  or  mul7ple  hits  

  • BWA  • It  uses  Burrows-‐Wheeler  indexing  algorithm  to  speed  up  alignment  7me  

    •   Fast  and  moderate  memory  usage    

    •   Work  for  different  sequencing  pla

  • The  Sequence  Alignment/Map  (SAM)  format:    •  Format  for  the  storage  of  sequence  alignments  and  their  mapping    coordinates    •  Supports  different  sequencing  pla

  • SAM  file  format  

  • BAM headers: an optional (but essential) part of a BAM file

    @HD VN:1.0 GO:none SO:coordinate @SQ SN:chrM LN:16571 @SQ SN:chr1 LN:247249719 @SQ SN:chr2 LN:242951149 [cut for clarity] @SQ SN:chr9 LN:140273252 @SQ SN:chr10 LN:135374737 @SQ SN:chr11 LN:134452384 [cut for clarity] @SQ SN:chr22 LN:49691432 @SQ SN:chrX LN:154913754 @SQ SN:chrY LN:57772954 @RG ID:20FUK.1 PL:illumina PU:20FUKAAXX100202.1 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.2 PL:illumina PU:20FUKAAXX100202.2 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.3 PL:illumina PU:20FUKAAXX100202.3 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.4 PL:illumina PU:20FUKAAXX100202.4 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.5 PL:illumina PU:20FUKAAXX100202.5 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.6 PL:illumina PU:20FUKAAXX100202.6 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.7 PL:illumina PU:20FUKAAXX100202.7 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.8 PL:illumina PU:20FUKAAXX100202.8 LB:Solexa-18484 SM:NA12878 CN:BI @PG ID:BWA VN:0.5.7 CL:tk @PG ID:GATK TableRecalibration VN:1.0.2864 20FUKAAXX100202:1:1:12730:189900 163 chrM 1 60 101M = 282 381

    GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTA…[more bases] ?BA@A>BBBBACBBAC@BBCBBCBC@BC@CAC@:BBCBBCACAACBABCBCCAB…[more quals] RG:Z:20FUK.1 NM:i:1 SM:i:37 AM:i:37 MD:Z:72G28 MQ:i:60 PG:Z:BWA UQ:i:33

    Required: Standard header

    Essential: contigs of aligned reference

    sequence. Should be in karotypic order.

    Essential: read groups. Carries platform (PL), library

    (LB), and sample (SM) information. Each read is

    associated with a read group

    Useful: Data processing tools applied to the reads

    SAM  file  format  

  • SAM  file  format  

  • They  quan7fy  the  probability  that  a  read  is  misplaced  (P  probability  that  the  alignment  is  wrong).  The  error  probability  scaled  in  the  Phred  is  the  mapping  quality  (Q).      The  calcula7on  of  mapping  quali7es  considers  all  the  following  factors:  •  The  repeat  structure  of  the  reference.  Reads  falling  in  repe77ve  regions  

    have  very  low  mapping  quality.  •  The  base  quality  of  the  read.  Low  quality  means  the  observed  read  

    sequence  is  possibly  wrong,  and  wrong  sequence  may  lead  to  a  wrong  alignment.  

    •  The  sensi7vity  of  the  alignment  algorithm.  The  true  hit  is  more  likely  to  be  missed  by  an  algorithm  with  low  sensi7vity,  which  also  causes  mapping  errors.  

    •  Paired  end  or  not.  Reads  mapped  in  pairs  are  more  likely  to  be  correct.  

    Mapping  quality  scores  

  •  •M:  match/mismatch    •I:  inser7on    •D:  dele7on    •S:  souclip  •H:  hardclip    •P:  padding    •N:  skip  

    CIGAR  string  

  • SAMtools      •  Library  and  souware  package  that  manipulate  BAM/SAM  files  •  SAMàBAM  conversion  (samtools  view)    •  Samtools  view  –f  INT  file.bam  

     Only  output  alignments  with  all  bits  in  INT  present  in  the  FLAG  field.    •  Samtools  view  –F  INT  file.bam  

     Skip  alignments  with  bits  present  in  INT  [0]      •  Creates  sorted  and  indexed  BAM  files  from  SAM  files  (samtools  sort  /samtools  

    index)  •  Removing  PCR  duplicates  (samtools  rmdup)  •  Merging  alignments  (samtools  merge)    •  Visualiza7on  of  alignments  from  BAM  files    •  SNP  calling  and  short  indel  detec3on    

    References:  1.   hep://samtools.sourceforge.net/  2. hep://samtools.sourceforge.net/samtools.shtml  

  • BEDtools      

    References:  1  Quinlan  AR  and  Hall  IM,  2010.  BEDTools:  a  flexible  suite  of  u7li7es  for  comparing  genomic  

    features.  Bioinforma7cs.  26,  6,  pp.  841–842.  2  hep://code.google.com/p/bedtools/downloads/detail?name=BEDTools-‐User-‐Manual.v4.pdf  

    The  BEDTools  u7li7es  allow  one  to  address  common  genomics  tasks  such  as  finding  feature  overlaps  and  compu7ng  coverage.  The  u7li7es  are  largely  based  on  four  widely-‐used  file  formats:  BED,  GFF/GTF,  VCF  and  SAM/BAM.    

    •  slopBed  Adjusts  each  BED  entry  by  a  requested  number  of  base  pairs.    •  shuffleBed  Randomly  permutes  the  loca7ons  of  a  BED  file  among  a  genome.  •  intersectBed  (BAM)  Returns  overlaps  between  two  BED/GFF/VCF  files.  •  genomeCoverageBed  (BAM)  Creates  a  "per  base"  report  of  genome  

    coverage.    •  subtractBed  Removes  the  por7on  of  an  interval  that  is  overlapped  by  

    another  feature.  •  mergeBed  Merges  overlapping  features  into  a  single  feature.    

    BED  format      

  •                  Freely  available  at    hep://github.org/pezmaster31/bamtools  

    BamTools  

  • Picard  comprises  Java-‐based  command-‐line  u7li7es  that  manipulate  SAM  files,  and  a  Java  API  (SAM-‐JDK)  for  crea7ng  new  programs  that  read  and  write  SAM  files.  Both  SAM  text  format  and  SAM  binary  (BAM)  format  are  supported    Freely  available  at  hep://picard.sourceforge.net  

    Picard  

  • GATK  

    Freely  available  at  hep://www.broadins7tute.org/gatk/  

  • Reads and fragments

    Depth of coverage

    Individual reads aligned to the genome

    Clean C/T heterozygote

    Details about specific read

    First and second read from the same fragment

    Reference genome

    Non-reference bases are colored; reference bases are grey

    IGV screenshot

    Reference:  hep://www.broadins7tute.org/igv/  

    Reads  Visualiza7on:  IGV      

  • Recita7on  We  are  going  to  use  a  fastq  file  that  is  present  in  GEO  (hep://www.ncbi.nlm.nih.gov/geo/,  record  GSE24326)    It  is  an  Hsf1  ChIP-‐seq  experiment  used  to  iden7fy  the  genome-‐wide  sites  of  HSF1  binding  in  human  adipocytes  derived  from  mesenchymal  stem  cells.    HSF1  (Heat  shock  factor  protein  1)  is  a  heat-‐shock  transcrip7on  factor.  Its  target  genes  include  major  inducible  heat  shock  proteins  such  as  Hsp72.  The  transcrip7on  of  heat-‐shock  genes  is  rapidly  induced  auer  temperature  stress.      The  known  Hsf1  mo7f  is:        

  • Quality  control        

  • Quality  control        fastx_quality_stats  -‐Q  33  -‐i  hsf1.fastq  -‐o  hsf1_stats.txt  The    output  TEXT  file  will  have  the  following  fields  (one  row  per  column):  •    column  =  column  number  (1  to  36  for  a  36-‐cycles  read  solexa  file)  •    count      =  number  of  bases  found  in  this  column.  •    min          =  Lowest  quality  score  value  found  in  this  column.  •    max          =  Highest  quality  score  value  found  in  this  column.  •    sum          =  Sum  of  quality  score  values  for  this  column.  •    mean        =  Mean  quality  score  value  for  this  column.  •    Q1  =  1st  quar7le  quality  score.  •    med  =  Median  quality  score.  •    Q3  =  3rd  quar7le  quality  score.  •    IQR  =  Inter-‐Quar7le  range  (Q3-‐Q1).  •    lW  =  'Leu-‐Whisker'  value  (for  boxplo+ng).  •    rW  =  'Right-‐Whisker'  value  (for  boxplo+ng).  •    A_Count  =  Count  of  'A'  nucleo7des  found  in  this  column.  •    C_Count  =  Count  of  'C'  nucleo7des  found  in  this  column.  •    G_Count  =  Count  of  'G'  nucleo7des  found  in  this  column.  •    T_Count  =  Count  of  'T'  nucleo7des  found  in  this  column.  •    N_Count  =  Count  of  'N'  nucleo7des  found  in  this  column.  •    max-‐count  =  max.  number  of  bases  (in  all  cycles)  

  • fastq_quality_boxplot_graph.sh  -‐i  hsf1_stats.txt  -‐o  hsf1.png                        fastx_ar7facts_filter  -‐Q  33  -‐v  -‐i  hsf1.fastq  -‐o  hsf1_ar7fact.fastq  It  removes  sequencing  ar7facts    Input:  6313567  reads.  Output:  6312015  reads.  discarded  1552  (0%)  ar7fact  reads.  

  • fastq_quality_trimmer  -‐Q  33  -‐t  20  -‐l  28  -‐v  -‐i  hsf1_ar7fact.fastq  –o  hsf1_ar7fact_trim.fastq    

    It  trims  sequences  based  on  quality    Minimum  Quality  Threshold:  20  Minimum  Length:  28  Input:  6312015  reads.  Output:  5929465  reads.  discarded  382550  (6%)  too-‐short  reads.    fastq_quality_filter  -‐Q  33  -‐v  -‐q  20  -‐p  80  –I  hsf1_ar7fact_trim.fastq  -‐o  

    hsf1_ar7fact_trim_qual.fastq  It  filters  sequences  based  on  quality      Quality  cut-‐off:  20  Minimum  percentage:  80  Input:  5929465  reads.  Output:  5655141  reads.  discarded  274324  (4%)  low-‐quality  reads.    

  • •  fastx_quality_stats  -‐Q  33  –i  hsf1_ar7fact_trim_qual.fastq  -‐o  hsf1_ar7fact_trim_qual_stats.txt    

    •  fastq_quality_boxplot_graph.sh  -‐i  hsf1_ar7fact_trim_qual_stats.txt  -‐o  hsf1_ar7fact_trim_qual.png    

  • Quality  control  with  fastQC  /home/lriva/FastQC/fastqc  -‐-‐contaminants    /home/lriva/FastQC/Contaminants/contaminant_list.txt  hsf1_ar7fact_trim_qual.fastq  

  • §  A fast program able to align short query sequences (< 200 bp; error rate < 3% ) to a target reference genome

    §  Work flow:

    ü index [-options] (it index the database)

    ü aln [-options] < input query.fastq> (it computes the SA coordinates for the input reads)

    ü samse [-options] (it converts SA coordinates into chromosomal coordinates and generates alignment in SAM formats for single reads)

    BWA  

  • Alignment  of  the  reads  

    •  bwa  aln  -‐t  4  -‐f  hsf1.sai  /db/bwa/0.6.2/hg19/hg19  hsf1_ar7fact_trim_qual.fastq    

    •  bwa  samse  /db/bwa/0.6.2/hg19/hg19  hsf1.sai  hsf1_ar7fact_trim_qual.fastq  |  samtools  view  –ut  /db/bwa/0.6.2/hg19/hg19  -‐  |  samtools  sort  -‐  hsf1    

    •  samtools  index  hsf1.bam    

  • Visualizing  alignments,  conver7ng  from  BAM  binary  to  SAM  text  format    samtools  view  -‐h  -‐o  hsf1_stdchr.nodup.sam  hsf1_stdchr.nodup.bam  

     less    hsf1_stdchr.nodup.sam    

  • Go  to  hep://bioserver.iit.ieo.eu/~lriva/  and  download  corsoPoplimiChIPseq.txt    ssh  [email protected]    ssh  user@dpt-‐n1.bioinfo.ieo.eu  cd  public_html      You  can  check  all  your  files  at  hep://bioserver.iit.ieo.eu/~user/  

    Let’s  start  

  • Filtering  alignments    

    coun7ng  and  visualizing  unmapped  reads  •  samtools  view  -‐f  4  -‐c  hsf1.bam  •  samtools  view  -‐f  4  hsf1.bam  |  less  

    HWI-‐EAS413:4:1:3:1708      4              *              0              0              *              *              0              0              GGTGCTGGCCAACATGACCTTGCACGGA        BCACCCCBCCCBCCBBCCCBB

  • Filtering  alignments    coun7ng  and  visualizing  mapped  reads  •  samtools  view  -‐F  4  -‐c  hsf1.bam  •  samtools  view  -‐F  4  hsf1.bam  |  less    HWI-‐EAS413:4:100:367:1780              16            chr1        9999        25            35M          *              0              0              GATCTCCCTAACCCTAACCCTAACCCTAACCCTAA          ?B;BABBAAC?AC?AC?CBCBCBCCCBCBBCCBCB          XT:A:U    NM:i:2    XN:i:2    X0:i:1    X1:i:0    XM:i:2    XO:i:0    XG:i:0    MD:Z:3A0A30  

  • select  only  mapped  reads  with  mapping  quality>=0  samtools  view  -‐F  4  -‐q  1  -‐bo  hsf1_stdchr.bam  hsf1.bam  chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chr10  chr11  chr12  chr13  chr14  chr15  chr16  chr17  chr18  chr19  chr20  chr21  chr22  chrX  

  • Prefix  trie  of  string  ‘GOOGOL’  0,6  

    1,2  

    1,2   4,4  

    6,6  

    2,2  

    2,2  

    4,6  

    1,2  

    1,2  4,4  

    6,6  

    2,2  

    2,2  

    6,6  

    2,2  

    2,2  

    3,3  

    5,5  

    1,1  

    4,4  

    6,6  

    2,2  

    2,2  

    G  

    O  

    O  

    G  

    O  

    O  

    G  

    O  

    O  

    G  

    G  

    O  

    G  

    O  

    O  

    G  

    L  

    ^  

    ^  

    ^  

    ^  

    ^  

    ^  ^    =  start  of  the  string  

    root  

  • Burrows-‐Wheeler  transform  (BWT)  

    X  =  ‘googol$’  (reference  string)  W  =  ‘go’    (query  string)  SA  interval=[1,2]  

       

    *   *  g   o   o   g   o   l  0   1   2   3   4   5  

    SA interval of ‘go’

    B= BWT(X)" (BWT string of X)"

    Suffix array =

    googol + $"

  • Suffix  array  interval:  formal  defini#on  

    Example:""X = ‘googol$’"W = ‘go’ "

    If W is an empty string then" = [1, n-1]"[ , ]"= SA interval of W "

    (‘go’)= 1 "(‘go’)= 2 "

    SA interval of ‘go’ = [1,2]"

    [ , ]"

  • Exact  match:  Ferragina-‐Manzini  algorithm  

    true only if and only if aW is a substring of X

    X = reference string; W = query string"C(a) = number of symbols < a in X[0,n-2]"O(a, i) = number of occurrences of a in B[0,i] where B= BWT (X)"If W is a substring of X: "

  • Exact  match:  backward  search  

    0 6 $googol 1 3 gol$goo 2 0 googol$ 3 5 l$googo 4 2 ogol$go 5 4 ol$goog 6 1 oogol$g

    X= ‘googol$’"bwt= ‘lo$oogg’"W=‘ogo’"

    ""

    if W ← ‘ ’:! L ←1! R ← length(X)-1!j←length(W)-1!!while (L ≤ R and j ≥ 0) do:! aW←W[ j:end ]!! if length(aW)

  • Exact  match:  backward  search  (con#nued)  

    A   T   T   T   G   A   T   C   T   T   T   T   G   A   T   C  0   1   2   3   4   5   6   7   8   9   10   11   12   13   14   15  

    *   *  

    X=  

    W=  GATC  

    SA  interval  =[6,7]  

    SA=  (16,  13,  5,  0,  15,  7,  12,  4,  14,  6,  11,  3,  10,  2,  9,  1,  8)  

    BWT  string=  CGG$TTTTAATTTTTAC  0      16        $ATTTGATCTTTTGATC  1      13        ATC$ATTTGATCTTTTG  2        5          ATCTTTTGATC$ATTTG  3        0          ATTTGATCTTTTGATC$  4      15        C$ATTTGATCTTTTGAT  5        7          CTTTTGATC$ATTTGAT  6      12        GATC$ATTTGATCTTTT  7      4            GATCTTTTGATC$ATTT  8    14          TC$ATTTGATCTTTTGA  9      6            TCTTTTGATC$ATTTGA  10  11        TGATC$ATTTGATCTTT  11  3            TGATCTTTTGATC$ATT  12  10        TTGATC$ATTTGATCTT  13  2            TTGATCTTTTGATC$AT  14  9            TTTGATC$ATTTGATCT  15  1            TTTGATCTTTTGATC$A  16  8            TTTTGATC$ATTTGATC    

  • Inexact  match  algorithm  

  • •  DNA  is  fragmented  •  Adaptors  ligated  to  fragments  •  Several  possible  protocols  yield  

    array  of  PCR  colonies.  (AMPLIFICATION)  

    •  Enzyma7c  extension  with  fluorescently  tagged  nucleo7des.  (SEQUENCING  REACTION)  

    •  Cyclic  readout  by  imaging  the  array.  

    •  Align  strings  to  the  reference  genome.      

    Basics of NGS technology!

    Consider  the  following  paper  that  describes  sequencing  technologies  in  details:  Michael  Metzker  ‘Sequencing  technologies-‐the  next  genera7on’    Nat  Rev  Genet.  2010  Jan;11(1):31-‐46.  

  • Mapping Algorithm (2 mismatches)

    GATGCATTGCTATGCCTCCCAGTCCGCAACTTCACG

    GATGCATTG CTATGCCTC CCAGTCCGC AACTTCACG seeds

    Exact match Genome

    GATGCATTG CTATGCCTC CCAGTCCGC AACTTCACG .........

    Indexed table of exactly matching seeds

    Approximate search around the exactly matching seeds

  • Mapping Algorithm (2 mismatches)

    •  Partition reads into 4 seeds {A,B,C,D} –  At least 2 seed must map with no mismatches

    •  Scan genome to identify locations where the seeds match exactly –  6 possible combinations of the seeds to search

    •  {AB, CD, AC, BD, AD, BC} –  6 scans to find all candidates

    •  Do approximate matching around the exactly-matching seeds. –  Determine all targets for the reads –  Ins/del can be incorporated

    •  The reads are indexed and hashed before scanning genome

    •  Bit operations are used to accelerate mapping –  Each nt encoded into 2-bits