7
8/11/16 1 DB IGV SERVER/REMOTE Personal Computer/Local Terminal Data files Applications and Servers Compute SSH WEB WEB App Service Data files SCP Window NGS Data and Sequence Alignment Manpreet S. Katari Aug 11, 2016 Outline NGS Data FastA FastQ SAM BAM GFF Sequence Alignment Global vs Local Dynamic Programming Burrow Wheeler’s Algorithm. Important files types FASTA FASTQ SAM BAM GFF Sequence files Alignment files Annotation files Important file types: FASTA A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the ">" and the first letter of the identifier. It is recommended that all lines of text be shorter than 80 characters. The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence. Important file types: FASTA >chrI CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC CACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTG GCCAACCTGTCTCTCAACTTACCCTCCATTACCCTGCCTCCACTCGTTAC CCTGTCCCATTCAACCATACCACTCCGAACCACCATCCATCCCTCTACTT ACTACCACTCACCCACCGTTACCCTCCAATTACCCATATCCAACCCACTG CCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATAC TGTTCTTCTACCCACCATATTGAAACGCTAACAAATGATCGTAAATAACA CACACGTGCTTACCCTACCACTTTATACCACCACCACATGCCATACTCAC CCTCACTTGTATACTGATTTTACGTACGCACACGGATGCTACAGTATATA CCATCTCAAACTTACCCTACTCTCAGATTCCACTTCACTCCATGGCCCAT CTCTCACTGAATCAGTACCAAATGCACTCACATCATTATGCACGGCACTT GCCTCAGCGGTCTATACCCTGTGCCATTTACCCATAACGCCCATCATTAT CCACATTTTGATATCTATATCTCATTCGGCGGTCCCAAATATTGTATAAC TGCCCTTAATACATACGTTATACCACTTTTGCACCATATACTTACCACTC CATTTATATACACTTATGTCAATATTACAGAAAAATCCCCACAAAAATCA CCTAAACATAAAAATATTCTACTTTTCAACAATAATACATAAACATATTG GCTTGTGGTAGCAACACTATCATGGTATCACTAACGTAAAAGTTCCTCAA TATTGCAATTTGCTTGAACGGATGCTATTTCAGAATATTTCGTACTTACA CAGGCCATACATTAGAATAATATGTCACATCACTGTCGTAACACTCTTTA GCGT

Data files DB NGS Data and Sequence WEB Alignmenthpc.ilri.cgiar.org/.../content/NGSDataAlignments.pdfData files Applications and Servers Compute SSH WEB WEB App Service Data files

  • Upload
    others

  • View
    65

  • Download
    0

Embed Size (px)

Citation preview

8/11/16

1

DB

IGV

SERVER/REMOTE

Personal Computer/Local

Terminal

Data files

Applications and Servers

Compute

SSH

WEB

WEB

App

Service

Data files

SCP

Window

NGSDataandSequenceAlignment

ManpreetS.KatariAug11,2016

Outline

• NGSData• FastA• FastQ• SAM• BAM• GFF

• SequenceAlignment• GlobalvsLocal• DynamicProgramming• BurrowWheeler’sAlgorithm.

ImportantfilestypesFASTA

FASTQ

SAMBAM

GFF

Sequencefiles

Alignmentfiles

Annotationfiles

Importantfiletypes:FASTAAsequenceinFASTAformatbeginswithasingle-linedescription,followedbylinesofsequencedata.Thedescriptionlineisdistinguishedfromthesequencedatabyagreater-than(">")symbolinthefirstcolumn.Thewordfollowingthe">"symbolistheidentifierofthesequence,andtherestofthelineisthedescription(bothareoptional).

Thereshouldbenospacebetweenthe">"andthefirstletteroftheidentifier.Itisrecommendedthatalllinesoftextbeshorterthan80characters.Thesequenceendsifanotherlinestartingwitha">"appears;thisindicatesthestartofanothersequence.

Importantfiletypes:FASTA>chrICCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTG

GCCAACCTGTCTCTCAACTTACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAACCACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATCCAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATACTGTTCTTCTACCCACCATATTGAAACGCTAACAAATGATCGTAAATAACA

CACACGTGCTTACCCTACCACTTTATACCACCACCACATGCCATACTCACCCTCACTTGTATACTGATTTTACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTCAGATTCCACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATGCACGGCACTTGCCTCAGCGGTCTATACCCTGTGCCATTTACCCATAACGCCCATCATTAT

CCACATTTTGATATCTATATCTCATTCGGCGGTCCCAAATATTGTATAACTGCCCTTAATACATACGTTATACCACTTTTGCACCATATACTTACCACTCCATTTATATACACTTATGTCAATATTACAGAAAAATCCCCACAAAAATCACCTAAACATAAAAATATTCTACTTTTCAACAATAATACATAAACATATTGGCTTGTGGTAGCAACACTATCATGGTATCACTAACGTAAAAGTTCCTCAA

TATTGCAATTTGCTTGAACGGATGCTATTTCAGAATATTTCGTACTTACACAGGCCATACATTAGAATAATATGTCACATCACTGTCGTAACACTCTTTAGCGT

8/11/16

2

Importantfiletypes:FASTA>seq0FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF>seq1

KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM>seq2EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK>seq3

MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK>seq4EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL

Importantfiletypes:FASTQ

FASTQformatisatext-basedformatforstoringbothabiologicalsequence(usuallynucleotidesequence)anditscorrespondingqualityscores.BoththesequenceletterandqualityscoreareeachencodedwithasingleASCIIcharacterforbrevity

Fastq formatReadIdentifier

ReadSequence

ReadSequenceQuality

FASTQ:DataFormat• FASTQ

• Textbased• EncodessequencecallsandqualityscoreswithASCIIcharacters• Storesminimalinformationaboutthesequenceread• 4linespersequence

• Line1:beginswith@;followedbysequenceidentifierandoptionaldescription

• Line2:thesequence• Line3:beginswiththe“+”andisfollowedbysequenceidentifiersanddescription(bothareoptional)

• Line4:encodingofqualityscoresforthesequenceinline2

• References/Documentation• http://maq.sourceforge.net/fastq.shtml• Cocketal.(2009).Nuc AcidsRes38:1767-1771.

Sequencedataformat

Phred QualityScore Probabilityofincorrectbasecall Basecallaccuracy

10 1in10 90 %

20 1in100 99 %

30 1in1000 99.9 %

40 1in10000 99.99 %

50 1in100000 99.999 %

Q=Phred QualityScoresP=Base-callingerrorprobabilities

Qualityscores

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~| | | | | |33 59 64 73 104 126

S - Sanger Phred+33, raw reads typically (0, 40)X - Solexa Solexa+64, raw reads typically (-5, 40)

I - Illumina 1.3+ Phred+64, raw reads typically (0, 40)J - Illumina 1.5+ Phred+64, raw reads typically (3, 40)

with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)

Format/Platform QualityScoreType ASCIIencodingSanger Phred:0-93 33-126Solexa Solexa:-5-62 64-126Illumina 1.3 Phred:0-62 64-126Illumina 1.5 Phred:0-62 64-126Illumina 1.8 Phred:0-62 33-126 ***Sangerformat!

Qualityscoreencodingdifferamongtheplatforms

MostanalysistoolsrequireSangerfastq qualityscoreencoding

8/11/16

3

http://en.wikipedia.org/wiki/FASTQ_format

FASTQQualityscores

http://en.wikipedia.org/wiki/FASTQ_format

FASTQQualityscores

CapitalLettersaregood!

SAM(SequenceAlignment/Map)

• SAMistheoutputofalignersthatmapreadstoareferencegenome

• Tabdelimitedw/headersectionandalignmentsection

• Headersectionsbeginwith@(areoptional)• Alignmentsectionhas11mandatoryfields

• BAMisthebinaryformatofSAM

http://samtools.sourceforge.net/

Alignmentdataformat

http://samtools.sourceforge.net/SAM1.pdf

MandatoryAlignmentFields

BitwiseFlag

0x20= 16^1*2+16^0*0 = 32What is 77? Find greatest value without going over77-64 = 13 4013- 8 = 5 8

5- 4 = 1 41- 1 = 0 1

What is 141?

12481632641282565121024

CIGARstring

29M 1D 9M 1D 9M 2D 21M 2D 18M 1D 70M

8/11/16

4

SAMformat SAMformatgetsconvertedtoaBAMfile

AnnotationFormats

• Mostlytabdelimitedfilesthatdescribethelocationofgenomefeatures(i.e.,genes,etc.)

• Alsousedfordisplayingannotationsonstandardgenomebrowsers• Importantforassociatingalignmentswithspecificgenomefeatures• Descriptions

Columns:Seqid,Source,Type,Start,End,Score,Strand,Phase,Attribute(Identifier)

GFFformat

Columns:Seqid,Source,Type,Start,End,Score,Strand,Phase,Attribute(Identifier)

Globalandlocalapproachestoaligningsequences

24

GLOBAL: Attempt to “match” and assess similarity between two entire sequences

LOCAL: Find subsequences of high similarity

… and then possibly “stick” (chain, net, thread) together local alignments to

obtain an overall comparison of the original sequences.

The second approach is more meaningful

(especially for long sequences, of different lengths, like whole

genomes)

Two protein or DNA sequences are unlikely to present a straightforward overall “match”, even if they are closely

related.

Why? Substitutions are not the only process by which they diverge:

insertions, deletions and rearrangements

8/11/16

5

25

Dynamicprogramming:Pairwisesequencealignment

Createamatrix tableforcomparingsequenceswithonesequencealongeachaxis(sizem+1,n+1)

Fillinpartialalignmentscores untilscoreforentiresequencehasbeencalculated:- Assignscoreforeachpositioni,j,progressingfrom

toplefttobottomright

- Scorerule:takemaximumofthethreechoices1.Takevaluefrom left,assigngappenalty(gapalongleftaxis)

2.Takevaluefrom top,assigngappenalty(gapalongtopaxis)

3.Takevaluefrom diagonalaboveleft,assignmatch/m ismatchscore

- Repeat untiltableiscompleted.

Usetrace-back toobtainfullalignment:AC--TCG

ACAGTAG

BLAST: heuristic database search using local alignmentBasic Local Alignment Search Tool

26

AcartoonforBLAST

(parameters)

…TLSRDQHAWRLS……

QW(queryword),sizeW=3.

{(RDQ ,16)(RBQ ,14)…(REQ ,12)…(RDB,11)} ForeachQW,usethescoringmatrixtoformaneighborhoodNB={allwordsofsizeWwithascore> T=11}

…TLSRDQ HAWRLS………RLSREQ HTWRSS……

Findmatch(es)towordsbelongingtoNBinthesubject(target)sequence

Foreachmatch,usethescoringmatrixandgappenalties toproduceanHSP(HighscoringSegmentPair),thenextendalignmentonbothsides,until• dropis>X,or• scoregoesbelowSmin (minimumhitscore)

querysequence

Smin

X

(cumulative)score

AboutBlat(fromgenome.ucsc.edu)• “BLATonDNAisdesignedtoquicklyfindsequencesof95%andgreatersimilarityoflength25basesormore.“

• “Itmaymissmoredivergentorshortersequencealignments.Itwillfindperfectsequencematchesof20bases.“

• “BLATisnotBLAST.”

• “DNABLATworksbykeepinganindexoftheentiregenomeinmemory.Theindexconsistsofalloverlapping11-merssteppingby5exceptforthoseheavilyinvolvedinrepeats.”

• “Theindextakesupabout2gigabytesofRAM.Thegenomeitselfisnotkeptinmemory,allowingBLATtodeliverhighperformanceonareasonablypricedLinuxbox.“

• “Theindexisusedtofindareasofprobablehomology,whicharethenloadedintomemoryforadetailedalignment.”

HashTable(BLAT)

28

ShortReadApplications

Findingthealignmentsistypicallytheperformancebottleneck

…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…GCGCCCTA

GCCCTATCGGCCCTATCG

CCTATCGGACTATCGGAAA

AAATTTGCAAATTTGC

TTTGCGGTTTGCGGTA

GCGGTATA

GTATAC…

TCGGAAATTCGGAAATTT

CGGTATAC

TAGGCTATA

GCCCTATCGGCCCTATCG

CCTATCGGACTATCGGAAA

AAATTTGCAAATTTGC

TTTGCGGT

TCGGAAATTCGGAAATTTCGGAAATTT

AGGCTATATAGGCTATATAGGCTATATGGCTATATG

CTATATGCG

…CC…CC…CCA…CCA…CCAT

ATAC…C…C…

…CCAT…CCATAG TATGCGCCC

GGTATAC…CGGTATAC

GGAAATTTG

…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…ATAC……CC

GAAATTTGC

NewalignmentalgorithmsmustaddresstherequirementsandcharacteristicsofNGSreads

• Millionsofreadsperrun(30xofgenomecoverage)• ShortReads(asshortas36bp)• Differenttypesofreads(single-end,paired-end,mate-pair,etc.)• Base-callingqualityfactors• Sequencingerrors(~1%)• Repetitiveregions• Sequencingorganismvs.referencegenome• Mustadjusttoevolvingsequencingtechnologiesanddataformats

8/11/16

6

Indexing• Genomesandreadsaretoolargefordirectapproacheslikedynamicprogramming

• Indexing isrequired

• ChoiceofindexiskeytoperformanceSuffixtree Suffixarray Seedhashtables

Manyvariants,incl.spacedseeds

SuffixArray Find"ctat”inthereference

• InventedbyDavidWheelerin1983(BellLabs).Publishedin1994.“ABlockSortingLosslessDataCompressionAlgorithm”

SystemsResearchCenterTechnicalReportNo124.PaloAlto,CA:DigitalEquipmentCorporation,BurrowsM,WheelerDJ.1994

• Originallydevelopedforcompressinglargefiles(bzip2,etc.)

• Lossless,FullyReversible

• AlignmentToolsbasedonBWT:bowtie,BWA,SOAP2,etc.

• Approach:• Alignreadsonthetransformed referencegenome,usinganefficientindex(FMindex)• Solvethesimpleproblemfirst(alignonecharacter)andthenbuildonthatsolutiontosolveaslightly

harderproblem(twocharacters)etc.

• Resultsingreatspeedandefficiencygains(afewGigaByte ofRAMfortheentireH.Genome).OtherapproachesrequiretensofGigaBytes ofmemoryandaremuchslower.

NGSReadAlignmentBurrowsWheelerTransformation(BWT)

c t g a a a c t g g t $t g a a a c t g g t $ cg a a a c t g g t $ c ta a a c t g g t $ c t ga a c t g g t $ c t g aa c t g g t $ c t g a ac t g g t $ c t g a a at g g t $ c t g a a a cg g t $ c t g a a a c tg t $ c t g a a a c t gt $ c t g a a a c t g g$ c t g a a a c t g g t

Text= c t g a a a c t g g t $

Ø Introduce$attheendandconstructallcyclicpermutationsofText

BurrowsWheelerTransformation

$ c t g a a a c t g g ta a a c t g g t $ c t ga a c t g g t $ c t g aa c t g g t $ c t g a ac t g a a a c t g g t $c t g g t $ c t g a a ag a a a c t g g t $ c tg g t $ c t g a a a c tg t $ c t g a a a c t gt $ c t g a a a c t g gt g a a a c t g g t $ ct g g t $ c t g a a a c

BWT(Text)= t g a a $ a t t g g c cBurrowsWheelerMatrix

• Sortrowsalphabetically,keepingofwhichrowwentwhere

BurrowsWheelerTransformation

ExactMatchingwithFMIndex

• Inprogressiverounds,top &bot delimittherangeofrowsbeginningwithprogressivelylongersuffixesofQ

8/11/16

7