28
By Zemin Ning & Adam Spargo Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The Wellcome Trust Sanger Institute The SSAHA2 Application Pack The SSAHA2 Application Pack

By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

Embed Size (px)

Citation preview

Page 1: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

By

Zemin Ning & Adam SpargoZemin Ning & Adam Spargo

Informatics Division

The Wellcome Trust Sanger InstituteThe Wellcome Trust Sanger Institute

The SSAHA2 Application The SSAHA2 Application PackPack

Page 2: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

SSAHASSAHA22

ssahaESTssahaESTcDNA/EST cDNA/EST AlignmentAlignment

cross_genocross_genomeme

Genome Genome AlignmentAlignment

ssaha2ssaha2Sequence Sequence AlignmentAlignment

TraceSearTraceSearchch

Trace Trace AlignmentAlignment

ssahaSNPssahaSNPSNP/indel SNP/indel detectiondetection

ssahaSVssahaSVStructural Structural VariationVariation

Page 3: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

Exon/Intron Splice Sites mRNA

5’-XXXXX------------------------XXXXXXXXX-3’

5’-XXXXXGTXXXXXXXXXAXXXXXXXXXXAGXXXXXXXXX-3’

genomic DNA Introns have conserved splice sites (Donor, Acceptor,

Branch point) => Define an intron as a gap with splice signals.

Initially, it was discovered that GT-AG introns are spliced by spliceosome containing U1, U2, U4/U6 and U5 snRNPs

However, real donors vary significantly

Donor AcceptorBranch point

Page 4: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

Site Modelling

Weight Matrix Model (WMM):> DonorA 0.32 0.60 0.08 0.00 0.00 0.46 0.72 0.06 0.14C 0.40 0.13 0.03 0.00 0.00 0.03 0.07 0.05 0.16G 0.18 0.13 0.81 1.00 0.00 0.48 0.12 0.84 0.23T 0.10 0.14 0.08 0.00 1.00 0.03 0.09 0.05 0.47 -3 -2 -1 +1 +2 +3 +4 +5 +6Staden R. (1984) Nucleic Acids Res. 12, 505-19

WMMs are constructed for donor, acceptor and branch sites based on EnsEMBL annotation

Page 5: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

U2 and U12 Donors

U2 donor logo:

U12 donor logo:

Page 6: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

U2 and U12 Branch U2 branch signal logo:

U12 branch logo:

Page 7: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

U2 and U12 Acceptors U2 acceptor logo:

U12 acceptor logo:

Page 8: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

SSAHA2 EnsEMBL Differences Query Subject Query Subject >tr:ENST00000254959 1 707 383735 384441 1 100.00 | 1 706 383735 384440 1 100.00 | 0 1 0 1 704 847 385527 385670 1 100.00 | 707 846 385530 385669 1 100.00 | -3 1 -3 1 844 942 393167 393265 1 100.00 | 847 940 393170 393263 1 100.00 | -3 2 -3 2 940 1139 393375 393574 1 100.00 | 941 1139 393376 393574 1 100.00 | -1 0 -1 0 1138 1263 394989 395114 1 100.00 | 1140 1261 394991 395112 1 100.00 | -2 2 -2 2 1261 1435 395201 395375 1 100.00 | 1262 1435 395202 395375 1 100.00 | -1 0 -1 0 1433 1597 396512 396676 1 100.00 | 1436 1596 396515 396675 1 100.00 | -3 1 -3 1 1595 1708 397769 397882 1 100.00 | 1597 1708 397771 397882 1 100.00 | -2 0 -2 0 1708 1889 402956 403137 1 100.00 | 1709 1888 402957 403136 1 100.00 | -1 1 -1 1 1887 2011 404133 404258 1 96.00 | 1889 1987 404135 404233 1 100.00 | -2 24 -2 25 1986 2132 404593 404739 1 100.00 | 1988 2131 404595 404738 1 100.00 | -2 1 -2 1 2131 2212 405993 406074 1 100.00 | 2132 2212 405994 406074 1 100.00 | -1 0 -1 0

SSAHA2 - “Unaware” of Splice Sites

Page 9: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

>tr:ENST00000254959

1 706 383735 384440 1 100.00 | 1 706 383735 384440 1 100.00 | 0 0 0 0 707 846 385530 385669 1 100.00 | 707 846 385530 385669 1 100.00 | 0 0 0 0 707 846 385530 385669 1 100.00 | 707 846 385530 385669 1 100.00 | 0 0 0 0 941 1139 393376 393574 1 100.00 | 941 1139 393376 393574 1 100.00 | 0 0 0 01140 1261 394991 395112 1 100.00 | 1140 1261 394991 395112 1 100.00 | 0 0 0 01262 1435 395202 395375 1 100.00 | 1262 1435 395202 395375 1 100.00 | 0 0 0 01436 1596 396515 396675 1 100.00 | 1436 1596 396515 396675 1 100.00 | 0 0 0 01597 1708 397771 397882 1 100.00 | 1597 1708 397771 397882 1 100.00 | 0 0 0 01709 1888 402957 403136 1 100.00 | 1709 1888 402957 403136 1 100.00 | 0 0 0 01889 1987 404135 404233 1 100.00 | 1889 1987 404135 404233 1 100.00 | 0 0 0 01988 2131 404595 404738 1 100.00 | 1988 2131 404595 404738 1 100.00 | 0 0 0 02132 2212 405994 406074 1 100.00 | 2132 2212 405994 406074 1 100.00 | 0 0 0 0

ssahaEST – Adjusted Splice Sites

ssahaEST EnsEMBL Differences Query Subject Query Subject

Page 10: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack
Page 11: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

SNP/indel Locus SNP/indel Locus Reference Reference

Read_mRead_m

Read_iRead_i

Read_1Read_1

Current Packages:

Gap4, POLYBASES, POLYPHRED, PTA, TGICL, autoSNP, miraEST, and SeqDoC, etc.

ssahaSNP – Detecting SNPs/indels by Genomic Alignment

Multiple read alignment can be reconstructed from individual alignments as aligned positions of each base for each read are based on a common reference (consensus).

Page 12: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

Neighbourhood Quality Standard  (NQS) 

(1) the quality value (Q) of the SNP base is 23, the Q value for the 5 bases on either side of the SNP is 15 

(2) At least nine of the flanking ten bases matched between reads.  

(3) The cluster depth is no greater than e.g. 8 reads, on the basis that deeper clusters might comprise a low-copy repeat. 

(4) The number of candidate SNPs in a cluster is 4, on the basis that clusters with more divergent sequences might be composed of low-copy repeats (recently diverged paralogous sequences, accumulating sequence differences between them.) 

Mullikin et al. Nature 407, 516 (2000)

Page 13: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

Output Format of ssahaSNP

Page 14: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

Output Format of Parsed SNPs

Page 15: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

Output Format of Parsed Indels

Page 16: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack
Page 17: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

ssahaSV ssahaSV - - A Computational A Computational Method to Detect Structural Method to Detect Structural

VariationsVariations

Page 18: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

Reference Sequence

Sample Reads

Deletion

Insertion

Insertion

VNTR

1

1’2’

2’

A’ A’’

Detection of Structural Variations

Page 19: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

DNA Sources and Reads DNA Sources and Reads

Species Cell lines Number of reads

Human HAPMAP 17109 1,841,054

Human HAPMAP 17119 5,977,374

Human HAPMAP 11321 4,488,765

Human HAPMAP 07340 3,728,821

Human HAPMAP 10470 557,845

Human Celera HuAA 2,788,046

Human Celera HuBB 19,397,599

Human Celera HuCC 1,745,337

Human Celea HuDD 2,011,152

Human Celera HuFF 1,507,522

Total Human 44,043,515

Chimpanzee Clint 30,838,333

Total Reads 74,881,848

Page 20: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

Length distribution of structural variants with Chimp ancestral data included.

0

5

10

15

20

25

10 100 1000 10000 100000

Sequence length of structural variants (bps)

Fre

qu

ency

rat

io (

%)

Deletion

VNTRs

Ancestral Deletion

Ancestral VNTRs

Page 21: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

Reference

Sample Reads

Sample Reads

Reference

VNTR

’ ’’

’ ’’Deletion

Target Site Duplications - RetrotransposonsTarget Site Duplications - Retrotransposons

Page 22: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

0

2

4

6

8

10

-40 -30 -20 -10 0 10 20 30 40

Length distribution of target site duplications (bps)

Freq

uenc

y ra

tio (%

)

Deletion

VNTRs

Distribution of Target Site DuplicationDistribution of Target Site Duplication

Page 23: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

Computational Validation - NOD (Non-Obese Diabetic) Mouse clone vs Reference Sequence

NOD SequenceNOD Sequence

Re

fere

nce

Seq

ue

nce

Re

fere

nce

Seq

ue

nce

DeletionDeletion

InsertionInsertion

Page 24: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack
Page 25: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack
Page 26: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

4. Insertion Chr13:307900303. Deletion Chr6:39030177-39030481

1. Insertion Chr1:237001745 2. Deletion Chr1:56954646-56954968

Experimental validation – PCR TestsExperimental validation – PCR Tests

Page 27: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

Type of Variation Exonic Intronic Non-coding Total

SV_deletion 17 892 1591 2500

SV_insertion 2 897 1459 2358

SV_VNTRs 8 966 1461 2435

Mapping Variants to EnsemblMapping Variants to Ensembl

  A total number of 7,293 structural variants have been identified: 2,500 deletions, 2,358 insertions and 2,435 VNTRs, using 44 million shotgun reads from 10 different human individuals.

66% of sequences of structural variants can be masked as retrotransposons; 28% of human variants share the same location with the chimp, i.e. ancestral states; 89% of ancestral deletions are retrotransposons, 66% for VNTRs; 38% of variants are located in exon/intron regions;

Conclusion: Mobile transposons are not more active in the intro-genetic regions as gene coverage on the human genome is also ~38%

Page 28: By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

Acknowledgements:

Jim Mullkin Two “Tony Cox”es Nikolar Ivanov Richard Durbin

The Project is funded by the Wellcome Trust.