Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Alignment and annotation of forensic STR data generated from MPS analysis:
Notes from the sequence workbench
Chris Phillips, Katherine Gettings, Jonathan King, Christophe van Neste,
Walther Parson
Capillary electrophoresis to sequencing of forensic STRs
D12S391:
14 - 27 X.3 intermediate
Capillary electrophoresis to sequencing of forensic STRs
D12S391:
[AGAT]a [AGAC]b [AGAT]c
Sanger to MPS
1516
1718
19
20
21
22
23
24
2526
Sanger to MPS
91 Alleles
19
20
21
22
23
24
2526
Sanger to MPS
91 Alleles
A B C D
E F G H
I J
1516
1718
STR sequence allele nomenclature guidelines
• Outlined 8 practical considerations for analysis of forensic STR data generated by MPS
• Stated analysis, export and storage of MPS data should be as full sequence strings (as in EMPOP)
• Recommended sequence alignments use the most up-to-date human genome assembly
• Suggested a framework for variant annotation with a curated sequence template file (S1)
• The published sequence template file S1 will be transferred to an FTP site with a change-log
• MPS STR allele nomenclature needs careful discussions and planning - a broader based project
• Genomic descriptions of novel STRs require systematic checks too: D5S2500 HDplex ≠ AGCU
STR sequence allele nomenclature guidelines
• Outlined 8 practical considerations for analysis of forensic STR data generated by MPS
• Stated analysis, export and storage of MPS data should be as full sequence strings (as in EMPOP)
• Recommended sequence alignments use the most up-to-date human genome assembly
• Suggested a framework for variant annotation with a curated sequence template file (S1)
• The published sequence template file S1 will be transferred to an FTP site with a change-log
• MPS STR allele nomenclature needs careful discussions and planning - a broader based project
• Genomic descriptions of novel STRs require systematic checks too: D5S2500 HDplex ≠ AGCU
STR sequence allele nomenclature guidelines
• Outlined 8 practical considerations for analysis of forensic STR data generated by MPS
• Stated analysis, export and storage of MPS data should be as full sequence strings (as in EMPOP)
• Recommended sequence alignments use the most up-to-date human genome assembly
• Suggested a framework for variant annotation with a curated sequence template file (S1)
• The published sequence template file S1 will be transferred to an FTP site with a change-log
• MPS STR allele nomenclature needs careful discussions and planning - a broader based project
• Genomic descriptions of novel STRs require systematic checks too: D5S2500 HDplex ≠ AGCU
STR sequence allele nomenclature guidelines
• Outlined 8 practical considerations for analysis of forensic STR data generated by MPS
• Stated analysis, export and storage of MPS data should be as full sequence strings (as in EMPOP)
• Recommended sequence alignments use the most up-to-date human genome assembly
• Suggested a framework for variant annotation with a curated sequence template file (S1)
• The published sequence template file S1 will be transferred to an FTP site with a change-log
• MPS STR allele nomenclature needs careful discussions and planning - a broader based project
• Genomic descriptions of novel STRs require systematic checks too: D5S2500 HDplex ≠ AGCU
STR sequence allele nomenclature guidelines
[AATG]a
TPOX
The repeat motif for each STR marker is listed according to the International Society of Forensic Genetics (ISFG) recommendation that the repeat sequence motif be defined so that the first 50- nucleotides on the GenBank forward strand define the repeat motif used (56)
[AATG]a
TPOX
5’ 3’
The repeat motif for each STR marker is listed according to the International Society of Forensic Genetics (ISFG) recommendation that the repeat sequence motif be defined so that the first 50- nucleotides on the GenBank forward strand define the repeat motif used (56)
rs57
6104
845
[AATG]a
TPOX
Data Slicer
5’ 3’
The repeat motif for each STR marker is listed according to the International Society of Forensic Genetics (ISFG) recommendation that the repeat sequence motif be defined so that the first 50- nucleotides on the GenBank forward strand define the repeat motif used (56)
rs57
6104
845
Data Slicer
Chr Position Ref allele Alt allele
• Outlined 8 practical considerations for analysis of forensic STR data generated by MPS
• Stated analysis, export and storage of MPS data should be as full sequence strings (as in EMPOP)
• Recommended sequence alignments use the most up-to-date human genome assembly
• Suggested a framework for variant annotation with a curated sequence template file (S1)
• The published sequence template file S1 will be transferred to an FTP site with a change-log
• MPS STR allele nomenclature needs careful discussions and planning - a broader based project
• Genomic descriptions of novel STRs require systematic checks too: D5S2500 HDplex ≠ AGCU
STR sequence allele nomenclature guidelines
oversight group
• Outlined 8 practical considerations for analysis of forensic STR data generated by MPS
• Stated analysis, export and storage of MPS data should be as full sequence strings (as in EMPOP)
• Recommended sequence alignments use the most up-to-date human genome assembly
• Suggested a framework for variant annotation with a curated sequence template file (S1)
• The published sequence template file S1 will be transferred to an FTP site with a change-log
• MPS STR allele nomenclature needs careful discussions and planning - a broader based project
• Genomic descriptions of novel STRs require systematic checks too: D5S2500 HDplex ≠ AGCU
STR sequence allele nomenclature guidelines
• Outlined 8 practical considerations for analysis of forensic STR data generated by MPS
• Stated analysis, export and storage of MPS data should be as full sequence strings (as in EMPOP)
• Recommended sequence alignments use the most up-to-date human genome assembly
• Suggested a framework for variant annotation with a curated sequence template file (S1)
• The published sequence template file S1 will be transferred to an FTP site with a change-log
• MPS STR allele nomenclature needs careful discussions and planning - a broader based project
• Genomic descriptions of novel STRs require systematic checks too: D5S2500 HDplex ≠ AGCU
oversight group
STR sequence allele nomenclature guidelines
• Outlined 8 practical considerations for analysis of forensic STR data generated by MPS
• Stated analysis, export and storage of MPS data should be as full sequence strings (as in EMPOP)
• Recommended sequence alignments use the most up-to-date human genome assembly
• Suggested a framework for variant annotation with a curated sequence template file (S1)
• The published sequence template file S1 will be transferred to an FTP site with a change-log
• MPS STR allele nomenclature needs careful discussions and planning - a broader based project
• Genomic descriptions of novel STRs require systematic checks too: D5S2500 HDplex ≠ AGCU
oversight group
STR sequence allele nomenclature guidelines
MPS analysis of STRs with the CEPH diversity panel
• Sequenced 27 A-STRs, 24 Y-STRs and 7 X-STRs in the CEPH panel
• Checked genotyping concordance with CE analysis of the same 944 samples made in 2010
• Transferred the reported STR sequences to Excel and aligned under the current human reference sequence at each locus
• Used simple macros to reverse-compliment when needed and order sequences alphabetically
• Cross-checked flanking SNPs plus RR and flanking Indels with 1000 Genomes P-III data
D21S11
FGA
TH01
!0.15 !0.1 !0.05 0 0.05 0.1 0.15
CEPH-wide average Heterozygosity of 81.2% 20 CODIS-A Autosomal STRs D1S1656
D2S1338
D12S391
FGA
D18S51
D21S11
D19S433
VWA
D8S1179
D13S317
D7S820
TH01
D16S539
D10S1248
D2S441
D5S818
D22S1045
D3S1358
CSF1PO
TPOX 5% 10% -15% -10% -5% 15%
Sized alleles from CE
Sequenced alleles from MPS
No sequence variation detected so far
D21S11
FGA
TH01
!0.15 !0.1 !0.05 0 0.05 0.1 0.15
CEPH-wide average Heterozygosity of 81.2% 20 CODIS-A Autosomal STRs D1S1656
D2S1338
D12S391
FGA
D18S51
D21S11
D19S433
VWA
D8S1179
D13S317
D7S820
TH01
D16S539
D10S1248
D2S441
D5S818
D22S1045
D3S1358
CSF1PO
TPOX 5% 10% -15% -10% -5% 15%
Sized alleles from CE
Sequenced alleles from MPS
No sequence variation detected so far
Allele
Sequ
ence
read
s
5’
3’
Alignment: human genome assemblies follow a set system
CRS
rCRS
short p arm
long
q arm
The forward strand starts at the first 5’ p-arm nucleotide
- to the last 3’ q-arm nucleotide
of each chromosome
5’
3’
CRS
rCRS
short p arm
long
q arm
14:45818468 coordinates map to unique positions
Alignment: human genome assemblies follow a set system
The forward strand starts at the first 5’ p-arm nucleotide
- to the last 3’ q-arm nucleotide
of each chromosome
5’
3’
Two main current human genome assemblies
CRS
rCRS
1:1
22:50818468
short p arm
long
q arm
GRCh38 (hg38) December 2013
5’
3’
CRS
rCRS
1:1
22:50818468
22:51304566
Two main current human genome assemblies
GRCh37 (hg19) February 2009
GRCh38 (hg38) December 2013
5’
3’
Frequency of new assembly builds
CRS
rCRS
GRCh37 (hg19) February 2009
GRCh38 (hg38) December 2013NCBI 35 (hg17) May 2004
NCBI 36 (hg18) March 2006
NCBI 34 (hg16) July 2003
Ten A-STR repeat descriptions use reverse strand direction
5’
3’5’
3’
D2S1338 [TGCC]a [TTCC]b - STRbase
D2S1338 [GGAA]a [GGCA]b - reference genome
W. Bär, B. Brinkmann, B. Budowle, A. Carracedo, P. Gill, P. Lincoln, DNA recommendations. Further report of the DNA Commission of the ISFH regarding the use of short tandem repeat systems. International Society for Forensic Haemogenetics, Int. J. Legal Med. 110 (1997) 175–176.
Coding strand
10
20801599-20801642
20842518-20842573
25931508-25931647
28030728-28030871
Both ¨double¨ Y-STR loci are tandem inversions
DYF387S1[AAAG]a [GTAG]b [GAAG]c [AAAG]d GAAG [AAAG]e [GAAG]f [AAAG]g
a=[TTTC]p b=[GAAA]q
a=[GAAA]p b=[GAAA]q
DYS385 a/b
[CTTT]a [CTTC]b [CTTT]c [CTTC]d [CTTT]e CTTC [CTAC]f [CTTT]g (fragment 2 description changes)
20801599-20801642
20842518-20842573
25931508-25931647
28030728-28030871
DYF387S1
DYS385 a/b
b
a
[AAAG]a [GTAG]b [GAAG]c [AAAG]d GAAG [AAAG]e [GAAG]f [AAAG]g
[CTTT]a [CTTC]b [CTTT]c [CTTC]d [CTTT]e CTTC [CTAC]f [CTTT]g (fragment 2 description changes)
a=[GAAA]p b=[TTTC]q
a=[GAAA]p b=[GAAA]q
Both ¨double¨ Y-STR loci are tandem inversions
GRCh37 and 38 match in 55 of 58 STRs compared so far
DYS437, DYS438, DYS439 show sequence differences in their repeat regions between GRCh37 and GRCh38
1000 Genomes still uses GRCh37 co-ordinates for all sequence and variant data
D21S11
FGA
TH01
!0.15 !0.1 !0.05 0 0.05 0.1 0.15
CEPH-wide average Heterozygosity of 81.2% 20 CODIS-A Autosomal STRs D1S1656
D2S1338
D12S391
FGA
D18S51
D21S11
D19S433
VWA
D8S1179
D13S317
D7S820
TH01
D16S539
D10S1248
D2S441
D5S818
D22S1045
D3S1358
CSF1PO
TPOX 5% 10% -15% -10% -5% 15%
Sized alleles from CE
Sequenced alleles from MPS
No sequence variation detected so far
D18S51 Reported Sequence (repeat sequence +10 nt)
5’ 3’
D18S51 Reported Sequence (repeat sequence +10 nt)
3’
D18S51 Repeat sequence 10 nt
3’
D18S51 Repeat sequence 10 nt
3’
D13S317 Reported Sequence (repeat sequence + 31 nt)
5’ 3’
D13S317 Repeat sequence 31 nt
D19S433
5’ 3’
Reported Sequence (18 nt + repeat sequence = 48 / 16 nt)
D19S433
5’ 3’
Reported Sequence (18 nt + repeat sequence = 48 / 16 nt)
D19S433
5’ 3’
Reported Sequence (18 nt + repeat sequence = 48 / 16 nt)
D19S433
5’ 3’
Repeat sequence of: 48 nt18 nt and: 16 nt
D19S433
5’ 3’
Repeat sequence of: 48 nt18 nt and: 16 nt
19
20
21
22
23
_
+
1516
1718
19
20
21
22
23
24
2526
How to deal with less or more nucleotides in the analysed repeat region compared to the reference sequence ?
19
20
21
22
23
_
+
151617
18
19
20
21
22
23
24
25
26
How to deal with less or more nucleotides in the analysed repeat region compared to the reference sequence ?
Towards a ‘lean’ form of STR annotation
Towards a ‘lean’ form of STR annotation
• - + A C T G • - + A C T G
e.g.
D13S317
Towards a ‘lean’ form of STR annotation
• - + A C T G • - + A C T G
RR start RR stop Anchor nt
e.g.
D13S317
205 ‘features’ in a grid of 2,457 (each with 13:x coordinates) = 8%
Towards a ‘lean’ form of STR annotation
e.g.
D13S317
D21S11 is the most challenging STR to describe
TA TCCATA
[TCTA]a [TCTG]b [TCTA]c TA [TCTA]d TCA [TCTA]e TCCATA [TCTA]f 126 nt of reference sequence > 213 nt of all observed positions
23 size alleles > 92 sequence alleles
D21S11 is the most challenging STR to describe
TA TCCATA
[TATCTA] [TCTA]N motif not in reference sequence
ad hoc 11-nt insertion in AFR singleton
[TCTA]a [TCTG]b [TCTA]c TA [TCTA]d TCA [TCTA]e TCCATA [TCTA]f
RR start RR stop
126 nt of reference sequence > 213 nt of all observed positions
23 size alleles > 92 sequence alleles (4-fold rise in variation)
D21S11 is the most challenging STR to describe
TA TCCATA
[TATCTA] [TCTA]N motif not in reference sequence
ad hoc 11-nt insertion in AFR singleton
[TCTA]a [TCTG]b [TCTA]c TA [TCTA]d TCA [TCTA]e TCCATA [TCTA]f [TATCTA] [TCTA]g
RR start RR stop
126 nt of reference sequence > 213 nt of all observed positions +10 novel nt
23 size alleles > 92 sequence alleles (4-fold rise in variation)
• Obtaining the whole sequence string allows more complete annotation of complex Indel and nucleotide re-arrangements at the repeat region as well as the compilation of flanking region variants from a forensic MPS analysis
• A standard, unified framework of sequence alignment based on the human genome reference assembly will make it easier to regulate and evolve a sequence-based STR allele nomenclature system
• Definition of the repeat region start and stop nucleotides in the reference sequence is the key step to ensuring backwards compatibility of existing size-allele nomenclature to the next step of assigning names to sequences
• To facilitate the adoption of standardised alignment and annotation practices for labs collecting population data with MPS, an Excel sequence template file will be published as an FTP file with regular scrutiny, updates and full change-log details. Since publication made 16 annotation changes.
Concluding remarks
• Obtaining the whole sequence string allows more complete annotation of complex Indel and nucleotide re-arrangements at the repeat region as well as the compilation of flanking region variants from a forensic MPS analysis
• A standard, unified framework of sequence alignment based on the human genome reference assembly will make it easier to regulate and evolve a sequence-based STR allele nomenclature system
• Definition of the repeat region start and stop nucleotides in the reference sequence is the key step to ensuring backwards compatibility of existing size-allele nomenclature to the next step of assigning names to sequences
• To facilitate the adoption of standardised alignment and annotation practices for labs collecting population data with MPS, an Excel sequence template file will be published as an FTP file with regular scrutiny, updates and full change-log details. Since publication made 16 annotation changes.
Concluding remarks
• Obtaining the whole sequence string allows more complete annotation of complex Indel and nucleotide re-arrangements at the repeat region as well as the compilation of flanking region variants from a forensic MPS analysis
• A standard, unified framework of sequence alignment based on the human genome reference assembly will make it easier to regulate and evolve a sequence-based STR allele nomenclature system
• Definition of the repeat region start and stop nucleotides in the reference sequence is the key step to ensuring backwards compatibility of existing size-allele nomenclature to the next step of assigning names to sequences
• To facilitate the adoption of standardised alignment and annotation practices for labs collecting population data with MPS, an Excel sequence template file will be published as an FTP file with regular scrutiny, updates and full change-log details. Since publication made 16 annotation changes.
Concluding remarks
• Obtaining the whole sequence string allows more complete annotation of complex Indel and nucleotide re-arrangements at the repeat region as well as the compilation of flanking region variants from a forensic MPS analysis
• A standard, unified framework of sequence alignment based on the human genome reference assembly will make it easier to regulate and evolve a sequence-based STR allele nomenclature system
• Definition of the repeat region start and stop nucleotides in the reference sequence is the key step to ensuring backwards compatibility of existing size-allele nomenclature to the next step of assigning names to sequences
• To facilitate the adoption of standardised alignment and annotation practices for labs collecting population data with MPS, an Excel sequence template file will be published as an FTP file with regular scrutiny, updates and full change-log details. Since publication made 16 annotation changes.
Concluding remarks
It is important that guidelines for any complex task are precise and easy to follow
Gràcies Graciñas Eskerrik asko Gracias
Gràcies Graciñas Eskerrik asko Gracias
Gràcies Graciñas Eskerrik asko Gracias
Gràcies Graciñas Eskerrik asko Gracias
Gràcies Graciñas Eskerrik asko Gracias