Alignment and annotation of forensic STR data generated ......International Society for Forensic Haemogenetics, Int. J. Legal Med. 110 (1997) 175–176. Coding strand 10. 20801599-20801642

Alignment and annotation of forensic STR data generated from MPS analysis:

Notes from the sequence workbench

Chris Phillips, Katherine Gettings, Jonathan King, Christophe van Neste,

Walther Parson

Capillary electrophoresis to sequencing of forensic STRs

D12S391:

14 - 27 X.3 intermediate

Capillary electrophoresis to sequencing of forensic STRs

D12S391:

[AGAT]a [AGAC]b [AGAT]c

Sanger to MPS

1516

1718

19

20

21

22

23

24

2526

Sanger to MPS

91 Alleles

19

20

21

22

23

24

2526

Sanger to MPS

91 Alleles

A B C D

E F G H

I J

1516

1718

STR sequence allele nomenclature guidelines

• Outlined 8 practical considerations for analysis of forensic STR data generated by MPS

• Stated analysis, export and storage of MPS data should be as full sequence strings (as in EMPOP)

• Recommended sequence alignments use the most up-to-date human genome assembly

• Suggested a framework for variant annotation with a curated sequence template file (S1)

• The published sequence template file S1 will be transferred to an FTP site with a change-log

• MPS STR allele nomenclature needs careful discussions and planning - a broader based project

• Genomic descriptions of novel STRs require systematic checks too: D5S2500 HDplex ≠ AGCU


[AATG]a

TPOX

The repeat motif for each STR marker is listed according to the International Society of Forensic Genetics (ISFG) recommendation that the repeat sequence motif be defined so that the first 50- nucleotides on the GenBank forward strand define the repeat motif used (56)

[AATG]a

TPOX

5’ 3’


rs57

6104

845

[AATG]a

TPOX

Data Slicer

5’ 3’


rs57

6104

845

Data Slicer

Chr Position Ref allele Alt allele

oversight group









MPS analysis of STRs with the CEPH diversity panel

• Sequenced 27 A-STRs, 24 Y-STRs and 7 X-STRs in the CEPH panel

• Checked genotyping concordance with CE analysis of the same 944 samples made in 2010

• Transferred the reported STR sequences to Excel and aligned under the current human reference sequence at each locus

• Used simple macros to reverse-compliment when needed and order sequences alphabetically

• Cross-checked flanking SNPs plus RR and flanking Indels with 1000 Genomes P-III data

D21S11

FGA

TH01

!0.15 !0.1 !0.05 0 0.05 0.1 0.15

CEPH-wide average Heterozygosity of 81.2% 20 CODIS-A Autosomal STRs D1S1656

D2S1338

D12S391

FGA

D18S51

D21S11

D19S433

VWA

D8S1179

D13S317

D7S820

TH01

D16S539

D10S1248

D2S441

D5S818

D22S1045

D3S1358

CSF1PO

TPOX 5% 10% -15% -10% -5% 15%

Sized alleles from CE

Sequenced alleles from MPS

No sequence variation detected so far

D21S11

FGA

TH01

!0.15 !0.1 !0.05 0 0.05 0.1 0.15


D2S1338

D12S391

FGA

D18S51

D21S11

D19S433

VWA

D8S1179

D13S317

D7S820

TH01

D16S539

D10S1248

D2S441

D5S818

D22S1045

D3S1358

CSF1PO

TPOX 5% 10% -15% -10% -5% 15%




Allele

Sequ

ence

read

s

5’

3’

Alignment: human genome assemblies follow a set system

CRS

rCRS

short p arm

long

q arm

The forward strand starts at the first 5’ p-arm nucleotide

- to the last 3’ q-arm nucleotide

of each chromosome

5’

3’

CRS

rCRS

short p arm

long

q arm

14:45818468 coordinates map to unique positions

Alignment: human genome assemblies follow a set system

The forward strand starts at the first 5’ p-arm nucleotide

- to the last 3’ q-arm nucleotide

of each chromosome

5’

3’

Two main current human genome assemblies

CRS

rCRS

1:1

22:50818468

short p arm

long

q arm

GRCh38 (hg38) December 2013

5’

3’

CRS

rCRS

1:1

22:50818468

22:51304566

Two main current human genome assemblies

GRCh37 (hg19) February 2009

GRCh38 (hg38) December 2013

5’

3’

Frequency of new assembly builds

CRS

rCRS

GRCh37 (hg19) February 2009

GRCh38 (hg38) December 2013NCBI 35 (hg17) May 2004

NCBI 36 (hg18) March 2006

NCBI 34 (hg16) July 2003

Ten A-STR repeat descriptions use reverse strand direction

5’

3’5’

3’

D2S1338 [TGCC]a [TTCC]b - STRbase

D2S1338 [GGAA]a [GGCA]b - reference genome

W. Bär, B. Brinkmann, B. Budowle, A. Carracedo, P. Gill, P. Lincoln, DNA recommendations. Further report of the DNA Commission of the ISFH regarding the use of short tandem repeat systems. International Society for Forensic Haemogenetics, Int. J. Legal Med. 110 (1997) 175–176.

Coding strand

10

20801599-20801642

20842518-20842573

25931508-25931647

28030728-28030871

Both ¨double¨ Y-STR loci are tandem inversions

DYF387S1[AAAG]a [GTAG]b [GAAG]c [AAAG]d GAAG [AAAG]e [GAAG]f [AAAG]g

a=[TTTC]p b=[GAAA]q

a=[GAAA]p b=[GAAA]q

DYS385 a/b

[CTTT]a [CTTC]b [CTTT]c [CTTC]d [CTTT]e CTTC [CTAC]f [CTTT]g (fragment 2 description changes)

20801599-20801642

20842518-20842573

25931508-25931647

28030728-28030871

DYF387S1

DYS385 a/b

b

a

[AAAG]a [GTAG]b [GAAG]c [AAAG]d GAAG [AAAG]e [GAAG]f [AAAG]g

[CTTT]a [CTTC]b [CTTT]c [CTTC]d [CTTT]e CTTC [CTAC]f [CTTT]g (fragment 2 description changes)

a=[GAAA]p b=[TTTC]q

a=[GAAA]p b=[GAAA]q

Both ¨double¨ Y-STR loci are tandem inversions

GRCh37 and 38 match in 55 of 58 STRs compared so far

DYS437, DYS438, DYS439 show sequence differences in their repeat regions between GRCh37 and GRCh38

1000 Genomes still uses GRCh37 co-ordinates for all sequence and variant data

D21S11

FGA

TH01

!0.15 !0.1 !0.05 0 0.05 0.1 0.15


D2S1338

D12S391

FGA

D18S51

D21S11

D19S433

VWA

D8S1179

D13S317

D7S820

TH01

D16S539

D10S1248

D2S441

D5S818

D22S1045

D3S1358

CSF1PO

TPOX 5% 10% -15% -10% -5% 15%




D18S51 Reported Sequence (repeat sequence +10 nt)

5’ 3’

D18S51 Reported Sequence (repeat sequence +10 nt)

3’

D18S51 Repeat sequence 10 nt

3’

D13S317 Reported Sequence (repeat sequence + 31 nt)

5’ 3’

D13S317 Repeat sequence 31 nt

D19S433

5’ 3’

Reported Sequence (18 nt + repeat sequence = 48 / 16 nt)

D19S433

5’ 3’

Repeat sequence of: 48 nt18 nt and: 16 nt

19

20

21

22

23

_

+

1516

1718

19

20

21

22

23

24

2526

How to deal with less or more nucleotides in the analysed repeat region compared to the reference sequence ?

19

20

21

22

23

_

+

151617

18

19

20

21

22

23

24

25

26

How to deal with less or more nucleotides in the analysed repeat region compared to the reference sequence ?

Towards a ‘lean’ form of STR annotation


• - + A C T G • - + A C T G

e.g.

D13S317


• - + A C T G • - + A C T G

RR start RR stop Anchor nt

e.g.

D13S317

205 ‘features’ in a grid of 2,457 (each with 13:x coordinates) = 8%


e.g.

D13S317

D21S11 is the most challenging STR to describe

TA TCCATA

[TCTA]a [TCTG]b [TCTA]c TA [TCTA]d TCA [TCTA]e TCCATA [TCTA]f 126 nt of reference sequence > 213 nt of all observed positions

23 size alleles > 92 sequence alleles


TA TCCATA

[TATCTA] [TCTA]N motif not in reference sequence

ad hoc 11-nt insertion in AFR singleton

[TCTA]a [TCTG]b [TCTA]c TA [TCTA]d TCA [TCTA]e TCCATA [TCTA]f

RR start RR stop

126 nt of reference sequence > 213 nt of all observed positions

23 size alleles > 92 sequence alleles (4-fold rise in variation)


TA TCCATA

[TATCTA] [TCTA]N motif not in reference sequence

ad hoc 11-nt insertion in AFR singleton

[TCTA]a [TCTG]b [TCTA]c TA [TCTA]d TCA [TCTA]e TCCATA [TCTA]f [TATCTA] [TCTA]g

RR start RR stop

126 nt of reference sequence > 213 nt of all observed positions +10 novel nt

23 size alleles > 92 sequence alleles (4-fold rise in variation)

• Obtaining the whole sequence string allows more complete annotation of complex Indel and nucleotide re-arrangements at the repeat region as well as the compilation of flanking region variants from a forensic MPS analysis

• A standard, unified framework of sequence alignment based on the human genome reference assembly will make it easier to regulate and evolve a sequence-based STR allele nomenclature system

• Definition of the repeat region start and stop nucleotides in the reference sequence is the key step to ensuring backwards compatibility of existing size-allele nomenclature to the next step of assigning names to sequences

• To facilitate the adoption of standardised alignment and annotation practices for labs collecting population data with MPS, an Excel sequence template file will be published as an FTP file with regular scrutiny, updates and full change-log details. Since publication made 16 annotation changes.

Concluding remarks

It is important that guidelines for any complex task are precise and easy to follow

Gràcies Graciñas Eskerrik asko Gracias

Documents

Alignment and annotation of forensic STR data generated ......International Society for Forensic Haemogenetics, Int. J. Legal Med. 110 (1997) 175–176. Coding strand 10. 20801599-20801642