61
Alignment and annotation of forensic STR data generated from MPS analysis: Notes from the sequence workbench Chris Phillips, Katherine Gettings, Jonathan King, Christophe van Neste, Walther Parson

Alignment and annotation of forensic STR data generated ......International Society for Forensic Haemogenetics, Int. J. Legal Med. 110 (1997) 175–176. Coding strand 10. 20801599-20801642

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

  • Alignment and annotation of forensic STR data generated from MPS analysis:

    Notes from the sequence workbench

    Chris Phillips, Katherine Gettings, Jonathan King, Christophe van Neste,

    Walther Parson

  • Capillary electrophoresis to sequencing of forensic STRs

    D12S391:

    14 - 27 X.3 intermediate

  • Capillary electrophoresis to sequencing of forensic STRs

    D12S391:

    [AGAT]a [AGAC]b [AGAT]c

  • Sanger to MPS

  • 1516

    1718

    19

    20

    21

    22

    23

    24

    2526

    Sanger to MPS

    91 Alleles

  • 19

    20

    21

    22

    23

    24

    2526

    Sanger to MPS

    91 Alleles

    A B C D

    E F G H

    I J

    1516

    1718

  • STR sequence allele nomenclature guidelines

  • • Outlined 8 practical considerations for analysis of forensic STR data generated by MPS

    • Stated analysis, export and storage of MPS data should be as full sequence strings (as in EMPOP)

    • Recommended sequence alignments use the most up-to-date human genome assembly

    • Suggested a framework for variant annotation with a curated sequence template file (S1)

    • The published sequence template file S1 will be transferred to an FTP site with a change-log

    • MPS STR allele nomenclature needs careful discussions and planning - a broader based project

    • Genomic descriptions of novel STRs require systematic checks too: D5S2500 HDplex ≠ AGCU

    STR sequence allele nomenclature guidelines

  • • Outlined 8 practical considerations for analysis of forensic STR data generated by MPS

    • Stated analysis, export and storage of MPS data should be as full sequence strings (as in EMPOP)

    • Recommended sequence alignments use the most up-to-date human genome assembly

    • Suggested a framework for variant annotation with a curated sequence template file (S1)

    • The published sequence template file S1 will be transferred to an FTP site with a change-log

    • MPS STR allele nomenclature needs careful discussions and planning - a broader based project

    • Genomic descriptions of novel STRs require systematic checks too: D5S2500 HDplex ≠ AGCU

    STR sequence allele nomenclature guidelines

  • • Outlined 8 practical considerations for analysis of forensic STR data generated by MPS

    • Stated analysis, export and storage of MPS data should be as full sequence strings (as in EMPOP)

    • Recommended sequence alignments use the most up-to-date human genome assembly

    • Suggested a framework for variant annotation with a curated sequence template file (S1)

    • The published sequence template file S1 will be transferred to an FTP site with a change-log

    • MPS STR allele nomenclature needs careful discussions and planning - a broader based project

    • Genomic descriptions of novel STRs require systematic checks too: D5S2500 HDplex ≠ AGCU

    STR sequence allele nomenclature guidelines

  • • Outlined 8 practical considerations for analysis of forensic STR data generated by MPS

    • Stated analysis, export and storage of MPS data should be as full sequence strings (as in EMPOP)

    • Recommended sequence alignments use the most up-to-date human genome assembly

    • Suggested a framework for variant annotation with a curated sequence template file (S1)

    • The published sequence template file S1 will be transferred to an FTP site with a change-log

    • MPS STR allele nomenclature needs careful discussions and planning - a broader based project

    • Genomic descriptions of novel STRs require systematic checks too: D5S2500 HDplex ≠ AGCU

    STR sequence allele nomenclature guidelines

  • [AATG]a

    TPOX

    The repeat motif for each STR marker is listed according to the International Society of Forensic Genetics (ISFG) recommendation that the repeat sequence motif be defined so that the first 50- nucleotides on the GenBank forward strand define the repeat motif used (56)

  • [AATG]a

    TPOX

    5’ 3’

    The repeat motif for each STR marker is listed according to the International Society of Forensic Genetics (ISFG) recommendation that the repeat sequence motif be defined so that the first 50- nucleotides on the GenBank forward strand define the repeat motif used (56)

    rs57

    6104

    845

  • [AATG]a

    TPOX

    Data Slicer

    5’ 3’

    The repeat motif for each STR marker is listed according to the International Society of Forensic Genetics (ISFG) recommendation that the repeat sequence motif be defined so that the first 50- nucleotides on the GenBank forward strand define the repeat motif used (56)

    rs57

    6104

    845

    Data Slicer

    Chr Position Ref allele Alt allele

  • • Outlined 8 practical considerations for analysis of forensic STR data generated by MPS

    • Stated analysis, export and storage of MPS data should be as full sequence strings (as in EMPOP)

    • Recommended sequence alignments use the most up-to-date human genome assembly

    • Suggested a framework for variant annotation with a curated sequence template file (S1)

    • The published sequence template file S1 will be transferred to an FTP site with a change-log

    • MPS STR allele nomenclature needs careful discussions and planning - a broader based project

    • Genomic descriptions of novel STRs require systematic checks too: D5S2500 HDplex ≠ AGCU

    STR sequence allele nomenclature guidelines

  • oversight group

    • Outlined 8 practical considerations for analysis of forensic STR data generated by MPS

    • Stated analysis, export and storage of MPS data should be as full sequence strings (as in EMPOP)

    • Recommended sequence alignments use the most up-to-date human genome assembly

    • Suggested a framework for variant annotation with a curated sequence template file (S1)

    • The published sequence template file S1 will be transferred to an FTP site with a change-log

    • MPS STR allele nomenclature needs careful discussions and planning - a broader based project

    • Genomic descriptions of novel STRs require systematic checks too: D5S2500 HDplex ≠ AGCU

    STR sequence allele nomenclature guidelines

  • • Outlined 8 practical considerations for analysis of forensic STR data generated by MPS

    • Stated analysis, export and storage of MPS data should be as full sequence strings (as in EMPOP)

    • Recommended sequence alignments use the most up-to-date human genome assembly

    • Suggested a framework for variant annotation with a curated sequence template file (S1)

    • The published sequence template file S1 will be transferred to an FTP site with a change-log

    • MPS STR allele nomenclature needs careful discussions and planning - a broader based project

    • Genomic descriptions of novel STRs require systematic checks too: D5S2500 HDplex ≠ AGCU

    oversight group

    STR sequence allele nomenclature guidelines

  • • Outlined 8 practical considerations for analysis of forensic STR data generated by MPS

    • Stated analysis, export and storage of MPS data should be as full sequence strings (as in EMPOP)

    • Recommended sequence alignments use the most up-to-date human genome assembly

    • Suggested a framework for variant annotation with a curated sequence template file (S1)

    • The published sequence template file S1 will be transferred to an FTP site with a change-log

    • MPS STR allele nomenclature needs careful discussions and planning - a broader based project

    • Genomic descriptions of novel STRs require systematic checks too: D5S2500 HDplex ≠ AGCU

    oversight group

    STR sequence allele nomenclature guidelines

  • MPS analysis of STRs with the CEPH diversity panel

    • Sequenced 27 A-STRs, 24 Y-STRs and 7 X-STRs in the CEPH panel

    • Checked genotyping concordance with CE analysis of the same 944 samples made in 2010

    • Transferred the reported STR sequences to Excel and aligned under the current human reference sequence at each locus

    • Used simple macros to reverse-compliment when needed and order sequences alphabetically

    • Cross-checked flanking SNPs plus RR and flanking Indels with 1000 Genomes P-III data

  • D21S11

    FGA

    TH01

    !0.15 !0.1 !0.05 0 0.05 0.1 0.15

    CEPH-wide average Heterozygosity of 81.2% 20 CODIS-A Autosomal STRs D1S1656

    D2S1338

    D12S391

    FGA

    D18S51

    D21S11

    D19S433

    VWA

    D8S1179

    D13S317

    D7S820

    TH01

    D16S539

    D10S1248

    D2S441

    D5S818

    D22S1045

    D3S1358

    CSF1PO

    TPOX 5% 10% -15% -10% -5% 15%

    Sized alleles from CE

    Sequenced alleles from MPS

    No sequence variation detected so far

  • D21S11

    FGA

    TH01

    !0.15 !0.1 !0.05 0 0.05 0.1 0.15

    CEPH-wide average Heterozygosity of 81.2% 20 CODIS-A Autosomal STRs D1S1656

    D2S1338

    D12S391

    FGA

    D18S51

    D21S11

    D19S433

    VWA

    D8S1179

    D13S317

    D7S820

    TH01

    D16S539

    D10S1248

    D2S441

    D5S818

    D22S1045

    D3S1358

    CSF1PO

    TPOX 5% 10% -15% -10% -5% 15%

    Sized alleles from CE

    Sequenced alleles from MPS

    No sequence variation detected so far

    Allele

    Sequ

    ence

    read

    s

  • 5’

    3’

    Alignment: human genome assemblies follow a set system

    CRS

    rCRS

    short p arm

    long

    q arm

    The forward strand starts at the first 5’ p-arm nucleotide

    - to the last 3’ q-arm nucleotide

    of each chromosome

  • 5’

    3’

    CRS

    rCRS

    short p arm

    long

    q arm

    14:45818468 coordinates map to unique positions

    Alignment: human genome assemblies follow a set system

    The forward strand starts at the first 5’ p-arm nucleotide

    - to the last 3’ q-arm nucleotide

    of each chromosome

  • 5’

    3’

    Two main current human genome assemblies

    CRS

    rCRS

    1:1

    22:50818468

    short p arm

    long

    q arm

    GRCh38 (hg38) December 2013

  • 5’

    3’

    CRS

    rCRS

    1:1

    22:50818468

    22:51304566

    Two main current human genome assemblies

    GRCh37 (hg19) February 2009

    GRCh38 (hg38) December 2013

  • 5’

    3’

    Frequency of new assembly builds

    CRS

    rCRS

    GRCh37 (hg19) February 2009

    GRCh38 (hg38) December 2013NCBI 35 (hg17) May 2004

    NCBI 36 (hg18) March 2006

    NCBI 34 (hg16) July 2003

  • Ten A-STR repeat descriptions use reverse strand direction

    5’

    3’5’

    3’

    D2S1338 [TGCC]a [TTCC]b - STRbase

    D2S1338 [GGAA]a [GGCA]b - reference genome

    W. Bär, B. Brinkmann, B. Budowle, A. Carracedo, P. Gill, P. Lincoln, DNA recommendations. Further report of the DNA Commission of the ISFH regarding the use of short tandem repeat systems. International Society for Forensic Haemogenetics, Int. J. Legal Med. 110 (1997) 175–176.

    Coding strand

    10

  • 20801599-20801642

    20842518-20842573

    25931508-25931647

    28030728-28030871

    Both ¨double¨ Y-STR loci are tandem inversions

    DYF387S1[AAAG]a [GTAG]b [GAAG]c [AAAG]d GAAG [AAAG]e [GAAG]f [AAAG]g

    a=[TTTC]p b=[GAAA]q

    a=[GAAA]p b=[GAAA]q

    DYS385 a/b

    [CTTT]a [CTTC]b [CTTT]c [CTTC]d [CTTT]e CTTC [CTAC]f [CTTT]g (fragment 2 description changes)

  • 20801599-20801642

    20842518-20842573

    25931508-25931647

    28030728-28030871

    DYF387S1

    DYS385 a/b

    b

    a

    [AAAG]a [GTAG]b [GAAG]c [AAAG]d GAAG [AAAG]e [GAAG]f [AAAG]g

    [CTTT]a [CTTC]b [CTTT]c [CTTC]d [CTTT]e CTTC [CTAC]f [CTTT]g (fragment 2 description changes)

    a=[GAAA]p b=[TTTC]q

    a=[GAAA]p b=[GAAA]q

    Both ¨double¨ Y-STR loci are tandem inversions

  • GRCh37 and 38 match in 55 of 58 STRs compared so far

    DYS437, DYS438, DYS439 show sequence differences in their repeat regions between GRCh37 and GRCh38

    1000 Genomes still uses GRCh37 co-ordinates for all sequence and variant data

  • D21S11

    FGA

    TH01

    !0.15 !0.1 !0.05 0 0.05 0.1 0.15

    CEPH-wide average Heterozygosity of 81.2% 20 CODIS-A Autosomal STRs D1S1656

    D2S1338

    D12S391

    FGA

    D18S51

    D21S11

    D19S433

    VWA

    D8S1179

    D13S317

    D7S820

    TH01

    D16S539

    D10S1248

    D2S441

    D5S818

    D22S1045

    D3S1358

    CSF1PO

    TPOX 5% 10% -15% -10% -5% 15%

    Sized alleles from CE

    Sequenced alleles from MPS

    No sequence variation detected so far

  • D18S51 Reported Sequence (repeat sequence +10 nt)

    5’ 3’

  • D18S51 Reported Sequence (repeat sequence +10 nt)

    3’

  • D18S51 Repeat sequence 10 nt

    3’

  • D18S51 Repeat sequence 10 nt

    3’

  • D13S317 Reported Sequence (repeat sequence + 31 nt)

    5’ 3’

  • D13S317 Repeat sequence 31 nt

  • D19S433

    5’ 3’

    Reported Sequence (18 nt + repeat sequence = 48 / 16 nt)

  • D19S433

    5’ 3’

    Reported Sequence (18 nt + repeat sequence = 48 / 16 nt)

  • D19S433

    5’ 3’

    Reported Sequence (18 nt + repeat sequence = 48 / 16 nt)

  • D19S433

    5’ 3’

    Repeat sequence of: 48 nt18 nt and: 16 nt

  • D19S433

    5’ 3’

    Repeat sequence of: 48 nt18 nt and: 16 nt

  • 19

    20

    21

    22

    23

    _

    +

    1516

    1718

    19

    20

    21

    22

    23

    24

    2526

    How to deal with less or more nucleotides in the analysed repeat region compared to the reference sequence ?

  • 19

    20

    21

    22

    23

    _

    +

    151617

    18

    19

    20

    21

    22

    23

    24

    25

    26

    How to deal with less or more nucleotides in the analysed repeat region compared to the reference sequence ?

  • Towards a ‘lean’ form of STR annotation

  • Towards a ‘lean’ form of STR annotation

    • - + A C T G • - + A C T G

    e.g.

    D13S317

  • Towards a ‘lean’ form of STR annotation

    • - + A C T G • - + A C T G

    RR start RR stop Anchor nt

    e.g.

    D13S317

  • 205 ‘features’ in a grid of 2,457 (each with 13:x coordinates) = 8%

    Towards a ‘lean’ form of STR annotation

    e.g.

    D13S317

  • D21S11 is the most challenging STR to describe

    TA TCCATA

    [TCTA]a [TCTG]b [TCTA]c TA [TCTA]d TCA [TCTA]e TCCATA [TCTA]f 126 nt of reference sequence > 213 nt of all observed positions

    23 size alleles > 92 sequence alleles

  • D21S11 is the most challenging STR to describe

    TA TCCATA

    [TATCTA] [TCTA]N motif not in reference sequence

    ad hoc 11-nt insertion in AFR singleton

    [TCTA]a [TCTG]b [TCTA]c TA [TCTA]d TCA [TCTA]e TCCATA [TCTA]f

    RR start RR stop

    126 nt of reference sequence > 213 nt of all observed positions

    23 size alleles > 92 sequence alleles (4-fold rise in variation)

  • D21S11 is the most challenging STR to describe

    TA TCCATA

    [TATCTA] [TCTA]N motif not in reference sequence

    ad hoc 11-nt insertion in AFR singleton

    [TCTA]a [TCTG]b [TCTA]c TA [TCTA]d TCA [TCTA]e TCCATA [TCTA]f [TATCTA] [TCTA]g

    RR start RR stop

    126 nt of reference sequence > 213 nt of all observed positions +10 novel nt

    23 size alleles > 92 sequence alleles (4-fold rise in variation)

  • • Obtaining the whole sequence string allows more complete annotation of complex Indel and nucleotide re-arrangements at the repeat region as well as the compilation of flanking region variants from a forensic MPS analysis

    • A standard, unified framework of sequence alignment based on the human genome reference assembly will make it easier to regulate and evolve a sequence-based STR allele nomenclature system

    • Definition of the repeat region start and stop nucleotides in the reference sequence is the key step to ensuring backwards compatibility of existing size-allele nomenclature to the next step of assigning names to sequences

    • To facilitate the adoption of standardised alignment and annotation practices for labs collecting population data with MPS, an Excel sequence template file will be published as an FTP file with regular scrutiny, updates and full change-log details. Since publication made 16 annotation changes.

    Concluding remarks

  • • Obtaining the whole sequence string allows more complete annotation of complex Indel and nucleotide re-arrangements at the repeat region as well as the compilation of flanking region variants from a forensic MPS analysis

    • A standard, unified framework of sequence alignment based on the human genome reference assembly will make it easier to regulate and evolve a sequence-based STR allele nomenclature system

    • Definition of the repeat region start and stop nucleotides in the reference sequence is the key step to ensuring backwards compatibility of existing size-allele nomenclature to the next step of assigning names to sequences

    • To facilitate the adoption of standardised alignment and annotation practices for labs collecting population data with MPS, an Excel sequence template file will be published as an FTP file with regular scrutiny, updates and full change-log details. Since publication made 16 annotation changes.

    Concluding remarks

  • • Obtaining the whole sequence string allows more complete annotation of complex Indel and nucleotide re-arrangements at the repeat region as well as the compilation of flanking region variants from a forensic MPS analysis

    • A standard, unified framework of sequence alignment based on the human genome reference assembly will make it easier to regulate and evolve a sequence-based STR allele nomenclature system

    • Definition of the repeat region start and stop nucleotides in the reference sequence is the key step to ensuring backwards compatibility of existing size-allele nomenclature to the next step of assigning names to sequences

    • To facilitate the adoption of standardised alignment and annotation practices for labs collecting population data with MPS, an Excel sequence template file will be published as an FTP file with regular scrutiny, updates and full change-log details. Since publication made 16 annotation changes.

    Concluding remarks

  • • Obtaining the whole sequence string allows more complete annotation of complex Indel and nucleotide re-arrangements at the repeat region as well as the compilation of flanking region variants from a forensic MPS analysis

    • A standard, unified framework of sequence alignment based on the human genome reference assembly will make it easier to regulate and evolve a sequence-based STR allele nomenclature system

    • Definition of the repeat region start and stop nucleotides in the reference sequence is the key step to ensuring backwards compatibility of existing size-allele nomenclature to the next step of assigning names to sequences

    • To facilitate the adoption of standardised alignment and annotation practices for labs collecting population data with MPS, an Excel sequence template file will be published as an FTP file with regular scrutiny, updates and full change-log details. Since publication made 16 annotation changes.

    Concluding remarks

  • It is important that guidelines for any complex task are precise and easy to follow

  • Gràcies Graciñas Eskerrik asko Gracias

  • Gràcies Graciñas Eskerrik asko Gracias

  • Gràcies Graciñas Eskerrik asko Gracias

  • Gràcies Graciñas Eskerrik asko Gracias

  • Gràcies Graciñas Eskerrik asko Gracias