1
rank phylum # of entries 1 Ascomycota 4629 (79%) 2 Basidiomycota 1116 (19%) 3 Mucoromycotina 38 (1%) Development of DNA sequence tagging tools based on machine learning using public sequence annotation data [配列注釈オープンデータに基づくDNA Smart Taggerの開発] Eli Kaminuma 1 , Takatomo Fujisawa 1 , Fumi Hayashi 1 , Tatsuya Nishizawa 2 , Yasukazu Nakamura 1 , Osamu Ogasawara 1 (1. Center for Information Biology, National Institute of Genetics, 2. IMSBIO Co., Ltd. ) ABSTRACT Large-scale DNA sequence data generated from New Generation Sequencers (NGSs) is published from the International Nucleotide Sequence Database Collaboration (INSDC) in addition to journal reports. Large scale open data of DNA sequences and the annotations could be basic data materials of machine learning models. However applications of large-scale DNA database to machine learning models are not enough to be developed for practical many situations. We suppose that sequence registration tools to be submitted to DNA data banks enable to support users to select appropriate annotation attributes by machine learning models. In this research, we aim to construct DNA Smart Tagging tools, where the tool name is “DNASmartTagger”, that predict candidate attribute values of annotation tags for DNA sequences. As the first step, we define the methodology of parametrization strategy for DNA sequence-based machine learning models. Moreover model performances and application domains of respective annotation tags are discussed. Practically, in experiment the 4,850 sequences of 16S rRNA gene in INSDC BCT division were retrieved to train supervised machine learning models for /PCR_primers annotation tag. In a similar procedure, the 5,831 fungal sequences of INSDC PLN division were utilized for constructing predictive models for /altitude annotation tag. Support Vector Machine (SVM), a supervised learning method, was used for classifying attribute values of individual tags. For evaluating models, the area under the ROC curve (AUC) of SVMs based on 5’end short fragment inputs was computed. AUC for /PCR_primers tag and /altitude tag were 0.83 and 0.79 individually, where fragment lengths were 60bp and 128bp. Further, the SVMs for the /altitude tag exhibited 0.87 AUC when input parameter was normalized frequency of k-mer indices. In detail, the input parameter means normalized frequency of appearance of possible 5-mers over not short fragment but full-length sequence. The result may indicate that k-mer based approach can work more efficiently than short fragment based approach. Acknowledgments: We would like to thank Dr. Jun Mashima, Koji Watanabe, Prof. Toshihisa Takagi for constructive comments and technical assistance. This work was partially supported by CREST, JST and ROIS management expenses grant. Computations were partially performed on the NIG supercomputer at ROIS National Institute of Genetics. Publication: DNA Data Bank of Japan. Mashima J, Kodama Y, Fujisawa T, Katayama T, Okuda Y, Kaminuma E, Ogasawara O, Okubo K, Nakamura Y, Takagi T.Nucleic Acids Res. 2017 45:D25-D31. The International Nucleotide Sequence Database Collaboration. Cochrane G, Karsch-Mizrachi I, Takagi T, International Nucleotide Sequence Database Collaboration, Nucleic Acids Res. 2016 44:D48-50. DDBJ new system and service refactoring. Ogasawara O, Mashima J, Kodama Y, Kaminuma E, Nakamura Y, Okubo K, Takagi T. Nucleic Acids Res. 2013 41:D25-9. Result(1) predictive performance of classification models for /PCR_primers attribute values The problem for high labor costs of manual annotation at sequence submission stage to DNA data bank Experimental conditions (preparing training datasets) ■ Investigating Predictive Performances by Input Types Study type( whole genome sequencing, metagenomics, amplicon sequencing, etc ) Indirect encoding(without responsive loci) vs direct encoding with responsive loci Whole genome markers vs fragment markers ■ Evaluating Performances for Other Tags by Heterogeneous Divisions e.g. /altitude contained divisions (bacteria, virus, etc.) ■ Designing the DNASmartTagger API System (under development) Model retraining function with successively new data release Visualizing multiple tag candidates Future Work BackgroundNext-Generation DNA Sequencing Produces Large Quantities of Data ProblemDetailed Manual Annotations Lead to High Labor Costs # of Released Entries DBCLS SRA statistics (Nakazato et al., 2013) http://sra.dbcls.jp/ DDBJ Trad DDBJ SRA (NGS Raw Reads) 2016 196M 91K 2015 189M 62K Accacactggtactgagacacgga ccagactcctacgggaggcagcag tgaggaatattggacaatggaggga actctgatccagccatgccgcgtgca ggaagactgccctatgggttgtaaac tgcttttatacaagaagaataagaga tacgtgtatcttgatgacggtattgtaa gaataagcaccggctaactccgtgc cagcagccgcggtaatacggaggg tgcaagcgttatccggaatcattgggt ttaaagggtccgtaggcggattaata agtcagtggtgaaagtctgcagctta actgtagaattgccattgatactgtta gtcttgaattattatgaagtagttagaa tatgtagtgtagcggtgaaatgcata gatattaca Input: DNA Sequence sequence e.g. INSDC FlatFile Format Altitudinal zonation Output: Annotation Tags DNASmartTagger Utilizing open data for training models BioSample 452 attribute tags INSDC 89 qualifier key tags Machine Learning Models GBIF, etc. DDBJ ANNOTATION HELP DNASmartTagger : a proposed machine learning method for predicting structured DNA sequence annotation Annotations Result(2) predictive performances of classification models for /altitude attribute values /altitude INSDC Qualifier Tag /PCR_primers INSDC Qualifier Tag /altitude tag Predicting target keys for pilot studies = INSDC qualifier keys (/altitude, /PCR_primers) * Sequence annotation of SRA BioSample contains many unstructured contents requiring exhaustive data cleansing. * INSDC sequence annotations are well controlled compared to SRA BioSample annotations. Sequences and annotation data were retrieved from the DDBJ ARSA data retrieval tool. TAG Output Variable Type # of Entries ML Model Design Classification Performance (AUC wt Cross-Validation) Data Retrieval Condition 1 /PCR_Primers Categorical (Multilabel) 4,850 Support Vector Machine(SVM) 5’end fragment (L=60bp) 0.83 [37 PrimerFwd models] 0.81 [104 PrimerRev models] BCT Division 16S rRNA TAG Output Variable Type # of Entries ML Model Design Classification Performance (AUC wt Cross-Validation) Data Retrieval Condition 2 /altitude Continuous Categorical (Multilabel) 5,831 SVM, 3 models 5’end fragment (L=128bp) 0.79 [Z1=0.81, Z2=0.73, Z3=0.83 ] - PLN Division - Fungi (keyword) 5-mers frequency for full-length () 0.87 [Z1=0.91, Z2=0.80, Z3=0.91 ] acagagttttcggactgctg acgaccggcgcacgggtg cgtaacgcgtatacaatcta cc acagagttttcggactgctgacgaccggcgcacgggtgcgtaacgcgtatacaatctaccttttgctaa gggatagcccagagaaatttggattaatactttatggtatgtatttatggcatcatatatacattaaaggtt acggcaaaagatgagtatgcgttctattagctagatggtaaggtaacggcttaccatggctacgatag ataggggccctgagagggggatcccccacactggtactgagacacggaccagactcctacggga ggcagcagtgaggaatattggacaatggagggaactctgatccagccatgccgcgtgcaggaaga ctgccctatgggttgtaaactgcttttatacaagaagaataagagatacgtgtatcttgatgacggtattg taagaataagcaccggctaactccgtgccagcagccgcggtaatacggagggtgcaagcgttatcc ggaatcattgggtttaaagggtccgtaggcggattaataagtcagtggtgaaagtctgcagcttaactg tagaattgccattgatactgttagtcttgaattattatgaagtagttagaatatgtagtgtagcggtgaaat gcatagatattacatagaataccgattgcgaaggcaggctactaataatatattgacgctgatggacg aaagcgtgggtagcgaacaggattagataccctggtagtccacgccgtaaacgatggtcactagct gttcggacttcggtctgagtggctaagcgaaagtgataagtgacccacctggggagtacgttcgcaa gaatgaaact 16S rRNA sequences (5’3’) 5’end fragment agtggctaagcgaaagtga taagtgacccacctgggga gtacgttcgcaagaatgaa act PrimerFWD candidateF27 (agagtttgatcmtggctcag) PrimerREV candidate1525R (aaggaggtgwtccarcc) 3’end fragment Model Performance Fragmentation processing of input sequence for /PCR_Primers tag prediction Frequency of entries by sequence length # of entries by models Model Performance ZONE Attribute Value Altitude Zone Code ALPINE ZONE 1500m -- Z3 MONTANE ZONE 800m--1500m Z2 LOWLAND ZONE 0--800m Z1 AUC Length of /altitude retrieved sequences Unique taxonmy ID.= 3,667 Unique attribute value = 257 Unique PrimerFWD seq. = 107 Unique PrimerREV seq,.= 115 Classification performance with frequency of entries by PrimerFWD sequence [fragment length=60] Classification performance by input fragment length (training dataset) fragment length [# of entries 20] AUC=Area Under the Curve, 37 PrimerFwd models [# of entries 20] rank primer set target loci # of entries 1 ITS1 - ITS4 ITS 228 2 ITS5 - ITS4 ITS 103 3 ITS5 - NL4 ITS, LSU 93 4 nu-ssu-0817 nu-ssu-1536 SSU 76 5 niaD15F - niaD12R euknr 21 Top5 primer set (annotated only) Top3 phyla Ref. Fiannaca et al., A k-mer-based barcode DNA classification methodology , AI in Medicine 64:173, 2015

[2017-05-29] DNASmartTagger

Embed Size (px)

Citation preview

Page 1: [2017-05-29] DNASmartTagger

rank phylum # of entries

1 Ascomycota 4629 (79%)

2 Basidiomycota 1116 (19%)

3 Mucoromycotina 38 (1%)

Development of DNA sequence tagging tools based on machine learning using

public sequence annotation data [配列注釈オープンデータに基づくDNA Smart Taggerの開発] Eli Kaminuma1, Takatomo Fujisawa1, Fumi Hayashi1, Tatsuya Nishizawa2, Yasukazu Nakamura1, Osamu Ogasawara1

(1. Center for Information Biology, National Institute of Genetics, 2. IMSBIO Co., Ltd. )

ABSTRACT Large-scale DNA sequence data generated from New Generation Sequencers (NGSs) is published from the International Nucleotide Sequence Database Collaboration (INSDC) in addition to

journal reports. Large scale open data of DNA sequences and the annotations could be basic data materials of machine learning models. However applications of large-scale DNA database to machine

learning models are not enough to be developed for practical many situations. We suppose that sequence registration tools to be submitted to DNA data banks enable to support users to select appropriate

annotation attributes by machine learning models. In this research, we aim to construct DNA Smart Tagging tools, where the tool name is “DNASmartTagger”, that predict candidate attribute values of

annotation tags for DNA sequences. As the first step, we define the methodology of parametrization strategy for DNA sequence-based machine learning models. Moreover model performances and application

domains of respective annotation tags are discussed. Practically, in experiment the 4,850 sequences of 16S rRNA gene in INSDC BCT division were retrieved to train supervised machine learning models for

/PCR_primers annotation tag. In a similar procedure, the 5,831 fungal sequences of INSDC PLN division were utilized for constructing predictive models for /altitude annotation tag. Support Vector Machine

(SVM), a supervised learning method, was used for classifying attribute values of individual tags. For evaluating models, the area under the ROC curve (AUC) of SVMs based on 5’end short fragment inputs

was computed. AUC for /PCR_primers tag and /altitude tag were 0.83 and 0.79 individually, where fragment lengths were 60bp and 128bp. Further, the SVMs for the /altitude tag exhibited 0.87 AUC when

input parameter was normalized frequency of k-mer indices. In detail, the input parameter means normalized frequency of appearance of possible 5-mers over not short fragment but full-length sequence.

The result may indicate that k-mer based approach can work more efficiently than short fragment based approach.

Acknowledgments: ・ We would like to thank Dr. Jun Mashima, Koji Watanabe, Prof. Toshihisa Takagi for constructive comments and technical assistance.

・ This work was partially supported by CREST, JST and ROIS management expenses grant.

・ Computations were partially performed on the NIG supercomputer at ROIS National Institute of Genetics.

Publication:

• DNA Data Bank of Japan. Mashima J, Kodama Y, Fujisawa T, Katayama T, Okuda Y, Kaminuma E, Ogasawara O, Okubo K, Nakamura Y, Takagi T.Nucleic Acids Res. 2017 45:D25-D31.

• The International Nucleotide Sequence Database Collaboration. Cochrane G, Karsch-Mizrachi I, Takagi T, International Nucleotide Sequence Database Collaboration, Nucleic Acids Res. 2016 44:D48-50.

• DDBJ new system and service refactoring. Ogasawara O, Mashima J, Kodama Y, Kaminuma E, Nakamura Y, Okubo K, Takagi T. Nucleic Acids Res. 2013 41:D25-9.

Result(1) predictive performance of classification models for /PCR_primers attribute values

The problem for high labor costs of manual annotation at sequence submission stage to DNA data bank

Experimental conditions (preparing training datasets)

■ Investigating Predictive Performances by Input Types • Study type( whole genome sequencing, metagenomics, amplicon sequencing, etc ) • Indirect encoding(without responsive loci) vs direct encoding with responsive loci • Whole genome markers vs fragment markers

■ Evaluating Performances for Other Tags by Heterogeneous Divisions

e.g. /altitude contained divisions (bacteria, virus, etc.)

■ Designing the DNASmartTagger API System (under development) • Model retraining function with successively new data release • Visualizing multiple tag candidates

Future Work

Background:Next-Generation DNA Sequencing Produces Large Quantities of Data

Problem: Detailed Manual Annotations Lead to High Labor Costs

# of Released Entries

DBCLS SRA statistics (Nakazato et al., 2013)

http://sra.dbcls.jp/

DDBJ Trad DDBJ SRA (NGS Raw

Reads)

2016 196M 91K

2015 189M 62K

Accacactggtactgagacacgga

ccagactcctacgggaggcagcag

tgaggaatattggacaatggaggga

actctgatccagccatgccgcgtgca

ggaagactgccctatgggttgtaaac

tgcttttatacaagaagaataagaga

tacgtgtatcttgatgacggtattgtaa

gaataagcaccggctaactccgtgc

cagcagccgcggtaatacggaggg

tgcaagcgttatccggaatcattgggt

ttaaagggtccgtaggcggattaata

agtcagtggtgaaagtctgcagctta

actgtagaattgccattgatactgtta

gtcttgaattattatgaagtagttagaa

tatgtagtgtagcggtgaaatgcata

gatattaca

Input: DNA Sequence

sequence

e.g. INSDC FlatFile Format

Altitudinal zonation

Output: Annotation Tags

DNASmartTagger

Utilizing open data

for training models BioSample

452 attribute tags INSDC 89 qualifier key tags

Machine Learning Models

GBIF, etc.

DDBJ ANNOTATION HELP

DNASmartTagger : a proposed machine learning method for predicting structured DNA sequence annotation

Annotations

Result(2) predictive performances of classification models for /altitude attribute values

/altitude INSDC Qualifier Tag /PCR_primers INSDC Qualifier Tag

/altitude tag

■ Predicting target keys for pilot studies = INSDC qualifier keys (/altitude, /PCR_primers)

* Sequence annotation of SRA BioSample contains many unstructured contents requiring exhaustive data cleansing.

* INSDC sequence annotations are well controlled compared to SRA BioSample annotations.

■ Sequences and annotation data were retrieved from the DDBJ ARSA data retrieval tool.

TAG Output

Variable Type

# of

Entries

ML Model Design Classification Performance

(AUC wt Cross-Validation)

Data Retrieval

Condition

1 /PCR_Primers

Categorical

(Multilabel)

4,850

Support Vector

Machine(SVM)

5’end fragment

(L=60bp)

0.83 [37 PrimerFwd models]

0.81 [104 PrimerRev models]

BCT Division

16S rRNA

TAG Output Variable

Type

# of

Entries

ML Model

Design

Classification Performance

(AUC wt Cross-Validation)

Data Retrieval

Condition

2 /altitude Continuous

↓※

Categorical

(Multilabel)

5,831 SVM, 3 models

5’end fragment (L=128bp)

0.79

[Z1=0.81, Z2=0.73, Z3=0.83 ]

- PLN Division

- Fungi

(keyword)

5-mers frequency for full-length

(※)

0.87

[Z1=0.91, Z2=0.80, Z3=0.91 ]

acagagttttcggactgctg

acgaccggcgcacgggtg

cgtaacgcgtatacaatcta

cc

acagagttttcggactgctgacgaccggcgcacgggtgcgtaacgcgtatacaatctaccttttgctaa

gggatagcccagagaaatttggattaatactttatggtatgtatttatggcatcatatatacattaaaggtt

acggcaaaagatgagtatgcgttctattagctagatggtaaggtaacggcttaccatggctacgatag

ataggggccctgagagggggatcccccacactggtactgagacacggaccagactcctacggga

ggcagcagtgaggaatattggacaatggagggaactctgatccagccatgccgcgtgcaggaaga

ctgccctatgggttgtaaactgcttttatacaagaagaataagagatacgtgtatcttgatgacggtattg

taagaataagcaccggctaactccgtgccagcagccgcggtaatacggagggtgcaagcgttatcc

ggaatcattgggtttaaagggtccgtaggcggattaataagtcagtggtgaaagtctgcagcttaactg

tagaattgccattgatactgttagtcttgaattattatgaagtagttagaatatgtagtgtagcggtgaaat

gcatagatattacatagaataccgattgcgaaggcaggctactaataatatattgacgctgatggacg

aaagcgtgggtagcgaacaggattagataccctggtagtccacgccgtaaacgatggtcactagct

gttcggacttcggtctgagtggctaagcgaaagtgataagtgacccacctggggagtacgttcgcaa

gaatgaaact

16S rRNA sequences (5’3’) 5’end fragment

agtggctaagcgaaagtga

taagtgacccacctgggga

gtacgttcgcaagaatgaa

act

PrimerFWD candidate:F27 (agagtttgatcmtggctcag)

PrimerREV candidate:1525R (aaggaggtgwtccarcc)

3’end fragment

■ Model Performance

■ Fragmentation processing of input sequence for /PCR_Primers tag prediction

■ Frequency of entries by

sequence length

# o

f entr

ies b

y m

odels

■ Model Performance

ZONE Attribute Value Altitude Zone Code

ALPINE ZONE 1500m -- Z3

MONTANE ZONE 800m--1500m Z2

LOWLAND ZONE 0--800m Z1

AU

C

■ Length of /altitude retrieved sequences

Unique taxonmy ID.= 3,667

Unique attribute value = 257

Unique PrimerFWD seq. = 107

Unique PrimerREV seq,.= 115

■ Classification performance with frequency of entries

by PrimerFWD sequence [fragment length=60]

■ Classification performance

by input fragment length (training dataset)

fragment length [# of entries ≧ 20] AUC=Area Under the Curve, 37 PrimerFwd models [# of entries ≧ 20]

rank primer set target

loci

# of

entries

1 ITS1 - ITS4 ITS 228

2 ITS5 - ITS4 ITS 103

3 ITS5 - NL4 ITS, LSU 93

4 nu-ssu-0817 –

nu-ssu-1536

SSU 76

5 niaD15F - niaD12R euknr 21

■Top5 primer set (annotated only)

■Top3 phyla

※Ref. Fiannaca et al., A k-mer-based barcode DNA classification methodology , AI in Medicine 64:173, 2015