1
AN AUTOMATED APPROACH TO IDENTIFYING DISEASE-GENE ASSOCIATIONS FROM THE MEDICAL LITERATURE TO INFORM GENE PANEL DESIGN Rational approaches to the design of gene panels for next-generation molecular diagnostic assays are made challenging by the amount and complexity of information that must be manually collected and assessed from the medical literature. We have developed a novel text-mining infrastructure that automatically identifies disease-gene-treatment associations from the titles and abstracts of more than 26M scientific publications in PubMed and disease-gene-variant associations from the full-text of 3.9M prioritized articles. For this work, curated lists of diseases and genes comprising 11.7K and 50.9K total entries, respectively, were used as initial search parameters to identify high-yield content. Next, custom-designed algorithms were used to scan full-text articles using a comprehensive variant lists comprising 602M total entries for every possible cDNA or protein variant in all gene transcripts sorted by biological outcome as second-tier search parameters. In total, we identified 909K putative disease-gene associations (24 articles per association on average) and 543K total variants distributed across all genes. This information was organized according to the strength of the association based on the total number and quality of individual citations and the position of disease-gene and disease-gene-variant key terms within the text. We decided to demonstrate this approach in Acute Myeloid Leukemia (AML) to create an NGS panel, where outcome is very poor and lists of well-known mutations and biomarkers for treatment are not well characterized outside of cytogenetics. In total, 11K unique variants in 151 genes were found to be associated with AML and ranked according to the number of citations for each. Each variant had been cited at least once from 3,865 individual scientific publications. These variants were further classified according to the journals in which they appeared and the resulting data was manually inspected for accuracy. Additionally, custom search algorithms identified recurrent gene amplifications, gene deletions and gene-fusions resulting from translocations and were also catalogued and incorporated in the final panel design. The final in silico panel design was compared to commercially-available panels and included several genes with a clear association to AML with potential clinical significance that are otherwise not present on any such panel. In conclusion, we have developed and tested a tool that rapidly and comprehensively interrogates, organizes and displays genomic targets for diseases and has promising applications for clinical panel construction with significantly reduced curation and research time. Mark J Kiel MD PhD 1 , Matthew Schu PhD 2 , Steve A Schwartz 1 , Victor Weigman PhD 2 1 GENOMENON, Ann Arbor MI; 2 Q2 Solutions - EA Genomics, Morrisville NC Contact us at [email protected] INTRODUCTION METHODS RESULTS + CONCLUSION Mastermind identified more than 1000 genes associated with AML. The top 500 were selected for further examination and were manually curated to include in the final analysis only those genes for which genetic variation (as opposed to disease-gene associations that inform flow cytometric and immunohistochemical assays). The final list of 151 genes was used to batch query the Mastermind database and the evidence was prioritized by citation count and examined for accuracy and cogency of clinical utility. During the course of content assessment for final candidate selection, the Mastermind user-interface was searched to investigate the content for individual variants, copy number changes or structural alterations to validate clinical significance for diagnostic, prognostic and therapeutic decision-making. MELANOMA BRAF p.V600E PUBLICATIONS CURATION CONTENT DISEASE-GENE ASSOCIATIONS DISEASE-GENE ASSOCIATIONS DISEASE-VARIANT ASSOCIATIONS QUALITY CONTROL + NATURAL LANGUAGE PROCESSING CONTENT PRIORITZATION PDF CONVERSION KEY TERMS DIAGNOSTIC THERAPEUTIC FUNCTIONAL CUSTOM ETC. DISEASE LIST GENE LIST GENETIC KEY TERMS COMPREHENSIVE VARIANT LIST ARTICLE FULL-TEXT PUBMED TITLES/ABSTRACTS Database Assembly. Titles and abstracts of the every article in PubMed are scanned to determine whether a disease or a gene or a key-term relevant to clinical genetics is mentioned. The PDFs of theses articles are downloaded and the full-text converted to searchable text. The full-text is then scanned using the disease and gene lists described above. When a gene is identified in the full-text, variant search is invoked to identify any variant in that gene mentioned in the text in any way that an author may describe it - either using Human Genome Variant Society (HGVS) nomenclature or any of dozens of non-standard formats. Scans for additional key terms that may indicate the data contained in the article is useful for making clinical decisions (such as diagnosis, prognosis or therapy-selection) are performed to further categorize and prioritize the data. Once variant and key term scans are completed, disease-gene and disease-variant association data is passed through a quality control process and natural language processing to organize and store the data in the final database. This process is repeated weekly to remain up-to-date. Having collected and organized the disease-gene-variant data that comprises Mastermind afforded an opportunity to test the hypothesis that gene panel design could be automated using an evidence-based approach. Specifically, we wished to automate candidate gene selection and organize the necessary evidence for each candidate that can be quickly interrogated to manually assess the validity and clinical significance of each candidate. To test this idea, we selected AML given the variety of genes associated with disease development and progression and the complex pathogenetic mechanisms. First, genes mentioned in association with AML in the titles and abstracts of PubMed were organized in descending order of the number of citations and manually examined to identify those whose association with AML was genetic in nature through mutation, copy number change or structural variation. This gene list was then used to batch-query the Mastermind database to extract all the variants identified in each gene and organize all the evidence behind each association. This list was then prioritized by number of citations per variant and manually examined for accuracy and validity. Similarly, the titles, abstracts and full-text fragments from prioritized articles were examined to identify genes involved in amplifications or deletions using natural language processing and to identify genes involved in putative fusion events by looking for geneA-geneB pairings. These results were also annotated according to the number of citations and the data manually reviewed for accuracy before inclusion on the final gene panel design. This list was then compared to commercially available gene panels targeted to AML diagnosis to note genes that were not included on any external panels. Content Curation The challenge of interrogating this amount of information is complicated by the complexity of the data itself and the infrequenct use of standardized nomenclature to describe both genetic variants and their association with clinical scenarios that can be described in a variety of different ways. Moreover, the associations between genetic biomarkers and diseases or other clinical phenomena are often difficult to codify, necessitating close examination of the primary evidence itself. Translating genomic knowledge from the medical literature into clinical insight to inform gene panel design is a challenging and subjective process First, meaningful content was identifed based on scans of the titles and abstracts for disease terms and gene names The resultant database was queried by disease to identify recurrently associated genes and gene variants. Gene amplifications, deletions and gene fusion events were also identified Acute Myeloid Leukemia was selected as a test case due to the complexity of the pathogenetic mechanisms contributing to this disease Our novel approach to automated, evidence-based gene panel design eliminates hours of searching and allows for more focused attention on assessing actionability and clinical significance of candidate biomarkers Information Required to Inform Rational Panel Design is Captured in the Medical Literature The most significant barrier to efficient and evidence-based gene panel design for next-generation DNA sequencing assays for use in clinical diagnosis is the challenge of accurately extracting this information out of the scientific literature. Typically this work requires many man-months of manual literature review. Not only is this a painstaking process - it also leads to routine inaccuracies including false positive and false negative candidates selected for final panel composition. Databases Comprehensive disease lists were developed using Medical Subject Heading (MeSH) Terms. Comprehensive gene lists with synonyms were developed using HUGO Gene Nomenclature Committee (HGNC) database of human gene names. Ancillary lists of useful clinical category terms were custom-developed. 4M Content An automated process to identify prioritized content based on scans of titles and abstracts containing diseases or gene names through the eutils PubMed API culminating in full-text download and scanning was developed on a custom informatic framework and resulted in an initial corpus of 3.3M full-text articles. Hypothetical Variants comprise the backbone of the database used to scan every word of each article to identify varaints at the cDNA and protein level. Disease-Gene Associations were identified using this data processing architecture along with any associated variants within each of the identified genes. 909K 602M www.genomenon.com Interested in learning more? 26M Title and Abstract Search in PubMed This challenge can be further defined by the difficulty associated with access to meaningful content through typically relied upon search tools such as PubMed which does not allow for high-throughput content aggregation and lacks necessary capabilities such as full-text search and flexible content-specific search criteria. Next, automated full-text data processing identifed disease-gene-variant associations from prioritized content TGAT G CATA Mastermind was also queried to identify genes described to be involved in amplification or deletion events based on scans of titles and abstracts and sentence fragments from the full-text of prioritized articles. This evidence was organized by citation count and manually examined prior to selection of CNV candidates for the final panel design. Mastermind identified gene fusion events by scanning for geneA-geneB pairs in the titles, abstracts and full-text sentence fragments of the prioritzed literature. The results were manually investigated to eliminate false positive candidates after data prioritization and examination of organized evidence. In total 14K putative fusion events were identified and curated down to 152 validated fusion events for inclusion on the final panel. SINGLE NUCLEOTIDE + INDEL VARIANTS COPY NUMBER VARIANTS DELETIONS + AMPLIFICATIONS STRUCTURAL VARIANTS + GENE FUSIONS APPROACH TO AUTOMATED PANEL DESIGN 24K 500 151 GENES ASSESSED AML ASSOCIATED GENES SELECTED 11K VARIANTS IDENTIFIED 1355 379 192 CANDIDATE CNVS FILTERED BY GENE CANDIDATE CNVS 51 SELECTED CNVS 14K 273 PUTATIVE FUSION EVENTS CANDIDATE FUSION EVENTS 152 SELECTED FUSION EVENTS MASTERMIND SOFTWARE INTERFACE 22663011, 21251612 20818844, 25706985, 26721323 16533790, 25477091 22663011, 21343559, 25265492 17525723, 24933606 23051966 FLT3 GENES VARIANTS MATCHES HIGH-IMPACT PAPERS 25324352, 26812617 22683711, 20708155, 25913614 22753870 KIT p.D835Y p.I836V p.I836D p.N822K p.D816V D835Y Asp 835 Tyr c.2503G>T FLT3_I836V A2506G c.2506_2507delAT c.2446G>T D816V p.N822K MILLIONS OF PUBLICATIONS ACUTE MYELOID LEUKEMIA

AACR Poster 040317 - genomenon.com€¦ · Ancillary lists of useful clinical category terms were custom-developed. Contentprioritized content based on scans of titles An automated

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: AACR Poster 040317 - genomenon.com€¦ · Ancillary lists of useful clinical category terms were custom-developed. Contentprioritized content based on scans of titles An automated

AN AUTOMATED APPROACH TO IDENTIFYING DISEASE-GENE ASSOCIATIONS FROM THE MEDICAL LITERATURE TO INFORM GENE PANEL DESIGN

Rational approaches to the design of gene panels for next-generation molecular diagnostic assays are made challenging by the amount and complexity of information that must be manually collected and assessed from the medical literature. We have developed a novel text-mining infrastructure that automatically identifies disease-gene-treatment associations from the titles and abstracts of more than 26M scientific publications in PubMed and disease-gene-variant associations from the full-text of 3.9M prioritized articles. For this work, curated lists of diseases and genes comprising 11.7K and 50.9K total entries, respectively, were used as initial search parameters to identify high-yield content. Next, custom-designed algorithms were used to scan full-text articles using a comprehensive variant lists comprising 602M total entries for every possible cDNA or protein variant in all gene transcripts sorted by biological outcome as second-tier search parameters. In total, we identified 909K putative disease-gene associations (24 articles per association on average) and 543K total variants distributed across all genes. This information was organized according to the strength of the association based on the total number and quality of individual citations and the position of disease-gene and disease-gene-variant key terms within the text.

We decided to demonstrate this approach in Acute Myeloid Leukemia (AML) to create an NGS panel, where outcome is very poor and lists of well-known mutations and biomarkers for treatment are not well characterized outside of cytogenetics. In total, 11K unique variants in 151 genes were found to be associated with AML and ranked according to the number of citations for each. Each variant had been cited at least once from 3,865 individual scientific publications. These variants were further classified according to the journals in which they appeared and the resulting data was manually inspected for accuracy. Additionally, custom search algorithms identified recurrent gene amplifications, gene deletions and gene-fusions resulting from translocations and were also catalogued and incorporated in the final panel design. The final in silico panel design was compared to commercially-available panels and included several genes with a clear association to AML with potential clinical significance that are otherwise not present on any such panel. In conclusion, we have developed and tested a tool that rapidly and comprehensively interrogates, organizes and displays genomic targets for diseases and has promising applications for clinical panel construction with significantly reduced curation and research time.

Mark J Kiel MD PhD1, Matthew Schu PhD2, Steve A Schwartz1, Victor Weigman PhD2

1GENOMENON, Ann Arbor MI; 2Q2 Solutions - EA Genomics, Morrisville NC

Contact us at [email protected]

INTRODUCTION

METHODS

RESULTS + CONCLUSION

Mastermind identified more than 1000 genes associated with AML. The top 500 were selected for further examination and were manually curated to include in the final analysis only those genes for which genetic variation (as opposed to disease-gene associations that inform flow cytometric and immunohistochemical assays). The final list of 151 genes was used to batch query the Mastermind database and the evidence was prioritized by citation count and examined for accuracy and cogency of clinical utility.

During the course of content assessment for final candidate selection, the Mastermind user-interface was searched to investigate the content for individual variants, copy number changes or structural alterations to validate clinical significance for diagnostic, prognostic and therapeutic decision-making.

MELANOMA

BRAF

p.V600E

P U B L I C AT I O N S C U R AT I O NC O N T E N T

D I S E A S E - G E N E A S S O C I AT I O N SD I S E A S E - G E N E A S S O C I AT I O N S

D I S E A S E - VA R I A N T A S S O C I AT I O N S

Q U A L I T Y C O N T R O L +N AT U R A L L A N G U A G E

P R O C E S S I N G

C O N T E N T P R I O R I T Z AT I O N

P D F C O N V E R S I O N

K E Y T E R M S

D I A G N O S T I CT H E R A P E U T I CF U N C T I O N A L

C U S T O ME T C .

D I S E A S E L I S TG E N E L I S T

G E N E T I C K E Y T E R M SC O M P R E H E N S I V E

VA R I A N T L I S T

A R T I C L E F U L L - T E X TP U B M E D

T I T L E S / A B S T R A C T S

Database Assembly. Titles and abstracts of the every article in PubMed are scanned to determine whether a disease or a gene or a key-term relevant to clinical genetics is mentioned. The PDFs of theses articles are downloaded and the full-text converted to searchable text. The full-text is then scanned using the disease and gene lists described above. When a gene is identified in the full-text, variant search is invoked to identify any variant in that gene mentioned in the text in any way that an author may describe it - either using Human Genome Variant Society (HGVS) nomenclature or any of dozens of non-standard formats. Scans for additional key terms that may indicate the data contained in the article is useful for making clinical decisions (such as diagnosis, prognosis or therapy-selection) are performed to further categorize and prioritize the data. Once variant and key term scans are completed, disease-gene and disease-variant association data is passed through a quality control process and natural language processing to organize and store the data in the final database. This process is repeated weekly to remain up-to-date.

Having collected and organized the disease-gene-variant data that comprises Mastermind afforded an opportunity to test the hypothesis that gene panel design could be automated using an evidence-based approach. Specifically, we wished to automate candidate gene selection and organize the necessary evidence for each candidate that can be quickly interrogated to manually assess the validity and clinical significance of each candidate.

To test this idea, we selected AML given the variety of genes associated with disease development and progression and the complex pathogenetic mechanisms.

First, genes mentioned in association with AML in the titles and abstracts of PubMed were organized in descending order of the number of citations and manually examined to identify those whose association with AML was genetic in nature through mutation, copy number change or structural variation.

This gene list was then used to batch-query the Mastermind database to extract all the variants identified in each gene and organize all the evidence behind each association. This list was then prioritized by number of citations per variant and manually examined for accuracy and validity.

Similarly, the titles, abstracts and full-text fragments from prioritized articles were examined to identify genes involved in amplifications or deletions using natural language processing and to identify genes involved in putative fusion events by looking for geneA-geneB pairings. These results were also annotated according to the number of citations and the data manually reviewed for accuracy before inclusion on the final gene panel design.

This list was then compared to commercially available gene panels targeted to AML diagnosis to note genes that were not included on any external panels.

Content Curation The challenge of interrogating this amount of information is complicated by the complexity of the data itself and the infrequenct use of standardized nomenclature to describe both genetic variants and their association with clinical scenarios that can be described in a variety of different ways. Moreover, the associations between genetic biomarkers and diseases or other clinical phenomena are often difficult to codify, necessitating close examination of the primary evidence itself.

Translating genomic knowledge from the medical literature into clinical insight to inform gene panel design

is a challenging and subjective process

First, meaningful content was identifed based on scans of the titles and abstracts for disease terms and gene

names

The resultant database was queried by disease to identify recurrently associated genes

and gene variants. Gene amplifications, deletions and gene fusion events were also

identified

Acute Myeloid Leukemia was selected as a test case due to

the complexity of the pathogenetic mechanisms contributing to this disease

Our novel approach to automated, evidence-based gene panel design eliminates hours of searching and allows for more focused attention on

assessing actionability and clinical significance of candidate biomarkers

Information Required to Inform Rational Panel Design is Captured in the Medical Literature The most significant barrier to efficient and evidence-based gene panel design for next-generation DNA sequencing assays for use in clinical diagnosis is the challenge of accurately extracting this information out of the scientific literature. Typically this work requires many man-months of manual literature review. Not only is this a painstaking process - it also leads to routine inaccuracies including false positive and false negative candidates selected for final panel composition.

Databases Comprehensive disease lists were developed using Medical Subject Heading (MeSH) Terms. Comprehensive gene lists with synonyms were developed using HUGO Gene Nomenclature Committee (HGNC) database of human gene names. Ancillary lists of useful clinical category terms were custom-developed.

4MContent An automated process to identify prioritized content based on scans of titles and abstracts containing diseases or gene names through the eutils PubMed API culminating in full-text download and scanning was developed on a custom informatic framework and resulted in an initial corpus of 3.3M full-text articles.

Hypothetical Variants comprise the backbone of the database used to scan every word of each article to identify varaints at the cDNA and protein level.

Disease-Gene Associations were identified using this data processing architecture along with any associated variants within each of the identified genes.

909K 602M

www.genomenon.comInterested in learning more?

26MTitle and Abstract Search in PubMed This challenge can be further defined by the difficulty associated with access to meaningful content through typically relied upon search tools such as PubMed which does not allow for high-throughput content aggregation and lacks necessary capabilities such as full-text search and flexible content-specific search criteria.

Next, automated full-text data processing identifed disease-gene-variant

associations from prioritized content

T G AT G C ATA

Mastermind was also queried to identify genes described to be involved in amplification or deletion events based on scans of titles and abstracts and sentence fragments from the full-text of prioritized articles. This evidence was organized by citation count and manually examined prior to selection of CNV candidates for the final panel design.

Mastermind identified gene fusion events by scanning for geneA-geneB pairs in the titles, abstracts and full-text sentence fragments of the prioritzed literature. The results were manually investigated to eliminate false positive candidates after data prioritization and examination of organized evidence. In total 14K putative fusion events were identified and curated down to 152 validated fusion events for inclusion on the final panel.

S I N G L E N U C L E O T I D E + I N D E L VA R I A N T S

C O P Y N U M B E R VA R I A N T SD E L E T I O N S + A M P L I F I C AT I O N S

S T R U C T U R A L VA R I A N T S + G E N E F U S I O N S

A P P R O A C H T O A U T O M AT E D

PA N E L D E S I G N

24K

500

151

G E N E SA S S E S S E D

A M LA S S O C I AT E D

G E N E S S E L E C T E D

11KVA R I A N T S

I D E N T I F I E D

1355

379

192

C A N D I D AT EC N V S

F I L T E R E D B YG E N E

C A N D I D AT EC N V S

51S E L E C T E D

C N V S

14K

273

P U TAT I V EF U S I O NE V E N T S

C A N D I D AT EF U S I O NE V E N T S

152S E L E C T E D

F U S I O NE V E N T S

M A S T E R M I N D S O F T WA R E I N T E R F A C E

22663011, 21251612

20818844, 25706985, 26721323

16533790, 25477091

22663011, 21343559, 25265492

17525723, 24933606

23051966FLT3

G E N E SVA R I A N T SM AT C H E SH I G H - I M PA C T PA P E R S

25324352, 26812617

22683711, 20708155, 25913614

22753870KIT

p.D835Y

p.I836V

p.I836D

p.N822K

p.D816V

D835Y

Asp 835 Tyr

c.2503G>T

FLT3_I836V

A2506G

c.2506_2507delAT

c.2446G>T

D816V

p.N822K

M I L L I O N S O F P U B L I C AT I O N S

ACUTE MYELOID LEUKEMIA