Upload
karin-verspoor
View
132
Download
3
Embed Size (px)
Citation preview
Using text mining to inform genetic variant interpretation
Karin VerspoorDepartment of Computing and Information [email protected]
So you’re a medical doctor …
• With a very sick patient• You can’t work out what’s going on• You suspect a rare disease• You order a DNA analysis
(whole exome or genome)• And find a genetic mutation
What does it mean?
Clinical interpretation of variantsSample Data Flow
19 Apr 2013 BigData 7
Histology
DNA Extract
PCR
Sequencing
Alignment
Variant Calling
Annotation
DB Load
Filtering
Known Variant ?
Publish Report
Curation
Peter Mac Mutation
DB
External and Locus Specific
DBs
Yes
No
Patient Clinical Report
Document Assembly
Variant Normalisation
Report Editing and
Signoff
Manual Step
Automatic Step
Wet Lab Bioinformatics Clinical Informatics
Patient Sample
Sample Data Flow
19 Apr 2013 BigData 7
Histology
DNA Extract
PCR
Sequencing
Alignment
Variant Calling
Annotation
DB Load
Filtering
Known Variant ?
Publish Report
Curation
Peter Mac Mutation
DB
External and Locus Specific
DBs
Yes
No
Patient Clinical Report
Document Assembly
Variant Normalisation
Report Editing and
Signoff
Manual Step
Automatic Step
Wet Lab Bioinformatics Clinical Informatics
Patient Sample
Sample Data Flow
19 Apr 2013 BigData 7
Histology
DNA Extract
PCR
Sequencing
Alignment
Variant Calling
Annotation
DB Load
Filtering
Known Variant ?
Publish Report
Curation
Peter Mac Mutation
DB
External and Locus Specific
DBs
Yes
No
Patient Clinical Report
Document Assembly
Variant Normalisation
Report Editing and
Signoff
Manual Step
Automatic Step
Wet Lab Bioinformatics Clinical Informatics
Patient Sample
Image courtesy Kenneth Doig, Peter Mac. “PipeCleaner for your NGS Pipeline” HISA Big Data 2013.
What’s a mutation?
• Genomic variation: alteration in a sequence– hereditary (germ-line) mutations– acquired (somatic) mutations
• Examples of variation – SNP (single nucleotide polymorphism)– Protein mutation– insertions, deletions, duplications, inversions, . . .
• Types of variations– DNA variations that have no adverse effects on our cells and
occur frequently in the population are called polymorphisms – DNA variations that do affect the function of the protein
made from a gene and occur less often are called mutations
The Challenge: Interpreting variants
§ Identifying variation is becoming easier, interpreting it remains difficult
• Which changes are due to normal individual variation?
• Which are associated with a phenotype of interest?
Interpreting variation through context
• Analysis of functional significance of variants– Predicted impact of mutations– Conservation analysis– Allele frequencies from large genomic databases
• Existing knowledge captured in structured sources– UniProt site-specific protein annotations– The Cancer Gene Atlas genomic characterisation data– Disease-specific variant databases, e.g. COSMIC and
InSiGHT
• Techniques for annotating variants– Data aggregation from multiple sources– Data integration and inference to reveal shared pathways
Exponentialknowledgegrowth
• ~1550peer-reviewedgene-relateddatabasesinNARonlineMol Biocollection
• Over25millionPubMedentries(>2,000/day)
• Breakdownofdisciplinaryboundariesmakesmoreofitrelevanttoeachofus
Whybiomedical textmining?
0
200000
400000
600000
800000
1000000
1200000
1914
1918
1922
1926
1930
1934
1938
1942
1946
1950
1954
1958
1962
1966
1970
1974
1978
1982
1986
1990
1994
1998
2002
2006
2010
2014
Publ
icat
ions
per
yea
r
Year
ExponentialgrowthinsizeofPubmed
Structured resources are not enough:Literature is the primary repository of knowledge
0
20000
40000
60000
80000
100000
120000
1/02 1/03 1/04 1/05 1/06 1/07
# Sw
iss-
Prot
Pro
tein
s
Proteins missing a FUNCTION commentProteins gaining a FUNCTION comment
“Manualcurationisnotsufficientforannotationofgenomicdatabases”BaumgartneretalISMB2007
“Our entire understanding of biology and medicine is really contained in the published literature. And since people write in natural language, if you can’t
get computers to turn that information into databases and computable information, you’re
falling behind.”-- Russ Altman, MD PhD, Stanford University
Recovery of variants from the literature using text mining
Study:
Jimeno Yepes A, Verspoor K. (2014) Literature mining of genetic variants for curation: Quantifying the importance of supplementary material. Database: The Journal of Biological Databases and Curation, bau003. doi:10.1093/database/bau003 [PMID:24520105]
Study: Recall of curated variants through the application of text mining
• Given a curated resource of genetic variants,• with explicit links to the source literature for
each variant,• and a mutation extraction tool with
demonstrated good performance on intrinsic evaluation
… how many variants can text mining recover?
InSiGHTGene:
Variant:p.Lys286Gln
Lit. Reference:Takahashi et al 2007
Motivations
• Assess real-world applicability of text mining tools for supporting analysis of genetic variants
• Speed up curation of mutation databases
Two databases
• InSiGHT, Human Variome Project– MLH1, MSH2, MSH6 and PMS2 linked to
Lynch syndrome (germline mutations)
• COSMIC, Sanger Institute– Somatic mutations linked to cancer
Database
PMIDsassociated to
Mutations
Total Mutation
Count
Average Mutations per article Std Dev
InSiGHT 809 7022 8.68 18.55COSMIC 7898 198864 25.18 521.18
Literature mutation extraction
• Many tools exist to perform mutation annotation– MutationMiner, MutationFinder, EMU, tmVar, SETH,
...
• Research shows that they have high precision and recall on MEDLINE abstracts (> 90% F1)
• There are also tools to do named entity extraction of genes, diseases, body parts …
Jimeno Yepes A, Verspoor K. (2014) Mutation extraction tools can be combined for robust recognition of genetic variants in the literature. F1000Research 2014, 3:18. doi: 10.12688/f1000research.3-18.v2 [PMID:25285203]
How to extract mutations from text?
• Essentially a named entity recognition task. • Early attempts focused on SNPs and protein mutations (amino
acid residues). • e.g., MutationFinder1 patterns (simplified):
(?P<wt_res>AminoAcid)(?P<pos>[1-9][0-9]*)(?P<mut_res>AminoAcid)
Gly17SerSer97Pro
• where AminoAcid is: (CYS|ILE|SER|GLN|MET|ASN|PRO|LYS|ASP|THR|PHE|ALA|GLY|HIS|LEU|ARG| TRP|VAL|GLU|TYR)|(GLUTAMINE|GLUTAMIC ACID|LEUCINE|VALINE| ISOLEUCINE|LYSINE|ALANINE|GLYCINE|ASPARTATE|METHIONINE| THREONINE|HISTIDINE|ASPARTIC ACID|ARGININE|ASPARAGINE| TRYPTOPHAN|PROLINE|PHENYLALANINE|CYSTEINE|SERINE|GLUTAMATE| TYROSINE)
1http://mutationfinder.sourceforge.net/
Human Genome Variation Society nomenclature (excerpt)
• Pattern-based approach to identifying genetic variants– dbSNP identifiers and standard HGVS nomenclature
(e.g. SETH https://rockt.github.io/SETH)
– natural language expressions of mutationso This missense mutation converts a highly conserved glycine
(Gly17 of neurophysin) to a valine residue.o Killer of prune (Kpn) is a mutation in the awd gene which
substitutes Ser for Pro at position 97 and causes dominant lethality in individuals that do not have a functional prune gene.
o … where cysteines at positions 6, 42, 48, 90 and 393 were replaced by serine.
Extraction of mutations from text
Extractor of Mutations (Kann Lab)
Studytextsources
• PubMed– 22Mcitations;titleandabstract
• PubMedCentral– fulltext– 512kavailablefromPMC-OpenAccess
• Publishersitecrawling– Availabilitydependsonlicense– HTMLpagescanbenoisy
• C676T–>Arg226Stopvs.C676TâArg226Stop
Extraction with EMU over our data
• EMU: Extract mutation from text and link the mutations to co-occurring genes
• Normalize all mutation mentions to HGVS format– Format used in COSMIC and InSiGHT
• Match {gene, HGVS variant, PMID} to curated data
ResultsAbstracts and Full Text
NG = No Gene (ignoring gene in match)
Common/Cmn = PMIDs in common between database and corpus subset (recall with respect to articles for which mutation entity recogniser had at least one positive extraction)
Set Cmnart
Match mutation Recall Recall NG Mutations
commonRecall
commonRecall
CmnNG
COSMIC Abs 2200 1884 0.0095 0.0122 12,940 0.1456 0.1875COSMIC FT 2071 3656 0.0184 0.0215 104,756 0.0349 0.0408
COSMIC Abs + FT 3738 4754 0.0239 0.0289 114,279 0.0416 0.0503
InSiGHT Abs 195 230 0.0328 0.045 1233 0.1865 0.2562InSiGHT FT 150 404 0.0575 0.0612 1626 0.2484 0.2644
InSiGHT Abs + FT 295 588 0.0837 0.0961 2657 0.2213 0.254
High Throughput vs non-High Throughput
Set Cmnart
Match mutation Recall Recall NG Recall
commonRecall
CmnNGHT abstract 1650 1357 0.0072 0.0096 0.1209 0.1608HT full text 1545 2719 0.0145 0.0172 0.027 0.0319HT Abs + FT 2608 3501 0.0187 0.0231 0.032 0.0395
NHT abstract 550 530 0.0461 0.0543 0.3055 0.3597NHT full text 526 937 0.0815 0.0915 0.235 0.2639
NHT Abs + FT 841 1259 0.109 0.1243 0.2538 0.2895
Group PMIDs Count Average mutation SD Mutation
recallCOSMIC 7898 198 864 25.18 521.27 100.00%COSMIC-HT 6266 187 367 29.9 584.82 94.22%
COSMIC-NHT 1632 11 497 7.04 38.05 5.78%
Considering tables and Supplementary material
• Subset from COSMIC and InSiGHT available as PubMed Central Open Access articles
• Supplementary material: MS Word, PDF, MS Excel, PPT, images, …
InSiGHT COSMIC
Set Articles Matched Recall (%) Articles Matched Recall (%)
Abstracts 13 1 0.4 563 140 0.41
XML Full Text (FT) 9 20 7.94 487 694 2.05
PDF FT (PDFFT) 4 7 2.78 76 23 0.07
Tables 8 18 7.14 394 466 1.38
FT+PDFFT+Tables 13 44 17.46 563 929 2.75
Supp. Mat. 1 88 34.92 138 17015 50.59
All 13 115 45.63 563 17896 52.92
Recall still only 50%: Where are the rest?
• Expressed in semi-structured data sources– do not necessarily follow standard nomenclature more
predictably – data spread unpredictably across columns (Wong et al.
2009)
• Different reference position in text than database– curator correction or normalized to different build
• Nomenclature variation– c.482_483delGA vs c.482_483del2
• Linguistic expression of mutations– deletion of exon 3– C>T mutation at nucleotide 2131
Information in tables (spreadsheets, etc.)is expressed differently than in narrative text
Gene listed in column heading
Non-standard nomenclature“Del exon 7”
Text mining over semi-structured data?
• Access ?• Variability (!)
– File formats– How connected to the main text?
• Semantics (?!)– How to make sense of the data?– How to map to standardized nomenclature?
… processing supplementary material will require new strategies. Some technical solutions. Some research.
Extraction of gene-disease-mutation relations
Study:
Verspoor KM, Heo GH, Kang KY, Song M. (in press) Extraction of fine-grained semantic relations for the Human Variome. BMC Medical Informatics and Decision Making.
Variant interpretation using literature
• Evidence of prior significance of variants• Evidence of established connection of the variant
to specific patient cohorts• Use alone or in combination with other evidence
• We aim to extract the relations that connect genes, diseases and mutations
• Specific Objective of the work: relation extraction over theVariome Corpus
gene-mutation-disease-phenotype relations
• Variome Annotation Schema– a schema defining entities and relations of interest
to curation of genetic variants• Variome Corpus
– A corpus of full text articles annotated according to the Variome Annotation Schema
– To be used as training and evaluation data for text mining tools for extracting genetic variation information from the published literature
31
http://www.opennicta.com.au/home/health/variome
The Variome Corpus
10 full-text publications related to colorectal cancerEntities Relations
Gene Gene-has-MutationMutation Cohort/Patient-has-MutationDisease Mutation-relatedto-DiseaseBody part Disease-relatedto-GeneCohort/Patient Disease-relatedto-BodyPartSize Mutation-has-SizeAge Cohort/Patient-has-AgeGender Cohort/Patient-has-GenderEthnicity or Geo Location Cohort/Patient-has-EthnicityLocCharacteristic Cohort/Patient-has-Disease
Cohort-has-size
Verspoor K, Jimeno Yepes A, Cavedon L, McIntosh T, Herten-Crabb A, Thomas Z, Plazzer JP. (2013) Annotating the Biomedical Literature for the Human Variome. Database: The Journal of Biological Databases and Curation, bat019.
§ 43k words§ Double-
annotated§ IAA varies§ .88-.92 F for
entities§ Relations
much lower; reconciled manually
The Variome Corpus annotation
33
• Recognise genetic variants
• Named entity recognition for gene names– Supervised learning for recognizing characteristics and contexts– Combined with dictionaries to support normalisation
• Associating variants to genes– Simple co-occurrence – Combined with sequence verification– Machine learning for relation classification (PKDE4J)
Extraction of mutation relations from text
Information Extraction, Structuring text
From:A subset of colorectal tumour DNA samples from 17 patients carrying the p.Lys618Ala variant …
To:T60 body-part 1307 1317 colorectalT7 disease 1318 1324 tumourR17_m relatedTo Arg1:T60 Arg2:T7
(colorectal relatedTo tumour)T61_merge size 1342 1344 17T24 cohort-patient 1345 1353 patientsR46_2 has Arg1:T24 Arg2:T7
(patients has tumour)T62 mutation 1367 1378 p.Lys618AlaR18_m has Arg1:T24 Arg2:T61_merge
(patients has 17) = (patient group size 17)R19_m has Arg1:T24 Arg2:T62
(patients has p.Lys618Ala)
PKDE4J: Yonsei University IE system
• PKDE4J– Extensible, flexible text mining system for public knowledge
discovery – Entity and relation extraction from the unstructured text data– Extension of Stanford CoreNLP (Manning et al., 2014)– http://informatics.yonsei.ac.kr/pkde4j
• Differentiation of PKDE4J– Configurable system
• Dictionary based entity extraction• Extensible system• Wide range of relation extraction tasks developing an
extensible rule engine based on dependency parsing– Accurate performance
• PKDE4J outperforms many other competing algorithms for both entity and relation extraction
PKDE4J: Yonsei University IE system
• PKDE4J’s major two pipelines – Entity Extraction: Target entities based on
dictionaries by extending Stanford CoreNLP– Relation Extraction: relationships among entities
based on dependency tree based rules
PKDE4J – Named Entity Recognition
PKDE4J – Named Entity Recognition
• Extension of Stanford CoreNLP• Three major submodules
Pre-Processing Dictionary loading Entity annotation
• Flexible configuration (number and format of dictionaries)
• Trie data structure
• Abbreviation resolution• Tokenization: Stanford
PTBTokenizer• Sentence splitting, POS
tagging, Lemmatization: Stanford CoreNLP
• String normalization: Special characters processing
• N-gram matching: Apache Lucene ShingleWrapper
• Approximate string matching: Soft-TFIDF
• Regex NER (Rule-based): Stanford CoreNLP
• Candidate entities filtering: POS filtering, Stopwordremoval
• Labeling: B/I/O format, Entity type
PKDE4J – Relation Extraction
PKDE4J – Relation Extraction
• Based on dependency parse (grammatical structure) based rules
• To extract a relation
Step 1: Identify the verbs in a sentence
CategoryNumber of
VerbsType Verb Example
Positive 68
Increase Lead, Contribute, RiseTransmit Shift, Move, Migrate
Substitute Supplement, Alter
Negative 54Decrease Decline, Diffuse, Down-regulateRemove Deplete, Abrogate, Disassociate
Neutral 111
Contain Possess, Constitute, IncludeModify Methylate, Modulate , NormalizeMethod Bleach, Centrifuge, SpinReport Evaluate, Analyze, Examine
Plain 165 Plain Return, Switch, Balance
PKDE4J - RE
Step 2: Check structure of sentence• Syntactic rules based on deep parsing
– Dependency tree encodes grammatical relations between words in a sentences.
– The tree denotes syntactic dependencies between two entities.– Need to spot the portion of parse tree that is useful, pertinent to
location of entities in a sentence.
PKDE4J - RE
• Rule Extraction– Use Strategy design pattern– Capture predefined rules (17 strategies)
①Verb in dependency path ②No verb in dependency path ③Detect nominalization ④Weak nominalization ⑤Negation ⑥Tense (active / passive) ⑦Contain clause⑧Clause distance⑨Negation clause
⑩Number intervening entities ⑪Entities in between ⑫Surface distance ⑬Entity counts ⑭Same head ⑮Entity order ⑯Full tree path ⑰Path length
Evaluation: PKDE4J over Variome Corpus
• Experimental set-up– Data split– Features?– 10-fold cross-validation
• Focus on relations: Used gold standard entities
• Baseline co-occurrence system
Results of the evaluation
Relation Extraction results for relations with at least 100 examples in the corpus.
Observations
• By applying text mining we can transform the literature from an unstructured, difficult to use resource, to a structured resource.
• We can build systems that can recognise core biological entities in the published literature.
• With this, the information is more accessible– Formalised and normalised in a database– Directly query-able
• and can be used to facilitate more computation:– Information retrieval in terms of entities– Predictive modeling and hypothesis generation
Conclusions
• Variants are relatively easy to recognise in the literature, when the recommended nomenclature is followed (so please use it!).
• The relations between variants and other entities are harder to extract, but still we can do a reasonable job.
• There is lots of information that is in ancillary files associated to the literature (with some challenges for automated systems).
The literature can be effectively mined to identify variant-related information to assist biocuration
and clinical interpretation of variants.
© Copyright The University of Melbourne 2016