Upload
joshua-ohara
View
225
Download
0
Tags:
Embed Size (px)
Citation preview
UniProt to MeSH
mapping proteins to disease terminologies
Yum L. Yip, Anaïs Mottaz, Patrick Ruch,
Anne-Lise Veuthey
ISMB/ECCB 2007 – Bio-Ontologies – Vienna, July 20
July 20Bio-Ontologies –ISMB 2007
The role of bioinformatics in biomedical research and future clinical patient care
Health problemin a patient
Bioinformatics:-Data storage and representation-Large-scale data generation-Large-scale data analysis
Basic research: -what is the mechanism?-Epidemiological studies
Basic research: -what is the mechanism?-Epidemiological studies
Basic research results stored in databases
up-to-date knowledge and large-scale results:-research direction-New hypothesis
Drug developmentClinical trials
Clinical patient care:Doctor prescribes an individualized treatment plan.
Molecular-level decision-support tools:Molecular-level decision-support tools:-Structured knowledge representationsStructured knowledge representations-‘‘Filtered’ information on fundamental Filtered’ information on fundamental biological mechanisms and significantbiological mechanisms and significant
Treatment outcome
July 20Bio-Ontologies –ISMB 2007
Disease:Pathology, diagnosis/prognosis,
Treatment, risk factor
Biological processes:Biological pathway/network,Protein-protein interaction
Proteins:Sequence, Function, structure,
modifications
Genes:Sequence, chromosomal
location, regulation, expression
Biomedical knowledge: a protein-centric view
July 20Bio-Ontologies –ISMB 2007
Biomedical knowledge: a protein-centric view
High quality manual annotation.Protein name, sequence, function,Domain, features and references.
16,702 human proteins
Proteins:Sequence, Function, structure,
modifications
Disease:Pathology, diagnosis/prognosis,
Treatment, risk factor
Disease annotation:-Link to 12,603 OMIM entries-Link to other specialized databases-32,921 variants (or polymorphisms)->3’000 associated diseases
Biological processes:Biological pathway/network,Protein-protein interaction
Biological process/proteomic:-Pathway annotation-Protein-protein interaction (DIP, INTACT)-protein 2D gel (Swiss-2DPAGE)
ReferencesLinks to >100 other databasesOver 82’420 journal references
Genes:Sequence, chromosomal
location, regulation, expression
Genomic data:-Genew, GeneCards, GenAtlas-Expression data (e.g. CleanEx)-Genome details: Ensembl
July 20Bio-Ontologies –ISMB 2007
Objective
Increase the accessibility of molecular biology resources to clinical researchers by indexing
UniProtKB/Swiss-Prot with the MeSH terminology
July 20Bio-Ontologies –ISMB 2007
Why UniProt KB/Swiss-Prot ?
Most comprehensive warehouse of protein sequences
With a high level of annotation and highly cross-linked with other biological databases.
Includes data on more than 30’000 variants, mostly c-SNPsc-SNPs (coding SNPs) or SAPs SAPs (Single Amino-acid Polymorphisms)
More than 3’000 Diseases associated with a protein are also described (mostly genetic diseases associated with SAPs)
http://beta.uniprot.org/
July 20Bio-Ontologies –ISMB 2007
Disease annotation
UniProtKB/Swiss-Prot entry P35240
July 20Bio-Ontologies –ISMB 2007
Why MeSH?
Controlled vocabulary thesaurus structured in a hierarchy of concepts
Each concept includes a set of terms -synonyms and lexical variants
MeSH is part of the UMLS, and, thus, linked to other medical terminologies
MeSH is used to index the biomedical literature
July 20Bio-Ontologies –ISMB 2007
The structure of MeSH
July 20Bio-Ontologies –ISMB 2007
Mapping procedure
UniProtKB/Swiss-Prot entryDisease comment line
Extracted disease name OMIM: title/alternative titles
Exact match Exact match
Partial match Partial match
Same descriptor
MeSH
July 20Bio-Ontologies –ISMB 2007
Disease extraction
Extraction using regular expressions‘are the cause of’‘involved in’etc.
MeSH‘Neurofibromatosis 2’
July 20Bio-Ontologies –ISMB 2007
Term matching procedure
• Exact matches: same length, same word order, case insensitive
• Partial matches: calculation of a similarity score between terms based of the IDF used in information retrieval:
The term with the highest score was chosen.
)(
))(
1log()
)(1
log(
diseasesizencwfreqcwfreq
S cw ncw
July 20Bio-Ontologies –ISMB 2007
Benchmark
• Used to evaluate the procedure in terms of recall and precision
• Used to set up a score threshold
40%
50%
60%
70%
80%
90%
100%
0% 10% 20% 30% 40% 50% 60%
Recall
Pre
cisi
on SP
OMIM
92 disease names from 43 Swiss-Prot entries manually mapped to MeSH terms
July 20Bio-Ontologies –ISMB 2007
92 disease comment
lines(82 OMIM)
Exact match Partial match Total
Retrieval Recall Precision Retrieval Recall Precision Retrieval Recall Precision
SP16
(17%)16
(17%)100%
20(22%)
16(17%)
80%36
(39%)32
(35%)89%
OMIM21
(23%)21
(23%)100%
21(23%)
19(21%)
90%42
(46%)40
(43%)95%
SP OMIM10
(11%)10
(11%)100%
8(9%)
8(9%)
100%18
(20%)18
(20%)100%
SP OMIM27
(29%)27
(29%)100%
23(25%)
19(21%)
83%50
(54%)46
(50%)92%
Results on the Benchmark
July 20Bio-Ontologies –ISMB 2007
Analysis of the results (1/3)
‘muscle liver brain eye nanism’
Disease
MeSH term ‘abnormalities, multiple’
‘muscle-eye-brain disease’
Manual mapping Automatic mapping
• Problems in granularity difference
July 20Bio-Ontologies –ISMB 2007
‘b-cell lymphoma’‘hematologic neoplasms’
‘hematopoietic tumors such as b-cell lymphomas’Disease(extracted)
MeSH term
Manual mapping Automatic mapping
Analysis of the results (2/3)
• Problems in disease name extraction
July 20Bio-Ontologies –ISMB 2007
‘epidermolysis bullosa dystrophica’‘epidermolysis bullosa simplex’
‘epidermolysis bullosa dystrophica, Cockayne-Touraine type’Disease(OMIMalternative title)
MeSH term
Manual mapping Automatic mapping
Analysis of the results (3/3)
• Problems inherent to the resources
‘epidermolysis bullosa simplex, Weber-Cockayne type’Disease SP
July 20Bio-Ontologies –ISMB 2007
Results on all Swiss-Prot
3197 disease comment lines
2398 OMIMSP OMIM SP OMIM SP OMIM
Exact match577
(18%)655
(20%)354
(11%)866
(27%)
Partial match691
(22%)600
(19%)317
(10%)751
(23%)
Total1268(40%)
1225(39%)
844(26%)
1617(51%)
July 20Bio-Ontologies –ISMB 2007
Discussion The mapping system was tuned for high precision to
provide a fully automated procedure. But we need to improve the recall by:
Including NLP techniques in the disease extraction and matching procedures;
Refining the score with other parameters (e.g. coming from information from the hierarchical structure of the MeSH)
Permitting a mapping to several MeSH terms; Trying to map to other terminologies such as ICD-10,
SnoMed-CT; Using information from the literature which is indexed with
MeSH terms.
July 20Bio-Ontologies –ISMB 2007
Benchmark extended to 200 diseases
Work in progress
200 disease comment
lines(173 OMIM)
Exact match Partial match Total
Retrieval Recall Precision Retrieval Recall Precision Retrieval Recall Precision
SP35
(18%)35
(18%)100%
54(27%)
47(24%)
87%89
(45%)82
(41%)92%
OMIM40
(20%)38
(19%)95%
56(28%)
48(24%)
86%96
(48%)86
(43%)90%
SP OMIM22
(11%)22
(11%)100%
28(14%)
26(13%)
93%62
(31%)60
(30%)97%
SP OMIM52
(26%)51
(26%)98%
65(33%)
56(28%)
86%117
(59%)107
(54%)91%
July 20Bio-Ontologies –ISMB 2007
Work in progress
Extract MeSH terms using full text from disease comment lines + references in Swiss-Prot + references in OMIM calculate frequency This frequency is used to refine the score for partial
match
Preliminary results:The recall was successfully increased to 62 % without
losing precision.
July 20Bio-Ontologies –ISMB 2007
Conclusion
We developped a generic terminology mapping procedure which can be used to link various biomedical resources.
Indexing UniProtKB with medical terms opens new possibilities of searching and mining data relevant for clinical research.
These results will help improve the interoperability between medical informatics and bioinformatics