View
218
Download
0
Category
Preview:
Citation preview
Literature Data Mining and Protein Ontology Development at the Protein Information Resource (PIR)Hu ZZ1, Mani I2, Liu H3, Vijay-Shanker K4, Hermoso V1, Nikolskaya A1, Natale DA1, and Wu CH1
1Protein Information Resource, Georgetown University Medical Center, 3900 Reservoir Road, NW, Washington, DC 20057; 2Georgetown University, 37th and O Streets, NW, Washington, DC 20057; 3University of Maryland at Baltimore County, Baltimore, MD 21250; 4Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716
PIRSF in DAG View
• PIRSF family hierarchy based on evolutionary relationships• Standardized PIRSF family names as hierarchical protein ontology• DAG Network structure for PIRSF family classification system
PIRSF-Based Protein Ontology
ABSTRACTAn integrated protein literature mining resource iProLINK is developed at PIR to provide data sources for Natural Language Processing (NLP) research on bibliography mapping, annotation extraction, protein named-entity recognition, and protein ontology development. A rule-based text-mining system RLIMS-P is used to extract protein phosphorylation information from MEDLINE abstracts to assist database annotation, an online BioThesaurus is developed for protein/gene name mapping and to assist with protein named-entity recognition, and a family classification PIRSF-based protein ontology is developed and to complement other ontologies.
As the volume of scientific literature rapidly grows, literature data mining becomes increasingly critical to facilitate genome/proteome annotation and to improve the quality of biological databases. Annotations derived from experimentally verified data from literature are of special value to the UniProtKB (UniProt Knowledgbase). One objective of UniProtKB is to have accurate, consistent, and rich annotation of protein sequence and function. Relevant to this goal are the literature-based curation and development and adoption of ontologies and controlled vocabularies.• Literature-Based Curation – Extract Reliable Information from Literature
• Protein properties: protein function, domains and sites, developmental stages, catalytic activity, binding and modified residues, regulation, induction, pathways, tissue specificity, subcellular location, quaternary structure…
• This will ensure high quality, accurate and up-to-date experimental data for each protein. But it is a major bottleneck!
• Ontologies/Controlled Vocabularies – For Information Integration and Knowledge Management
• UniProtKB entries will be annotated using widely accepted biological ontologies and other controlled vocabularies, e.g. Gene Ontology (GO) and EC nomenclature.
The Protein Information Resource has been collaborating with several NLP research groups to develop text-mining methodologies to extract information from biological literature and to develop protein ontology.
INTRODUCTION PIR – Integrated Protein Informatics Resource for Genomic/Proteomic Research
(http://pir.georgetown.edu)
UniProt – Central international database of protein sequence and function
(http://www.uniprot.org)
Bioinformatics. 2005 Jun 1;21(11):2759-65
High recall for paper retrieval and high precision for information extraction
• UniProtKB site feature annotation• Proteomics MS data analysis: protein identification
Benchmarking of RLIMS-P
Sentence extraction
Part of speech tagging
Preprocessing
Acronym detection
Term recognition
Entity Recognition
Noun and verb group detection
Other syntactic structure detection
Phrase Detection
Semantic Type
Classification
Nominal level relation
Verbal level relation
Relation Identification
Abstracts Full-Length Texts
Post-Processing
Extracted Annotations Tagged Abstracts
Pattern 1: <AGENT> <VG-active-phosphorylate> <THEME> (in/at <SITE>)?ATR/FRP-1 also phosphorylated p53 in Ser 15
http://pir.georgetown.edu/iprolink/
RLIMS-P Rule-based LIterature Mining System for Protein Phosphorylation
Substrate(e.g., cPLA2)
phosphorylated-cPLA2
Enzyme(e.g., MAP kinase)
<THEME> Substrate (protein being phosphorylated)
<AGENT> Enzyme (kinase catalyzing the phosphorylation)
Phosphorylation
P-site
(e.g., Ser505)
P-group
<SITE> P-Site (amino acid residue being phosphorylated)
Ser-P
RLIMS-P
Protein Phosphorylation Annotation Extraction• Manual tagging assisted with computational extraction• Training sets of positive and negative samples
BioThesaurus reportUniProtKB entry P35625
• Tagging guideline versions 1.0 and 2.0
– Generation of domain expert-tagged corpora
– Inter-coder reliability – upper bound of machine tagging
• Dictionary pre-tagging
– F-measure: 0.412 (0.372 Precision, 0.462 Recall)
– Advantages: helpful with standardization and extent of tagging, reducing fatigue problem, and improve inter-coder reliability.
• BioThesaurus for pre-tagging
Raw Thesurus
iProClass
NCBIEntrez Gene
RefSeqGenPept
UniProtUniProtKB
UniRef90/50PIR-PSD
Genome
FlyBaseWormBase
MGDSGDRGD
OtherHUGO
ECOMIM
Name Filtering
Highly Ambiguous Nonsensical
Terms
Semantic Typing
UMLS
NameExtraction
UniProtKB Entries:
Protein/Gene Names &
Synonyms
BioThesaurus
BioThesaurus
• Biological entity tagging
• Name mapping
• Database annotation
• literature mining
• Gateway to other resources
Applications:
# UniProtKB entry 1.86m
# Source DB record 6.6m
# Gene/protein name/terms 3.6m
BioThesaurus v1.0 m = million
(May, 2005)
Protein Name Tagging
Example 2. Name ambiguity of CLIM1
PIRSF to GO Mapping
• Superimpose GO and PIRSF hierarchies• Bidirectional display (GO- or PIRSF-centric views)
• Complements GO: PIRSF-based ontology can be used to analyze GO branches and concepts and to provide links between the GO sub-ontologies
• Mapped 5363 curated PIRSF homeomorphic families and subfamilies to the GO hierarchy– 68% of the PIRSF families and subfamilies map
to GO leaf nodes– 2329 PIRSFs have shared GO leaf nodes
DynGO viewer
Two cases: analyze GO branches and concepts and identify missing GO nodes
Case I. Nuclear receptor superfamily Case II. IGF-binding protein superfamily
iProLINK: An integrated protein resource for literature mining
1. Bibliography mapping - UniProt mapped citations2. Annotation extraction - annotation tagged literature3. Protein entity recognition - dictionary, tagged literature4. Protein ontology development - PIRSF-based ontology
http://pir.georgetown.edu/iprolink/
Testing and Benchmarking Dataset
• RLIMS-P text mining tool
• Protein dictionaries
• Name tagging guideline
• Protein ontology
3 4
5 6Protein Ontology Can Complement GO
Expanding a Node: Identification of GO subtrees that need expansion if GO concepts are too broad
– IGFBP subfamilies– High- vs. low-affinity
binding for IGF between IGFBP and IGFBPrP
GO-centric view
2
1
Exploration of Gene and Protein Ontology
PIRSF-centric view
1
Molecular function
Biological process
Estrogen receptor alpha (PIRSF50001)
Systematic links between three GO sub-ontologies based on the shared annotations at different protein family levels, e.g., linking molecular function and biological process:
– estrogen receptor binding and
– estrogen receptor signaling pathway
Acknowledgements
Research Projects
NIH: NHGRI/NIGMS/NLM/NIMH/NCRR/NIDCR (UniProt)
NSF: SEIII (Entity Tagging)
NSF: ITR (Ontology)
Collaborators
I. Mani from Georgetown University Department of Linguistics on protein name recognition and protein name ontology.
H. Liu from University of Maryland Department of Information System on protein name recognition and text mining.
Vijay K. Shanker from University of Delaware Department of Computer and Information Science on text mining of protein phosphorylation features.
Summary• PIR iProLINK literature mining resource
provides annotated data sets for NLP research on annotation extraction and protein ontology development
• RLIMS-P text-mining tool for protein phosphorylation from PubMed literature. Coupling the high recall for paper retrieval and high precision for information extraction, RLIMS-P can be applied for UniProtKB protein feature annotation.
• Biothesaurus can be used to solve name synonym and ambiguity, name mapping.
• PIRSF-based protein ontology can complement GO by identify missing GO concepts/nodes and provides systematic links between three GO sub-ontologies.
7
8
PIRSF001499: Bifunctional CM/PDH (T-protein)
PIRSF006786: PDH, feedback inhibition-insensitive
PIRSF005547: PDH, feedback inhibition-sensitive
PF02153: Prephenatedehydrogenase (PDH)
PIRSF017318: CM of AroQ class, eukaryotic type
PIRSF001501: CM of AroQ class, prokaryotic type
PIRSF026640: Periplasmic CM
PIRSF001500: Bifunctional CM/PDT (P-protein)
PIRSF001499: Bifunctional CM/PDH (T-protein)
PF01817: Chorismatemutase (CM)
PIRSF006493: Ku, prokaryotic type
PIRSF500001: IGFBP-1
…
PIRSF500006: IGFBP-6
PIRSF Homeomorphic Subfamily
• 0 or more levels
• Functional specialization
PIRSF018239: IGFBP-related protein, MAC25 type
PIRSF001969: IGFBP
PIRSF003033: Ku70 autoantigen
PIRSF016570: Ku80 autoantigen
PIRSF Homeomorphic Family• Exactly one level
• Full-length sequence similarity and common domain architecture
PIRSF Superfamily
• 0 or more levels
• One or more common domains
PF00219: Insulin-like growth factor binding protein
(IGFBP)
PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain
Domain Superfamily• One common Pfam
domain
PIRSF001499: Bifunctional CM/PDH (T-protein)
PIRSF006786: PDH, feedback inhibition-insensitive
PIRSF005547: PDH, feedback inhibition-sensitive
PF02153: Prephenatedehydrogenase (PDH)
PIRSF017318: CM of AroQ class, eukaryotic type
PIRSF001501: CM of AroQ class, prokaryotic type
PIRSF026640: Periplasmic CM
PIRSF001500: Bifunctional CM/PDT (P-protein)
PIRSF001499: Bifunctional CM/PDH (T-protein)
PF01817: Chorismatemutase (CM)
PIRSF006493: Ku, prokaryotic type
PIRSF500001: IGFBP-1
…
PIRSF500006: IGFBP-6
PIRSF Homeomorphic Subfamily
• 0 or more levels
• Functional specialization
PIRSF018239: IGFBP-related protein, MAC25 type
PIRSF001969: IGFBP
PIRSF003033: Ku70 autoantigen
PIRSF016570: Ku80 autoantigen
PIRSF Homeomorphic Family• Exactly one level
• Full-length sequence similarity and common domain architecture
PIRSF Superfamily
• 0 or more levels
• One or more common domains
PF00219: Insulin-like growth factor binding protein
(IGFBP)
PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain
Domain Superfamily• One common Pfam
domain
PIRSF: A network structure from superfamilies to subfamilies to reflect evolutionary relationships of full-length proteins
DefinitionsBasic unit = Homeomorphic FamilyHomeomorphic: Full-length similarity, common domain architectureNetwork Structure: Flexible number of levels with varying degrees
of sequence conservation
PIRSF Protein Family Classification
Example 1. Name ambiguity of TIMP3
http://pir.georgetown.edu/iprolink/biothesaurus/
Web-based BioThesaurus
Gene/Protein Name Mapping
1.Search Synonyms
2.Resolve Name Ambiguity
3.Underlying ID Mapping
Online RLIMS-P text-mining tool (version 1.0)
http://pir.georgetown.edu/iprolink/rlimsp/
1
2
1. Search interface
2. Summary table with top hit of all sites
3. All sites and tagged text evidence
3
DAG file: ftp://ftp.pir.georgetown.edu/pir_databases/pirsf/dagfiles/
Liu et al, 2005, submitted
Liu et al, 2005, submitted
Recommended