21
Text mining for protein and small molecule relations Lars Juhl Jensen EMBL

Text mining for protein and small molecule relations

Embed Size (px)

DESCRIPTION

BCB-Seminar, Charite, Berlin, Germany, Januar 19, 2006

Citation preview

Page 1: Text mining for protein and small molecule relations

Text mining for protein andsmall molecule relations

Lars Juhl Jensen

EMBL

Page 2: Text mining for protein and small molecule relations

Why?

Page 3: Text mining for protein and small molecule relations

Overview

• Entity recognition and identification– Recognition: find the words that are names of entities– Identification: figure out which entities they refer to

• Information extraction– Simple statistical co-occurrence methods– Natural Language Processing (NLP)

• Text mining– Mining text for overlooked relations– Discovery of global trends from text alone

Page 4: Text mining for protein and small molecule relations

Entity recognition

• Features– Morphological: mixes letters and digits or ends on -ase – Context: followed by “protein” or “gene”– Grammar: should occur as a noun

• Methodologies– Manually crafted rule-based systems– Machine learning (SVMs)

• But what can it be used for?

Page 5: Text mining for protein and small molecule relations

Entity identification

• A good synonyms list is the key– Combine many sources– Curate to eliminate stop words

• Flexible matching to handle orthographic variation– Case variation: CDC28, Cdc28, and cdc28– Prefixes: myc and c-myc– Postfixes: Cdc28 and Cdc28p– Spaces and hyphens: cdc28 and cdc-28– Latin vs. Greek letters: TNF-alpha and TNFA

Page 6: Text mining for protein and small molecule relations

Example

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation

Entities identified– S. cerevisiae proteins: Clb2 (YPR119W), Cdc28

(YBR160W), Swe1 (YJL187C), and Cdc5 (YMR001C)

Page 7: Text mining for protein and small molecule relations

Identification of small molecules

• We have compiled a list of 14 million synonyms for 4 million chemicals– This list was compiled based on many resources:

PubChem, KEGG, ChEBI, and SuperDrug– A stop word list was manually curated for based on

synonyms that occur 2000+ times in Medline

• Searching Medline with this list gives 12.5 million hits in 4.6 million abstracts– The precision and recall has not been evaluated yet– However, stop word curation has eliminated the most

critical errors so fairly high precision is likely

Page 8: Text mining for protein and small molecule relations

Co-occurrence

• Relations are extracted for co-occurring entities– Relations are always symmetric– The type of relation is not given

• Scoring the relations– More co-occurrences more significant– Ubiquitous entities less significant– Same sentence vs. same paragraph

• Simple, good recall, poor precision

Page 9: Text mining for protein and small molecule relations

Example

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation

Relations– Correct: Clb2–Cdc28, Clb2–Swe1, Cdc28–Swe1, and

Cdc5–Swe1– Wrong: Clb2–Cdc5 and Cdc28–Cdc5

Page 10: Text mining for protein and small molecule relations

NLP

• Information is extracted based on parsing and interpreting phrases or full sentences– Good at extracting specific types of relations– Handles directed relations

• Complex, good precision, poor recall

Page 11: Text mining for protein and small molecule relations

Example

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation

Relations:– Complex: Clb2–Cdc28– Phosphorylation: Clb2Swe1, Cdc28Swe1, and

Cdc5Swe1

Page 12: Text mining for protein and small molecule relations

Syntacto-semantic taggingPart-of-speechGene and protein namesCue words for entity recognitionCue words for relation extraction

Named entity chunkingA CASS grammar recognizes

noun chunks related to gene expression:[nxgene The GAL4 gene]

Relation chunkingOur CASS grammar also

extracts relations between entities:[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

Page 13: Text mining for protein and small molecule relations
Page 14: Text mining for protein and small molecule relations

Extraction of relations between protein and small molecules

• Over 650,000 protein–chemical relations were identified using simple co-occurrence on Medline

• Benchmarking on DrugBank– Of the 959 protein–drug interactions in DrugBank,

literature mining evidence was found for 299– 30% recall is thus our current upper limit– Precision has not yet been evaluated

• We are also working on adapting our NLP system to extract protein–chemical relations

Page 15: Text mining for protein and small molecule relations

Text mining

• New relations can be inferred from published ones– This can lead to actual discoveries if no person knows

all the facts required for making the inference– Combine facts from disconnected literatures

• Global trends can be discovered from literature– Although all the detailed data is in the text, people may

have missed the big picture– Identify significant correlations– Find temporal trends

Page 16: Text mining for protein and small molecule relations
Page 17: Text mining for protein and small molecule relations

Correlations

• “Customers who bought this item also bought …”

• Correlation protein roles in networks– Transcription factors are

themselves transcriptionally regulated

– Kinases are themselves phosphorylated

– Many proteins are both regulated transcriptionally and post-translationally

Page 18: Text mining for protein and small molecule relations

Temporal trends

Page 19: Text mining for protein and small molecule relations

Buzzwords

Page 20: Text mining for protein and small molecule relations

Acknowledgments

• Charité– Mathias Dunkel– Robert Preißner

• EML Research– Jasmin Saric– Isabel Rojas

• EMBL Heidelberg– Rossitza Ouzounova– Peer Bork– Rob Russell– Reinhard Schneider

Page 21: Text mining for protein and small molecule relations

Thank you!