31
ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 [email protected]

ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 [email protected]

Embed Size (px)

Citation preview

Page 1: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

ChEBI,text mining

and ontological best practice

Colin BatchelorRoyal Society of Chemistry

[email protected]

Page 2: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

2

What is text mining?

Marti Hearst, Berkeley:“Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources.”

Can ChEBI help?

Page 3: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

3

Overview

Reasoning

ChEBI as dictionary

Regular polysemy in chemistry

Some possible solutions

Page 4: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

4

Reasoning

Page 5: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

5

Reasoning

Reasoning is using the logical structure of an ontology to automatically infer facts about the world which have not been explicitly added by a human being.

Computers have no real-world knowledge beyond what we tell them.

Page 6: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

6

Logical structure:properties of relations

We only have time to look at transitivity and is_a.

Smith et al., “Relations in Biomedical Ontologies”, Genome Biol., 2005, 6, R46.

Relation Transitive Symmetric Reflexive Anti-symmetric

is_a Yes No Yes Yes

part_of Yes No Yes Yes

Page 7: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

7

ChEBI’s is_a is not transitive (1)

If a relation R is transitive, then:

If a R b and b R c, then a R c.

glutathione is_a cofactor cofactor is_a biological role

therefore glutathione is_a biological role

Page 8: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

8

ChEBI’s is_a is not transitive (2)

water is_a amphiprotic solvent amphiprotic solvent is_a protophilic solvent (*) protophilic solvent is_a Bronsted base (*) Bronsted base is_a base base is_a biological role

therefore water is_a basetherefore water is_a biological role

* how come “protophilic solvent” and “Bronsted base” only have one child each?

Page 9: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

9

ChEBI’s is_a is not transitive (3)

N-hydroxy-L-aspartic acid is_a hydroxamic acids

hydroxamic acids is_a organic functional classes

therefore N-hydroxy-L-aspartic acid is_a organic functional classes

Page 10: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

10

is_a has many meanings!

1. An amount of a compound has a biological role: tris is_a buffer.*

2. An amount of a compound has an application: sodium dodecyl sulfate is_a detergent.*

3. A less-abstract type is an example of a more abstract type: propane is_a alkanes.

4. ?!: metals is_a atoms.*

* Not a property of a lone atom or molecule!

Page 11: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

11

Computers need facts about the world, not about ChEBI curation

Page 12: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

12

ChEBI as dictionary

Page 13: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

13

Evaluating name–structure conversion with ChEBI

ChEBI release 37 (26 September 2007) contains 12688 annotated entities, of which 8486 have InChI strings.

We use OSCAR3 (oscar3-chem.sourceforge.net) for name–structure conversion.

We convert chebi.obo to an XML file, each paragraph containing either a ChEBI name or an IUPAC name.

The layered structure of the InChI lets us give partial credit for incomplete matches.

Page 14: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

14

Results: IUPAC names

Total 8447

Identified as chemical 8255 (97.73%)

With InChI (upper bound) 1810 (21.43%)

Matching InChI, disregarding fixed hydrogen layer 1734 (20.53%)

Matching InChI, disregarding stereo 1176

Matching InChI, exact (lower bound) 1174 (13.90%)

Not all of name matched 1024

Name identified as two or more separate names 974 (11.53%)

Page 15: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

15

Results: ChEBI names

Total 8146

Identified as chemical 7173 (88.06%)

With InChI (upper bound) 1036 (12.72%)

Matching InChI, disregarding fixed hydrogen layer 953 (11.70%)

Matching InChI, disregarding stereo 637

Matching InChI, exact (lower bound) 628 (7.71%)

Not all of name matched 764

Name identified as two or more separate names 373 (4.58%)

Page 16: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

16

Regular polysemy

Page 17: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

17

Regular polysemy

… where words stand for multiple things in a consistent way.

Examples: Brand names Grinding Figure–ground Exact–class–part polysemy in chemistry

Peter Corbett, Colin Batchelor and Ann Copestake (2008), “Pyridines, pyridine and pyridine rings”, Proc. BERBMTM08 at LREC 2008, Marrakech, Morocco.

Page 18: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

18

Regular polysemy

Brand names“Learning to buy a Renault and talk to BMW”

Grinding“The squirrel scampered down the path and kept

stopping and looking at the officers to check they were behind”

vs.“[…] the trick was to serve squirrel fresh and not to

leave it hanging like other game”

Page 19: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

19

Regular polysemy

Figure–ground Audrey Hepburn painted the door (figure) Audrey Hepburn walked through the door

(ground) The Incredible Hulk walked through the

door (ambiguous)

Page 20: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

20

Methyl, the radical (exact)

Page 21: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

21

Methyl, the group (part)

Page 22: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

22

Can ChEBI handle methyl?

methyl group (CHEBI:32875) YESmethyl radical (CHEBI:29309) YES

Page 23: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

23

Imidazole (exact)

Page 24: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

24

An imidazole (class)

Page 25: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

25

imidazole side-chain/group/ring (part)

Page 26: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

26

Can ChEBI handle imidazole?

imidazoles (CHEBI:24780) YESimidazole (CHEBI:16069) YES

imidazole ring not yetimidazolyl group not yet

Page 27: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

27

Mapping exact, class and part to entries in ChEBI

Tests:1. Has InChI: exact2. Name is plural: class3. Ends in –yl, “group” or “residue”: part

Test 2 doesn’t work for applications or roles.Test 3 is brittle.

I would much rather use the logical structure of the ontology.

Page 28: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

28

Some possible solutions

Page 29: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

29

Some possible solutions (1)

ChEBI must represent facts about the world rather than about itself.

Examples: If unclassified compounds have a structure, they

should be in the molecular structure tree rather than the unclassifieds tree.

“organic functional classes” is a tool for assigning nomenclature. No chemical compound is an “organic functional class”.

Page 30: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

30

Some possible solutions (2)

ChEBI must distinguish between what is always true and what is only sometimes true.

Example: Replace some is_a relationships with

has_biological_role and has_application.

We need ChEBI to represent parts of molecules that aren’t substituents. They should all be descendants of molecular part (a new term), as should amino acid residues and nucleoside residues.

Page 31: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

31

Questions?