50
Exploring Structure Databases 222nd American Chemical Society National Meeting Herman Skolnik Award Symposium August 28, 2001 Robert W. Snyder MDL Information Systems

Exploring Structure Databases - Our Mission | ACS Division of

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

222nd American Chemical Society National MeetingHerman Skolnik Award Symposium

August 28, 2001

Robert W. SnyderMDL Information Systems

Page 2: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Agenda

Comparing Reaction Databases

Building a Reaction Knowledge Base

Top 10 Reaction Types

Measuring Unique Transformations in Reaction Databases

Page 3: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

An Interlude with Guenter…

After Joining MDLBefore Joining MDL

Page 4: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Agenda

Comparing Reaction Databases

Building a Reaction Knowledge Base

Top 10 Reaction Types

Measuring Unique Transformations in Reaction Databases

Page 5: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Comparing Reaction Databases

Complementary abstracting guidelines?solution-phase ChemInform Reaction Librarysolid-phase SPOREheterocyclic chemistry CHC

Price?Literature references?Overlap of reactions?

Page 6: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

“Duplications Among Reaction Databases”

Paper by James Hendrickson and Ling Zhang (JCICS, 2000, 40, 380-383)Analyzed 16 reaction databases

including RefLib, CHC, RX-JSM, ORGSYN, CSMtotal of 1,075,484 reactions

Converted reactions to common database formatPerformed pair-wise duplication checksResults: 2.7% duplication

Page 7: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Limitations of Exact Reaction Comparison

Requiring an exact reaction structure match may be too stringent of a conditionTwo reactions may have the same transformation but differ in side groups which don’t play a significant role in the reaction

What if we could compare reaction databases by degree of similarity?

Page 8: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Molecule Similarity

N

O

OO

OH

H N

O

OO

O

N

OOI

Diagnostic Agent

AntiarthriticAnesthetic

Page 9: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Reaction Similarity

N+

O

O

NO

N+

O

O

O

N

N+

O

O

N+

O

O

N

N

Page 10: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Tanimoto Coefficient Measures Similarity

∑ N [ A , J ] • N [ B , J ]

∑ N [ A , J ]2 + ∑ N [ B , J ]2 - ∑ N [ A , J ] • N [ B , J ]

whereA, B are the two structures,N[ A , J ] and N[ B , J ] are the number of occurrencesof the Jth fragment in structures A and B

Page 11: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Tanimoto Coefficient Measures Similarity

Molecule and reactions keys used to compute Tanimoto coefficientStructure keys are binary

Can we extend this concept to compare similarity of reaction databases?

Page 12: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Reaction Classification

Consistent assignment of a numerical index (15-digit integer) to a reaction center topologyTechnology developed by InfoChem GmbHBased on structural environment around the reaction center(s)Can be used as an indicator for reaction type

Page 13: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Three Levels of Classification

C

ClO

H C

ClCl

CC

CCl

OH

CC

CCl

Cl

C

CCC

CCl

OH C

CCC

CCl

Cl

Broad

Medium

Narrow

Page 14: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Reaction Classification Code Examples

Br

69%

324005931589888N+O O

NH2

325741082969498Br

O

O

O

Br

O OH

O

294560435478524

Page 15: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Reaction Database Similarity Measure

Treat reaction classification codes as synthetic methodology keys of the databaseThe more classification codes two databases share in common, the more similar they areExistence of classification codes is nonbinary

there can be many reaction examples in a database with the same classification codes

We can compute a Tanimoto coefficient between databases using the classification codes as the methodology fingerprint

Page 16: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Tanimoto Coefficient Measures Similarity

∑ N [ A , J ] • N [ B , J ]

∑ N [ A , J ]2 + ∑ N [ B , J ]2 - ∑ N [ A , J ] • N [ B , J ]

whereA, B are the two databases,N[ A , J ] and N[ B , J ] are the number of occurrencesof the Jth classcode in database A and B

Page 17: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Reaction Database Similarity Matrices

1.00Rxn DB 7

1.00Rxn DB 6

1.00Rxn DB 5

1.00ARxn DB 4

1.00Rxn DB 3

A1.00Rxn DB 2

1.00Rxn DB 1

Rxn DB 7Rxn DB 6Rxn DB 5Rxn DB 4Rxn DB 3Rxn DB 2Rxn DB 1ClassificationLevel

Page 18: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Reaction Database Similarity Measures

1.000.720.150.500.380.260.06THEIL

0.721.000.110.410.310.320.08RX-JSM

0.150.111.000.140.090.030.01ORGSYN

0.500.410.141.000.190.130.04CHC

0.380.310.090.191.000.090.03SPORE

0.260.320.030.130.091.000.30REFLIB

0.060.080.010.040.030.301.00CIRX

THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXBROAD

Page 19: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Reaction Database Similarity Measures

1.000.720.150.500.380.260.06THEIL

0.721.000.110.410.310.320.08RX-JSM

0.150.111.000.140.090.030.01ORGSYN

0.500.410.141.000.190.130.04CHC

0.380.310.090.191.000.090.03SPORE

0.260.320.030.130.091.000.30REFLIB

0.060.080.010.040.030.301.00CIRX

THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXBROAD

Page 20: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Reaction Database Similarity Measures

1.000.720.150.500.380.260.06THEIL

0.721.000.110.410.310.320.08RX-JSM

0.150.111.000.140.090.030.01ORGSYN

0.500.410.141.000.190.130.04CHC

0.380.310.090.191.000.090.03SPORE

0.260.320.030.130.091.000.30REFLIB

0.060.080.010.040.030.301.00CIRX

THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXBROAD

Page 21: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Reaction Database Similarity Measures

1.000.720.150.500.380.260.06THEIL

0.721.000.110.410.310.320.08RX-JSM

0.150.111.000.140.090.030.01ORGSYN

0.500.410.141.000.190.130.04CHC

0.380.310.090.191.000.090.03SPORE

0.260.320.030.130.091.000.30REFLIB

0.060.080.010.040.030.301.00CIRX

THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXBROAD

Page 22: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Reaction Database Similarity Measures

1.000.720.150.500.380.260.06THEIL

0.721.000.110.410.310.320.08RX-JSM

0.150.111.000.140.090.030.01ORGSYN

0.500.410.141.000.190.130.04CHC

0.380.310.090.191.000.090.03SPORE

0.260.320.030.130.091.000.30REFLIB

0.060.080.010.040.030.301.00CIRX

THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXBROAD

Page 23: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Reaction Database Similarity Measures

1.000.460.170.300.130.230.05THEIL

0.461.000.110.260.140.230.06RX-JSM

0.170.111.000.090.040.030.01ORGSYN

0.300.260.091.000.040.110.03CHC

0.130.140.040.041.000.040.02SPORE

0.230.230.030.110.041.000.29REFLIB

0.050.060.010.030.020.291.00CIRX

THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXMEDIUM

Page 24: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Reaction Database Similarity Measures

1.000.460.170.300.130.230.05THEIL

0.461.000.110.260.140.230.06RX-JSM

0.170.111.000.090.040.030.01ORGSYN

0.300.260.091.000.040.110.03CHC

0.130.140.040.041.000.040.02SPORE

0.230.230.030.110.041.000.29REFLIB

0.050.060.010.030.020.291.00CIRX

THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXMEDIUM

Page 25: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Reaction Database Similarity Measures

1.000.240.140.160.100.280.05THEIL

0.241.000.070.120.090.210.06RX-JSM

0.140.071.000.060.030.040.01ORGSYN

0.160.120.061.000.030.090.03CHC

0.100.090.030.031.000.050.02SPORE

0.280.210.040.090.051.000.21REFLIB

0.050.060.010.030.020.211.00CIRX

THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXNARROW

Page 26: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Reaction Database Similarity Measures

1.000.240.140.160.100.280.05THEIL

0.241.000.070.120.090.210.06RX-JSM

0.140.071.000.060.030.040.01ORGSYN

0.160.120.061.000.030.090.03CHC

0.100.090.030.031.000.050.02SPORE

0.280.210.040.090.051.000.21REFLIB

0.050.060.010.030.020.211.00CIRX

THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXNARROW

Page 27: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Reaction Center Environment vs. Average Similarity

0.21

0.14

0.10

0.07

0.050.03

0.02

0.00

0.05

0.10

0.15

0.20

0.25

0 1 2 3 4 5 6 7 8

Size of Reaction Center Environment

Aver

age

Data

base

Sim

ilarit

y

Page 28: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

An Interlude with Guenter…

Page 29: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Agenda

Comparing Reaction Databases

Building a Reaction Knowledge Base

Top 10 Reaction Types

Measuring Unique Transformations in Reaction Databases

Page 30: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Building a Reaction Knowledge Base

A knowledge base should:provide a single point of entrybe based on reaction typelink to individual data sources for full reaction

Can we build a reaction knowledge base built on InfoChem reaction classification codes?

Page 31: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Reaction Knowledge Base

Reaction Knowledge Base

ReactionDB

ReactionDB

ReactionDB

Page 32: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Reaction Knowledge Base – the Source

ChemInform = 644,947 rxnsThe Reference Library = 171,110 rxnsSPORE = 9,193 rxnsCHC = 42,375 rxnsORGSYN = 5,392 rxnsRX-JSM = 68,803 rxnsTHEILHEIMER = 46,467 rxns

988,287 total rxns

Page 33: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Reaction Knowledge Base - Classcodes

Broad classcodes: 176,702 (5.6:1)Medium classcodes: 373,923 (2.6:1)Narrow classcodes: 447,562 (2.2:1)

Page 34: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Reaction Knowledge Base - Questions

What are the top reported reaction types?

Which transformations have not been reported on solid-phase?

Are there any transformations that are unique to a database? Can this be used as a database selection criteria?

Page 35: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Agenda

Comparing Reaction Databases

Building a Reaction Knowledge Base

Top 10 Reaction Types

Measuring Unique Transformations in Reaction Databases

Page 36: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Top 10 Reaction Types (BROAD)

18615527176345843,3944,709257407228603162

16121223261635233,6335,184247321184998100

82210137489793,9825,889248507929234704

1183691264131,1833,9546,245283117507256719

1752723270671,2364,0956,331248372910785489

7732713152191,1714,2386,428259123750963135

3213222322561,6953,7026,751242815105629999

345340331936938075,2938,030261039242542204

453427591565771,4215,6559,324228413385171318

37046332121262,4806,64711,035267586484050778

THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXFreq.BROAD

Page 37: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Top 10 Reaction Types (BROAD)

18615527176345843,3944,709257407228603162

16121223261635233,6335,184247321184998100

82210137489793,9825,889248507929234704

1183691264131,1833,9546,245283117507256719

1752723270671,2364,0956,331248372910785489

7732713152191,1714,2386,428259123750963135

3213222322561,6953,7026,751242815105629999

345340331936938075,2938,030261039242542204

453427591565771,4215,6559,324228413385171318

37046332121262,4806,64711,035267586484050778

THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXFreq.BROAD

O

O

O

O

Page 38: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Top 10 Reaction Types (BROAD)

18615527176345843,3944,709257407228603162

16121223261635233,6335,184247321184998100

82210137489793,9825,889248507929234704

1183691264131,1833,9546,245283117507256719

1752723270671,2364,0956,331248372910785489

7732713152191,1714,2386,428259123750963135

3213222322561,6953,7026,751242815105629999

345340331936938075,2938,030261039242542204

453427591565771,4215,6559,324228413385171318

37046332121262,4806,64711,035267586484050778

THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXFreq.BROAD

O

O

N+

O

O

O

O

N+

O

O

Page 39: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Top 10 Reaction Types (BROAD)

18615527176345843,3944,709257407228603162

16121223261635233,6335,184247321184998100

82210137489793,9825,889248507929234704

1183691264131,1833,9546,245283117507256719

1752723270671,2364,0956,331248372910785489

7732713152191,1714,2386,428259123750963135

3213222322561,6953,7026,751242815105629999

345340331936938075,2938,030261039242542204

453427591565771,4215,6559,324228413385171318

37046332121262,4806,64711,035267586484050778

THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXFreq.BROAD

N

O

O

N

O

+

Page 40: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Top 10 Reaction Types (MEDIUM)

105332528451939131,400294529079813567

80391210112499501,423255624291255456

7578141532998641,433313146019762509

106732064402548171,441298065273835854

66481922001848411,463323130024518137

5310923753151,4422,185318547512250093

54699781761,7772,332324501134592978

1071442921313361,4292,477312444602259924

110962117274331,8272,717312632986028511

136172153171,1842,6884,564313125515671298

THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXFreq.MEDIUM

Page 41: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Top 10 Reaction Types (MEDIUM)

105332528451939131,400294529079813567

80391210112499501,423255624291255456

7578141532998641,433313146019762509

106732064402548171,441298065273835854

66481922001848411,463323130024518137

5310923753151,4422,185318547512250093

54699781761,7772,332324501134592978

1071442921313361,4292,477312444602259924

110962117274331,8272,717312632986028511

136172153171,1842,6884,564313125515671298

THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXFreq.MEDIUM

O O

Page 42: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Top 10 Reaction Types (MEDIUM)

105332528451939131,400294529079813567

80391210112499501,423255624291255456

7578141532998641,433313146019762509

106732064402548171,441298065273835854

66481922001848411,463323130024518137

5310923753151,4422,185318547512250093

54699781761,7772,332324501134592978

1071442921313361,4292,477312444602259924

110962117274331,8272,717312632986028511

136172153171,1842,6884,564313125515671298

THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXFreq.MEDIUM

Br

Br

N+

Br

O

O

Br

Page 43: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Top 10 Reaction Types (NARROW)

113323068485663332262032163809

29251199076405691336340987165931

63141455108465706313549094109654

1039431248459826329038361228014

39226340161520836318972626250200

11300042746921329002963942591

5130728172625944294560435478524

235111731407021,055336777852836899

4278194101276421,070324005931589888

99641853402377751,350325741082969498

THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXFreq.NARROW

Page 44: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Top 10 Reaction Types (NARROW)

113323068485663332262032163809

29251199076405691336340987165931

63141455108465706313549094109654

1039431248459826329038361228014

39226340161520836318972626250200

11300042746921329002963942591

5130728172625944294560435478524

235111731407021,055336777852836899

4278194101276421,070324005931589888

99641853402377751,350325741082969498

THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXFreq.NARROW

O

O

N+

O

O

O

O

N

Page 45: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Agenda

Comparing Reaction Databases

Building a Reaction Knowledge Base

Top 10 Reaction Types

Measuring Unique Transformations in Reaction Databases

Page 46: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Unique Transformations (Broad)

ChemInform = 84,266 (69%)CHC = 9,589 (50%)The Reference Library = 20,536 (39%)SPORE = 619 (33%)RX-JSM = 9,483 (32%)ORGSYN = 401 (16%)THEILHEIMER = 2 (<1%)

Page 47: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Unique Transformations (Medium)

ChemInform = 184,391 (73%)CHC = 20,357 (65%)The Reference Library = 47,107 (46%)SPORE = 1,932 (52%)RX-JSM = 20,433 (40%)ORGSYN = 985 (26%)THEILHEIMER = 2 (<1%)

Page 48: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Unique Transformations (Narrow)

ChemInform = 237,744 (77%)CHC = 25,485 (75%)SPORE = 3,180 (67%)The Reference Library = 54,227 (50%)RX-JSM = 22,311 (45%)ORGSYN = 1,066 (32%)THEILHEIMER = 2 (<1%)

Page 49: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Summary

InfoChem reaction classification codes can be used to measure similarity between databasesA reaction knowledge base can be built using reaction classification and mined for:

ranking and linking of most common reaction typesidentification of reaction types not reported on solid phasecontribution of each database to the overall knowledge

Page 50: Exploring Structure Databases - Our Mission | ACS Division of

Exploring Structure Databases

Thank You

Congratulations Guenter!

Thank you for your attention.