59
Computer Structure Codes (after lectures by Dr. J.M. Barnard) • How do you store chemical structures on computer? • What can you do with them there? • How do the computer systems used in chemical informatics work?

Computer Structure Codes (after lectures by Dr. J.M. Barnard)

  • Upload
    admon

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

Computer Structure Codes (after lectures by Dr. J.M. Barnard). How do you store chemical structures on computer? What can you do with them there? How do the computer systems used in chemical informatics work?. Representing a chemical structure. How much information do you want to include? - PowerPoint PPT Presentation

Citation preview

Page 1: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Computer Structure Codes(after lectures by Dr. J.M. Barnard)

• How do you store chemical structures on computer?

• What can you do with them there?

• How do the computer systems used in chemical informatics work?

Page 2: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Representing a chemical structure

• How much information do you want to include?– atoms present– connections between atoms

• bond types

– stereochemical configuration– charges– isotopes – 3D-coordinates for atoms

OH

CH2

CHNH2

OH

O

Page 3: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Representing a chemical structure

• How much information do you want to include?– atoms present– connections between atoms

• bond types (aromatic ring identification)

– stereochemical configuration– charges– isotopes – 3D-coordinates for atoms

OH

CH2

CHNH2

OH

O

Page 4: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Representing a chemical structure

• How much information do you want to include?– atoms present– connections between atoms

• bond types

– stereochemical configuration– charges– isotopes – 3D-coordinates for atoms

OH

CH2

CHNH2

OH

O

Page 5: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Representing a chemical structure

• How much information do you want to include?– atoms present– connections between atoms

• bond types

– stereochemical configuration– charges– isotopes– 3D-coordinates for atoms

OH

CH2

CHNH3

+

O

O

Page 6: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Representing a chemical structure

• How much information do you want to include?– atoms present– connections between atoms

• bond types

– stereochemical configuration– charges– isotopes– 3D-coordinates for atoms

OH

CH2

C14 HNH2

OH

O

Page 7: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

2D structure diagram• chemists’ “natural language”• used by most computer systems for display• shows topology, optionally stereochemistry• several commonly-used computer programs

allow input /editing of structure diagrams– ISIS/Draw (MDL)

http://www.mdl.com

– ChemDraw (CambridgeSoft)http://www.cambridgesoft.com/products/

– GRINS/JavaGRINS (Daylight)http://www.daylight.com/products/javatools.html

Page 8: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

2D structure diagram• provides 2D pictorial representation of

chemical structure– display on screen– cut/paste/embed in Word document etc.

• inter-convert with other forms for further processing– database searching– structure analysis– property prediction– database analysis

Page 9: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Registry Numbers• unique identifiers for compounds or substances

– catalog number

• most chemical databases have them– Chemical Abstracts– Beilstein– private compound registries in pharmaceutical companies

• usually just “idiot numbers”– no chemical information

• may have hierarchical structureparent compound stereoisomer salt batch

• need to decide what is a separate compound

Page 10: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Line Notations

• represent structures as compact linear string of alphanumeric symbols

• easily handled by computer– compact storage– easily transmitted over a network

• allow rapid manual coding/decoding by trained users– much faster for input than using a structure

drawing program

Page 11: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Line Notations: SMILES

Simplified Molecular Input Line Entry System• developed by Dave Weininger (Daylight)

OC(=O)C(N)CC1=CC=C(O)C=C1

OHCH2CH

NH2

OH

O 1

Page 12: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Other line notations

• ROSDAL (Beilstein)Representation Of Structure Diagram Arranged Linearly

1O-2=3O,2-4-5N,4-6-7=-12-7,10-13O

• Sybyl Line Notation (Tripos)OHC(=O)CH(NH2)CH2C[1]=CHCH=C(OH)CH=CH@1

• Wiswesser Line Notation (WLN) (obsolete)

QVYZ1R DQ

OHCH2CH

NH2

OH

O

1

3

4

5

6

8 9

111213

Page 13: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Connection Tables (CTs)

• main form of structure representation in computer systems– list atoms and bonds (and other data) as a table

• many different formats – “internal” CTs (in memory)

• algorithmic processing

– “external” CTs (disk files)• archival storage • data exchange between programs

Page 14: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Internal Connection Table

• usually “redundant”– every bond shown twice, once for each atom

• implemented as array of records• record for each atom might store

– atomic type– hydrogen count– formal charge– 2D display co-ordinates– bonds to neighboring atoms– etc.

Page 15: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

“Redundant” Connection Table

1. O 1 2 12. C 0 1 1 3 2 4 13. O 0 2 24. C 1 2 1 5 1 6 15. N 2 4 16. C 2 4 1 7 17. C 0 6 1 8 2 12 18. C 1 7 2 9 19. C 1 8 1 10 210.C 0 9 2 11 1 13 111.C 1 10 1 12 212.C 1 11 2 7 113.O 1 10 1

9

OH

CH2

CHNH2

OHO 13

4

5

6

8

11

12

13

Page 16: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

MDL Connection Table

• proprietary file format developed by MDL– http://www.mdl.com/downloads/latest_releases/index.jsp

• de facto standard for exchange of datasets• several different flavours and versions

– Molfile (single molecule)– SDfile (set of molecules and data)– RGfile (Markush structure)– Rxnfile (single reaction)– RDfile (set of reactions with data)

• separates atoms, bonds into separate blocks

Page 17: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Standard Connection Table Formats

• different vendors have proprietary CT formats• many attempts to establish agreed “standard”

formats– no real general success– different user communities have failed to

coordinate efforts– some standards exist in restricted areas

• SMILES and MDL CT formats widely used• most popular programs read/write several

different formats

Page 18: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Standard Connection Table Formats

• Standard Molecular Data (SMD) format– never gained wide acceptance

• Protein Data Bank (PDB) format• Crystallographic Information File (CIF)• Molecular Information File (MIF)

– developed from SMD and compatible with CIF

• Chemical Exchange Format (CXF) – Chemical Abstracts Service

• Chemical Markup Language (CML)– for data exchange using the Internet

• INChI (IUPAC/NIST Chemical Identifier)

Page 19: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Conclusions

• There are lots of ways of storing a chemical structure in a computer– including different amounts of information

• Most important ones are– line notations (e.g. SMILES)– connection tables (e.g. MDL Molfile)– nomenclature

• Structure diagrams used for input/output

Page 20: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Topological Graph Theory

• branch of mathematics– particularly useful in chemical informatics

and in computer science generally

• study of “graphs” which consist of– a set of “nodes”– a set of “edges” joining

pairs of nodes

Page 21: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Properties of graphs

• graphs are only about connectivity– spatial position of nodes is irrelevant – length of edges are irrelevant– crossing edges are irrelevant

Page 22: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Structure Diagrams as Graphs• 2D structure diagrams very like topological graphs

– atoms nodes– bonds edges

• terminal hydrogen atoms are not normally shown as separate nodes (“implicit” H)– reduces number of nodes by ~50%– “hydrogen count” information used to colour

neighbouring “heavy atom” atom– separate nodes sometimes used for “special” hydrogens

• deuterium, tritium• hydrogen bonded to more than one other atom• hydrogens attached to stereocentres

Page 23: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Advantages of using graphs

• mathematical theory is well understood

• graphs can be easily represented in computers– many useful algorithms are known

• identical graphs identical molecules

• different graphs different molecules

Page 24: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Disadvantages of graphs• analogy between chemical structures and graphs is

not perfect– identical graphs <=/=> identical molecules– different graphs <=/=> different molecules

• realities of chemical structures cause problems– aromaticity stereochemistry– tautomerism coordination compounds– multi-centre bonds inorganic compounds– macromolecules polymers– incompletely-defined substances

• many graph algorithms are inherently slow

Page 25: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Aromaticity• electronic property of certain ring systems,

giving enhanced chemical stability• bonds in aromatic rings have properties that

are distinct from single and double bonds• generally accepted definition is Hückel rule

– 4n+2 pi-electrons (n is a small integer)

• there are borderline cases• aromaticity causes problems for computer

representation– different systems deal with it in different ways

Page 26: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Aromaticity problems

• using single and double bonds can give different topological graphs for the same compound

• one solution is to usean aromatic bond type

Br Br

BrBr

Br

Br

Page 27: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Alternating bonds and aromaticity

• Chemical Abstracts Registry System uses a “normalised” bond type for all rings with alternating single and double bonds

– this includes some systems that are not aromatic

– and omits some that are S

Page 28: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Representing aromaticity• some systems represent aromaticity as an atom

property– SMILES allows use of lower-case atomic symbols for

aromatic atoms (adjacent aromatic atoms are assumed to be joined by aromatic bonds)

• problem: aromaticity is really a ring property

S

s1cccc1S1C=CC=C1

Brc1c(Br)cccc1BrC1=C(Br)C=CC=C1

Br

Br

Page 29: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Tautomerism• dynamic equilibrium

between positional isomers (labile H)

• are they different compounds?– answer depends on what you want to do with them

• can use normalised bondsto represent them by a single graph– gets mixed up with ring

alternating bonds– some tautomers may be

aromatic, when others are not

NH

O

N

OH

N

O H

Page 30: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Tautomerism

• tautomerism is a matter of degree• tautomers can be defined in different

waysHQ–X=R Q=X–RH

only certain elements can be Q, X or R• keto-enol tautmers

are not recognisedby Chemical Abstracts

• mono-unsaturatedcarbon chains arenot distinguishedby Daylight

OH O

OH

O

OH

O

Page 31: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Structure conventionssometimes called “business rules”

– some chemical groups can be shown in different but equally valid ways

– conventions are needed to determine which is preferred– software may be needed to convert to preferred form

NOO

N+

OO

Page 32: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Stereochemistry• different compounds with identical

connectivity• same topology, different topography

S-tyrosine R-tyrosine

Page 33: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Stereochemistry

• configuration is often unknown – or partially known (relative stereochemistry)– or you may have a mixture of stereoisomers

• in which one isomer may occur in enantiomeric excess

• many different descriptors used by chemists– wedge (up) and hatched (down) bonds in structure

diagrams– Cahn, Ingold, Prelog (CIP) designators (R, S, E, Z)– text-based descriptors (stereoparent, or optical

rotation)

Page 34: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Stereochemistry: up/down bonds

• can be used as additional “colours” for graph edges– many connection table

formats have special codes for up and down bonds

– need to know which end of bond is which

• useful for re-generating diagrams for display• can be used to calculate other stereo

descriptors

OH

CH2NH2

O OH

OH

CH2NH2

O OH

Page 35: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Up/down bond problems

• different patterns of up/down bonds can show the same stereo- isomer – different graphs,

same molecule

• some patterns of up and down bonds actually convey no useful information about configuration

OH

CH2NH2

O OH

OH

CH2 NH2

OOH

ClF

CH3

CH2

CH3

Page 36: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Stereochemistry: CIP designators• R.S. Cahn, C. Ingold, and V. Prelog,

– Angewandte Chemie Intl. Ed. in English 1966, 5, 385-551

• one-letter designator for stereocenters– based on rules assigning priorities to groups around it– tetrahedral carbons (R, S)– double bonds (E, Z)

• additional colors for graph nodes or edges– useful for distinguishing stereoisomers when absolute

configuration is known– less useful for matching parts of structures (substructure

search) as priority rules can cause designator to change when remote part of structure is changed

Page 37: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Double bond stereo in SMILES

/ and \ used as “directional” single bonds– only meaningful when used on both atoms

of a double bond– several ways of showing same

configuration

ClI

Br

F

FI

Br

Cl

Cl/C(F)=C(Br)/ I Cl\ C(F)=C(Br)/ I

Cl\ C(F)=C(Br)\ I Cl/ C(F)=C(Br)\ I

Page 38: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Other complications

• Organometallic and co-ordination compounds– complex stereochemistry– special bond types may be needed (dative bonds

etc.)– ambiguity over covalent/ionic character of bonds

• “business rules” rules usually needed

• Inorganic compounds– topological representation often not possible– composition may not involve integral ratios

between elements

Page 39: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Macromolecules

• in principle can represent all atoms, as for small molecules

• some systems use “shortcuts” or “superatoms” for subunits (e.g. amino acids)

AspHis

ValCys

Gly AlaHis

ValOH

CysArg

Trp

Tyr

ValTyr

AlaArg

ProAla

AspTyr

GlyGly

Ala OH

Page 40: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Macromolecules

• Each shortcut is defined with appropriate attachment points

• ordinary atoms can bemixed with shortcuts

• system can expandshortcuts when needed

Tyr

NH*

O

O

*"

OH

Page 41: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Polymers

• special problems are presented because properties of polymer can be affected by polymerisation conditions– average number of subunits– extent of cross-linking– ratio between different subunits– random / block sequences of subunits– etc.

• Two main approaches– monomer representation– structural repeating unit (SRU) representation

Page 42: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Incompletely-defined substances

• unknown stereochemistry

• unknown attachment position

• unknown repetition

OH

n

NH2

Cl

Page 43: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Markush (“Generic”) structures

– structures with R-groups– shorthand for describing sets of structures

with common featuresOH

R1R2

Br

*

I*

Cl

*R1=

CH2

*

CH3CH2

* CH2CH3 CH2

* CH2CH2

CH3R2=

Page 44: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Markush structures

– also called “generic” structures– very important in chemical patents

• inventor claims whole class of related compounds

– can be used to describe combinatorial libraries

– can be used as queries in database searches

Page 45: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Canonicalization• a given chemical structure (or graph) can have

many valid and unambiguous representations– different order of rows in connection table– different order of atoms in SMILES

• for comparison purposes it would be useful to have a single unique or “canonical” representation

• process of converting input representation to canonical form is called “canonicalization” or “canonization”– process of applying “rules” (i.e. an algorithm)

Page 46: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Canonicalization• an obvious approach:

– generate all possible valid SMILES– choose the one that comes first

alphabetically

• this would be very slow, but effective, and there is a danger of missing one– principle was used for canonicalizing

Wiswesser Line Notation

Page 47: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Canonicalization• most methods in use today involve

renumbering the atoms in some unique and reproducible way– can be used to number rows in connection

table– can determine order of atoms in SMILES

• normally involve a node labelling technique called “relaxation”– example is Morgan’s algorithm (1965)

Page 48: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Symmetry perception

• if ties between label values cannot beresolved on basis of atom/bond types, the atoms are symmetrically equivalent, andit doesn’t matter which is chosen next

• Morgan’s algorithm is thus also useful for identifying symmetry in molecules

Page 49: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Morgan’s algorithm• Works by taking more of the graph into account at

each iteration– essence of “relaxation” technique is iteratively updating

a value by looking at its immediate neighbours

• It is not infallible– graphs (“isospectral” graphs) are known where the

algorithm cannot distinguish nodes that are not symmetrically equivalent

• There are many variations on it– and several theoretical papers analysing it

mathematically

Page 50: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Ring perception• How many rings are there in these structures

and which ones are they?

• rings are important features of chemical structures– nomenclature generation

– aromaticity perception

– synthetic significance

– fragment descriptor generation

Page 51: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Rings and ring systems

• A ring system is a subgraph in which every edge is part of a cycle

Page 52: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Which rings to perceive?

• Usually the smallest set of smallest rings– two 6-membered rather than

one 6- and one 10-membered– two 5-membered rather than

one 5- and one 6-membered

• But there may be more than one SSSR– C-S-C-C-C-C– C-C-C-C-O-C– C-S-C-C-O-C

S

O

Page 53: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Substructure Fragments• Subgraphs can be identified in a structure

graph corresponding to functional groups, rings etc. – –OH– –NH2– –COOH– phenyl

• this can be done bytracing appropriatepaths in the graph

• subgraphs may overlap

OH

CH2

CHNH2

OH

O

Page 54: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Fragment codes– many early chemical information systems were

based on identifying fragments of this sort• originally the fragments were identified manually• and represented on punched cards

– special fragment codes (dictionaries of fragments) were devised for different systems

• some of these are still in use, though with automated encoding of structures

• particularly important are the systems for “Markush” structures in patents (e.g. Derwent WPI code)

Page 55: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Fingerprints• the fragments present in a structure can be

represented as a sequence of 0s and 1s

00010100010101000101010011110100– 0 means fragment is not present in structure– 1 means fragment is present in structure (perhaps

multiple times)

• each 0 or 1 can be represented as a single bit in the computer (a “bitstring”)

• for chemical structures often called structure “fingerprints”

Page 56: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Fingerprints• fingerprints are typically 150-2500 bits long• where a fixed dictionary of fragments is used

there can be a 1:1 relationship between fragment and bit position in fingerprint– sometimes several related fragments will “set” the

same bit

• disadvantage is that if structure contains few fragments from the dictionary, no bits are set– can be avoided if “generalised” fragments are

used(involving e.g. “any atom”, “any ring bond” types)

Page 57: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

2D structure depiction• if structures are stored without 2D display

coordinates, we need to generate them– SMILES

• “depiction” algorithms are used for this• identify and lay out ring systems first

– complications over orientation of some systems– Chemical Abstracts stores “standard depictions” of

all ring systems it has encountered

• then add side chains, avoiding collisions– many features can be added to improve

appearance

Page 58: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

3D structure depiction• much more complicated than 2D• need to store standard bond lengths and

angles• need to distinguish atoms in different

hybridisation states (sp2 vs sp3 carbon)• need rotate single bonds to avoid “bumps”• sophisticated “conformation generation”

programs identify low-energy conformers– very useful for identifying molecules with the

correct shape to fit into biological receptor sites

Page 59: Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Nomenclature generation

• most systematic nomenclature is based on ring systems– need to identify/prioritise ring systems first– identify standard numbering for system

• frequently need to store this

– add side chains and substituents with appropriate locants