55
Self-contained sequence representation (SCSR): Bridging the gap between bioinformatics and cheminformatics K. T. Taylor, W. L. Chen, B. D. Christie, Joseph L. Durant, D. L. Grier, B. A. Leland, J. G. Nourse Recent Progress in Chemical Structure Representation ACS Boston 2010: CINF Division of Chemical Information August 23, 2010

Self-Contained Sequence Representation (SCSR)

  • Upload
    biovia

  • View
    1.188

  • Download
    1

Embed Size (px)

DESCRIPTION

Keith Taylor describes a new hybrid method for representing and searching biologics in chemical databases.

Citation preview

Page 1: Self-Contained Sequence Representation (SCSR)

Self-contained sequence representation (SCSR):

Bridging the gap between bioinformatics and cheminformatics

K. T. Taylor, W. L. Chen, B. D. Christie, Joseph L. Durant,D. L. Grier, B. A. Leland, J. G. Nourse

Recent Progress in Chemical Structure RepresentationACS Boston 2010: CINF Division of Chemical Information

August 23, 2010

Page 2: Self-Contained Sequence Representation (SCSR)

• Biologicals are a significant and growing component of lifescience company pipelines

• Existing databases of proteins and nucleotides are largely sequence based

• Chemical modifications are generally handled as annotations

• Structure searching is focused on sequence searching– BLAST– FASTA

• Modifications are searched using text searches

Background

Page 3: Self-Contained Sequence Representation (SCSR)

UniProt - Red Fluorescent Protein

Page 4: Self-Contained Sequence Representation (SCSR)

Jamey D. Marth, "A Unified Vision of the Building Blocks of Life", Nature Cell Biology, Vol 10, pg 1015-1016, 2008

Residue-based Representation: Alphabet of Life

Page 5: Self-Contained Sequence Representation (SCSR)

• Natural– Capture chemically intuitive features– Alphabet of life

• Widespread– Been implemented multiple times

• Can use a set of predefined templates to define residues

• Can use pseudo-atoms to represent residues

Residue-based Representations

Page 6: Self-Contained Sequence Representation (SCSR)

• Provides significant size reduction– ~8x for proteins– ~20x for nucleotides– ~10x for saccharides

• Converts a large biomolecule into a "small molecule"– More efficient storage– More efficient searching

Advantages of Pseudo-Atoms

Page 7: Self-Contained Sequence Representation (SCSR)

• What is the residue definition?– Does the cysteine pseudoatom include the sulfur?

• What is the connectivity?– Do I read a cyclic peptide clockwise or counter-

clockwise?– Where is the phosphate bound?

Problems with Pseudo-Atoms

Page 8: Self-Contained Sequence Representation (SCSR)

• Biologics are growing rapidly– Average 22% of pipeline, 45%-100% in some companies

• Current definitions are not safely transferable between laboratories

• Increased reliance on partners, strategic alliances and CROs– Data integrity is an issue

• Companies seeking one solution for traditional drugs and biologics

Changing R&D landscape

Page 9: Self-Contained Sequence Representation (SCSR)

Challenges - Representation

• Bioinformatics (UniProt):– Are compact– Contain shorthand descriptions of

modifications– Require agreement on structural

entities– Structure Activity Relationships are

difficult to determine

• Cheminformatics (molfile):– Verbose– Complete– Not subject to interpretation– Structure Activity Relationships are

straightforward to determine

Page 10: Self-Contained Sequence Representation (SCSR)

• Reduce the size of the object• Retain chemical modifications• Eliminate ambiguity• Allow similar residues to be found using standard

substructure searches• Store biomolecules and small drug molecules in

one database• Facilitate structure property tables

The Need

Page 11: Self-Contained Sequence Representation (SCSR)

Hybrid sequences

• Ambiguous residue labels

Page 12: Self-Contained Sequence Representation (SCSR)

Hybrid sequences

• Ambiguous residue labels

Page 13: Self-Contained Sequence Representation (SCSR)

Hybrid sequences

• Ambiguous residue labels

Page 14: Self-Contained Sequence Representation (SCSR)

Hybrid sequences

• Ambiguous residue labels

Page 15: Self-Contained Sequence Representation (SCSR)

PO

P

O

O

N

NNH

N NH2

O

O

NH 2

O

OH

O

O H

O

O N

N

NHN

NH 2

O

O HOH

SR

R

R

S R

RPO

P

O

O

N

NNH

N NH2

O

O

NH 2

O

OH

O

O H

O

O N

N

NHN

NH2

O

O HOH

SR

R

R

S R

R

Hybrid sequences

• Ambiguous residue labels

Page 16: Self-Contained Sequence Representation (SCSR)

Duplicate sequences

Different sequences

Page 17: Self-Contained Sequence Representation (SCSR)

Duplicate sequences

Different sequences

Page 18: Self-Contained Sequence Representation (SCSR)

Duplicate sequences

Same structureDifferent sequences

Page 19: Self-Contained Sequence Representation (SCSR)

Post Translational Modifications (PTM)Present Challenges

• Modified ResiduesAdditions to prochiral centers (ubiquitous)

Eliminations (serine to didehydroalanine) (lanthionine antibiotics)

Stereochemistry changes (L to D conversion of amino acids) (bacterial peptidoglycans)

proline (2S,4R)-4-hydroxyproline

NH

O H

ONH

OH

O H

O

serine didehydroalanine

NH2

OH

O H

O

NH2

CH2

O H

O

Page 20: Self-Contained Sequence Representation (SCSR)

Challenges in Using Residues

Modifications across Residue Boundaries • Thiazole/oxazole formation (thiazolylpeptide antibiotics)

• Imidazolinone formation (Green Fluorescent Protein)

NH

O

NH

O

SH

N

S

NH

O

NN

O HO

O

O H

CH 3

NH

Page 21: Self-Contained Sequence Representation (SCSR)

Solution: Use a Hybrid Representation

• Use residues where appropriate• Use explicit chemistry where appropriate• Templates define the residues• Templates capture attachment points

• No ambiguity in cyclic & cross-linked structures

Page 22: Self-Contained Sequence Representation (SCSR)

• Self Contained Sequence Representation (SCSR)– Standard sequence representation

– A set of unique residue templates (Tgroup)

– Only the subset of Tgroups that are used in the structure are included

– Symyx Direct cartridge keeps chemistry and sequence synchronized enabling support for SSS and BLAST / FASTA

Compact, Hybrid Representation

Page 23: Self-Contained Sequence Representation (SCSR)

• Self Contained Sequence Representation (SCSR)– Standard sequence representation

– A set of unique residue templates (Tgroup)

– Only the subset of Tgroups that are used in the structure are included

– Symyx Direct cartridge keeps chemistry and sequence synchronized enabling support for SSS and BLAST / FASTA

F

Compact, Hybrid Representation

Page 24: Self-Contained Sequence Representation (SCSR)

• Self Contained Sequence Representation (SCSR)– Standard sequence representation

– A set of unique residue templates (Tgroup)

– Only the subset of Tgroups that are used in the structure are included

– Symyx Direct cartridge keeps chemistry and sequence synchronized enabling support for SSS and BLAST / FASTA

F

Compact, Hybrid Representation

Page 25: Self-Contained Sequence Representation (SCSR)

• Self Contained Sequence Representation (SCSR)– Standard sequence representation

– A set of unique residue templates (Tgroup)

– Only the subset of Tgroups that are used in the structure are included

– Symyx Direct cartridge keeps chemistry and sequence synchronized enabling support for SSS and BLAST / FASTA

F F F

Compact, Hybrid Representation

Page 26: Self-Contained Sequence Representation (SCSR)

Features of Hybrid Representation

• Emphasis is on chemical modifications in the structure

• Chemical complexity is handled using explicit chemistry

• Compression effectively converts large biomolecules into "small molecules"

• Supports Substructure Searching (SSS)

• Substructure searching highlights chemically similar structures

• Fastsearch indexing of areas with explicit chemistry performs well and scales well

Page 27: Self-Contained Sequence Representation (SCSR)

Searching with Accelrys Representation

• Substructure searching finds chemically identical regions, regardless of what they are named

• Fast-Search indexing of areas with explicit chemistry makes searching perform well and scale well

• Complements existing sequence searching tools (BLAST, FASTA, ...)

Page 28: Self-Contained Sequence Representation (SCSR)

NN

O HOO

O H

C H 3

NH

SSS Finds Fluorescent Proteins

Page 29: Self-Contained Sequence Representation (SCSR)

NN

O

SSS Finds Fluorescent Proteins

Page 30: Self-Contained Sequence Representation (SCSR)

NN

O

SSS Finds Fluorescent Proteins

Page 31: Self-Contained Sequence Representation (SCSR)

• Use residues where appropriate

• Use explicit chemistry where appropriate

• Templates define the residues

Solution: Use a Hybrid Representation

Page 32: Self-Contained Sequence Representation (SCSR)

• Template classes allow AA/DNA/RNA... in the same structure

• Templates capture attachment point information– removes ambiguity in cyclic and crosslinked structures

• Based on V3000 file format– optionally include templates in molfile

Solution: Use a Hybrid Representation

Page 33: Self-Contained Sequence Representation (SCSR)

• Business rules define when to use :– a template– explicit chemistry

• More templates, more compression• More explicit chemistry, more found in structure

searches• Where the templates are stored is user-defined

– Global storage reduces molfile size– In-the-molfile creates a Self-Describing File

A Flexible Framework

Page 34: Self-Contained Sequence Representation (SCSR)

Label Conventions:Al - left backbone (N or 5')

Br - right backbone (C or 3')

Cx - non-backbone connection

Template Example

NH

H

SH

O H

O

[

]

]Al

Cx

Br

Page 35: Self-Contained Sequence Representation (SCSR)

• Original residue identity• Detailed connectivity• Residue numbering• Description of modifications • Sequence annotations

What is registered?

Page 36: Self-Contained Sequence Representation (SCSR)

Representing Cyclic Peptides

Anantin

Al

Cx

Br

O

O

NH

NH

IN

G

DF

GS

YH

GF

IN

N

GW

G

I

F G

Page 37: Self-Contained Sequence Representation (SCSR)

NH

O

N

CH3

CH3

NH2

S

O

NH2

K

N

DH

F

II E L

Bacitracin A

Representing Modifications

Page 38: Self-Contained Sequence Representation (SCSR)

Non-natural Amino Acids

NH

O

NH

O

CH3

NH

O

NH

S OO

NCH3 CH3

CH3

NH

O

NH

NH

O

NH

I

NH

O

O

CH3

NH

O

O

NH

O

NN+

N–

Page 39: Self-Contained Sequence Representation (SCSR)

• Emphasis is on chemical modifications in the structure

• Chemical complexity is handled using explicit chemistry

• Compression effectively converts large biomolecules into "small molecules"

• Supports Structure Searching (SSS, Similarity, Flexmatch)

Features of the Representation

Page 40: Self-Contained Sequence Representation (SCSR)

• Substructure searching finds chemically identical regions, regardless of what they are named

• Fastsearch indexing of areas with explicit chemistry makes searching perform well and scale well

• Complements existing sequence searching tools (BLAST, FASTA, ...)

Searching

Page 41: Self-Contained Sequence Representation (SCSR)

• Represent large biomolecules, focussing on chemical modifications– Templates for unmodified residues– Explicit chemistry for modified regions

• Search for chemically modified features in large biomolecules– Emphasis on searching modified regions as explicit chemistry

• Complements existing sequence-based systems• Supported in

– Symyx Direct 7.0– Accelrys Draw 4.0

Hybrid Representation

Page 42: Self-Contained Sequence Representation (SCSR)

The following slides were not part of the formal presentation

They reflect what was done in the live demonstration of Accelrys Draw 4.0

Supplementary material

Page 43: Self-Contained Sequence Representation (SCSR)

Draw 4.0 – Interpret UniProt filehuman insulin

Page 44: Self-Contained Sequence Representation (SCSR)

Draw 4.0 – Interpret UniProt filehuman insulin

Page 45: Self-Contained Sequence Representation (SCSR)

Draw 4.0 – Interpret UniProt filehuman insulin

Page 46: Self-Contained Sequence Representation (SCSR)

Draw 4.0 – Interpret UniProt filehuman insulin

Page 47: Self-Contained Sequence Representation (SCSR)

Draw 4.0 – Interpret UniProt filehuman insulin

Page 48: Self-Contained Sequence Representation (SCSR)

Draw 4.0 – Interpret UniProt filehuman insulin

Page 49: Self-Contained Sequence Representation (SCSR)

Draw 4.0 – Edit SequenceModify

Page 50: Self-Contained Sequence Representation (SCSR)

Draw 4.0 – Edit SequenceModify

Page 51: Self-Contained Sequence Representation (SCSR)

Draw 4.0 – Edit SequenceModify

Page 52: Self-Contained Sequence Representation (SCSR)

Draw 4.0 – Edit SequenceModify

Page 53: Self-Contained Sequence Representation (SCSR)

Draw 4.0 – Edit SequenceMutate

Page 54: Self-Contained Sequence Representation (SCSR)

Draw 4.0 – Edit SequenceMutate

Page 55: Self-Contained Sequence Representation (SCSR)

Draw 4.0 – Edit SequenceMutate