Self-Contained Sequence Representation (SCSR)

Preview:

DESCRIPTION

Keith Taylor describes a new hybrid method for representing and searching biologics in chemical databases.

Citation preview

Self-contained sequence representation (SCSR):

Bridging the gap between bioinformatics and cheminformatics

K. T. Taylor, W. L. Chen, B. D. Christie, Joseph L. Durant,D. L. Grier, B. A. Leland, J. G. Nourse

Recent Progress in Chemical Structure RepresentationACS Boston 2010: CINF Division of Chemical Information

August 23, 2010

• Biologicals are a significant and growing component of lifescience company pipelines

• Existing databases of proteins and nucleotides are largely sequence based

• Chemical modifications are generally handled as annotations

• Structure searching is focused on sequence searching– BLAST– FASTA

• Modifications are searched using text searches

Background

UniProt - Red Fluorescent Protein

Jamey D. Marth, "A Unified Vision of the Building Blocks of Life", Nature Cell Biology, Vol 10, pg 1015-1016, 2008

Residue-based Representation: Alphabet of Life

• Natural– Capture chemically intuitive features– Alphabet of life

• Widespread– Been implemented multiple times

• Can use a set of predefined templates to define residues

• Can use pseudo-atoms to represent residues

Residue-based Representations

• Provides significant size reduction– ~8x for proteins– ~20x for nucleotides– ~10x for saccharides

• Converts a large biomolecule into a "small molecule"– More efficient storage– More efficient searching

Advantages of Pseudo-Atoms

• What is the residue definition?– Does the cysteine pseudoatom include the sulfur?

• What is the connectivity?– Do I read a cyclic peptide clockwise or counter-

clockwise?– Where is the phosphate bound?

Problems with Pseudo-Atoms

• Biologics are growing rapidly– Average 22% of pipeline, 45%-100% in some companies

• Current definitions are not safely transferable between laboratories

• Increased reliance on partners, strategic alliances and CROs– Data integrity is an issue

• Companies seeking one solution for traditional drugs and biologics

Changing R&D landscape

Challenges - Representation

• Bioinformatics (UniProt):– Are compact– Contain shorthand descriptions of

modifications– Require agreement on structural

entities– Structure Activity Relationships are

difficult to determine

• Cheminformatics (molfile):– Verbose– Complete– Not subject to interpretation– Structure Activity Relationships are

straightforward to determine

• Reduce the size of the object• Retain chemical modifications• Eliminate ambiguity• Allow similar residues to be found using standard

substructure searches• Store biomolecules and small drug molecules in

one database• Facilitate structure property tables

The Need

Hybrid sequences

• Ambiguous residue labels

Hybrid sequences

• Ambiguous residue labels

Hybrid sequences

• Ambiguous residue labels

Hybrid sequences

• Ambiguous residue labels

PO

P

O

O

N

NNH

N NH2

O

O

NH 2

O

OH

O

O H

O

O N

N

NHN

NH 2

O

O HOH

SR

R

R

S R

RPO

P

O

O

N

NNH

N NH2

O

O

NH 2

O

OH

O

O H

O

O N

N

NHN

NH2

O

O HOH

SR

R

R

S R

R

Hybrid sequences

• Ambiguous residue labels

Duplicate sequences

Different sequences

Duplicate sequences

Different sequences

Duplicate sequences

Same structureDifferent sequences

Post Translational Modifications (PTM)Present Challenges

• Modified ResiduesAdditions to prochiral centers (ubiquitous)

Eliminations (serine to didehydroalanine) (lanthionine antibiotics)

Stereochemistry changes (L to D conversion of amino acids) (bacterial peptidoglycans)

proline (2S,4R)-4-hydroxyproline

NH

O H

ONH

OH

O H

O

serine didehydroalanine

NH2

OH

O H

O

NH2

CH2

O H

O

Challenges in Using Residues

Modifications across Residue Boundaries • Thiazole/oxazole formation (thiazolylpeptide antibiotics)

• Imidazolinone formation (Green Fluorescent Protein)

NH

O

NH

O

SH

N

S

NH

O

NN

O HO

O

O H

CH 3

NH

Solution: Use a Hybrid Representation

• Use residues where appropriate• Use explicit chemistry where appropriate• Templates define the residues• Templates capture attachment points

• No ambiguity in cyclic & cross-linked structures

• Self Contained Sequence Representation (SCSR)– Standard sequence representation

– A set of unique residue templates (Tgroup)

– Only the subset of Tgroups that are used in the structure are included

– Symyx Direct cartridge keeps chemistry and sequence synchronized enabling support for SSS and BLAST / FASTA

Compact, Hybrid Representation

• Self Contained Sequence Representation (SCSR)– Standard sequence representation

– A set of unique residue templates (Tgroup)

– Only the subset of Tgroups that are used in the structure are included

– Symyx Direct cartridge keeps chemistry and sequence synchronized enabling support for SSS and BLAST / FASTA

F

Compact, Hybrid Representation

• Self Contained Sequence Representation (SCSR)– Standard sequence representation

– A set of unique residue templates (Tgroup)

– Only the subset of Tgroups that are used in the structure are included

– Symyx Direct cartridge keeps chemistry and sequence synchronized enabling support for SSS and BLAST / FASTA

F

Compact, Hybrid Representation

• Self Contained Sequence Representation (SCSR)– Standard sequence representation

– A set of unique residue templates (Tgroup)

– Only the subset of Tgroups that are used in the structure are included

– Symyx Direct cartridge keeps chemistry and sequence synchronized enabling support for SSS and BLAST / FASTA

F F F

Compact, Hybrid Representation

Features of Hybrid Representation

• Emphasis is on chemical modifications in the structure

• Chemical complexity is handled using explicit chemistry

• Compression effectively converts large biomolecules into "small molecules"

• Supports Substructure Searching (SSS)

• Substructure searching highlights chemically similar structures

• Fastsearch indexing of areas with explicit chemistry performs well and scales well

Searching with Accelrys Representation

• Substructure searching finds chemically identical regions, regardless of what they are named

• Fast-Search indexing of areas with explicit chemistry makes searching perform well and scale well

• Complements existing sequence searching tools (BLAST, FASTA, ...)

NN

O HOO

O H

C H 3

NH

SSS Finds Fluorescent Proteins

NN

O

SSS Finds Fluorescent Proteins

NN

O

SSS Finds Fluorescent Proteins

• Use residues where appropriate

• Use explicit chemistry where appropriate

• Templates define the residues

Solution: Use a Hybrid Representation

• Template classes allow AA/DNA/RNA... in the same structure

• Templates capture attachment point information– removes ambiguity in cyclic and crosslinked structures

• Based on V3000 file format– optionally include templates in molfile

Solution: Use a Hybrid Representation

• Business rules define when to use :– a template– explicit chemistry

• More templates, more compression• More explicit chemistry, more found in structure

searches• Where the templates are stored is user-defined

– Global storage reduces molfile size– In-the-molfile creates a Self-Describing File

A Flexible Framework

Label Conventions:Al - left backbone (N or 5')

Br - right backbone (C or 3')

Cx - non-backbone connection

Template Example

NH

H

SH

O H

O

[

]

]Al

Cx

Br

• Original residue identity• Detailed connectivity• Residue numbering• Description of modifications • Sequence annotations

What is registered?

Representing Cyclic Peptides

Anantin

Al

Cx

Br

O

O

NH

NH

IN

G

DF

GS

YH

GF

IN

N

GW

G

I

F G

NH

O

N

CH3

CH3

NH2

S

O

NH2

K

N

DH

F

II E L

Bacitracin A

Representing Modifications

Non-natural Amino Acids

NH

O

NH

O

CH3

NH

O

NH

S OO

NCH3 CH3

CH3

NH

O

NH

NH

O

NH

I

NH

O

O

CH3

NH

O

O

NH

O

NN+

N–

• Emphasis is on chemical modifications in the structure

• Chemical complexity is handled using explicit chemistry

• Compression effectively converts large biomolecules into "small molecules"

• Supports Structure Searching (SSS, Similarity, Flexmatch)

Features of the Representation

• Substructure searching finds chemically identical regions, regardless of what they are named

• Fastsearch indexing of areas with explicit chemistry makes searching perform well and scale well

• Complements existing sequence searching tools (BLAST, FASTA, ...)

Searching

• Represent large biomolecules, focussing on chemical modifications– Templates for unmodified residues– Explicit chemistry for modified regions

• Search for chemically modified features in large biomolecules– Emphasis on searching modified regions as explicit chemistry

• Complements existing sequence-based systems• Supported in

– Symyx Direct 7.0– Accelrys Draw 4.0

Hybrid Representation

The following slides were not part of the formal presentation

They reflect what was done in the live demonstration of Accelrys Draw 4.0

Supplementary material

Draw 4.0 – Interpret UniProt filehuman insulin

Draw 4.0 – Interpret UniProt filehuman insulin

Draw 4.0 – Interpret UniProt filehuman insulin

Draw 4.0 – Interpret UniProt filehuman insulin

Draw 4.0 – Interpret UniProt filehuman insulin

Draw 4.0 – Interpret UniProt filehuman insulin

Draw 4.0 – Edit SequenceModify

Draw 4.0 – Edit SequenceModify

Draw 4.0 – Edit SequenceModify

Draw 4.0 – Edit SequenceModify

Draw 4.0 – Edit SequenceMutate

Draw 4.0 – Edit SequenceMutate

Draw 4.0 – Edit SequenceMutate

Recommended