19
http://www.pdb.org/ • [email protected] Data Integration and Management A PDB Perspective

Http:// [email protected] Data Integration and Management A PDB Perspective

Embed Size (px)

Citation preview

Page 1: Http:// info@rcsb.org Data Integration and Management A PDB Perspective

http://www.pdb.org/ • [email protected]

Data Integration and Management

A PDB Perspective

Page 2: Http:// info@rcsb.org Data Integration and Management A PDB Perspective

What is PDB?

• Single international repository of three-dimensional data for biological macromolecules

• Public community resource• Established at Brookhaven in 1971 (7

structures)• Moves to RCSB in 1998• wwPDB established in 2004• > 25,000 structures in PDB

Page 3: Http:// info@rcsb.org Data Integration and Management A PDB Perspective

Community

• Scientific Community - at all levels– Structural biologists (crystallography, NMR, cryo-

EM)– Biologists– Computational biologists

• Journals• General Community

– Secondary school– General public

• Internal– RCSB PDB staff– wwPDB members

Page 4: Http:// info@rcsb.org Data Integration and Management A PDB Perspective

Data Representation

• Macromolecular Crystallographic Information Framework

• XML DTD/Schema Mapping• SQL Schema Mapping• CORBA IDL Mapping• Supporting emerging ontology representations - OWL

Page 5: Http:// info@rcsb.org Data Integration and Management A PDB Perspective

Elements of Dictionary Metadata

• Data Attributes– Definition– Examples– Data type (primitive type/regular expression

patterns)– Range or allowed values

• Classes– Categories– Subcategories– Category groups

• Associations– Parent-child relationships– Interdependencies/exclusivity– Methods

Page 6: Http:// info@rcsb.org Data Integration and Management A PDB Perspective

Difficult Issues

• Resolving semantic ambiguities – encoding meaning

• Integrating controlled vocabularies

• Separation of primary and derived information

• Supporting rapid evolution of science

Page 7: Http:// info@rcsb.org Data Integration and Management A PDB Perspective

What’s Driving Data Definition

• IUCr-sponsored community effort

• Automated data acquisition

• Data management and data exchange for PDB

• New technologies (e.g. cryo-electron

microscopy)

• High-throughput structure determination and

structural genomics

Page 8: Http:// info@rcsb.org Data Integration and Management A PDB Perspective

TargetSelection Protein

Production

StructureDetermination

PDBDeposition

Merged

Project Data

CrystalProduction

ProjectDatabase

ExchangeDictionary

Typical Project Deposition Data

Flow

Page 9: Http:// info@rcsb.org Data Integration and Management A PDB Perspective

Data Sharing Nightmare

Page 10: Http:// info@rcsb.org Data Integration and Management A PDB Perspective

Incremental Data Pipeline

Page 11: Http:// info@rcsb.org Data Integration and Management A PDB Perspective

Current Integration Strategy

• Provide software tools to collect bits of data from the output from each program step

• Convert data in log and output files to a common representation

• Merge the data corresponding to the successful outcome

• Provide an editor tool to enter remaining data and check consistency of results

Page 12: Http:// info@rcsb.org Data Integration and Management A PDB Perspective

Data Deposition and Annotation

PDB ID

DistributionSite

Depositor

ArchivalData

Core DB

PDB EntryADIT Annotate Validate

Depositor Approval

Validation Report

Corrections

Step 2

Step 3

Step 4

Step 1

Functional AnnotationStep 5

Page 13: Http:// info@rcsb.org Data Integration and Management A PDB Perspective

Integrated Data Processing System

ADITADITsrv

ADITADITsrv

Reports Final Files

MAXIT

Validation

Database Loader

Metadata DictionariesData

Views

Client Input Tool

Data Assembled by Depositor

ADIT

ADITsrv

Page 14: Http:// info@rcsb.org Data Integration and Management A PDB Perspective

Features of System

• Different dictionaries without software changes

• Metadata customization of both functionality and

content

• Automatically scales with changes in content

• Can be distributed to multiple deposition sites

• Reference data and standard nomenclature (ERFs)

• Self-monitoring

Page 15: Http:// info@rcsb.org Data Integration and Management A PDB Perspective

Data Distribution

Applications

mmCIF Data Files(Data Reference Standard)

APIServers

RelationalDatabase

mmCIFParsers

XML Files

Page 16: Http:// info@rcsb.org Data Integration and Management A PDB Perspective

Automatic Production of Macromolecular Structure

API Components

PDB Exchange Dictionary +

API Specific Data Dictionaries

CORBA IDL, SQL Schema,

XML DTD/Schemas, Data Loaders

Database Access Classes

MetamodelFramework

Page 17: Http:// info@rcsb.org Data Integration and Management A PDB Perspective

Management

• Complex challenges in technology and sociology• Communicate and work with diverse community• Help create and enforce community policies and

standards• Must take advantage of the most current

innovations in new technologies• New technologies must be introduced so as to

enable and not disrupt the users of the resource• Beyond all else is the need for good data and a

robust data representation

Page 18: Http:// info@rcsb.org Data Integration and Management A PDB Perspective

Access

• RCSB Protein Data Bank Site• http://www.pdb.org/

• OpenMMS site (Java implementation)• http://openmms.sdsc.edu/

• RCSB PDB Software Download Site (C++ and Python implementation, NDB server)• http://deposit.pdb.org/mmcif/FILM/

• RCSB PDB Dictionary Resource Site • http://deposit.pdb.org/mmcif/

• RCSB PDB Beta Data Site• ftp://beta.rcsb.org/pub/pdb/uniformity/data/

Page 19: Http:// info@rcsb.org Data Integration and Management A PDB Perspective

http://www.pdb.org/ • [email protected]

Operated by three members of the RCSB: Rutgers, The State University of New Jersey; San Diego Supercomputer Center at the University of

California, San Diego; Center for Advanced Research in Biotechnology/UMBI/NIST

The RCSB PDB is supported by funds from the National Science Foundation (NSF), the National Institute of General Medical Sciences

(NIGMS), the Office of Science, Department of Energy (DOE), the National Library of Medicine (NLM), the National Cancer Institute (NCI),

the National Center for Research Resources (NCRR), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), and the

National Institute of Neurological Disorders and Stroke (NINDS).

The RCSB PDB is a member of the wwPDB (http://www.wwpdb.org/)