Topics The topics: basic concepts of molecular biology more on Perl overview of the field ...

Preview:

Citation preview

TopicsTopics

The topics:The topics: basic concepts of molecular biologybasic concepts of molecular biology more on Perlmore on Perl overview of the fieldoverview of the field biological databases and database biological databases and database

searchingsearching sequence alignmentssequence alignments phylogenetic treesphylogenetic trees protein structure predictionprotein structure prediction microarray data analysismicroarray data analysis

The Human Genome The Human Genome ProjectProject

The human genome sequence is complete - The human genome sequence is complete - almost - almost - approximately 3 billion base pairs.approximately 3 billion base pairs.

Some of these slides are adapted from Lecture Notes of Stuart M. Brown at NYU

Whole genome sequencing Whole genome sequencing has now become routinehas now become routine

How does the human genome How does the human genome stack up?stack up?

OrganismOrganism Genome Size Genome Size (Bases)(Bases)

Estimated Estimated GenesGenes

Human (Human (Homo sapiensHomo sapiens)) 3.2 billion3.2 billion 25,00025,000

Laboratory mouse (Laboratory mouse (M. musculusM. musculus)) 2.6 billion2.6 billion 25,00025,000

Mustard weed (Mustard weed (A. thalianaA. thaliana)) 100 million100 million 25,00025,000

Roundworm (Roundworm (C. elegansC. elegans)) 97 million97 million 19,00019,000

Fruit fly (Fruit fly (D. melanogasterD. melanogaster)) 137 million137 million 13,00013,000

Yeast (Yeast (S. cerevisiaeS. cerevisiae)) 12.1 million12.1 million 6,0006,000

Bacterium (Bacterium (E. coliE. coli)) 4.6 million4.6 million 3,2003,200

Human immunodeficiency virus (HIV)Human immunodeficiency virus (HIV) 97009700 99

U.S. Department of Energy Genome Programs, Genomics and Its Impact on Science and Society, 2003

The Path ForwardThe Path Forward How does How does DNADNA impact impact health health??

Identify and understand the difference in DNA Identify and understand the difference in DNA sequence (A,T,C,G) among human populationssequence (A,T,C,G) among human populations

What do all the What do all the genesgenes do? do? Discover the functions of human genes by Discover the functions of human genes by

experimentation and by finding genes with experimentation and by finding genes with similar funcs in the model organismssimilar funcs in the model organisms

What are the functions of What are the functions of nongenenongene areas? areas? Identify important elements in the nongene Identify important elements in the nongene

regions of DNAregions of DNA How does info in the genome enable How does info in the genome enable lifelife??

Explore life at the ultimate level of the whole Explore life at the ultimate level of the whole organism instead of single genes/proteins.organism instead of single genes/proteins.

U.S. Department of Energy, 2005

Diverse applicationsDiverse applications MedicineMedicine – – customized treatments, …customized treatments, … Microbes for energy and the environmentMicrobes for energy and the environment

– – generate clean energy source, clean up toxic generate clean energy source, clean up toxic wastes,…wastes,…

BioanthropologyBioanthropology – human lineage – human lineage Agriculture, livestock breeding, Agriculture, livestock breeding,

BioprocessingBioprocessing – – crops&animals more resistant crops&animals more resistant to diseases, efficient industrial processes,…to diseases, efficient industrial processes,…

DNA identificationDNA identification – – implicate people accused implicate people accused of crimes, identify contaminants in air, water, … of crimes, identify contaminants in air, water, …

U.S. Department of Energy, 2005

Genomics: Journey to the Center Genomics: Journey to the Center of Biologyof Biology

Without doubt, the greatest achievement in biology Without doubt, the greatest achievement in biology over the past millennium has been the elucidation over the past millennium has been the elucidation of the mechanism of heredity. The instructions for of the mechanism of heredity. The instructions for assembling every organism on the planet are all assembling every organism on the planet are all specified in DNA sequences that can be specified in DNA sequences that can be translated into translated into digital informationdigital information and stored in and stored in a computer for analysis. As a consequence of this a computer for analysis. As a consequence of this revolution, revolution, biologybiology in the 21st century is rapidly in the 21st century is rapidly becoming an becoming an information scienceinformation science. Powerful new . Powerful new types of types of bioinformaticsbioinformatics will clearly be required will clearly be required to to assimilate and interpret the dataassimilate and interpret the data that will that will issue from various types of genomics research.issue from various types of genomics research.

Eric Lander & Robert Weinberg, Science, 2000Eric Lander & Robert Weinberg, Science, 2000

Nucleic Acid Sequence Nucleic Acid Sequence DatabasesDatabases

the principal nucleic acid sequence databases are GeneBank, EMBL and DDBJ, which each collect a portion of the total sequence data reported world-wide, and exchange new and updated entries on a daily basis

Nucleic acid sequence DatabasesNucleic acid sequence DatabasesEMBLEMBL (European Molecular Biology Laboratory)(European Molecular Biology Laboratory)

GenBankGenBank (USA)(USA)

DDBJDDBJ ((DNA Data Bank of JapanDNA Data Bank of Japan))

ENSEMBLENSEMBL (project between EMBL - EBI and the Sanger Institute, to (project between EMBL - EBI and the Sanger Institute, to produce and maintain automatic annotation on selected eukaryotic genomes produce and maintain automatic annotation on selected eukaryotic genomes ))

dbESTdbEST (division of GenBank)(division of GenBank)

GSDBGSDB (Genome Sequence DataBase, division of GenBank) (Genome Sequence DataBase, division of GenBank)

GenBankGenBank Once upon a time, Once upon a time, GenBankGenBank sent sent

out sequence updates on CD-ROM out sequence updates on CD-ROM disks a few times per year.disks a few times per year.

Specialised Genomic Specialised Genomic ResourcesResources

In addition to the comprehensive DNA sequence DBs, In addition to the comprehensive DNA sequence DBs, there is a variety of more specialised genomic there is a variety of more specialised genomic resources.resources.

These so called boutique DBs bring focus to species-These so called boutique DBs bring focus to species-specific genomics and to particular sequencing specific genomics and to particular sequencing techniques.techniques.

Specialised Genomic ResourcesSpecialised Genomic Resources

SGDSGD – – Saccharomyces Genome DatabaseSaccharomyces Genome Database

UniGeneUniGene - - gene-oriented clusters from GenBankgene-oriented clusters from GenBank

TIGRTIGR - Databases of The Institute for Genomic - Databases of The Institute for Genomic ResearchResearch

ACeDBACeDB – – A C.elegans DataBaseA C.elegans DataBase

Protein Information Protein Information ResourcesResources

The primary structure of a protein is its amino acid sequence

The second structure of a protein corresponds to regions of local regularity (e.g., α-helices and β-strands).

The tertiary structure of a protein arises from the packing of its secondary structure elements, which may form discrete domains within a fold.

Levels of protein sequence and structural organisation:

primary

tertiary

secondary

Primary Protein Primary Protein DatabasesDatabases

The primary structure of a protein is its amino acid sequence. These are stored in primary databases as linear alphabets that denote the constituent residues.

Protein sequence DatabasesProtein sequence Databases

SWISS-PROT - SWISS-PROT - Protein knowledgebaseProtein knowledgebase

TrEMBL - TrEMBL - Computer-annotated supplement to Computer-annotated supplement to Swiss-Prot Swiss-Prot

PIR – PIR – Protein Information ResourceProtein Information Resource

MIPSMIPS – – Munich Information Centre for Protein Munich Information Centre for Protein SequencesSequences

NRL-3DNRL-3D - - produced by PIRproduced by PIR

Structure Classification Structure Classification DBsDBs

Contain 3D structures available from Contain 3D structures available from crystallographic and spectroscopic studiescrystallographic and spectroscopic studies

Structure Classification DatabasesStructure Classification Databases

PDBPDB – – Protein Data BankProtein Data Bank

CATHCATH – – Class, Architecture, Topology, Class, Architecture, Topology, HomologyHomology

SCOPSCOP – – Structural Classification of ProteinsStructural Classification of Proteins

PDB: Growth (2006)

Databases concerning Databases concerning MutationsMutations

dbSNPdbSNP http://www.ncbi.nlm.nih.gov/SNPhttp://www.ncbi.nlm.nih.gov/SNP

HGBASEHGBASE (Human Genome Variation Database) (Human Genome Variation Database)http://hgbase.cgr.ki.sehttp://hgbase.cgr.ki.se

The SNP Consortium (TSC)The SNP Consortium (TSC) http://snp.cshl.orghttp://snp.cshl.org

LiteratureLiterature DatabasesDatabases

PubMedPubMed http://www.ncbi.nlm.nih.gov/entrez/queryhttp://www.ncbi.nlm.nih.gov/entrez/query

Bioinformatics OnlineBioinformatics Online http://www.bioinformatics.oupjournals.orghttp://www.bioinformatics.oupjournals.org

NatureNature http://www.nature.comhttp://www.nature.com

ScienceScience http://www.sciencemag.orghttp://www.sciencemag.org

Systems Systems BiologyBiology

Integrate different levels of Integrate different levels of

information to understand information to understand

how biological systems functionhow biological systems function

Use computational and mathematical Use computational and mathematical models to analyze, model and models to analyze, model and simulate cellular networks, simulate cellular networks, interactions and pathways. interactions and pathways.

MicroarrayMicroarray

DNA microarrayDNA microarray is a new is a new technology to measure the technology to measure the level of the level of the mRNA gene mRNA gene productsproducts of a living cell. of a living cell.

Affymetrix GeneChipAffymetrix GeneChip®® Probe Probe

ArraysArrays

24~50µm

Each probe cell or feature containsmillions of copies of a specificoligonucleotide probe

Image of Hybridized Probe Array

Single stranded, fluorescentlylabeled cRNA target

Oligonucleotide probe

**

**

*

1.28cm

GeneChip Probe Array

Hybridized Probe Cell

BGT108_DukeUniv

*

Bioinformatics Bioinformatics ToolsTools

Database & searchingDatabase & searching Computational Computational

algorithmsalgorithms AlignmentAlignment Similarity Similarity ClusteringClustering Pattern SearchingPattern Searching

Structure predictionsStructure predictions Statistical methodsStatistical methods Data visualizationData visualization

BioinformaticsBioinformatics

BioinformaticsBioinformatics is the research, development, or is the research, development, or application of computational tools and application of computational tools and approaches for expanding the use of biological, approaches for expanding the use of biological, medical, behavioral or health data, including medical, behavioral or health data, including those to acquire, store, organize, archive, those to acquire, store, organize, archive, analyze, or visualize such data;analyze, or visualize such data;

Computational biologyComputational biology is the development and is the development and application of data-analytical and theoretical application of data-analytical and theoretical methods, mathematical modeling and methods, mathematical modeling and computational simulation techniques to the computational simulation techniques to the study of biological, behavioral, and social study of biological, behavioral, and social systems. systems.

Recommended