of 63 /63
Biology 4900 Biology 4900 Biocomputing

Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Embed Size (px)

Citation preview

Page 1: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Biology 4900Biology 4900

Biocomputing

Page 2: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Chapter 2Chapter 2

Molecular Databases and Data Analysis

Page 3: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Literature DatabasesLiterature Databases

• Online databases available at CSU– Galileo– JSTOR

• Online databases at other sites– PubMed. If you find a useful article, you can check

PubMed Central to see if it is available online for free.• Where to get articles

– PubMed Central– GIL– Interlibrary loan

Page 4: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

DNA RNA

cDNA*ESTsUniGene

phenotype

genomicDNAdatabases

protein sequence databases

protein

Sources of Molecular DataSources of Molecular Data

*Expressed Sequence Tags

Page 5: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Molecular DatabasesMolecular Databases

• Primary Database– Archival - sequences submitted directly from experimental

sequencing results• Very little interpretation• Anyone can submit; accuracy not checked• Examples

– Nucleic Acid: EMBL, DDJB, GenBANK– Protein: Swiss-Prot, PIR, PDB

• Secondary Databases– Curated – sequences are validated/checked and may be

annotated• Refseq (nucleic acids and proteins, but limited to certain

organisms)• TrEMBL, GenPept, Uniprot

Page 6: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Nucleic Acid DatabasesNucleic Acid Databases• Contain:

– Nucleic acid sequences• Chain termination method (Sanger sequencing)

– Used for sequences 100-1000 bp• Whole Genome Shotgun (WGS) Sequencing

– Used for sequences >1000 bp– DNA chopped into little chunks– Sequenced using chain termination method (reads)– Numerous, overlapping reads are collected and assembled

into sequence (computational methods)– Annotations for each sequence

• Putative identification of open reading frames (ORFs = parts of gene that encode protein) in sequence

• Putative intron(excised)/exon(retained) locations• Authors, dates, publication, etc.

Page 7: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

GenBank

EMBL DDBJ

International Nucleotide Sequence Database CollaborationInternational Nucleotide Sequence Database Collaboration(Public nucleotide and protein sequence databases)(Public nucleotide and protein sequence databases)

Name: European Molecular Biology Laboratory (EMBL)Location: European Bioinformatics Institute (EBI)

Name: DNA Database of Japan (DDBJ)Location: National Institute of Genetics, Mishima

Daily Info sharing

Name: GenBankLocation: National Institutes of Health, National Center for Biotechnology Information

Daily Info sharing

Daily Info sharing

Page 8: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

GenBankGenBank

• As of April 2011, There were approximately 126,551,501,141 bases in 135,440,924 sequence records in the traditional GenBank divisions.

• Read the following paper: http://www.ncbi.nlm.nih.gov/pubmed/21071399• Home Page: http://www.ncbi.nlm.nih.gov/genbank/

Homo sapiens 14.9 billion basesMus musculus 8.9bRattus norvegicus 6.5bBos taurus 5.4bZea mays 5.0bSus scrofa 4.8bDanio rerio 3.1bStrongylocentrotus purpurata 1.4bOryza sativa (japonica) 1.2bNicotiana tabacum 1.2b

Page 9: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

GenBank Home PageGenBank Home Page

Page 10: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

NCBI ResourcesNCBI Resources

• PubMed• BLAST• OMIM• Taxonom

y Browser• Structure

Page 11: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

NCBI key features: PubMedNCBI key features: PubMed• National Library of Medicine's search service

• 21 million citations from MEDLINE & others (as of 2011)

• Links to other online journals

• http://www.ncbi.nlm.nih.gov/pubmed

• Starting point for most research

Page 12: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Literature Searches through PubMedLiterature Searches through PubMed

Page 13: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Use the pull-down menu to access related resources such as Medical Subject Headings (MeSH)

Page 14: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

A “how to” pull-down menu links to tutorialsA “how to” pull-down menu links to tutorials

Page 15: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Use “Advanced search” to limit by author, year, Use “Advanced search” to limit by author, year, language, etc.language, etc.

Page 16: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

PubMed search strategies

Try the tutorial

Use boolean queries (capitalize AND, OR, NOT)lipocalin AND disease

Try using limits (see Advanced search)

There are links to find Entrez entries and external resources

Page 17: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

lipocalin AND disease(504 results)

lipocalin OR disease(2,500,000 results)

lipocalin NOT disease(2,370 results)

1 AND 2

1 OR 2

1 NOT 2

1

1

1

2

2

2

Page 18: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Save Searches, Save Results, Get PapersSave Searches, Save Results, Get Papers

Page 19: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

PubMed Author SearchPubMed Author Search

Page 20: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Scholar Google SearchScholar Google Search

• http://scholar.google.com/

• Includes references that may not be found in PubMed

Page 21: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

A search from NCBI main page will search:

• the scientific literature; • DNA and protein sequence databases; • 3D protein structure data; • population study data sets; • assemblies of complete genomes

•String searchSearch by author, date, keyword, publication, etc.

NCBI key features NCBI key features

Classroom exercise:Author searchesPaper searchesProtein searches

Page 22: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

BLAST is…• Basic Local Alignment Search Tool• NCBI's sequence similarity search tool• supports analysis of DNA and protein databases

NCBI key features: BLASTNCBI key features: BLAST

3CLN

Page 23: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

•Online Mendelian Inheritance in Man•Catalog of human genes and genetic disorders

NCBI key features: OMIMNCBI key features: OMIM

Page 24: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

• Browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses)• Taxonomy information such as genetic codes• Molecular data on extinct organisms• Useful to find a protein or gene from a species

NCBI key features: Taxonomy BrowserNCBI key features: Taxonomy Browser

Page 25: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

• Molecular Modelling Database (MMDB)• biopolymer structures obtained from

the Protein Data Bank (PDB)• Cn3D (a 3D-structure viewer)• vector alignment search tool (VAST)

NCBI key features: StructureNCBI key features: Structure

Page 26: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Cn3DCn3D

•A 3D-structure viewer•Must download (ftp://ftp.ncbi.nlm.nih.gov/cn3d/Cn3D-4.3.msi)•Use to align structures identified as similar by VAST

Page 27: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Example: Researching beta globinExample: Researching beta globin

• Beta globin is protein, so it will be found in 3 different types of databases

DNA *RNA Proteins

GenBank dbGSSGenBank dbHTGSGenBank dbSTS

GenBank Entrez GeneGenBank dbESTUniGeneGene Expression Omnibus

Entrez ProteinUniProtPDBSCOPCATH

*Because RNA is unstable, it can be transcribed into complementary DNA (cDNA)

Page 28: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Necessary (yet annoying) DefinitionsNecessary (yet annoying) Definitions

• Sequence Tagged Site (STS): Small DNA fragments with both DNA sequence data and mapping data (genes assigned to chromosomes)

• Expressed Sequence Tags (EST): Partial DNA sequence of a complementary (cDNA) clone– Typically these are randomly-selected cDNA clones

sequenced on a single strand (300-800 bp)– Useful for identifying novel genes– Higher rate of error

http://genome.wellcome.ac.uk/doc_WTD020755.html

Page 29: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

UnigeneUnigene

• Unique Gene (Unigene) Project to create gene-oriented clusters by partitioning ESTs into non-redundant sets– http://www.ncbi.nlm.nih.gov/unigene– Ultimately there should be only 1 cluster per gene– Usually more than 1 due to errors– Types of errors

• 2 or more clusters may represent different parts of the same gene• Sequence errors• Cloning artifacts (DNA transcribed during creation of cDNA that

doesn’t correspond to authentic transcript)

EST’s

Unigene Cluster

Page 30: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

http://www.ncbi.nlm.nih.gov/unigene

UnigeneUnigene

Page 31: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

GenBank FlatfileGenBank Flatfile• A format for organizing genomic sequence data. Includes the following:• Sequence and annotations• Header

– Locus name or accession number: unique to sequence description– Size: number of nucleotide bases or amino acid residues– Molecule: DNA, RNA, strandedness (ds, ss), and type of RNA or DNA– Genbank division code: 18 divisions (PRI = primate, PLN = plant, BAC = bacterial, etc.)– Date of last modification

• Definition Line: brief description of sequence (e.g. source organism, protein/gene name, function)

• Accession: unique identifier for a record• Version

– May be more than one accession– Record modification (accession.1; accession.2)– GI: is specific to version; may be more than one

• Keywords• Source: organism or clone description• Reference: publications that discuss data reported• Authors and Journal publication info• PubMed identifier: link to sequence record (abstract)• Features: vary (chromosomal info., coding info, protein id, % of each nucleotide)• Sequence Data

Jump to example

Page 32: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

What is an accession number?What is an accession number?

An accession number is label that is used to identify a sequence. It is a (unique) string of letters and/or numbers that corresponds to a molecular sequence.

Examples (all for retinol-binding protein, RBP4):

X02775 GenBank genomic DNA sequenceNT_030059 Genomic contig (overlapping DNA fragments)Rs7079946 dbSNP (single nucleotide polymorphism)

N91759.1 An expressed sequence tag (1 of 170)NM_006744 RefSeq DNA sequence (from a transcript)

NP_007635 RefSeq proteinAAC02945 GenBank proteinQ28369 SwissProt protein1KT7 Protein Data Bank structure record

protein

DNA

RNA

Page 33: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

NCBI’s important RefSeq project: best NCBI’s important RefSeq project: best representative sequencesrepresentative sequences

RefSeq (accessible via the main page of NCBI)provides an expertly curated accession number thatcorresponds to the most stable, agreed-upon “reference”version of a sequence.

RefSeq identifiers include the following formats:

Complete genome NC_######Complete chromosome NC_######Genomic contig NT_######mRNA (DNA format) NM_###### e.g. NM_006744Protein NP_###### e.g. NP_006735

Page 34: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

UniGene Name Search: OncomodulinUniGene Name Search: Oncomodulin

All results listed

Allows filtering

Page 35: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

UniGene Name Search: Select Human OncomodulinUniGene Name Search: Select Human Oncomodulin

• 4 Expressed Sequence Tags from 1 complementary DNA library• Identifies chromosome and map position on chromosome• Compares cluster transcripts with refseq proteins

Page 36: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

UniGene Name Search: Select Human OncomodulinUniGene Name Search: Select Human Oncomodulin

Click on link for menu of other links:Conserved domains

Gene summaryProtein sequence

Clicking on Protein sequence link then takes you to predicted protein sequence file (NP_006179.2)

Page 37: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

UniGene Name Search: Select Human OncomodulinUniGene Name Search: Select Human Oncomodulin

12

3

4

Once here, you can:1.Open FASTA file2.Run BLAST3.Identify and view conserved domains4.See related proteins

Page 38: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Access to sequences: Gene at NCBIAccess to sequences: Gene at NCBI

Gene is a great starting point: it collectskey information on each gene/protein from major databases. It covers all major organisms.

Example: RefSeq provides a curated, optimal accession number for each DNA (NM_000518 for beta globin DNA corresponding to mRNA) or protein (NP_000509)

These references should be more reliable data

Page 39: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Gene Name Search: OncomodulinGene Name Search: Oncomodulin

Returns list of gene entries for oncomodulin for different organismsClick on a highlighted link to see details

Page 40: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Gene Name Search: Select Human OncomodulinGene Name Search: Select Human Oncomodulin

Summary of all gene information, including mapping (when available). Note that this sequence has been validated as a RefSeq.Scrolling down, you can find link to protein data through UniProt.

Page 41: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Gene Name Search: Link to Oncomodulin ProteinGene Name Search: Link to Oncomodulin Protein

Page 42: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Protein Name Search: OncomodulinProtein Name Search: Oncomodulin

Notice that I filtered this search so that results show only human oncomodulin

Page 43: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

You can change the display (as shown)…

Page 44: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

FASTA format:versatile, compact with one header line

followed by a string of nucleotides or amino acids in the single letter code

Page 45: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Comparison of Gene to other resourcesComparison of Gene to other resources

Gene: collects key information on each gene/protein from major databases. It covers all major organisms.

UniGene: Database with information on where in a body, when in development, and how abundantly a transcript is expressed

HomoloGene: Gathers information on sets of related proteins based on common genetic ancestry.

Page 46: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Homologene Name Search: OncomodulinHomologene Name Search: Oncomodulin

Provides list of homologous

(related) genes

Page 47: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Homologene Name Search: OncomodulinHomologene Name Search: Oncomodulin

Shows conserved domains of protein sequences. If you click on graphic,

takes you to summary of

domain/family information.

Page 48: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

ExPASy to access protein and DNA sequencesExPASy to access protein and DNA sequences

• ExPASy (Expert Protein Analysis System) sequence retrieval system

• Visit http://www.expasy.ch/ • Similar to Entrez for NCBI

Example: Search for calmodulin

Jump to Prosite

Page 49: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

UniProt: a centralized protein database (uniprot.org)

This is separate from NCBI, and interlinked.

Page 50: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

UniProt: CalmodulinUniProt: Calmodulin

• Search Results for bovine calmodulin (P62157)

Page 51: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Protein Secondary Structure: PDBSum (EMBL-EBI)Protein Secondary Structure: PDBSum (EMBL-EBI)

•http://www.ebi.ac.uk/pdbsum/

•Either enter PDB file or can load new/existing sequence

Page 52: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

ExPASy: vast proteomics resources (www.expasy.ch)

Page 53: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Genome BrowsersGenome Browsers

Genomic DNA is organized in chromosomes. Genome browsers display ideograms (pictures) of chromosomes, with user-selected “annotation tracks” that display many kinds of information.

The two most essential human genome browsers are at Ensembl and UCSC. We will focus on UCSC (but the two are equally important). The browser at NCBI is not commonly used.

Page 54: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

clickhuman

Ensembl genome browser (www.ensembl.org)

Page 55: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

enterbeta globin

Page 56: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Ensembl output for beta globin includes views of chromosome 11 (top), the region (middle), and a detailed view (bottom).

There are various horizontal annotation tracks.

Page 57: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

The UCSC Genome BrowserThe UCSC Genome Browser

• This browser’s focus is on humans and other eukaryotes

• you can select which tracks to display (and how much information for each track)

• tracks are based on data generated by the UCSC team and by the broad research community

• you can create “custom tracks” of your own data! Just format a spreadsheet properly and upload it

• The Table Browser is equally important as the more visual Genome Browser, and you can move between the two

Page 58: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

[1] Visit http://genome.ucsc.edu/, click Genome Browser

[2] Choose organisms, enter query (beta globin), hit submit

Page 59: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

[4] On the UCSC Genome Browser:--choose which tracks to display

Page 60: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Protein DatabasesProtein Databases

• What do they contain?– Amino acid sequences

• Primary sequence– Direct submissions - protein sequencing– SWISS-PROT, PIR

• Secondary sequence– Translations - putative proteins resulting from modifying (i.e.

intron splicing) nucleic acid sequence– GenPept, TrEMBL

• Structure– Protein Data Bank

– Annotations• Function, domains, etc.

Page 61: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

SWISS-PROTSWISS-PROT• Created by Amos Bairoch in 1986 at the Department of

Medical Biochemistry in Geneva• Maintained by the Swiss Institute of Bio-informatics (SIB) and

funded by GeneBio• Few redundancies• Direct submission (from sequencing, not translation)

• PIR (The Protein Information Resource) was created by M.O. Dayhoff in 1965

• Maintained by many• In 2004, joined with other databases (Swiss-Prot and TrEMBL)

to become part of the UniProt consortium

PIRPIR

Page 62: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

Protein Data BankProtein Data Bank

• Archive of 3-D structural data of biological macromolecules

• Based on experimental data• Managed by the Research

Collaboratory for Structural Bioinformatics (RCSB)– Rutgers & UCSD

• As of January 11, 2012 contained 78477 structures

• ~ 5000 membrane proteins

http://www.rcsb.org/pdb/statistics/contentGrowthChart.do?content=total&seqid=100

Page 63: Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis

PDB: Source of protein sequence and structure dataPDB: Source of protein sequence and structure data