Upload
mervyn-sanders
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Bioinformatics
Jack MinOffice 3012Office hours: TR 12:15 – 4
How would you define "bioinformatics"?
Can you distinguish between a gene and a genome?
Do you know what a Blast search is?
Do you know what any of the letters in "blast" represent?
What is Genbank?What is NCBI?What is meant by "codon usage"?
Here are two nucleotide sequences:(i) AGTCGGTAACCTAAG(ii) GGCAAAUACUAAGGA
Which of these is a DNA sequence? Explain your answer.
Here is a DNA sequence: 5' ATTGGRTCCAATA 3‘
(i) What do the 5' and 3' mean?(ii) What does the R stand for?(iii) Write the sequence of the
complementary DNA strand, using the same notation.
What is meant by the "template" and "coding" strands of DNA?
Name three different kinds of RNA.Which amino acids are represented by the following symbols? A
LQKGW
International Union of Pure and Applied Chemistry
What is a PAM matrix?Can both nucleotide and amino acid sequences
be used to build molecular phylogenies?Explain the method of parsimony for building
sequence-based phylogenies.What is the program CLUSTAL used for?
What do you know about microarrays? For instance, are there different types of microarrays, and what are they used for?
This question is to test your general knowledge of genetics and molecular biology. Can you define the following terms?
(a) Gene
(b) Locus
(c) Allele
(d) Linkage
(e) Linkage disequilibrium
(f) synonymous substitution
(g) Intron
(h) concerted evolution
(i) Pleiotropy
(j) PCR
(k) RFLP
(l) Haplotype
• To provide an introduction to bioinformatics with a focus on the National Center for Biotechnology Information (NCBI) and EBI
• To focus on the analysis of DNA, RNA and proteins
• To introduce you to the analysis of genomes
• To combine theory and practice to help you solve research problems
What are the goals of the course?
Themes throughout the course
Textbooks
Web sites
Literature references
Gene/protein families
Computer labs
Textbook
Bioinformatics and Functional Genomics Second edition (Wiley, 2009).
ReferenceBioinformaticsThird edition (Wiley, 2005)Baxevanis and Ouellette
Web sites
The course website is reached via moodle: http://pevsnerlab.kennedykrieger.org/moodle(or Google “moodle bioinformatics”)--This site contains the powerpoints for each lecture. including color and black & white versions--The weekly quizzes are here
The textbook website is:http://www.bioinfbook.orgThis has powerpoints, URLs, etc. organized by chapter
Literature references
You are encouraged to read original source articles(posted on moodle). They will enhance your understanding of the material. Readings are optional but recommended.
http://ghr.nlm.nih.gov/handbook.pdfhttp://www.ncbi.nlm.nih.gov/books/NBK21101/
Themes throughout the course: gene/protein families
We will use beta globin and retinol-binding protein 4 (RBP4) as model genes/proteins throughout the course.
Globins including hemoglobin and myoglobin carry oxygen. RBP4 is a member of the lipocalin family. It is a small, abundant carrier protein. We will study globins and lipocalins in a variety of contexts including
--sequence alignment--gene expression--protein structure--phylogeny--homologs in various species
Computer labs
You can use any computer you can find –
Computer lab in the department, in your lab,
Computers in my lab (3013)
Outline for today
Definition of bioinformatics
Overview of the NCBI website
Accessing information about DNA and proteins--Definition of an accession number--Four ways to find information on proteins and DNA
Access to biomedical literature
Pairwise alignment: introduction
What is Bioinformatics?Broad definition:
The use of computational tools for:
acquiring,
storing and
analyzing
biological information.
Narrow definition:The use of computational tools for:
acquiring,storing andanalyzing
molecular sequence information.
• Interface of biology and computers
• Analysis of proteins, genes and genomes using computer algorithms and computer databases
• Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects.
What is bioinformatics?
Tool-users
Tool-makers
bioinformatics
public healthinformatics
medicalinformatics
infrastructure
databases algorithms
Three perspectives on bioinformatics
The cell
The organism
The tree of life
Page 4
DNA RNA phenotypeprotein
Page 5
Time ofdevelopment
Body region, physiology, pharmacology, pathology
Page 5
After Pace NR (1997) Science 276:734
Page 6
Fig. 2.1Page 17
Growth of GenBank + Whole Genome Shotgun(1982-November 2008)
Nu
mb
er
of s
eq
uen
ces
in G
en
Ban
k (m
illio
ns)
0
50
100
150
200
250
1982 1987 1992 1997 2002 2007
Ba
se p
air
s o
f DN
A in
Gen
Ba
nk (
bill
ions
) B
ase
pa
irs
in G
en
Ban
k +
WG
S (
billi
ons
)
DNA RNA protein
Central dogma of molecular biology
genome transcriptome proteome
Central dogma of bioinformatics and genomics
DNA RNA
cDNAESTsUniGene
phenotype
genomicDNAdatabases
protein sequence databases
protein
Fig. 2.2Page 20
GenBankEMBL DDBJ
There are three major public DNA databases
The underlying raw DNA sequences are identical
Page 16
GenBankEMBL DDBJ
Housedat EBI
EuropeanBioinformatics
Institute
There are three major public DNA databases
Housed at NCBINational
Center forBiotechnology
Information
Housed in Japan
Page 16
The Trace Archive at NCBI containsover 2 billion traces
11/08
Taxonomy at NCBI:~200,000 species are represented in GenBank
http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi11/08
The most sequenced organisms in GenBank
Homo sapiens 13.1 billion basesMus musculus 8.4bRattus norvegicus 6.1bBos taurus 5.2bZea mays 4.6bSus scrofa (wild boar) 3.6bDanio rerio (zebrafish) 3.0bOryza sativa (japonica) 1.5bStrongylocentrotus purpurata (sea urchins) 1.4bNicotiana tabacum 1.1b
Updated 11-6-08GenBank release 168.0Excluding WGS, organelles, metagenomics
Table 2-2Page 18
National Center for BiotechnologyInformation (NCBI)
www.ncbi.nlm.nih.gov
Page 24
www.ncbi.nlm.nih.govFig. 2.5Page 25
Fig. 2.5Page 25
PubMed is… • National Library of Medicine's search service• 16 million citations in MEDLINE• links to participating online journals• PubMed tutorial (via “Education” on side bar)
Page 24
Entrez integrates…
• the scientific literature; • DNA and protein sequence databases; • 3D protein structure data; • population study data sets; • assemblies of complete genomes
Page 24
Entrez is a search and retrieval system that integrates NCBI databases
Page 24
BLAST is…
• Basic Local Alignment Search Tool• NCBI's sequence similarity search tool• supports analysis of DNA and protein databases• 100,000 searches per day
Page 25
OMIM is…
•Online Mendelian Inheritance in Man•catalog of human genes and genetic disorders•created by Dr. Victor McKusick; led by Dr. Ada Hamosh
at JHMI
Page 25
Books is…
• searchable resource of on-line books
Page 26
TaxBrowser is…
• browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses)• taxonomy information such as genetic codes• molecular data on extinct organisms
Page 26
Structure site includes…
• Molecular Modelling Database (MMDB)
• biopolymer structures obtained from
the Protein Data Bank (PDB)• Cn3D (a 3D-structure viewer)• vector alignment search tool (VAST)
Page 26
Outline for today
Definition of bioinformatics
Overview of the NCBI website
Accessing information about DNA and proteins--Definition of an accession number--Five ways to find information on proteins and DNA
Access to biomedical literature
Pairwise alignment: introduction
Accession numbers are labels for sequences
NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or theraw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequenceor other record relevant to molecular data.
Page 26
What is an accession number?
An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.
Examples (all for retinol-binding protein, RBP4):
X02775 GenBank genomic DNA sequenceNT_030059 Genomic contigRs7079946 dbSNP (single nucleotide polymorphism)
N91759.1 An expressed sequence tag (1 of 170)NM_006744 RefSeq DNA sequence (from a transcript)
NP_007635 RefSeq proteinAAC02945 GenBank proteinQ28369 SwissProt protein1KT7 Protein Data Bank structure record
protein
DNA
RNA
Page 27
Five ways to access DNA and protein sequences
[1] Entrez Gene with RefSeq[2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI)[4] ExPASy Sequence Retrieval System (separate from NCBI)[5] UCSC Genome Browser
Page 27
5 ways to access protein and DNA sequences
[1] Entrez Gene with RefSeq
Entrez Gene is a great starting point: it collectskey information on each gene/protein from major databases. It covers all major organisms.
RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635)
Page 27
From the NCBI homepage, type “beta globin”and hit “Go”
revised 11/08Fig. 2.7Page 29
revisedFig. 2.7Page 29
By applying limits, there are now fewer entries
revisedFig. 2.8Page 30
Entrez Gene (top of page)
Note that links tomany other HBB database entries are available
Entrez Gene (middle of page)
Entrez Gene (middle of page, continued)
Entrez Gene (bottom of page): RefSeqs
Entrez Gene (bottom of page): non-RefSeq accessions
Fig. 2.9Page 32
Fig. 2.9Page 32
Fig. 2.9Page 32
FASTA format:versatile, compact with
>one header line followed by a string of nucleotides or amino acids in the single letter code
Fig. 2.10Page 32
Five ways to access DNA and protein sequences
[1] Entrez Gene with RefSeq[2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI)[4] ExPASy Sequence Retrieval System (separate from NCBI)[5] UCSC Genome Browser
Page 31
DNA RNA
complementary DNA(cDNA)
protein
UniGene
Fig. 2.3Page 23
UniGene: unique genes via ESTs
• Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene
• UniGene clusters contain many expressed sequence tags (ESTs), which are DNA sequences (typically 500 base pairs in length) corresponding to the mRNA from an expressed gene. ESTs are sequenced from a complementary DNA (cDNA) library.
• UniGene data come from many cDNA libraries. Thus, when you look up a gene in UniGene you get information on its abundance and its regional distribution.
Pages 20-21
Cluster sizes in UniGene
This is a gene with1 EST associated;the cluster size is 1 Fig. 2.3
Page 23
Cluster sizes in UniGene
This is a gene with10 ESTs associated;the cluster size is 10
Cluster sizes in UniGene (human)
Cluster size (ESTs) Number of clusters1 40,3002 18,5003-4 18,0005-8 13,4009-16 8,10017-32 5,200
500-1000 1,9001000-4000 9404000-16,000 7416,000-65,000 8
UniGene build 216, 11/0816000:70000[ESTC]
UniGene: unique genes via ESTs
Conclusion: UniGene is a useful tool to look upinformation about expressed genes. UniGenedisplays information about the abundance of atranscript (expressed gene), as well as its regionaldistribution of expression (e.g. brain vs. liver).
We will discuss UniGene further in gene expression section.
Page 31
Five ways to access DNA and protein sequences
[1] Entrez Gene with RefSeq[2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI)[4] ExPASy Sequence Retrieval System (separate from NCBI)[5] UCSC Genome Browser
Page 31
Ensembl to access protein and DNA sequences
Try Ensembl at www.ensembl.org for a premierhuman genome web browser.
We will encounter Ensembl as we study the human genome,BLAST, and other topics.
clickhuman
enterRBP4
Five ways to access DNA and protein sequences
[1] Entrez Gene with RefSeq[2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI)[4] ExPASy Sequence Retrieval System (separate from NCBI)[5] UCSC Genome Browser
Page 33
ExPASy to access protein and DNA sequences
ExPASy sequence retrieval system(ExPASy = Expert Protein Analysis System)
Visit http://www.expasy.ch/
Page 33
Fig. 2.11Page 33
Five ways to access DNA and protein sequences
[1] Entrez Gene with RefSeq[2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI)[4] ExPASy Sequence Retrieval System (separate from NCBI)[5] UCSC Genome Browser
Page 33
[1] Visit http://genome.ucsc.edu/, click Genome Browser
[2] Choose organisms, enter query (beta globin), hit submit
Example of how to access sequence data:HIV-1 pol
There are many possible approaches. Begin at the mainpage of NCBI, and type an Entrez query: hiv-1 pol
Page 34
11/08
Page 34
Searching for HIV-1 pol:Following the “genome” link yields
a manageable five results
Example of how to access sequence data:HIV-1 pol
For the Entrez query: hiv-1 polthere are about 80,000 nucleotide or protein records(and >200,000 records for a search for “hiv-1”),but these can easily be reduced in two easy steps:
--specify the organism, e.g. hiv-1[organism]--limit the output to RefSeq!
Page 34
only 1 RefSeq
over 200,000nucleotide entriesfor HIV-1
Examples of how to access sequence data:histone
query for “histone” # results
protein records 21847RefSeq entries 7544
RefSeq (limit to human) 1108NOT deacetylase 697
At this point, select a reasonable candidate (e.g.histone 2, H4) and follow its link to Entrez Gene.There, you can confirm you have the right gene/protein.
8-12-06
Outline for today
Definition of bioinformatics
Overview of the NCBI website
Accessing information about DNA and proteins--Definition of an accession number--Four ways to find information on proteins and DNA
Access to biomedical literature
Pairwise alignment: introduction
PubMed at NCBIto find literatureinformation
PubMed is the NCBI gateway to MEDLINE.
MEDLINE contains bibliographic citations and author abstracts from over 4,600 journals published in the United States and in 70 foreign countries.
It has >18 million records dating back to 1950s.
Page 35Updated 11-08
MeSH is the acronym for "Medical Subject Headings."
MeSH is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM. MeSH vocabulary is used for indexing journal articles for MEDLINE.
The MeSH controlled vocabulary imposes uniformity and consistency to the indexing of biomedical literature.
Page 35
lipocalin AND disease(60 results)
lipocalin OR disease(1,650,000 results)
lipocalin NOT disease(530 results)
1 AND 2
1 OR 2
1 NOT 2
1
1
1
2
2
2
Fig. 2.12Page 34
true positive“globin” is found
“globin” is not found
“globin” is present
“globin” is absent
Article contents:
Search result:
true negativefalse negative(article discusses
globins)
false positive(article does not
discuss globins)
Pairwise sequence alignment
Outline: pairwise alignment
• Overview and examples• Definitions: homologs, paralogs, orthologs• Assigning scores to aligned amino acids: Dayhoff’s PAM matrices• Alignment algorithms: Needleman-Wunsch, Smith-Waterman• Statistical significance of pairwise alignments
• It is used to decide if two proteins (or genes) are related structurally or functionally
• It is used to identify domains or motifs that are shared between proteins
• It is the basis of BLAST searching
• It is used in the analysis of genomes
Pairwise sequence alignment is the most fundamental operation of bioinformatics
Pairwise alignment: protein sequencescan be more informative than DNA
• protein is more informative (20 vs 4 characters); many amino acids share related biophysical properties
• codons are degenerate: changes in the third position often do not alter the amino acid that is specified
• protein sequences offer a longer “look-back” time
• DNA sequences can be translated into protein, and then used in pairwise alignments
Page 54
• DNA can be translated into six potential proteins
5’ CAT CAA 5’ ATC AAC 5’ TCA ACT
5’ GTG GGT 5’ TGG GTA 5’ GGG TAG
Pairwise alignment: protein sequencescan be more informative than DNA
5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’
Pairwise alignment: protein sequencescan be more informative than DNA
• Many times, DNA alignments are appropriate--to confirm the identity of a cDNA--to study noncoding regions of DNA--to study DNA polymorphisms--example: Neanderthal vs modern human DNA
Query: 181 catcaactacaactccaaagacacccttacacccactaggatatcaacaaacctacccac 240 |||||||| |||| |||||| ||||| | ||||||||||||||||||||||||||||||| Sbjct: 189 catcaactgcaaccccaaagccacccct-cacccactaggatatcaacaaacctacccac 247
Pairwise alignment The process of lining up two sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology.
Definitions
HomologySimilarity attributed to descent from a common ancestor.
Definitions
Page 42
HomologySimilarity attributed to descent from a common ancestor.
Definitions
Identity
The extent to which two (nucleotide or amino acid) sequences are invariant.
Page 44
RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVA 59 + K++ + ++ GTW++MA + L + A glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKA 55
Orthologs Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function.
Paralogs Homologous sequences within a single species that arose by gene duplication.
Definitions: two types of homology
Page 43
Orthologs:members of a gene (protein)family in variousorganisms.This tree showsRBP orthologs.
common carp
zebrafish
rainbow trout
teleost
African clawed frog
chicken
mouserat
rabbitcowpighorse
human
10 changes Page 43
Paralogs:members of a gene (protein)family within aspecies
apolipoprotein D
retinol-bindingprotein 4
Complementcomponent 8
prostaglandinD2 synthase
neutrophilgelatinase-associatedlipocalin
10 changesLipocalin 1Odorant-bindingprotein 2A
progestagen-associatedendometrialprotein
Alpha-1Microglobulin/bikunin
Page 44
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin
Pairwise alignment of retinol-binding protein 4 and -lactoglobulin
Page 46
SimilarityThe extent to which nucleotide or protein sequences are related. It is based upon identity plus conservation.
IdentityThe extent to which two sequences are invariant.
Conservation Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue.
Definitions
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin
Pairwise alignment of retinol-binding protein and -lactoglobulin
Identity(bar)
Page 46
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin
Pairwise alignment of retinol-binding protein and -lactoglobulin
Somewhatsimilar
(one dot)
Verysimilar
(two dots)
Page 46
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin
Pairwise alignment of retinol-binding protein and -lactoglobulin
Internalgap
Terminalgap
Page 46
• Positions at which a letter is paired with a null are called gaps.
• Gap scores are typically negative.
• Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap.
• In BLAST, it is rarely necessary to change gap values from the default.
Gaps