Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Dr. Awwad Abdoh Radwan
Dept Pharm Organic Chemistry,
Faculty of Pharmacy,
Assiut University.
29/09/2005
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Required Texts
•
Image Source: http://www.amazon.com/
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Other Bioinformatics Books
Image Source: http://www.amazon.com/
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Other Reference Books
Image Source: http://www.amazon.com/
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Tentative Schedule of Topics• Overview of molecular biology• Pairwise sequence alignment• Multiple sequence alignment• Sequence Databases• Database searching• Construction of phylogenetic trees• RNA secondary structure prediction Microarray
image analysis• Sequence assembly techniques• Gene Prediction• Protein Folding Prediction
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
What is Bioinformatics/ Computational Biology?
• Bioinformatics: collection and storage of biological information
• Computational biology: development of algorithms and statistical models to analyze biological data
• Bioinformatics/Computational Biology will be interchanged
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Why should I care?• SmartMoney ranks
Bioinformatics as #1 among next HotJobs
• Business Week 50 Masters of Innovation
• Jobs available, exciting research potential
• Important information waiting to be decoded!
http://smartmoney.com/consumer/index.cfm?story=working-june02
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Why is bioinformatics hot?• Supply/demand: few people adequately
trained in both biology and computer science
• Genome sequencing, microarrays, etc lead to large amounts of data to be analyzed
• Leads to important discoveries
• Saves time and money
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
What skills are needed?
• Well-grounded in one of the following areas:– Computer science– Molecular biology– Statistics
• Working knowledge and appreciation in the others!
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Overview of Molecular Biology
• Cells• Chromosomes• DNA• RNA• Amino Acids• Proteins• Genome/Transcriptome/Proteome
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Cells• Complex system enclosed
in a membrane
• Organisms are unicellular (bacteria, baker’s yeast) or multicellular
• Humans:– 60 trillion cells – 320 cell types
Example Animal Cellwww.ebi.ac.uk/microarray/ biology_intro.htm
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Chromosomes• In eukaryotes, nucleus
contains one or several double stranded DNA molecules organized as chromosomes
• Humans: – 22 Pairs of autosomes– 1 pair sex chromosomes
Human Karyotypehttp://avery.rutgers.edu/WSSP/StudentScholars/
Session8/Session8.html
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
What is DNA?• DNA: Deoxyribonucleic Acid
• Single stranded molecule (oligomer, polynucleotide) chain of nucleotides
• 4 different nucleotides:– Adenosine (A)– Cytosine (C)– Guanine (G)– Thymine (T)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Nucleotide Bases
• Purines (A and G)• Pyrimidines (C and T)• Difference is in base structure
Image Source: www.ebi.ac.uk/microarray/ biology_intro.htm
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
DNA polynucleotides(oligomers)• Different nucleotides
are strung together to form polynucleotides
• Ends of the polynucleotide are different
• A directionality is present
• Convention is to label the coding strand from 5’ to 3’
http://www.emc.maricopa.edu/faculty/farabee/BIOBK/BioBookDNAMOLGEN.html
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Single Strand PolynucleotideExample polynucleotide:
5’ G→T→A→A→A→G→T→C→C→C→G→T→T→A→G→C 3’
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Double Stranded DNA
Source: unknown
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Double Helix• Two complementary DNA strands form a stable DNA
double helix
• Spring 2003 marked the 50th anniversary of its discovery
Image source; www.ebi.ac.uk/microarray/ biology_intro.htm
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
RNA• Ribonucleic Acid• Important in a variety of ways, including
protein synthesis• Similar to DNA• Thymine (T) is replaced by uracil (U)
• RNA can be:– Single stranded– Double stranded– Hybridized with DNA
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
mRNA
• Messenger RNA
• Linear molecule encoding genetic information copied from DNA molecules
• Transcription: process in which DNA is copied into an RNA molecule
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
mRNA Processing
Image source: http://departments.oxy.edu/biology/Stillman/bi221/111300/processing_of_hnrnas.htm
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
tRNA structure
Source: http://www.tulane.edu/~biochem/nolan/lectures/rna/frames/trnabtx2.htm
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
tRNA• Amino acid attached to each tRNA
• Determined by 3 base anticodon sequence (complementary to mRNA)
• Translation: process in which the nucleotide sequence of the processed mRNA is used in order to join amino acids together into a protein with the help of ribosomes and tRNA
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Genetic Code
• 4 possible bases (A, C, G, U)• 3 bases in the codon• 4 * 4 * 4 = 64 possible codon sequences• Start codon: AUG• Stop codons: UAA, UAG, UGA• 61 codons to code for amino acids (AUG as
well)• 20 amino acids – redundancy in genetic code
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
20 Amino Acids• Glycine (G, GLY)• Alanine (A, ALA)• Valine (V, VAL)• Leucine (L, LEU)• Isoleucine (I, ILE)• Phenylalanine (F, PHE)• Proline (P, PRO)• Serine (S, SER)• Threonine (T, THR)• Cysteine (C, CYS)• Methionine (M, MET)• Tryptophan (W, TRP)• Tyrosine (T, TYR)• Asparagine (N, ASN)• Glutamine (Q, GLN)• Aspartic acid (D, ASP)• Glutamic Acid (E, GLU)• Lysine (K, LYS)• Arginine (R, ARG)• Histidine (H, HIS)• START: AUG• STOP: UAA, UAG, UGA
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Proteins• Polypeptides having a three dimensional structure.
• Primary–sequence of amino acids constituting the polypeptide chain
• Secondary–local organization into secondary structures such as α helices and β sheets
• Tertiary –three dimensional arrangements of the amino acids as they react to one another due to the polarity and resulting interactions between their side chains
• Quaternary–number and relative positions of the protein subunits
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Protein Structure
Image source: www.ebi.ac.uk/microarray/biology_intro.html
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Central Dogma
DNA↓
RNA↓
PROTEIN
Image source: unknown
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
What is a Gene?
• the physical and functional unit of heredity that carries information from one generation to the next
• DNA sequence necessary for the synthesis of a functional protein or RNA molecule
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Brief History of Sequencing• Discovery of Complementary Bases
– Erwin Chargaff, 1950
• Discovery of DNA Double Helix– 1953 – only 50 years ago
– James Watson– Francis Crick– Rosland Franklin
Image: www.simr.org.uk/pages/biotechnology/ biotechnology_2.html
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
History Of Genetic Code
• Genetic Code Completely uncovered (1965)– Marshall Nierenberg
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Brief History of Sequencing• First Protein Sequence
– ~1955 Bovine Insulin (Fred Sanger)
• First DNA Sequence– ~1965 yeast alanine tRNA (77 bases)
• Development of DNA sequencing– Maxam-Gilbert and Sanger Methods (1977)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Genetic Mapping• Sex-linked genes studied since early 1900s
• Gene mapping takes off in late 1970s– David Botstein (RFLPs 1978)
• 1979: 579 Genes Mapped• 2003 ~30,000 Genes Mapped
– Mapping of Huntington’s Disease (First Diseased Gene)• Triplet Repeat• 1983• Nancy Wexler
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Shotgun Sequencing Approach
• Developed 1991 TIGR– Craig Venter, Hamilton Smith
• Break genome into millions of pieces– Sequence each piece– Reassemble into full genomes
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Whole Genome Shotgun Approach
• reads generated directly from a whole-genome library
• assemble the genome all at once
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Base calling and Assembly Software
• PHRED and PHRAP Developed (1988)– PHRED: Base calling software– PHRAP: Assists in assembly of sequenced
data
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Available Assemblers• SEQAID (Peltola et al., 1984)• CAP (Huang, 1992)• PHRAP (Green, 1994)• TIGR Assembler (Sutton et al., 1995)• AMASS (Kim et al., 1999)• CAP3 (Huang and Madan, 1999)• Celera Assembler (Myers et al., 2000)• EULER (Pevzner et al., 2001)• ARACHNE (Batzoglou et al., 2002)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Human Genome Project
• Began in 1990 (US DOE – 15 years)– Identify all genes in human DNA– Determine sequence of human genome– Develop faster sequencing technologies– Develop tools for data analysis
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Genomes– Fruit Fly– Mouse– Rat– Rice– Zebra fish– Puffer fish– Chicken– Dog– Frog
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Growth of GenBank
• 1982: 600,000 Bases
• 2002: 28.5 Billion Bases
Image source: www.ncbi.nlm.nih.gov
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Other Notables
• Dayhoff ATLAS Database of Proteins (1960s)
• Sequence Comparison Algorithms– 1970, Needleman-Wunch (global alignment)
• Protein Databank– Brookhaven PDB (1973)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Other Notables
• NMR for protein structure identification (1980)
• IntelliGenetics Founded– DNA and Protein sequence analysis
(1980)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Other Notables• Smith-Waterman algorithm
– Local sequence alignment (1981)
• GenBank Database created (1982)
• Genetics Computer Group Founded– GCG suite (1982)
• PCR First Described (1985)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Other Notables
• FASTP Algorithm – Protein database searching (1985)
• SWISS-PROT – Protein Database (1986)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Other Notables
• PERL Programming Language– Allows for sequence manipulation (1987)
• NCBI Established (1988)
• Human Genome Initiative (1988)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Other Notables• FASTA Program released (1988)
– DNA and Protein sequence database searches
• BLAST Program released (1990)– Allows for quick database searches
• Informax Founded (1990)
• Human Genome Project Begins (1990)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Other Notables
First Commercial Microarray chips produced (1996)
• Dolly Cloned (1997)
• Capillary Sequencing machines introduced (1997)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Microarrays
• Microarray:– New Technology (less than 10 years old)
• Allows study of thousands of genes at same time
– Study genes under different conditions– Glass slide of DNA molecules
• Molecule: string of bases • uniquely identifies gene or unit to be studied
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka
Microarray Image Analysis• Microarrays detect gene
interactions: 4 colors: – Green: high control– Red: High sample– Yellow: Equal– Black: None
• Problem is to quantify image signals