1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida

Preview:

Citation preview

1

CAP5510 – BioinformaticsFall 2009

Tamer Kahveci

CISE Department

University of Florida

2

Vital Information

• Instructor: Tamer Kahveci

• Office: E436

• Time: Mon/Wed/Thu 3:00 - 3:50 PM

• Office hours: Mon/Wed 2:00-2:50 PM

• TA: TBA

• Course page: – http://www.cise.ufl.edu/~tamer/teaching/fall2009

3

Goals

• Understand the major components of bioinformatics data and how computer technology is used to understand this data better.

• Learn main potential research problems in bioinformatics and gain background information.

4

This Course will

• Give you a feeling for main issues in molecular biological computing: sequence, structure and function.

• Give you exposure to classic biological problems, as represented computationally.

• Encourage you to explore research problems and make contribution.

5

This Course will not

• Teach you biology.

• Teach you programming

• Teach you how to be an expert user of off-the-shelf molecular biology computer packages.

• Force you to make a novel contribution to bioinformatics.

6

Course Outline

• Introduction to terminology• Biological sequences • Sequence comparison

– Lossless alignment (DP)– Lossy alignments (BLAST, etc)

• Substitution matrices, statistics • Multiple alignment • Phylogeny • Protein structures and function (primary, secondary, etc.) • Structure alignment • Structure prediction ?• Pathways

7

Grading

• Homeworks (35 %) • Project (50 %)

– Contribution (2.5 % bonus)

• Survey (15 %)

How can I get an A ?

Bioinformatics DailyFirst homework is posted

First homework is posted

8

Expectations

• Require– Data structures and algorithms.– Coding (C, Java)

• Encourage – actively participate in discussions in the classroom– read bioinformatics literature in general– attend colloquiums on campus

• Academic honesty

9

Text Book

• Not required, but recommended.• Class notes + papers.

10

Where to Look ?

• Journals– Bioinformatics– Genome Research– Nucleic Acid Research– Journal of Computational Biology– Protein Science

• Conferences– RECOMB– ISMB– PSB– CSB– VLDB, ICDE, SIGMOD

11

What is Bioinformatics?• Bioinformatics is the field of science in which biology, computer

science, and information technology merge into a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. There are three important sub-disciplines within bioinformatics:– the development of new algorithms and statistics with which to assess

relationships among members of large data sets – the analysis and interpretation of various types of data including

nucleotide and amino acid sequences, protein domains, and protein structures

– the development and implementation of tools that enable efficient access and management of different types of information.

From NCBI (National Center for Biotechnology Information)http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/milestones.html

12

Does biology have anything to do with computer science?

13

Challenges 1/6

• Data diversity– DNA

(ATCCAGAGCAG)– Protein sequences

(MHPKVDALLSR)– Protein structures– Microarrays– Pathways– Bio-images– Time series

14

Challenges 2/6

• Database diversity– GenBank, SwissProt, …– PDB, Prosite, …– KEGG, EcoCyc, MetaCyc, …

15

Challenges 3/6• Database size

– GeneBank : As of August 2009, there are over 85,759,586,764 bases.

– 400 K protein sequences, each about 300 long

– 50K protein structures in PDB. 400K in Modbase.

Genome sequence now accumulate so quickly that, in less than a week, a single laboratory can produce more bits of data than

Shakespeare managed in a lifetime, although the latter make better reading.

-- G A Pekso, Nature 401: 115-116 (1999)

16

• Moore’s Law Matched by Growth of Data• CPU vs Disk

– As important as the increase in computer speed has been, the ability to store large amounts of information on computers is even more crucial

Str

uct

ure

s in

PD

B

0500

10001500200025003000350040004500

1980 1985 1990 19950

20

40

60

80

100

120

1401979 1981 1983 1985 1987 1989 1991 1993 1995

CP

U In

stru

ctio

nT

ime

(ns)Num.

Protein DomainStructures

Challenges 4/6

17

Challenges 5/6

• Deciphering the code– Within same data type: hard– Across data types: harder

caacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg

atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgcagcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatacatggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtgaaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatccagcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattcttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaactggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgcaggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgtgttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact

18

Challenges 6/6

• Inaccuracy

• Redundancy

19

What is the Real Solution?

We need better computational methods

•Compact summarization•Fast and accurate analysis of data•Efficient indexing

20

A Gentle Introduction to Molecular Biology

21

Goals

• Understand major components of biological data– DNA, protein sequences, expression arrays,

protein structures

• Get familiar to basic terminology

• Learn commonly used data formats

22

Genetic Material: DNA

• Deoxyribonucleic Acid, 1950s– Basis of inheritance– Eye color, hair color,

• 4 nucleotides – A, C, G, T

23

Chemical Structure of Nucleotides

Purines

Pyrmidines

24

Making of Long Chains

5’ -> 3’

25

DNA structure

• Double stranded, helix (Watson & Crick)

• Complementary– A-T– G-C

• Antiparallel– 3’ -> 5’ (downstream)– 5’ -> 3’ (upstream)

• Animation (ch3.1)

26

Base Pairs

27

Question

• 5’ - GTTACA – 3’

• 5’ – XXXXXX – 3’ ?

• 5’ – TGTAAC – 3’

• Reverse complements.

28

Repetitive DNA

• Tandem repeats: highly repetitive – Satellites (100 k – 1 Gbp) / (a few hundred bp)– Mini satellites (1 k – 20 kbp) / (9 – 80 bp)– Micro satellites (< 150 bp) / (1 – 6 bp)– DNA fingerprinting

• Interspersed repeats: moderately repetitive– LINE– SINE

• Proteins contain repetitive patterns too

29

Genetic Material: an Analogy

• Nucleotide => letter• Gene => sentence• Contig => chapter• Chromosome => book

– Gender, hair/eye color, …– Disorders: down syndrome, turner syndrome, …

• http://gslc.genetics.utah.edu/units/disorders/karyotype/– Chromosome number varies for species

• http://www.web-books.com/MoBio/Free/Ch1C2.htm– We have 46 (23 + 23) chromosomes

• http://www.web-books.com/MoBio/Free/Ch1C5.htm

• Complete genome => volumes of encyclopedia• Hershey & Chase experiment show that DNA is the

genetic material. (ch14)

31

Functions of Genes 2/2

• Movement: contracting in order to pull things together or push things apart.

• Transcription control: deciding when other genes should be turned ON/OFF– Animation (ch7)

• Trafficking: affecting where different elements end up inside the cell

32

Central Dogma

33

Introns and Exons 1/2

34

Introns and Exons 2/2

• Humans have about 35,000 genes = 40,000,000 DNA bases = 3% of total DNA in genome.

• Remaining 2,960,000,000 bases for control information. (e.g. when, where, how long, etc...)

35

Central dogma

ProteinPhenotype

DNA(Genotype)

Gene expression

36

Gene Expression

• Building proteins from DNA– Promoter sequence: start of a gene 13 nucleotides.

• Positive regulation: proteins that bind to DNA near promoter sequences increases transcription.

• Negative regulation

37

Microarray

Animation on creating microarrays

38

Amino Acids

• 20 different amino acids– ACDEFGHIKLMNPQRSTVWY but not BJOUXZ

• ~300 amino acids in an average protein, ~400 K known protein sequences

• How many nucleotides can encode one amino acid ?– 42 < 20 < 43

– E.g., Q (glutamine) = CAG– degeneracy– Triplet code (codon)

39

Triplet Code

40

Molecular Structure of Amino Acid

Side Chain

•Non-polar, Hydrophobic (G, A, V, L, I, M, F, W, P)•Polar, Hydrophilic (S, T, C, Y, N, Q)•Electrically charged (D, E, K, R, H)

C

41

Peptide Bonds

42

Direction of Protein Sequence

Animation on protein synthesis (ch15)

43

Data Format

• GenBank

• EMBL (European Mol. Biol. Lab.)

• SwissProt

• FASTA

• NBRF (Nat. Biomedical Res. Foundation)

• Others– IG, GCG, Codata, ASN, GDE, Plain ASCII

44

Primary Structure of Proteins

phi1

psi1

phi2

2N angles

45

Secondary Structure: Alpha Helix

• 1.5 A translation• 100 degree rotation• Phi = -60• Psi = -60

46

anti-parallel parallel

Secondary Structure: Beta sheet

Phi = -135Psi = 135

47

Ramachandran Plot

Sample pdb entry ( http://www.rcsb.org/pdb/ )

48

• 3-d structure of a polypeptide sequence– interactions between non-local atoms

tertiary structure ofmyoglobin

Tertiary Structure

49

• Arrangement of protein subunits

quaternary structure of Cro

human hemoglobin tetramer

Quaternary Structure

50

• 3-d structure determined by protein sequence

• Prediction remains a challenge

• Diseases caused by misfolded proteins– Mad cow disease

• Classification of protein structure

Structure Summary

51

STOP

Next Week:•Basic sequence comparison•Dynamic programming methods

–Global/local alignment–Gaps