Introduction to Algorithms in Computational Biology Lecture 1 This class has been edited from Nir Friedman’s lecture which is available at nir

.

Introduction to Algorithms in Computational Biology

Lecture 1

This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes made by Dan Geiger.

Background Readings: The first three chapters (pages 1-31) in Genetics in Medicine, Nussbaum et al., 2001.

http://www.cs.huji.ac.il/~pmai



2

Course InformationMeetings:

Lecture, by Dan Geiger: Mondays 16:30 –18:30, Taub 4. Tutorial, by Ydo Wexler: Tuesdays 10:30 – 11:30, Taub 2.

Grade: 20% in five question sets. These questions sets are obligatory. Each

contains 4-6 theoretical problems. Submit in pairs in two weeks time 80% test. Must pass beyond 55 for the homework’s grade to count

Information and handouts:

www.cs.technion.ac.il/~cs236522

A brochure with zeroxed material at Taub library

http://www.cs.technion.ac.il/~cs236522

3

Course PrerequisitesComputer Science and Probability Background Data structure 1 (cs234218) Algorithms 1 (cs234247) Probability (any course)

Some Biology Background Formally: None, to allow CS students to take this course. Recommended: Biology 1 (especially for those in the

Bioinformatics track), or a similar Biology course, and/or a serious desire to complement your knowledge in Biology by reading the appropriate material (see the course web site).

Studying the algorithms in this course while acquiring enough biology background is far more rewarding than ignoring the biological context.

4

Relations to Some Other Courses

Intro to Bioinformatics (cs236523). This course covers practical aspects and hands on experience with web-based bioinformatics Software . Albeit not a formal requirement, it is recommended that you look on the web site http://webcourse.technion.ac.il/234523/ and examine the relevant software.

Algorithms in Computational Biology (cs236522). This is the current course which focuses on modeling some bioinformatics problems and presents algorithms for their solution.

Bioinformatics project (cs5236524). Developing bioinformatics tools under close guidance.

http://webcourse.technion.ac.il/234523/

5

First Homework Assignment

Solve two of the questions for Chapter 2 and two of the questions for Chapter 3.

Due time: During the third tutorial class, or earlier in the teaching assistant’s mail slot. Recall to submit in pairs.

Read carefully the first three chapters (pages 1-31) in Genetics in Medicine, Nussbaum et al., 2001.

6

Computational Biology

Computational biology is the application of computational tools and techniques to (primarily) molecular biology. It enables new ways of study in life sciences, allowing analytic and predictive methodologies that support and enhance laboratory work. It is a multidisciplinary area of study that combines Biology, Computer Science, and Statistics.

Computational biology is also called Bioinformatics, although many practitioners define Bioinformatics somewhat narrower by restricting the field to molecular Biology only.

7

Examples of Areas of Interest

• Building evolutionary trees from molecular (and other) data• Efficiently assembling genomes of various organisms• Understanding the structure of genomes (SNP, SSR, Genes)• Understanding function of genes in the cell cycle and disease• Deciphering structure and function of proteins

8

Exponential growth of biological information: growth of sequences, structures, and literature.

9

Four Aspects

Biological What is the task?

Algorithmic How to perform the task at hand efficiently?

Learning How to adapt/estimate/learn parameters and

models describing the task from examples

Statistics How to differentiate true phenomena from

artifacts

10

Example: Sequence Comparison

Biological Evolution preserves sequences, thus similar genes might

have similar function

Algorithmic Consider all ways to “align” one sequence against

another

Learning How do we define “similar” sequences? Use examples to

define similarity

Statistics When we compare to ~106 sequences, what is a random

match and what is true one

11

Course Goals

Learning about computational tools for (primarily) molecular biology.

We will cover computational tasks that are posed by modern molecular biology

We will discuss the biological motivation and setup for these tasks

We will understand the kinds of solutions that exist and what principles justify them

12

Topics I

Dealing with DNA/Protein sequences: Finding similar sequences Models of sequences: Hidden Markov Models Gene finding Genome projects and how sequences are found

13

Topics II

Models of genetic change: Long term: evolutionary changes among species Reconstructing evolutionary trees from sequences Short term: genetic variations in a population Finding genes by linkage and association

14

Topics III (One class, if time allows)

Protein World: How proteins fold - secondary & tertiary structure How to predict protein folds from sequences data How to analyze proteins changes from raw

experimental measurements (MassSpec)

15

Human Genome

Most human cells contain

46 chromosomes:

2 sex chromosomes (X,Y):

XY – in males.

XX – in females.

22 pairs of chromosomes named autosomes.

16

DNA OrganizationS

ourc

e: A

lber

ts e

t al

17

The Double HelixS

ourc

e: A

lber

ts e

t al

18

DNA Components

Four nucleotide types: Adenine Guanine Cytosine Thymine

Hydrogen bonds(electrostatic connection): A-T C-G

19

Genome Sizes

E.Coli (bacteria) 4.6 x 106 bases Yeast (simple fungi) 15 x 106 bases Smallest human chromosome 50 x 106 bases Entire human genome 3 x 109 bases

20

Genetic Information

Gene – basic unit of genetic information. They determine the inherited characters.

Genome – the collection of genetic information.

Chromosomes – storage units of genes.

21

GenesThe DNA strings include: Coding regions (“genes”)

E. coli has ~4,000 genes Yeast has ~6,000 genes C. Elegans has ~13,000 genes Humans have ~32,000 genes

Control regions These typically are adjacent to the genes They determine when a gene should be

expressed “Junk” DNA (unknown function)

22

The Cell

All cells of an organism contain the same DNA content (and the same genes) yet there is a variety of cell types.

23

Example: Tissues in Stomach

How is this variety encoded and expressed ?

24

Central Dogma

Transcription

mRNA

Translation

ProteinGene

cells express different subset of the genesIn different tissues and under different conditions

שעתוק תרגום

25

Transcription

Coding sequences can be transcribed to RNA

RNA nucleotides: Similar to DNA, slightly different backbone Uracil (U) instead of Thymine (T)

Sou

rce:

Mat

hew

s &

van

Hol

de

26

Transcription: RNA Editing

Exons hold information, they are more stable during evolution.This process takes place in the nucleus. The mRNA molecules diffuse through the nucleus membrane to the outer cell plasma.

1. Transcribe to RNA2. Eliminate introns3. Splice (connect) exons* Alternative splicing exists

27

RNA roles Messenger RNA (mRNA)

Encodes protein sequences. Each three nucleotide acids translate to an amino acid (the protein building block).

Transfer RNA (tRNA) Decodes the mRNA molecules to amino-acids. It connects

to the mRNA with one side and holds the appropriate amino acid on its other side.

Ribosomal RNA (rRNA) Part of the ribosome, a machine for translating mRNA to

proteins. It catalyzes (like enzymes) the reaction that attaches the hanging amino acid from the tRNA to the amino acid chain being created.

...

28

Translation (Outside the nucleolus)

Translation is mediated by the ribosome Ribosome is a complex of protein & rRNA

molecules The ribosome attaches to the mRNA at a

translation initiation site Then ribosome moves along the mRNA sequence

and in the process constructs a sequence of amino acids (polypeptide) which is released and folds into a protein.

29

Genetic Code

There are 20 amino acids from which proteins are build.

30

Protein Structure

Proteins are poly-peptides of 70-3000 amino-acids

This structure is (mostly) determined by the sequence of amino-acids that make up the protein

31

Protein Structure

32

Evolution

Related organisms have similar DNA Similarity in sequences of proteins Similarity in organization of genes along the

chromosomes Evolution plays a major role in biology

Many mechanisms are shared across a wide range of organisms

During the course of evolution existing components are adapted for new functions

33

Evolution

Evolution of new organisms is driven by Diversity

Different individuals carry different variants of the same basic blue print

Mutations The DNA sequence can be changed due to

single base changes, deletion/insertion of DNA segments, etc.

Selection bias

34

The Tree of Life

Sou

rce:

Alb

erts

et

al

35

Example for Phylogenetic AnalysisInput: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species.

Question: Which evolutionary tree best explains these sequences ?

AGAAAA

GGAAAG

AAA AAA

AAA

21 1

Total #substitutions = 4

One Answer (the parsimony principle): Pick a tree that has a minimum total number of substitutions of symbols between species and their originator in the evolutionary tree (Also called phylogenetic tree).

36

Example ContinuedThere are many trees possible. For example:

AGAGGA

AAAAAG

AAA AGA

AAA

11

1


GGAAAA

AGAAAG

AAA AAA

AAA

11 2


The left tree is “better” than the right tree.

Questions:Is this principle yielding realistic phylogenetic trees ? (Evolution)How can we compute the best tree efficiently ? (Computer Science)What is the probability of substitutions given the data ? (Learning)Is the best tree found significantly better than others ? (Statistics)

37

Werner’s Syndrome

A successful application of genetic linkage analysis

38

The Disease

First references in 1960s Causes premature ageing Linkage studies from 1992 WRN gene cloned in 1996 Subsequent discovery of mechanisms involved in

wild-type and mutant proteins

39

A sample Input

The study used 13 Markers; here we see only one.

The study used 14 families; here we see only one.

2

4

5

1

3

H

A1/A1

D

A2/A2

H

A1/A2

D

A1/A2

H

A2/A2

D DA1 A2

H DA1 A2

H | DA2 | A2

D DA2 A2

Recombinant

Phase inferred

40

Genehunter Output

position LOD_score information 0.00 -1.254417 0.224384 1.52 2.836135 0.226379 ...[data skipped]...

18.58 13.688599 0.384088 19.92 14.238474 0.401992 21.26 14.718037 0.426818 22.60 15.159389 0.462284 22.92 15.056713 0.462510 23.24 14.928614 0.463208 23.56 14.754848 0.464387

...[data skipped]...

81.84 1.939215 0.059748 90.60 -11.930449 0.087869

distance between markers in centi-

morgans

Most ‘likely’ position

D8S339D8S131

D8S259

Marker’s name

Log likelihood of placing disease

gene at distance, relative to it being

unlinked.

Maximum log likelihood score

41

Final Location

Marker D8S131

Marker D8S259

location of marker D8S339

WRN Gene final location

Error in location by genetic linkage of about 1.25M base pairs.

Documents

Introduction to Algorithms in Computational Biology Lecture 1 This class has been edited from Nir Friedman’s lecture which is available at nir