60
Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 Welcome to CSE 527: Computational Biology 1

Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Embed Size (px)

Citation preview

Page 1: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Lecture 1 – Sep 27, 2011CSE 527 Computational Biology, Fall 2011

Instructor: Su-In LeeTA: Christopher Miles

Monday & Wednesday 12:00-1:20Johnson Hall (JHN) 022

Welcome to CSE 527: Computational Biology

1

Page 2: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Who is the instructor? Prof. Su-In Lee

Assistant Professor A joint faculty member Computer Science & Engineering, Genome

Sciences Office hours: Wednesday 1:30-2:30

Research interests Developing machine learning techniques

applied to Computational Biology (genetics, systems biology) Predictive Medicine, Translational Medicine

2

Page 3: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Teaching assistant Christopher Miles (CSE PhD student) Office: TBA Office hours: Monday 1:30-2:30 Email: [email protected]

3

Page 4: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Curing cancer. Understanding how the blue print of life (DNA)

determines important traits (e.g. diseases)? Predicting your disease susceptibilities based on

your biological information including DNA sequence.

Predicting sudden changes in the condition of patients at ICU (intensive care unit).

Determining the order of A,G,C and T in my 3-billion long DNA sequence.

: CSE 527 will provide you with basic

concepts and ML/statistical techniques that you can use to realize these goals.

What is the Coolest Thing a Computational/Mathematical Scientist Can Do?

Page 5: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Biological information (data)

DNA sequence information RNA levels of 30K genes Protein levels of 30K genes DNA molecule’s 3D structure :

More and More of Biology is Becoming an Information Science

Cell: The basic unit of life

AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC

DNA

AUGUGGAUUGUU

AUGCGCGUC

AUGUUACGCACCUAC

AUGAUUGAURNA

Protein MWIV MRV MLRTYMID

Gene (~30,000 in human)

AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC

AUGCGCGUC

MRV

RNA degradation

MID

AUGAUUAUAUGAUUGAU

MID

Gene regulation

Gene interaction map

A cell’s biological state can be described by millions of numbers!

What biological discoveries we can make highly depends on the computational method we use to analyze the data.

Machine learning techniques provide very effective tools.

Gene expression

Page 6: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Outline Course logistics

A zero-knowledge based introduction to biology

Potential project topics

6

Page 7: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Goals of this course Introduction to Computational Biology

Basic concepts and scientific questions Basic biology for computational scientists In-depth coverage of ML techniques Current active areas of research

Useful machine learning (ML) algorithms Probabilistic graphical models, clustering,

classification Learning techniques (MLE, EM)

7

Page 8: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Topics in CSE 527 Part 1: Basic ML algorithms

Introduction to probabilistic models Bayesian networks, Hidden Markov models Representation and learning

Part 2: Topics in computational biology and areas of active research Genetics, systems biology, predictive medicine,

sequence analysis Finding genetic factors for complex biological

traits Inferring biological networks from data Comparative genomics DNA/RNA sequence analysis 8

Page 9: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Course responsibilities Class participation and attendance (10%)

Good answers to the questions asked in class Initiating a productive discussion.

Homework assignments (40%) Four problem sets

Due at beginning of class Up to 3 late days (24-hr period) for the quarter

Collaboration allowed Teams of 2 or 3 students Individual writeups

Final project (50%) A group of up to two students. 9

Page 10: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Project overview (1/2) Topic

Choose from the list of project topics on the course website, or come up with your own.

Open-ended

Project deliverables Project proposal (due 10/19) Midterm report (due 11/16) Final report (due 12/14) Final presentations or poster session (12/7)

10

Page 11: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Project overview (2/2) Final report

Short report (up to 10 pages) Conference-style presentation Successful project reports can be submitted to

computational biology/ ML conferences (ISMB, RECOMB, NIPS, ICML)

Or journals (PLoS journals, Nature journals, PNAS, Genome Research and so on)

11

Page 12: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Reading material Lecture notes

Mostly based on recent papers & old seminar papers

Biological background The Cell, a molecular approach by Copper Genetics, from genes to genomes by Hartwell and

more Principles of Population genetics by Hartl & Clark

Computational background Probabilistic graphical models by Profs. Daphne

Koller & Nir Friedman Prof. Andrew Ng’s machine learning lecture note

(cs229.stanford.edu)

No textbook required for the course 12

Page 13: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Class resources Course website – cs.washington.edu/527

Lecture notes, assignments, project topics Deadlines of assignments and projects

Mailing list [email protected]

13

Page 14: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Outline Course logistics

A zero-knowledge based introduction to biology Prepared by George Asimenos (PhD student,

Stanford) for CS262 Computational Genomics by Prof. Serafim Batzoglou (Stanford).

Potential project topics

14

Page 15: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Cells: Building Blocks of Life

© 1997-2005 Coriell Institute for Medical Research

cell, nucleus, cytoplasm, mitochondrion

15

Eukaryots: Plants, animals, humans DNA resides in the

nucleus Contain other

compartments for other specialized functions

Prokaryots: Bacteria Do not contain

compartments Little recognizable

substructure

Page 16: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

DNA: “Blueprints” for a cell Genetic information encoded in

long strings of double-stranded DNA

Deoxyribo Nucleic Acid comes in only four flavors: Adenine, Cytosine, Guanine, Thymine

16

Page 17: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Nucleotide

O

C C

CC

H

H

HHH

H

H

COPO

O-

O

to next nucleotide

to previous nucleotide

to base

Deoxyribose, nucleotide, base, A, C, G, T, 3’, 5’

3’

5’ Adenine (A)

Cytosine (C)

Guanine (G)

Thymine (T)

Let’s write “AGACC”!17

Page 18: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

“AGACC” (backbone)

18

Page 19: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

“AGACC” (DNA)deoxyribonucleic acid (DNA)

5’

5’3’

3’

19

Page 20: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

DNA is double stranded

3’

5’

5’

3’

DNA is always written 5’ to 3’

AGACC or GGTCT

strand, reverse complement

20

Page 21: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

DNA Packaginghistone, nucleosome, chromatin, chromosome, centromere, telomere

H1DNA

H2A, H2B, H3, H4

~146bp

telomerecentromere

nucleosome

chromatin

21

Page 22: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

The Genome The genome is the full set of hereditary

information for an organism Humans bundle two copies of the genome

into 46 chromosomes in every cell = 2 x (1-22 + X/Y)

22

Page 23: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Building an organism

cellDNA

Every cell has the same sequence of DNA

Subsets of the DNA sequence determine the identity and function of different cells

23

Page 24: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

From DNA To Organism

?

Proteins do most of the work in biology, and are encoded by subsequences of DNA, known as genes.

24

Page 25: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

RNA

O

C C

CC

H

OH

HHH

H

H

COPO

O-

O

to next ribonucleotide

to previous ribonucleotide

to base

ribonucleotide, U

3’

5’ Adenine (A)

Cytosine (C)

Guanine (G)

Uracil (U)

T U

25

Page 26: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Genes & Proteins

3’5’

5’3’TAGGATCGACTATATGGGATTACAAAGCATTTAGGGA...TCACCCTCTCTAGACTAGCATCTATATAAAACAGAA

ATCCTAGCTGATATACCCTAATGTTTCGTAAATCCCT...AGTGGGAGAGATCTGATCGTAGATATATTTTGTCTT

AUGGGAUUACAAAGCAUUUAGGGA...UCACCCUCUCUAGACUAGCAUCUAUAUAA

(transcription)

(translation)

Single-stranded RNA

protein

Double-stranded DNA

gene, transcription, translation, protein

26

Page 27: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Gene Transcriptionpromoter

3’5’

5’3’

G A T T A C A . . .

C T A A T G T . . .

27

Page 28: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Gene Transcription

Transcription factors: a type of protein that binds to DNA and helps initiate gene transcription.

Transcription factor binding sites: short sequences of DNA (6-20 bp) recognized and bound by TFs.

RNA polymerase binds a complex of TFs in the promoter.

transcription factor, binding site, RNA polymerase

3’5’

5’3’

G A T T A C A . . .

C T A A T G T . . .

28

Page 29: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Gene Transcription

3’5’

5’3’

The two strands are separated

G A T T A C

A . . .

C T A A T G T . . .

29

Page 30: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Gene Transcription

3’5’

5’3’

An RNA copy of the 5’→3’ sequence is created from the 3’→5’ template

G A T T A C

A . . .

C T A A T G T . . .

G A U U A C A

30

Page 31: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Gene Transcription

3’5’

5’3’

G A U U A C A . . .

G A T T A C A . . .

C T A A T G T . . .

pre-mRNA 5’ 3’

31

Page 32: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

RNA Processing5’ cap, polyadenylation, exon, intron, splicing, UTR, mRNA

5’ cap poly(A) tail

intron

exon

mRNA

5’ UTR 3’ UTR

32

Page 33: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Gene Structure

5’ 3’

promoter

5’ UTR exons 3’ UTR

introns

coding

non-coding

33

Page 34: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

How many? (Human Genome) Genes:

~ 20,000 Exons per gene:

~ 8 on average (max: 148) Nucleotides per exon:

170 on average (max: 12k) Nucleotides per intron:

5,500 on average (max: 500k) Nucleotides per gene:

45k on average (max: 2,2M)

34

Page 35: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

From RNA to Protein Proteins are long strings of amino acids

joined by peptide bonds Translation from RNA sequence to amino

acid sequence performed by ribosomes 20 amino acids 3 RNA letters required to

specify a single amino acid

35

Page 36: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Amino acidamino acid

C

O

N

H

C

H

H OH

R

There are 20 standard amino acids

AlanineArginine

AsparagineAspartateCysteine

GlutamateGlutamine

GlycineHistidine

IsoleucineLeucineLysine

MethioninePhenylalanine

ProlineSerine

ThreonineTryptophan

TyrosineValine

36

Page 37: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

C

O

N

H

C

H

R

to previous aa to next aa

N-terminus

(start)

H OH

C-terminus

(end)

N-terminus, C-terminus

from 5’ 3’ mRNA

Proteins

37

Page 38: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Translation

The ribosome (a complex of protein and RNA) synthesizes a protein by reading the mRNA in triplets (codons). Each codon is translated to an amino acid.

ribosome, codon

mRNA

P site A site

38

Page 39: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

The genetic code Mapping from a codon to an amino acid

39

Page 40: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Translation

5’ . . . A U U A U G G C C U G G A C U U G A . . . 3’

UTR Met

Start Codon

Ala Trp ThrStop

Codon40

Page 41: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Translation

t-RNAMet Ala

5’ . . . A U U A U G G C C U G G A C U U G A . . . 3’

Trp

41

amino acid

Page 42: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Errors?

What if the transcription / translation machinery makes mistakes?

What is the effect of mutations in coding regions?

mutation

42

Page 43: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Reading Framesreading frame

G C U U G U U U A C G A A U U A G

G C U U G U U U A C G A A U U A G

G C U U G U U U A C G A A U U A G

G C U U G U U U A C G A A U U A G

4343

Page 44: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Synonymous Mutation

G C U U G U U U A C G A A U U A G

Ala Cys Leu Arg Ile

G C U U G U U U A C G A A U U A G

synonymous (silent) mutation, fourfold site

G

G C U U G U U U G C G A A U U A G

Ala Cys Leu Arg Ile

4444

Page 45: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Missense Mutation

G C U U G U U U A C G A A U U A G

Ala Cys Leu Arg Ile

G C U U G U U U A C G A A U U A G

missense mutation

G

G C U U G G U U A C G A A U U A G

Ala Trp Leu Arg Ile

4545

Page 46: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Nonsense Mutation

G C U U G U U U A C G A A U U A G

Ala Cys Leu Arg Ile

G C U U G U U U A C G A A U U A G

nonsense mutation

A

G C U U G A U U A C G A A U U A G

Ala STOP

46

Page 47: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Frameshift

G C U U G U U U A C G A A U U A G

Ala Cys Leu Arg Ile

G C U U G U U U A C G A A U U A G

frameshift

G C U U G U U A C G A A U U A G

Ala Cys Tyr Glu Leu

47

Page 48: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Transcription and translation

48Illustration from Radboud University Nijmegen

Let’s see how this happens! Transcription: http://www.youtube.com/watch?v=DA2t5N72mgw Translation:

http://www.youtube.com/watch?v=WkI_Vbwn14g&feature=related

Page 49: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Gene Expression Regulation

When should each gene be expressed? Regulate gene expression

Examples:

Make more of gene A when substance X is present Stop making gene B once you have enough Make genes C1, C2, C3 simultaneously

Why? Every cell has same DNA but each cell expresses different proteins.

Signal transduction: One signal converted to another

Cascade has “master regulators” turning on many proteins, which in turn each turn on many proteins, ...

Regulation, signal transduction

49

Page 50: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Gene Regulation Gene expression is controlled at many

levels DNA chromatin structure Transcription Post-transcriptional modification RNA transport Translation mRNA degradation Post-translational modification

50

Page 51: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Transcription regulation Much gene regulation

occurs at the level of transcription.

Primary players: Binding sites (BS) in cis-regulatory

modules (CRMs) Transcription factor (TF) proteins RNA polymerase II

Primary mechanism: TFs link to BSs Complex of TFs forms Complex assists or inhibits

formation of the RNA polymerase II machinery

51

Page 52: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Transcription Factor Binding Sites Short, degenerate DNA sequences

recognized by particular TFs

For complex organisms, cooperative binding of multiple TFs required to initiate transcription

Binding Sequence Logo

52

Page 53: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Summary All hereditary information encoded in

double-stranded DNA Each cell in an organism has same DNA DNA RNA protein Proteins have many diverse roles in cell Gene regulation diversifies protein products

within different cells

53

Page 54: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Outline Course logistics

A zero-knowledge based introduction to biology

Potential project topics

54

Page 55: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Say that a cancer patient X undergoes a chemotherapy.

There are >200 drugs patient X can be treated with. How do doctors choose which drug to use in

chemotherapy treatment ?

Which Drug Patient X Should Be Treated With?

Example project topic #1 (1/3)

Follicular lymphoma

Diffuse large B cell lymphoma

A few histologic features Patient X

Chemotherapy drugs5-IodotubercidinAcrichineARQ-197Arsenic trioxideAS101AS-703026AT-7519AxitinibAzacitidine :

How can we improve this?

Page 56: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Which Drug Patient X Should Be Treated With?

Example project topic #1 (1/3)

Follicular lymphoma

Diffuse large B cell lymphoma

A few histologic features Patient X

Chemotherapy drugs5-IodotubercidinAcrichineARQ-197Arsenic trioxideAS101AS-703026AT-7519AxitinibAzacitidine :

Doctors cannot handle millions of numbers! How about computers?

RNA levels of genes

Protein levels of genes

Epigenetics (Methylation)

A few histologic features

…ACGTAGCTAGCTAGCTAGCTGATGCTAGCTACGTGCT…DNA sequence

Page 57: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

This is a pure machine learning problem!

Let’s Build a Prediction ModelExample project topic #1

(3/3)16

0 d

rugs

Drug sensitivity

test

~100 patients at UWMC …

In collaboration with Tony Blau, Pam Becker, Ray Monnat, David Hawkins (Medicine)

g1

g2

g4

g5g6

g3

e8

g11 g1

4g1

5

g9

g16

ggg30,00

0

g3

g7

g12g13

g

g

g

g

g

g

g1030

,00

0

genes

RNA levels of genes in

cancer cells

Drug 3

Drug 2

Drug i

Drug 6

Drug 4 Drug

5

Drug 160

30,000 features!(feature selection)

Prior knowledge on drugs’ targets

Publicly available RNA

level data

>3000 patientsTransfer learning,

Feature reconstruction

Goal: realizing personalized cancer treatment

Patient X

Page 58: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

How Well Can We Predict Disease- related Traits Based on DNA?

Example project topic #2 (1/2)

Standard approach Find a simple rule! Failed to detect the DNA

affecting many important traits.

…ACTCGGTAGACCTAAATTCGGCCCGG…

…ACCCGGTAGACCTTTATTCGGCCCGG…

…ACCCGGTAGACCTTAATTCGGCCGGG…

:

…ACCCGGTAGTCCTATATTCGGCCCGG…

…ACTCGGTAGTCCTATATTCGGCCGGG…

DNA sequence…ACTCGGTAGACCTAAATTCGGCCCGG…

…ACCCGGTAGACCTTTATTCGGCCCGG…

…ACCCGGTAGACCTTAATTCGGCCGGG…

:

…ACCCGGTAGTCCTATATTCGGCCCGG…

…ACTCGGTAGTCCTATATTCGGCCGGG…

obesity

Individual1

Individual2

Individual3

IndividualN-1

IndividualN

Obesity

:

A

A

A

T

T

Athin, T fat

p≈106 !

cell,a complex system

??

environmental factors

Causality?

N instances

s1 s2 sptoo weak to be detected

?

One of the most important research problems in this area is to develop new computational methods that can represent more complicated interaction between sequence variation and trait.

Page 59: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Longitudinal study Environmental factors

Age, sex, smoking status

How Well Can We Predict Disease-related Traits Based on DNA?

Example project topic #2 (2/2)

… AC

TC

GG

AC

CTA

AA

TC

CC

G…

… AC

CC

GG

AC

CTTA

ATG

CG

G… … A

CC

CG

GA

CC

TA

TA

TG

CC

G… … A

CC

CG

GA

CC

TTTA

TG

CC

G…

:… A

CC

CG

GA

CC

TTA

ATG

CG

G… … A

CC

CG

GTC

CTA

TA

TG

CC

G…

… AC

TC

GG

TC

CTTA

ATG

CG

G…

… AC

TC

GG

TC

CTA

TA

TG

CG

G…

Sequence Information

s4

s2s3

s1

Cholesterol

~2000 subjects

:

Fatty acid

GlucoseInsulin

Cholesterol

Fatty acid

GlucoseInsulin

Phenotype Data

Phenotype Data

Year 0

Year 25

:

Age-specific genetic influence

sP

Structural learning

p≈106 !(feature selection)

In collaboration with Alex Reiner (Epidemiology)

Page 60: Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

More project topics at the course website!

Questions?