1
Lecture 1 – Sep 27, 2011CSE 527 Computational Biology, Fall 2011
Instructor: Su-In LeeTA: Christopher Miles
Monday & Wednesday 12:00-1:20Johnson Hall (JHN) 022
Welcome to CSE 527: Computational Biology
1
Who is the instructor? Prof. Su-In Lee
Assistant Professor A joint faculty member Computer Science & Engineering, Genome Sciences Office hours: Wednesday 1:30-2:30
Research interests Developing machine learning techniques applied to
Computational Biology (genetics, systems biology) Predictive Medicine, Translational Medicine
2
2
Teaching assistant Christopher Miles (CSE PhD student) Office: TBA Office hours: Monday 1:30-2:30 Email: [email protected]
3
Curing cancer. Understanding how the blue print of life (DNA)
determines important traits (e.g. diseases)? Predicting your disease susceptibilities based on your
biological information including DNA sequence. Predicting sudden changes in the condition of patients
at ICU (intensive care unit). Determining the order of A,G,C and T in my 3-billion
long DNA sequence.:
CSE 527 will provide you with basic concepts and ML/statistical techniques that you can use to realize these goals.
What is the Coolest Thing a Computational/Mathematical Scientist Can Do?
3
Biological information (data) DNA sequence information RNA levels of 30K genes Protein levels of 30K genes DNA molecule’s 3D structure :
More and More of Biology is Becoming an Information Science
Cell: The basic unit of life
AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC
DNA
AUGUGGAUUGUU AUGCGCGUC AUGUUACGCACCUAC AUGAUUGAURNA
Protein MWIV MRV MLRTY MID
Gene (~30,000 in human)
AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC
AUGCGCGUC
MRV
RNA degradation
MID
AUGAUUAUAUGAUUGAU
MID
Gene regulation
Gene interaction map
A cell’s biological state can be described by millions of numbers! What biological discoveries we can make highly depends on
the computational method we use to analyze the data. Machine learning techniques provide very effective tools.
Gene expression
Outline Course logistics
A zero-knowledge based introduction to biology
Potential project topics
6
4
Goals of this course Introduction to Computational Biology
Basic concepts and scientific questions Basic biology for computational scientists In-depth coverage of ML techniques Current active areas of research
Useful machine learning (ML) algorithms Probabilistic graphical models, clustering, classification Learning techniques (MLE, EM)
7
Topics in CSE 527 Part 1: Basic ML algorithms
Introduction to probabilistic models Bayesian networks, Hidden Markov models Representation and learning
Part 2: Topics in computational biology and areas of active research Genetics, systems biology, predictive medicine,
sequence analysis Finding genetic factors for complex biological traits Inferring biological networks from data Comparative genomics DNA/RNA sequence analysis
8
5
Course responsibilities Class participation and attendance (10%)
Good answers to the questions asked in class Initiating a productive discussion.
Homework assignments (40%) Four problem sets
Due at beginning of class Up to 3 late days (24-hr period) for the quarter
Collaboration allowed Teams of 2 or 3 students Individual writeups
Final project (50%) A group of up to two students. 9
Project overview (1/2) Topic
Choose from the list of project topics on the course website, or come up with your own.
Open-ended
Project deliverables Project proposal (due 10/19) Midterm report (due 11/16) Final report (due 12/14) Final presentations or poster session (12/7)
10
6
Project overview (2/2) Final report
Short report (up to 10 pages) Conference-style presentation Successful project reports can be submitted to
computational biology/ ML conferences (ISMB, RECOMB, NIPS, ICML)
Or journals (PLoS journals, Nature journals, PNAS, Genome Research and so on)
11
Reading material Lecture notes
Mostly based on recent papers & old seminar papers
Biological background The Cell, a molecular approach by Copper Genetics, from genes to genomes by Hartwell and more Principles of Population genetics by Hartl & Clark
Computational background Probabilistic graphical models by Profs. Daphne Koller &
Nir Friedman Prof. Andrew Ng’s machine learning lecture note
(cs229.stanford.edu)
No textbook required for the course 12
7
Class resources Course website – cs.washington.edu/527
Lecture notes, assignments, project topics Deadlines of assignments and projects
Mailing list [email protected]
13
Outline Course logistics
A zero-knowledge based introduction to biology Prepared by George Asimenos (PhD student,
Stanford) for CS262 Computational Genomics by Prof. Serafim Batzoglou (Stanford).
Potential project topics
14
8
Cells: Building Blocks of Life
© 1997-2005 Coriell Institute for Medical Research
cell, nucleus, cytoplasm, mitochondrion
15
Eukaryots: Plants, animals, humans DNA resides in the nucleus Contain other compartments
for other specialized functions
Prokaryots: Bacteria Do not contain compartments Little recognizable
substructure
DNA: “Blueprints” for a cell Genetic information encoded in long
strings of double-stranded DNA
Deoxyribo Nucleic Acid comes in only four flavors: Adenine, Cytosine, Guanine, Thymine
16
9
Nucleotide
O
C C
CC
H
H
HHH
H
H
COPO
O-
O
to next nucleotide
to previous nucleotide
to base
Deoxyribose, nucleotide, base, A, C, G, T, 3’, 5’
3’
5’ Adenine (A)
Cytosine (C)
Guanine (G)
Thymine (T)
Let’s write “AGACC”!17
“AGACC” (backbone)
18
10
“AGACC” (DNA)deoxyribonucleic acid (DNA)
5’
5’3’
3’
19
DNA is double stranded
3’
5’
5’
3’
DNA is always written 5’ to 3’
AGACC or GGTCT
strand, reverse complement
20
11
DNA Packaginghistone, nucleosome, chromatin, chromosome, centromere, telomere
H1DNA
H2A, H2B, H3, H4
~146bp
telomerecentromere
nucleosome
chromatin
21
The Genome The genome is the full set of hereditary
information for an organism Humans bundle two copies of the genome into 46
chromosomes in every cell = 2 x (1-22 + X/Y)
22
12
Building an organism
cellDNA Every cell has the same sequence of DNA
Subsets of the DNA sequence determine the identity and function of different cells
23
From DNA To Organism
?
Proteins do most of the work in biology, and are encoded by subsequences of DNA, known as genes.
24
13
RNA
O
C C
CC
H
OH
HHH
H
H
COPO
O-
O
to next ribonucleotide
to previous ribonucleotide
to base
ribonucleotide, U
3’
5’ Adenine (A)
Cytosine (C)
Guanine (G)
Uracil (U)
T U
25
Genes & Proteins
3’5’
5’3’TAGGATCGACTATATGGGATTACAAAGCATTTAGGGA...TCACCCTCTCTAGACTAGCATCTATATAAAACAGAA
ATCCTAGCTGATATACCCTAATGTTTCGTAAATCCCT...AGTGGGAGAGATCTGATCGTAGATATATTTTGTCTT
AUGGGAUUACAAAGCAUUUAGGGA...UCACCCUCUCUAGACUAGCAUCUAUAUAA
(transcription)
(translation)
Single-stranded RNA
protein
Double-stranded DNA
gene, transcription, translation, protein
26
14
Gene Transcriptionpromoter
3’5’
5’3’
G A T T A C A . . .
C T A A T G T . . .
27
Gene Transcription
Transcription factors: a type of protein that binds to DNA and helps initiate gene transcription.
Transcription factor binding sites: short sequences of DNA (6-20 bp) recognized and bound by TFs.
RNA polymerase binds a complex of TFs in the promoter.
transcription factor, binding site, RNA polymerase
3’5’
5’3’
G A T T A C A . . .
C T A A T G T . . .
28
15
Gene Transcription
3’5’
5’3’
The two strands are separated
29
Gene Transcription
3’5’
5’3’
An RNA copy of the 5’→3’ sequence is created from the 3’→5’ template
30
16
Gene Transcription
3’5’
5’3’
G A U U A C A . . .
G A T T A C A . . .
C T A A T G T . . .
pre-mRNA 5’ 3’
31
RNA Processing5’ cap, polyadenylation, exon, intron, splicing, UTR, mRNA
5’ cap poly(A) tail
intron
exon
mRNA
5’ UTR 3’ UTR
32
17
Gene Structure
5’ 3’
promoter
5’ UTR exons 3’ UTR
introns
coding
non-coding
33
How many? (Human Genome) Genes:
~ 20,000 Exons per gene:
~ 8 on average (max: 148)
Nucleotides per exon:170 on average (max: 12k)
Nucleotides per intron:5,500 on average (max: 500k)
Nucleotides per gene:45k on average (max: 2,2M)
34
18
From RNA to Protein Proteins are long strings of amino acids joined by
peptide bonds Translation from RNA sequence to amino acid
sequence performed by ribosomes 20 amino acids 3 RNA letters required to
specify a single amino acid
35
Amino acidamino acid
C
O
N
H
C
H
H OH
R
There are 20 standard amino acids
AlanineArginine
AsparagineAspartateCysteine
GlutamateGlutamine
GlycineHistidine
IsoleucineLeucineLysine
MethioninePhenylalanine
ProlineSerine
ThreonineTryptophan
TyrosineValine
36
19
C
O
N
H
C
H
R
to previous aa to next aa
N-terminus
(start)
H OH
C-terminus
(end)
N-terminus, C-terminus
from 5’ 3’ mRNA
Proteins
37
Translation
The ribosome (a complex of protein and RNA) synthesizes a protein by reading the mRNA in triplets (codons). Each codon is translated to an amino acid.
ribosome, codon
mRNA
P site A site
38
20
The genetic code Mapping from a codon to an amino acid
39
Translation
5’ . . . A U U A U G G C C U G G A C U U G A . . . 3’
UTR Met
Start Codon
Ala Trp ThrStop
Codon40
21
Translation
t‐RNA
Met Ala
5’ . . . A U U A U G G C C U G G A C U U G A . . . 3’
Trp
41
amino acid
Errors?
What if the transcription / translation machinery makes mistakes?
What is the effect of mutations in coding regions?
mutation
42
22
Reading Framesreading frame
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
4343
Synonymous Mutation
G C U U G U U U A C G A A U U A G
Ala Cys Leu Arg Ile
G C U U G U U U A C G A A U U A G
synonymous (silent) mutation, fourfold site
G
G C U U G U U U G C G A A U U A G
Ala Cys Leu Arg Ile
4444
23
Missense Mutation
G C U U G U U U A C G A A U U A G
Ala Cys Leu Arg Ile
G C U U G U U U A C G A A U U A G
missense mutation
G
G C U U G G U U A C G A A U U A G
Ala Trp Leu Arg Ile
4545
Nonsense Mutation
G C U U G U U U A C G A A U U A G
Ala Cys Leu Arg Ile
G C U U G U U U A C G A A U U A G
nonsense mutation
A
G C U U G A U U A C G A A U U A G
Ala STOP
46
24
Frameshift
G C U U G U U U A C G A A U U A G
Ala Cys Leu Arg Ile
G C U U G U U U A C G A A U U A G
frameshift
G C U U G U U A C G A A U U A G
Ala Cys Tyr Glu Leu
47
Transcription and translation
48Illustration from Radboud University Nijmegen
Let’s see how this happens! Transcription: http://www.youtube.com/watch?v=DA2t5N72mgw Translation: http://www.youtube.com/watch?v=WkI_Vbwn14g&feature=related
25
Gene Expression Regulation
When should each gene be expressed? Regulate gene expression
Examples:
Make more of gene A when substance X is present Stop making gene B once you have enough Make genes C1, C2, C3 simultaneously
Why? Every cell has same DNA but each cell expresses different proteins.
Signal transduction: One signal converted to another Cascade has “master regulators” turning on many proteins,
which in turn each turn on many proteins, ...
Regulation, signal transduction
49
Gene Regulation Gene expression is controlled at many levels
DNA chromatin structure Transcription Post-transcriptional modification RNA transport Translation mRNA degradation Post-translational modification
50
26
Transcription regulation Much gene regulation occurs at
the level of transcription.
Primary players: Binding sites (BS) in cis-regulatory
modules (CRMs) Transcription factor (TF) proteins RNA polymerase II
Primary mechanism: TFs link to BSs Complex of TFs forms Complex assists or inhibits formation of
the RNA polymerase II machinery
51
Transcription Factor Binding Sites Short, degenerate DNA sequences recognized by
particular TFs
For complex organisms, cooperative binding of multiple TFs required to initiate transcription
Binding Sequence Logo
52
27
Summary All hereditary information encoded in double-
stranded DNA Each cell in an organism has same DNA DNA RNA protein Proteins have many diverse roles in cell Gene regulation diversifies protein products
within different cells
53
Outline Course logistics
A zero-knowledge based introduction to biology
Potential project topics
54
28
Say that a cancer patient X undergoes a chemotherapy. There are >200 drugs patient X can be treated with. How do doctors choose which drug to use in
chemotherapy treatment ?
Which Drug Patient X Should Be Treated With?
Example project topic #1 (1/3)
Follicular lymphoma
Diffuse large B cell lymphoma
A few histologic features Patient X
Chemotherapy drugs5-IodotubercidinAcrichineARQ-197Arsenic trioxideAS101AS-703026AT-7519AxitinibAzacitidine
:
How can we improve this?
Which Drug Patient X Should Be Treated With?
Example project topic #1 (1/3)
Follicular lymphoma
Diffuse large B cell lymphoma
A few histologic features Patient X
Chemotherapy drugs5-IodotubercidinAcrichineARQ-197Arsenic trioxideAS101AS-703026AT-7519AxitinibAzacitidine
:
Doctors cannot handle millions of numbers! How about computers?
RNA levels of genes
Protein levels of genes
Epigenetics (Methylation)
A few histologic features
…ACGTAGCTAGCTAGCTAGCTGATGCTAGCTACGTGCT…
DNA sequence
29
This is a pure machine learning problem!
Let’s Build a Prediction ModelExample project topic #1 (3/3)
160
drug
s
Drug sensitivity
test
~100 patients at UWMC
…
In collaboration with Tony Blau, Pam Becker, Ray Monnat, David Hawkins (Medicine)
g1g2
g4g5
g6
g3
e8
g11
g14g15
g9
g16
gg
g30,000
g3
g7
g12g13
gg
g
g
g
g
g1030,0
00 g
enes
RNA levelsof genes in cancer cells
Drug 3Drug 2
Drug i
Drug 6
Drug 4Drug 5
Drug 160
30,000 features!(feature selection)
Prior knowledge on drugs’ targets
Publicly available RNA level data
>3000 patientsTransfer learning,Feature reconstruction
Goal: realizing personalized cancer treatment
Patient X
How Well Can We Predict Disease-related Traits Based on DNA?
Example project topic #2 (1/2)
Standard approach Find a simple rule! Failed to detect the DNA affecting
many important traits.
…ACTCGGTAGACCTAAATTCGGCCCGG…
…ACCCGGTAGACCTTTATTCGGCCCGG…
…ACCCGGTAGACCTTAATTCGGCCGGG…
:
…ACCCGGTAGTCCTATATTCGGCCCGG…
…ACTCGGTAGTCCTATATTCGGCCGGG…
DNA sequence
…ACTCGGTAGACCTAAATTCGGCCCGG…
…ACCCGGTAGACCTTTATTCGGCCCGG…
…ACCCGGTAGACCTTAATTCGGCCGGG…
:
…ACCCGGTAGTCCTATATTCGGCCCGG…
…ACTCGGTAGTCCTATATTCGGCCGGG…
obesity
…
Individual1
Individual2
Individual3
IndividualN-1
IndividualN
Obesity
:
A
A
A
T
T
Athin, T fat
p≈106 !
cell,a complex system
??
environmental factors
Causality?
N instances
s1 s2 sptoo weak to be detected
?
One of the most important research problems in this area is to develop new computational methods that can represent more complicated interaction between sequence variation and trait.
30
Longitudinal study Environmental factors
Age, sex, smoking status
How Well Can We Predict Disease-related Traits Based on DNA?
Example project topic #2 (2/2)
…
…A
CTC
GG
AC
CTA
AA
TCC
CG
……
AC
CC
GG
AC
CTT
AA
TGC
GG
……
AC
CC
GG
AC
CTA
TATG
CC
G…
…A
CC
CG
GA
CC
TTTA
TGC
CG
…:
…A
CC
CG
GA
CC
TTA
ATG
CG
G…
…A
CC
CG
GTC
CTA
TATG
CC
G…
…A
CTC
GG
TCC
TTA
ATG
CG
G…
…A
CTC
GG
TCC
TATA
TGC
GG
…
Sequence Information
s4
s2s3
s1
Cholesterol
~2000 subjects
:
Fatty acidGlucose
Insulin
Cholesterol
Fatty acid
GlucoseInsulin
Phenotype Data
Phenotype Data
Year 0
Year 25
:
Age-specific genetic influence
sP
Structural learning
p≈106 !(feature selection)
In collaboration with Alex Reiner (Epidemiology)
More project topics at the course website!
Questions?