Lecture 1 – Sep 27, 2011CSE 527 Computational Biology, Fall 2011
Instructor: Su-In LeeTA: Christopher Miles
Monday & Wednesday 12:00-1:20Johnson Hall (JHN) 022
Welcome to CSE 527: Computational Biology
1
Who is the instructor? Prof. Su-In Lee
Assistant Professor A joint faculty member Computer Science & Engineering, Genome
Sciences Office hours: Wednesday 1:30-2:30
Research interests Developing machine learning techniques
applied to Computational Biology (genetics, systems biology) Predictive Medicine, Translational Medicine
2
Teaching assistant Christopher Miles (CSE PhD student) Office: TBA Office hours: Monday 1:30-2:30 Email: [email protected]
3
Curing cancer. Understanding how the blue print of life (DNA)
determines important traits (e.g. diseases)? Predicting your disease susceptibilities based on
your biological information including DNA sequence.
Predicting sudden changes in the condition of patients at ICU (intensive care unit).
Determining the order of A,G,C and T in my 3-billion long DNA sequence.
: CSE 527 will provide you with basic
concepts and ML/statistical techniques that you can use to realize these goals.
What is the Coolest Thing a Computational/Mathematical Scientist Can Do?
Biological information (data)
DNA sequence information RNA levels of 30K genes Protein levels of 30K genes DNA molecule’s 3D structure :
More and More of Biology is Becoming an Information Science
Cell: The basic unit of life
AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC
DNA
AUGUGGAUUGUU
AUGCGCGUC
AUGUUACGCACCUAC
AUGAUUGAURNA
Protein MWIV MRV MLRTYMID
Gene (~30,000 in human)
AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC
AUGCGCGUC
MRV
RNA degradation
MID
AUGAUUAUAUGAUUGAU
MID
Gene regulation
Gene interaction map
A cell’s biological state can be described by millions of numbers!
What biological discoveries we can make highly depends on the computational method we use to analyze the data.
Machine learning techniques provide very effective tools.
Gene expression
Outline Course logistics
A zero-knowledge based introduction to biology
Potential project topics
6
Goals of this course Introduction to Computational Biology
Basic concepts and scientific questions Basic biology for computational scientists In-depth coverage of ML techniques Current active areas of research
Useful machine learning (ML) algorithms Probabilistic graphical models, clustering,
classification Learning techniques (MLE, EM)
7
Topics in CSE 527 Part 1: Basic ML algorithms
Introduction to probabilistic models Bayesian networks, Hidden Markov models Representation and learning
Part 2: Topics in computational biology and areas of active research Genetics, systems biology, predictive medicine,
sequence analysis Finding genetic factors for complex biological
traits Inferring biological networks from data Comparative genomics DNA/RNA sequence analysis 8
Course responsibilities Class participation and attendance (10%)
Good answers to the questions asked in class Initiating a productive discussion.
Homework assignments (40%) Four problem sets
Due at beginning of class Up to 3 late days (24-hr period) for the quarter
Collaboration allowed Teams of 2 or 3 students Individual writeups
Final project (50%) A group of up to two students. 9
Project overview (1/2) Topic
Choose from the list of project topics on the course website, or come up with your own.
Open-ended
Project deliverables Project proposal (due 10/19) Midterm report (due 11/16) Final report (due 12/14) Final presentations or poster session (12/7)
10
Project overview (2/2) Final report
Short report (up to 10 pages) Conference-style presentation Successful project reports can be submitted to
computational biology/ ML conferences (ISMB, RECOMB, NIPS, ICML)
Or journals (PLoS journals, Nature journals, PNAS, Genome Research and so on)
11
Reading material Lecture notes
Mostly based on recent papers & old seminar papers
Biological background The Cell, a molecular approach by Copper Genetics, from genes to genomes by Hartwell and
more Principles of Population genetics by Hartl & Clark
Computational background Probabilistic graphical models by Profs. Daphne
Koller & Nir Friedman Prof. Andrew Ng’s machine learning lecture note
(cs229.stanford.edu)
No textbook required for the course 12
Class resources Course website – cs.washington.edu/527
Lecture notes, assignments, project topics Deadlines of assignments and projects
Mailing list [email protected]
13
Outline Course logistics
A zero-knowledge based introduction to biology Prepared by George Asimenos (PhD student,
Stanford) for CS262 Computational Genomics by Prof. Serafim Batzoglou (Stanford).
Potential project topics
14
Cells: Building Blocks of Life
© 1997-2005 Coriell Institute for Medical Research
cell, nucleus, cytoplasm, mitochondrion
15
Eukaryots: Plants, animals, humans DNA resides in the
nucleus Contain other
compartments for other specialized functions
Prokaryots: Bacteria Do not contain
compartments Little recognizable
substructure
DNA: “Blueprints” for a cell Genetic information encoded in
long strings of double-stranded DNA
Deoxyribo Nucleic Acid comes in only four flavors: Adenine, Cytosine, Guanine, Thymine
16
Nucleotide
O
C C
CC
H
H
HHH
H
H
COPO
O-
O
to next nucleotide
to previous nucleotide
to base
Deoxyribose, nucleotide, base, A, C, G, T, 3’, 5’
3’
5’ Adenine (A)
Cytosine (C)
Guanine (G)
Thymine (T)
Let’s write “AGACC”!17
“AGACC” (backbone)
18
“AGACC” (DNA)deoxyribonucleic acid (DNA)
5’
5’3’
3’
19
DNA is double stranded
3’
5’
5’
3’
DNA is always written 5’ to 3’
AGACC or GGTCT
strand, reverse complement
20
DNA Packaginghistone, nucleosome, chromatin, chromosome, centromere, telomere
H1DNA
H2A, H2B, H3, H4
~146bp
telomerecentromere
nucleosome
chromatin
21
The Genome The genome is the full set of hereditary
information for an organism Humans bundle two copies of the genome
into 46 chromosomes in every cell = 2 x (1-22 + X/Y)
22
Building an organism
cellDNA
Every cell has the same sequence of DNA
Subsets of the DNA sequence determine the identity and function of different cells
23
From DNA To Organism
?
Proteins do most of the work in biology, and are encoded by subsequences of DNA, known as genes.
24
RNA
O
C C
CC
H
OH
HHH
H
H
COPO
O-
O
to next ribonucleotide
to previous ribonucleotide
to base
ribonucleotide, U
3’
5’ Adenine (A)
Cytosine (C)
Guanine (G)
Uracil (U)
T U
25
Genes & Proteins
3’5’
5’3’TAGGATCGACTATATGGGATTACAAAGCATTTAGGGA...TCACCCTCTCTAGACTAGCATCTATATAAAACAGAA
ATCCTAGCTGATATACCCTAATGTTTCGTAAATCCCT...AGTGGGAGAGATCTGATCGTAGATATATTTTGTCTT
AUGGGAUUACAAAGCAUUUAGGGA...UCACCCUCUCUAGACUAGCAUCUAUAUAA
(transcription)
(translation)
Single-stranded RNA
protein
Double-stranded DNA
gene, transcription, translation, protein
26
Gene Transcriptionpromoter
3’5’
5’3’
G A T T A C A . . .
C T A A T G T . . .
27
Gene Transcription
Transcription factors: a type of protein that binds to DNA and helps initiate gene transcription.
Transcription factor binding sites: short sequences of DNA (6-20 bp) recognized and bound by TFs.
RNA polymerase binds a complex of TFs in the promoter.
transcription factor, binding site, RNA polymerase
3’5’
5’3’
G A T T A C A . . .
C T A A T G T . . .
28
Gene Transcription
3’5’
5’3’
The two strands are separated
G A T T A C
A . . .
C T A A T G T . . .
29
Gene Transcription
3’5’
5’3’
An RNA copy of the 5’→3’ sequence is created from the 3’→5’ template
G A T T A C
A . . .
C T A A T G T . . .
G A U U A C A
30
Gene Transcription
3’5’
5’3’
G A U U A C A . . .
G A T T A C A . . .
C T A A T G T . . .
pre-mRNA 5’ 3’
31
RNA Processing5’ cap, polyadenylation, exon, intron, splicing, UTR, mRNA
5’ cap poly(A) tail
intron
exon
mRNA
5’ UTR 3’ UTR
32
Gene Structure
5’ 3’
promoter
5’ UTR exons 3’ UTR
introns
coding
non-coding
33
How many? (Human Genome) Genes:
~ 20,000 Exons per gene:
~ 8 on average (max: 148) Nucleotides per exon:
170 on average (max: 12k) Nucleotides per intron:
5,500 on average (max: 500k) Nucleotides per gene:
45k on average (max: 2,2M)
34
From RNA to Protein Proteins are long strings of amino acids
joined by peptide bonds Translation from RNA sequence to amino
acid sequence performed by ribosomes 20 amino acids 3 RNA letters required to
specify a single amino acid
35
Amino acidamino acid
C
O
N
H
C
H
H OH
R
There are 20 standard amino acids
AlanineArginine
AsparagineAspartateCysteine
GlutamateGlutamine
GlycineHistidine
IsoleucineLeucineLysine
MethioninePhenylalanine
ProlineSerine
ThreonineTryptophan
TyrosineValine
36
C
O
N
H
C
H
R
to previous aa to next aa
N-terminus
(start)
H OH
C-terminus
(end)
N-terminus, C-terminus
from 5’ 3’ mRNA
Proteins
37
Translation
The ribosome (a complex of protein and RNA) synthesizes a protein by reading the mRNA in triplets (codons). Each codon is translated to an amino acid.
ribosome, codon
mRNA
P site A site
38
The genetic code Mapping from a codon to an amino acid
39
Translation
5’ . . . A U U A U G G C C U G G A C U U G A . . . 3’
UTR Met
Start Codon
Ala Trp ThrStop
Codon40
Translation
t-RNAMet Ala
5’ . . . A U U A U G G C C U G G A C U U G A . . . 3’
Trp
41
amino acid
Errors?
What if the transcription / translation machinery makes mistakes?
What is the effect of mutations in coding regions?
mutation
42
Reading Framesreading frame
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
4343
Synonymous Mutation
G C U U G U U U A C G A A U U A G
Ala Cys Leu Arg Ile
G C U U G U U U A C G A A U U A G
synonymous (silent) mutation, fourfold site
G
G C U U G U U U G C G A A U U A G
Ala Cys Leu Arg Ile
4444
Missense Mutation
G C U U G U U U A C G A A U U A G
Ala Cys Leu Arg Ile
G C U U G U U U A C G A A U U A G
missense mutation
G
G C U U G G U U A C G A A U U A G
Ala Trp Leu Arg Ile
4545
Nonsense Mutation
G C U U G U U U A C G A A U U A G
Ala Cys Leu Arg Ile
G C U U G U U U A C G A A U U A G
nonsense mutation
A
G C U U G A U U A C G A A U U A G
Ala STOP
46
Frameshift
G C U U G U U U A C G A A U U A G
Ala Cys Leu Arg Ile
G C U U G U U U A C G A A U U A G
frameshift
G C U U G U U A C G A A U U A G
Ala Cys Tyr Glu Leu
47
Transcription and translation
48Illustration from Radboud University Nijmegen
Let’s see how this happens! Transcription: http://www.youtube.com/watch?v=DA2t5N72mgw Translation:
http://www.youtube.com/watch?v=WkI_Vbwn14g&feature=related
Gene Expression Regulation
When should each gene be expressed? Regulate gene expression
Examples:
Make more of gene A when substance X is present Stop making gene B once you have enough Make genes C1, C2, C3 simultaneously
Why? Every cell has same DNA but each cell expresses different proteins.
Signal transduction: One signal converted to another
Cascade has “master regulators” turning on many proteins, which in turn each turn on many proteins, ...
Regulation, signal transduction
49
Gene Regulation Gene expression is controlled at many
levels DNA chromatin structure Transcription Post-transcriptional modification RNA transport Translation mRNA degradation Post-translational modification
50
Transcription regulation Much gene regulation
occurs at the level of transcription.
Primary players: Binding sites (BS) in cis-regulatory
modules (CRMs) Transcription factor (TF) proteins RNA polymerase II
Primary mechanism: TFs link to BSs Complex of TFs forms Complex assists or inhibits
formation of the RNA polymerase II machinery
51
Transcription Factor Binding Sites Short, degenerate DNA sequences
recognized by particular TFs
For complex organisms, cooperative binding of multiple TFs required to initiate transcription
Binding Sequence Logo
52
Summary All hereditary information encoded in
double-stranded DNA Each cell in an organism has same DNA DNA RNA protein Proteins have many diverse roles in cell Gene regulation diversifies protein products
within different cells
53
Outline Course logistics
A zero-knowledge based introduction to biology
Potential project topics
54
Say that a cancer patient X undergoes a chemotherapy.
There are >200 drugs patient X can be treated with. How do doctors choose which drug to use in
chemotherapy treatment ?
Which Drug Patient X Should Be Treated With?
Example project topic #1 (1/3)
Follicular lymphoma
Diffuse large B cell lymphoma
A few histologic features Patient X
Chemotherapy drugs5-IodotubercidinAcrichineARQ-197Arsenic trioxideAS101AS-703026AT-7519AxitinibAzacitidine :
How can we improve this?
Which Drug Patient X Should Be Treated With?
Example project topic #1 (1/3)
Follicular lymphoma
Diffuse large B cell lymphoma
A few histologic features Patient X
Chemotherapy drugs5-IodotubercidinAcrichineARQ-197Arsenic trioxideAS101AS-703026AT-7519AxitinibAzacitidine :
Doctors cannot handle millions of numbers! How about computers?
RNA levels of genes
Protein levels of genes
Epigenetics (Methylation)
A few histologic features
…ACGTAGCTAGCTAGCTAGCTGATGCTAGCTACGTGCT…DNA sequence
This is a pure machine learning problem!
Let’s Build a Prediction ModelExample project topic #1
(3/3)16
0 d
rugs
Drug sensitivity
test
~100 patients at UWMC …
In collaboration with Tony Blau, Pam Becker, Ray Monnat, David Hawkins (Medicine)
g1
g2
g4
g5g6
g3
e8
g11 g1
4g1
5
g9
g16
ggg30,00
0
g3
g7
g12g13
g
g
g
g
g
g
g1030
,00
0
genes
RNA levels of genes in
cancer cells
Drug 3
Drug 2
Drug i
Drug 6
Drug 4 Drug
5
Drug 160
30,000 features!(feature selection)
Prior knowledge on drugs’ targets
Publicly available RNA
level data
>3000 patientsTransfer learning,
Feature reconstruction
Goal: realizing personalized cancer treatment
Patient X
How Well Can We Predict Disease- related Traits Based on DNA?
Example project topic #2 (1/2)
Standard approach Find a simple rule! Failed to detect the DNA
affecting many important traits.
…ACTCGGTAGACCTAAATTCGGCCCGG…
…ACCCGGTAGACCTTTATTCGGCCCGG…
…ACCCGGTAGACCTTAATTCGGCCGGG…
:
…ACCCGGTAGTCCTATATTCGGCCCGG…
…ACTCGGTAGTCCTATATTCGGCCGGG…
DNA sequence…ACTCGGTAGACCTAAATTCGGCCCGG…
…ACCCGGTAGACCTTTATTCGGCCCGG…
…ACCCGGTAGACCTTAATTCGGCCGGG…
:
…ACCCGGTAGTCCTATATTCGGCCCGG…
…ACTCGGTAGTCCTATATTCGGCCGGG…
obesity
…
Individual1
Individual2
Individual3
IndividualN-1
IndividualN
Obesity
:
A
A
A
T
T
Athin, T fat
p≈106 !
cell,a complex system
??
environmental factors
Causality?
N instances
s1 s2 sptoo weak to be detected
?
One of the most important research problems in this area is to develop new computational methods that can represent more complicated interaction between sequence variation and trait.
Longitudinal study Environmental factors
Age, sex, smoking status
How Well Can We Predict Disease-related Traits Based on DNA?
Example project topic #2 (2/2)
…
… AC
TC
GG
AC
CTA
AA
TC
CC
G…
… AC
CC
GG
AC
CTTA
ATG
CG
G… … A
CC
CG
GA
CC
TA
TA
TG
CC
G… … A
CC
CG
GA
CC
TTTA
TG
CC
G…
:… A
CC
CG
GA
CC
TTA
ATG
CG
G… … A
CC
CG
GTC
CTA
TA
TG
CC
G…
… AC
TC
GG
TC
CTTA
ATG
CG
G…
… AC
TC
GG
TC
CTA
TA
TG
CG
G…
Sequence Information
s4
s2s3
s1
Cholesterol
~2000 subjects
:
Fatty acid
GlucoseInsulin
Cholesterol
Fatty acid
GlucoseInsulin
Phenotype Data
Phenotype Data
Year 0
Year 25
:
Age-specific genetic influence
sP
Structural learning
p≈106 !(feature selection)
In collaboration with Alex Reiner (Epidemiology)
More project topics at the course website!
Questions?