View
216
Download
0
Embed Size (px)
Citation preview
1- 1
Chapter 1
Introduction
1- 2
Introduction – Gene( 基因 ) History
1865 Mendel: The basic unit of inheritance is a gene.
Mendel’s work was forgotten until 1900s. 1944 The gene was known to be made of
DNA (Deoxyribonucleic Acid). 1953 James Watson and Francis Crick :
Double helical structure of DNA.
( 雙股螺旋 )
1- 3
Introduction – Gene History (Cont.)
1990 The Human Genome Project ( 人類基 因體計畫 ) started. 1995 The first free-living organism to be
sequenced : haemophilus influenzae( 流行性感冒嗜血桿菌 )
1998 CELERA joined the gene research. 2000 The human DNA sequence draft was
completed (published in 2001).
1- 4
Bioinformatics - 國內相關計畫 2000 年國科會「生物資訊」跨領域研究 2001 年國科會國家型研究計畫
基因體醫學國家型計畫 2001 年國科會跨領域專題研究
工程處:資訊科技 生物處:生物資訊
1- 5
動物細胞 ( 細胞核、細胞質、細胞膜 )
DNA 位於細胞核內之「核仁」
1- 6
DNA Double Helix ( 雙股螺旋)
1- 7
DNA Double Helix ( 雙股螺旋)
1- 8
DNA 中核甘酸間之鍵結
1- 9
核甘酸 核甘酸 (Nucleotide) 為核酸分子構成單元 核甘酸包含:
五碳糖 (去氧核糖 , deoxyribose) 磷酸基 (phosphate group) 含氮鹼基之一 (A 、 G 、 C 、 T 、 U)
胞嘧啶 (C)
1- 10
DNA 四種含氮鹼基
1- 11
DNA Double Helix ( 雙股螺旋)
1- 12
DNA Sequence
1- 13
DNA and RNA Nucleotide ( 核甘酸 ) : 腺嘌呤 (adenine, A)
鳥糞嘌呤 (guanine, G)胞嘧啶 (cytosine, C)胸腺嘧啶 (thymine, T)尿嘧啶 (uracil, U)
DNA(deoxyribonucleic acid , 去氧核糖核酸 ) {A, G, C, T} (base pair: GC, A=T ) RNA(ribonucleic acid, 核糖核酸 ) {A, G, C, U} (base pair: GC, A=U, GU )
1- 14
DNA Length
The total length of the human DNA is about 3109 (30 億 ) base pairs.
1% ~ 1.5% of DNA sequence is useful. # of human genes: 30,000~40,000
Conclusion from the human genome project Expected # is 100,000 originally.
1- 15
DNA Sequencing( 定序 ) Given DNA sequence:
TGCACTTGACGCATGCT
Cut the sequence after random A:
ATGCT length=5
ACGCATGCT length=9
AACGCATGCT length=10
ACTTGAACGCATGCT length=15
1- 16
DNA Sequencing 電泳法 (eletrophoresis)
1- 17
DNA Sequencing
1- 18
Amino Acids ( 胺基酸 ) 胺基酸:蛋白質的基本單位,共 20 種
1- 19
General Structure of an Amino Acid
COO
C HH3N +
CH2
CH2
CH2
CH2
NH3+
Carboxyl Group
Amino Group
R Group
3 groups:
Amino Group ( 胺基 )
Carboxyl Group (羧基 )
R Group (R 基團 )
1- 20
Amino Acids ( 胺基酸 ) 分子
1- 21
Amino Acids ( 胺基酸 )分子
1- 22
Protein ( 蛋白質 ) 分子
1- 23
Amino Acids and RNA每三個核甘酸 (codon ,基因密碼 ) 對應至一種胺基
酸。Second Position of CodonU C A G
First
Position
U
UUU Phe [F]UUC Phe [F]UUA Leu [L]UUG Leu [L]
UCU Ser [S]UCC Ser [S]UCA Ser [S]UCG Ser [S]
UAU Tyr [Y]UAC Tyr [Y]UAA Ter [en
d]UAG Ter [end]
UGU Cys [C]UGC Cys [C]
UGA Ter [end]UGG Trp [W]
UCAG
Third
Position
C
CUU Leu [L]CUC Leu [L]CUA Leu [L]CUG Leu [L]
CCU Pro [P]CCC Pro [P]CCA Pro [P]CCG Pro [P]
CAU His [H]CAC His [H]CAA Gln [Q]CAG Gln [Q]
CGU Arg [R]CGC Arg [R]CGA Arg [R]CGG Arg [R]
UCAG
A
AUU Ile [I]AUC Ile [I]AUA Ile [I]AUG Met
[M]
ACU Thr [T]ACC Thr [T]ACA Thr [T]ACG Thr [T]
AAU Asn [N]AAC Asn [N]AAA Lys [K]AAG Lys [K]
AGU Ser [S]AGC Ser [S]AGA Arg [R]AGG Arg [R]
UCAG
G
GUU Val [V]GUC Val [V]GUA Val [V]GUG Val [V]
GCU Ala [A]GCC Ala [A]GCA Ala [A]GCG Ala [A]
GAU Asp [D]GAC Asp [D]GAA Glu [E]GAG Glu [E]
GGU Gly [G]GGC Gly [G]GGA Gly [G]GGG Gly [G]
UCAG
AUG is also the “start” codon.
1- 24
From DNA via RNA to Protein
1- 25
DNA, Genes and Proteins
DNA: program for cell processes Proteins: execute cell processes
TCCAA
CGGTGC
TGAGGT
GCAC
GeneProtein
DNA
1- 26
Promoter( 啟動子 ) and Gene
TranscriptionalStart Site
ATG TAG
TranscriptionalTermination Site
TATA
TTG
PromoterUpstream Downstream
intron
exon
1- 27
GeneRegulatory Element
RNA polymerase(Protein)
Transcription Factor(Protein)
DNA
By Blanchette
Regulation ( 調控 ) of Genes
1- 28
Gene
RNA polymerase
Transcription Factor(Protein)
Regulatory Element
DNA
By Blanchette
Regulation of Genes
1- 29
Gene
RNA polymerase
Transcription Factor
Regulatory Element
DNA
New protein
By Blanchette
Regulation of Genes
1- 30
From DNA via RNA to Protein
1- 31
From RNA to Protein
1- 32
From RNA to Protein
1- 33
Primary Structure ( 一級結構 ) of Protein
牛的胰島素 ( 一種蛋白質 ) 之胺基酸序列
1- 34
Secondary Structure ( 二級結構 ) of Protein
1- 35
Tertiary Structure ( 三級結構 ) of Protein
血紅素分子三級結構
1- 36
Quaternary Structure ( 四級結構 ) of Protein
血紅素分子四級結構
1- 37
Problems on Different Levels
1- 38
Some Problems in Bioinformatics Sequence comparison
Longest common subsequence Edit distance Similarity Multiple sequence alignment
Fragment assembly of DNA sequences Shortest common superstring
Physical mapping Double digest problem Consecutive ones problem
Evolutionary trees Molecular structure prediction
Protein folding
1- 39
Sequence Comparison
Goals: Database search: Given a sequence S and a set
of sequences G, to find all the sequences in G, which are similar to S.
Similarity: To find which parts of the sequences are alike and which parts differ.
- Sequence alignment (global alignment)
- Local alignment
1- 40
Sequence Alignement
Global alignment
Local alignment
1- 41
Longest Common Subsequence(1)
To find a longest common subsequence between two strings.
string1: TAGTCACG
string2: AGACTGTC
LCS :AGACG Dynamic programming:
jiji
jiji
jiji
ji
baifc
baifc
baifc
c
0
0
1
max
1,
,1
1,1
,
1- 42
Longest Common Subsequence(2)
TAGTCACGAGACTGTCLCS:AGACG
- A G A C T G T C
0 0 0 0 0 0 0 0 0-
0 0 0 0 0 1 1 1 1T
0 1 1 1 1 1 1 1 1A
0 1 2 2 2 2 2 2 2G
0 1 2 2 2 3 3 3 3T
0 1 2 2 3 3 3 3 4C
0 1 2 3 3 3 3 3 4A
0 1 2 3 4 4 4 4 4C
0 1 2 3 4 4 5 5 5G
S2
S1
1- 43
Edit Distance(1) To find a smallest edit process between two string
s.
S1: TAGTCAC G
S2: AG ACTGTC
Operation: DMMDDMMIMII
Insertbdistc
Deleteadistc
baMatchc
c
jji
iji
jiji
ji
),(
),(
)(0
min
1,
,1
1,1
,
.1),(),( Suppose ji bdistadist
1- 44
Edit Distance(2)
TAGTCAC G
AG ACTGTC
DMMDDMMIMII
- A G A C T G T C
0 1 2 3 4 5 6 7 8-
1 2 3 4 5 4 5 6 7T
2 1 2 3 4 5 6 7 8A
3 2 1 2 3 4 5 6 7G
4 3 2 3 4 3 4 5 6T
5 4 3 4 3 4 5 6 5C
6 5 4 3 4 5 6 7 6A
7 6 5 4 3 4 5 6 7C
8 7 6 5 4 5 4 5 6G
ci-1,j-1 ci-1,j
ci,jci,j-1
S2
S1
1- 45
Similarity
Two sequences s1 and s2.
p is the match value if ai = bj, else it is the mismatch value.
g is the gap penalty.
jiji
jiji
jiji
ji
baifgc
baifgc
baifpc
c
1,
,1
1,1
, max
1- 46
Sequence Alignment
a = TAGTCACGb = AGACTGTC----TAGTCACG TAGTCAC-G--
AGACT-GTC--- -AG--ACTGTC Which one is better?
1- 47
Sequence Alignment Formulac0,0 = 0
ci,0 = ic0,j = j
if ai bj
if ai = bj
2
1
1
1
maxmax
1,1
1,
,1
1,1
,
ji
ji
ji
ji
ji
c
c
c
c
c
1- 48
Sequence Alignment Example
TAGTCAC-G--
-AG--ACTGTC
- A G A C T G T C
0 -1 -2 -3 -4 -5 -6 -7 -8-
-1 -1 -2 -3 -4 -2 -3 -4 -5T
-2 1 0 0 -1 -2 -3 -4 -5A
-3 0 3 2 1 0 0 -1 -2G
-4 -1 2 2 1 3 2 2 1T
-5 -2 1 1 4 3 2 1 4C
-6 -3 0 3 3 3 2 1 3A
-7 -4 -1 2 5 4 3 2 3C
-8 -5 -2 1 4 4 6 5 4G
1- 49
Multiple Sequence Alignments1 = ATTCGAT
s2 = TTGAG
s3 = ATGCT alignments1 = ATTCGAT
s2 = -TT-GAG
s3 = AT--GCT If the number of sequences is k, and k is large,
how to solve the problem? NP-complete problem
1- 50
Multiple Sequence Alignment - SP
Sum-of-pairs
score = ji
ji SSscoring ),(
1- 51
Example of Sum-of-pairs Score
s1 = ATTCGAT
s2 = -TT-GAG
s3 = AT--GCT
For the alignment, the pairwise alignment scores are:
score(s1,s2) = 5
score(s2,s3) = 0
score(s1,s3) = 5 SP score = 10
1- 52
Multiple Sequence Alignment - Star Star alignment is an approximation system
of sum-of-pairs (SP) scoring system. Star alignment score =
TiiT SSscoring ),(
1- 53
Example of Star Scores1 = ATTCGAT
s2 = -TT-GAG
s3 = AT--GCTFor the alignment, the pairwise alignment scores are:
score(s1,s2) = 5
score(s2,s3) = 0
score(s1,s3) = 5
Star score = max{score(s1,s2)+score(s1,s3), score(s2,s1)+score(s2,s3),
score(s3,s1)+score(s3,s2)}= max{5+5, 5+0, 5+0} = 10
1- 54
Multiple Sequence Alignment - Tree Tree
score =
where Si and Sj are adjacent, Sk and Sl are adjace
nt.
lk
lkji
ji SSscoringSSscoring,,
),(),(
1- 55
Fragment Assembly
Depending on experimental factors: Fragment length can be as low as 200 or high
as 700. Typical problems involve target sequences
30,000 to 100,000 base-pairs long, and total number of fragments is in the range 500 to 2000.
1- 56
Shortest Common Superstring Given a set of k strings P ={s1,s2…,sk}, to fi
nd a shortest superstring s containing every string in P as a substring. That is, |s| is the minimal.ACCGT --ACCGT--CGTGC ----CGTGCTTAC TTAC-----TACCGT -TACCGT
---------------TTACCGTGC
NP-complete problem
1- 57
Physical Mapping Given a (0,1) matrix of probes versus
clones, to reconstruct the relative places of clones or probes.
NP-complete problem
1- 58
Consecutive Ones Problem(1)
1- 59
Consecutive Ones Problem(2) Consider a (0,1) matrix M, with rows indexed by
clones and columns by probes, and position (i, j) is 1 if clone i contains probe j.
The problem is to permute the columns so that the ones in each row are consecutive.
A (0, 1) matrix has the k-consecutive ones property (k-C1P) if there exists a column order such that in each row the occurrences of all ones appear in at most k consecutive blocks.
The k-consecutive ones Problem: Does a given (0, 1) matrix have the k-consecutive ones
property? NP-complete, for k 2
1- 60
Double Digest Problem(1)
enzyme a = |A| = {1, 3, 3, 12}
enzyme b = |B| = {1, 2, 3, 3, 4, 6}
c = |A B| = |C| = {1, 1, 1, 1, 2, 2, 2, 3, 6}
1- 61
Double Digest Problem(2) Given the lengths of fragments, |Xi Xj|, 1 i j
n, obtained by applying either one of the two restriction enzymes A and B, or both, to determine the order of these fragments.
a = |A| = {ai: 1 i n} from the first digest
b = |B| = {bi: 1 i m} from the second digest.
c = |A B| = |C| = {ci: 1 i l} from first and second digests.
ni mi li
iii cba1 1 1
1- 62
Evolutionary Trees(1)
siamang( )合趾猴
gibbon( )長臂猿
orangutan( )猩猩
human( )人類
gorilla( )大猩猩
chimpanzee( )黑猩猩
1- 63
Evolutionary Trees(2) Genome sequences: Given genomes of
several organisms, to build an evolutionary tree in which the number of mutations (changes) is minimal.
Character matrix: Given a (0, 1) character state matrix of several organisms, to build a perfect evolutionary tree.
Distance matrix: Given a distance matrix of several organisms, to build a tree satisfying the distances between all organisms.
1- 64
Perfect Phylogeny(1)
1- 65
Protein Structure
1- 66
Protein Folding Given the primary structure of a protein, to
compute or evaluate its 3-dimensional structure.
Primary structure (sequence):
1- 67
Protein Folding Problem
The characteristic of each amino: H (hydrophobic, non-polar)
(hating water, 疏水性 ) P (hydrophilic, polar)
(loving water, 親水性 ) The amino acid sequence of a protein can be vi
ewed as a binary sequence of H’s (1’s) and P’s (0’s).
1- 68
Example of H-P Model
Input sequence: 011001001110010
0 1 1 0
0
1
00
1
11
1 0
0
0
0 1 1 0
0
1
00
1
11
1
0
0
0
Score = 5Score = 3
1- 69
Protein folding on H-P Model
The protein folding on H-P model: Given a sequence of 1’s (H’s) and 0’s (P’s), to find a self-avoiding paths embedded in either a 2D or 3D lattice such that the number of pairs of adjacent 1’s is maximized.
NP-complete even for 2D lattice.
1- 70
RNA Secondary Structure Prediction (1)
RNA: {A, G, C, U} Base pairs:
GC (Watson-Crick base pair)
A=U (Watson-Crick base pair)
GU (Wobble base pair) (a,b) is defined as 1 if a and b can form a
base pair; otherwise it is 0.
1- 71
RNA Secondary Structure Prediction (2)
1- 72
RNA Secondary Structure Prediction (3)
X(i,j) is maximum number of base pairs in the sequence aiai+1…aj, i j.
Dynamic Programming:
X(i,j) = 0 if | j i | 1.
i k j 1.
)},()],1(1)1,(max{[
)1,1(max),(
1jk aajkXkiX
jiXjiX
1- 73
Reference - Books Algorithms on strings, trees, and sequences : com
puter science and computational biology, Dan Gusfield, Cambridge University Press, 1997.
Introduction to computational molecular biology, Joao Carlos Setubal and Joao Meidanis, PWS Pub., 1997.
Introduction to computational biology: maps, sequences and genomes, Michael S. Waterman, Champman & Hall, 1995.
Manuscript of Prof. R. C. T. Lee http://www.csie.ncnu.edu.tw/~rctlee/biology.html
1- 74
Reference – Books (Biology) 生物學 C. Starr & R. Taggart 原著 丁澤民 王偉 張世衿 連慧瑞 編譯 現代分子生物學 朱玉賢 李毅 編著 藝軒出版社 分子生物學入門 駒野徹、酒井裕 合著 何士慶 譯 科技圖書 DNA 圖解小百科 (名詞解釋) 威惹利、培瑞、李哈 合著 潘震澤 譯 新新聞文化公司
1- 75
Reference – Journals (1) Bioinfomatics (SCI) Bulletin of Mathematical Biology (SCI) Computer Applications in the Biosciences Journal of Computational Biology (SCI expanded) Journal of Mathematical Biology (SCI) Journal of Molecular Biology (SCI) Nucleic Acids Research (SCI) Gene (SCI) Science (SCI)
1- 76
Reference – Journals (2) Genome Research (SCI) PROTEINS: Structure, Function, and Bioinformati
cs (SCI) Gene (SCI) Current Opinion in Structural Biology (SCI) Protein-Structure Function and Bioinfomatics (SC
I) BMC Bioinformatics (SCI Expanded) Computational Biology and Chemistry (SCI) BioSystems (SCI)
1- 77
Reference – Web Sites C. B. Yang http://par.cse.nsysu.edu.tw Bioweb link of C.B.Yang BioWeb http://bioweb.uwlax.edu/ MIT Biology Hypertextbook http://esg-w
ww.mit.edu:8001/esgbio/ Bioinformatics Related Journals http://www.
iscb.org/journals.html NCBI (National Center for Biotechnology Inform
ation http://www.ncbi.nlm.nih.gov/
1- 78
Conclusion (1)Bioinformatics and Computer Science Algorithm: all computing problems. Image processing: 3D images of RNA folds
or protein. Database: massive database and retrieval. Distributed system and parallel processing:
massive storage and accelerating computation.
1- 79
Conclusion (2)
Biology easily has 500 years of exciting problems to work on.
-- Donald E. Knuth