1- 1 Chapter 1 Introduction. 1- 2 Introduction – Gene( 基因 ) History 1865 Mendel: The basic unit of inheritance is a gene. Mendel’s work was forgotten

1- 1

Chapter 1

Introduction

1- 2

Introduction – Gene( 基因 ) History

1865 Mendel: The basic unit of inheritance is a gene.

Mendel’s work was forgotten until 1900s. 1944 The gene was known to be made of

DNA (Deoxyribonucleic Acid). 1953 James Watson and Francis Crick :

Double helical structure of DNA.

( 雙股螺旋 )

1- 3

Introduction – Gene History (Cont.)

1990 The Human Genome Project ( 人類基因體計畫 ) started. 1995 The first free-living organism to be

sequenced : haemophilus influenzae( 流行性感冒嗜血桿菌 )

1998 CELERA joined the gene research. 2000 The human DNA sequence draft was

completed (published in 2001).

1- 4

Bioinformatics - 國內相關計畫 2000 年國科會「生物資訊」跨領域研究 2001 年國科會國家型研究計畫

基因體醫學國家型計畫 2001 年國科會跨領域專題研究

工程處：資訊科技生物處：生物資訊

1- 5

動物細胞 ( 細胞核、細胞質、細胞膜 )

DNA 位於細胞核內之「核仁」

1- 6

DNA Double Helix ( 雙股螺旋）

1- 7


1- 8

DNA 中核甘酸間之鍵結

1- 9

核甘酸核甘酸 (Nucleotide) 為核酸分子構成單元核甘酸包含：

五碳糖 (去氧核糖 , deoxyribose) 磷酸基 (phosphate group) 含氮鹼基之一 (A 、 G 、 C 、 T 、 U)

胞嘧啶 (C)

1- 10

DNA 四種含氮鹼基

1- 11


1- 12

DNA Sequence

1- 13

DNA and RNA Nucleotide ( 核甘酸 ) ：腺嘌呤 (adenine, A)

鳥糞嘌呤 (guanine, G)胞嘧啶 (cytosine, C)胸腺嘧啶 (thymine, T)尿嘧啶 (uracil, U)

DNA(deoxyribonucleic acid , 去氧核糖核酸 ) {A, G, C, T} (base pair: GC, A=T ) RNA(ribonucleic acid, 核糖核酸 ) {A, G, C, U} (base pair: GC, A=U, GU )

1- 14

DNA Length

The total length of the human DNA is about 3109 (30 億 ) base pairs.

1% ~ 1.5% of DNA sequence is useful. # of human genes: 30,000~40,000

Conclusion from the human genome project Expected # is 100,000 originally.

1- 15

DNA Sequencing( 定序 ) Given DNA sequence:

TGCACTTGACGCATGCT

Cut the sequence after random A:

ATGCT length=5

ACGCATGCT length=9

AACGCATGCT length=10

ACTTGAACGCATGCT length=15

1- 16

DNA Sequencing 電泳法 (eletrophoresis)

1- 17

DNA Sequencing

1- 18

Amino Acids ( 胺基酸 ) 胺基酸：蛋白質的基本單位，共 20 種

1- 19

General Structure of an Amino Acid

COO

C HH3N +

CH2

CH2

CH2

CH2

NH3+

Carboxyl Group

Amino Group

R Group

3 groups:

Amino Group ( 胺基 )

Carboxyl Group (羧基 )

R Group (R 基團 )

1- 20

Amino Acids ( 胺基酸 ) 分子

1- 21

Amino Acids ( 胺基酸 )分子

1- 22

Protein ( 蛋白質 ) 分子

1- 23

Amino Acids and RNA每三個核甘酸 (codon ，基因密碼 ) 對應至一種胺基

酸。Second Position of CodonU C A G

First

Position

U

UUU Phe [F]UUC Phe [F]UUA Leu [L]UUG Leu [L]

UCU Ser [S]UCC Ser [S]UCA Ser [S]UCG Ser [S]

UAU Tyr [Y]UAC Tyr [Y]UAA Ter [en

d]UAG Ter [end]

UGU Cys [C]UGC Cys [C]

UGA Ter [end]UGG Trp [W]

UCAG

Third

Position

C

CUU Leu [L]CUC Leu [L]CUA Leu [L]CUG Leu [L]

CCU Pro [P]CCC Pro [P]CCA Pro [P]CCG Pro [P]

CAU His [H]CAC His [H]CAA Gln [Q]CAG Gln [Q]

CGU Arg [R]CGC Arg [R]CGA Arg [R]CGG Arg [R]

UCAG

A

AUU Ile [I]AUC Ile [I]AUA Ile [I]AUG Met

[M]

ACU Thr [T]ACC Thr [T]ACA Thr [T]ACG Thr [T]

AAU Asn [N]AAC Asn [N]AAA Lys [K]AAG Lys [K]

AGU Ser [S]AGC Ser [S]AGA Arg [R]AGG Arg [R]

UCAG

G

GUU Val [V]GUC Val [V]GUA Val [V]GUG Val [V]

GCU Ala [A]GCC Ala [A]GCA Ala [A]GCG Ala [A]

GAU Asp [D]GAC Asp [D]GAA Glu [E]GAG Glu [E]

GGU Gly [G]GGC Gly [G]GGA Gly [G]GGG Gly [G]

UCAG

AUG is also the “start” codon.

1- 24

From DNA via RNA to Protein

1- 25

DNA, Genes and Proteins

DNA: program for cell processes Proteins: execute cell processes

TCCAA

CGGTGC

TGAGGT

GCAC

GeneProtein

DNA

1- 26

Promoter( 啟動子 ) and Gene

TranscriptionalStart Site

ATG TAG

TranscriptionalTermination Site

TATA

TTG

PromoterUpstream Downstream

intron

exon

1- 27

GeneRegulatory Element

RNA polymerase(Protein)

Transcription Factor(Protein)

DNA

By Blanchette

Regulation ( 調控 ) of Genes

1- 28

Gene

RNA polymerase

Transcription Factor(Protein)

Regulatory Element

DNA

By Blanchette

Regulation of Genes

1- 29

Gene

RNA polymerase

Transcription Factor

Regulatory Element

DNA

New protein

By Blanchette

Regulation of Genes

1- 30

From DNA via RNA to Protein

1- 31

From RNA to Protein

1- 32

From RNA to Protein

1- 33

Primary Structure ( 一級結構 ) of Protein

牛的胰島素 ( 一種蛋白質 ) 之胺基酸序列

1- 34

Secondary Structure ( 二級結構 ) of Protein

1- 35

Tertiary Structure ( 三級結構 ) of Protein

血紅素分子三級結構

1- 36

Quaternary Structure ( 四級結構 ) of Protein

血紅素分子四級結構

1- 37

Problems on Different Levels

1- 38

Some Problems in Bioinformatics Sequence comparison

Longest common subsequence Edit distance Similarity Multiple sequence alignment

Fragment assembly of DNA sequences Shortest common superstring

Physical mapping Double digest problem Consecutive ones problem

Evolutionary trees Molecular structure prediction

Protein folding

1- 39

Sequence Comparison

Goals: Database search: Given a sequence S and a set

of sequences G, to find all the sequences in G, which are similar to S.

Similarity: To find which parts of the sequences are alike and which parts differ.

- Sequence alignment (global alignment)

- Local alignment

1- 40

Sequence Alignement

Global alignment

Local alignment

1- 41

Longest Common Subsequence(1)

To find a longest common subsequence between two strings.

string1: TAGTCACG

string2: AGACTGTC

LCS :AGACG Dynamic programming:

jiji

jiji

jiji

ji

baifc

baifc

baifc

c

0

0

1

max

1,

,1

1,1

,

1- 42

Longest Common Subsequence(2)

TAGTCACGAGACTGTCLCS:AGACG

- A G A C T G T C

0 0 0 0 0 0 0 0 0-

0 0 0 0 0 1 1 1 1T

0 1 1 1 1 1 1 1 1A

0 1 2 2 2 2 2 2 2G

0 1 2 2 2 3 3 3 3T

0 1 2 2 3 3 3 3 4C

0 1 2 3 3 3 3 3 4A

0 1 2 3 4 4 4 4 4C

0 1 2 3 4 4 5 5 5G

S2

S1

1- 43

Edit Distance(1) To find a smallest edit process between two string

s.

S1: TAGTCAC G

S2: AG ACTGTC

Operation: DMMDDMMIMII

Insertbdistc

Deleteadistc

baMatchc

c

jji

iji

jiji

ji

),(

),(

)(0

min

1,

,1

1,1

,

.1),(),( Suppose ji bdistadist

1- 44

Edit Distance(2)

TAGTCAC G

AG ACTGTC

DMMDDMMIMII

- A G A C T G T C

0 1 2 3 4 5 6 7 8-

1 2 3 4 5 4 5 6 7T

2 1 2 3 4 5 6 7 8A

3 2 1 2 3 4 5 6 7G

4 3 2 3 4 3 4 5 6T

5 4 3 4 3 4 5 6 5C

6 5 4 3 4 5 6 7 6A

7 6 5 4 3 4 5 6 7C

8 7 6 5 4 5 4 5 6G

ci-1,j-1 ci-1,j

ci,jci,j-1

S2

S1

1- 45

Similarity

Two sequences s1 and s2.

p is the match value if ai = bj, else it is the mismatch value.

g is the gap penalty.

jiji

jiji

jiji

ji

baifgc

baifgc

baifpc

c

1,

,1

1,1

, max

1- 46

Sequence Alignment

a = TAGTCACGb = AGACTGTC----TAGTCACG TAGTCAC-G--

AGACT-GTC--- -AG--ACTGTC Which one is better?

1- 47

Sequence Alignment Formulac0,0 = 0

ci,0 = ic0,j = j

if ai bj

if ai = bj

2

1

1

1

maxmax

1,1

1,

,1

1,1

,

ji

ji

ji

ji

ji

c

c

c

c

c

1- 48

Sequence Alignment Example

TAGTCAC-G--

-AG--ACTGTC

- A G A C T G T C

0 -1 -2 -3 -4 -5 -6 -7 -8-

-1 -1 -2 -3 -4 -2 -3 -4 -5T

-2 1 0 0 -1 -2 -3 -4 -5A

-3 0 3 2 1 0 0 -1 -2G

-4 -1 2 2 1 3 2 2 1T

-5 -2 1 1 4 3 2 1 4C

-6 -3 0 3 3 3 2 1 3A

-7 -4 -1 2 5 4 3 2 3C

-8 -5 -2 1 4 4 6 5 4G

1- 49

Multiple Sequence Alignments1 = ATTCGAT

s2 = TTGAG

s3 = ATGCT alignments1 = ATTCGAT

s2 = -TT-GAG

s3 = AT--GCT If the number of sequences is k, and k is large,

how to solve the problem? NP-complete problem

1- 50

Multiple Sequence Alignment - SP

Sum-of-pairs

score = ji

ji SSscoring ),(

1- 51

Example of Sum-of-pairs Score

s1 = ATTCGAT

s2 = -TT-GAG

s3 = AT--GCT

For the alignment, the pairwise alignment scores are:

score(s1,s2) = 5

score(s2,s3) = 0

score(s1,s3) = 5 SP score = 10

1- 52

Multiple Sequence Alignment - Star Star alignment is an approximation system

of sum-of-pairs (SP) scoring system. Star alignment score =

TiiT SSscoring ),(

1- 53

Example of Star Scores1 = ATTCGAT

s2 = -TT-GAG

s3 = AT--GCTFor the alignment, the pairwise alignment scores are:

score(s1,s2) = 5

score(s2,s3) = 0

score(s1,s3) = 5

Star score = max{score(s1,s2)+score(s1,s3), score(s2,s1)+score(s2,s3),

score(s3,s1)+score(s3,s2)}= max{5+5, 5+0, 5+0} = 10

1- 54

Multiple Sequence Alignment - Tree Tree

score =

where Si and Sj are adjacent, Sk and Sl are adjace

nt.

lk

lkji

ji SSscoringSSscoring,,

),(),(

1- 55

Fragment Assembly

Depending on experimental factors: Fragment length can be as low as 200 or high

as 700. Typical problems involve target sequences

30,000 to 100,000 base-pairs long, and total number of fragments is in the range 500 to 2000.

1- 56

Shortest Common Superstring Given a set of k strings P ={s1,s2…,sk}, to fi

nd a shortest superstring s containing every string in P as a substring. That is, |s| is the minimal.ACCGT --ACCGT--CGTGC ----CGTGCTTAC TTAC-----TACCGT -TACCGT

---------------TTACCGTGC

NP-complete problem

1- 57

Physical Mapping Given a (0,1) matrix of probes versus

clones, to reconstruct the relative places of clones or probes.

NP-complete problem

1- 58

Consecutive Ones Problem(1)

1- 59

Consecutive Ones Problem(2) Consider a (0,1) matrix M, with rows indexed by

clones and columns by probes, and position (i, j) is 1 if clone i contains probe j.

The problem is to permute the columns so that the ones in each row are consecutive.

A (0, 1) matrix has the k-consecutive ones property (k-C1P) if there exists a column order such that in each row the occurrences of all ones appear in at most k consecutive blocks.

The k-consecutive ones Problem: Does a given (0, 1) matrix have the k-consecutive ones

property? NP-complete, for k 2

1- 60

Double Digest Problem(1)

enzyme a = |A| = {1, 3, 3, 12}

enzyme b = |B| = {1, 2, 3, 3, 4, 6}

c = |A B| = |C| = {1, 1, 1, 1, 2, 2, 2, 3, 6}

1- 61

Double Digest Problem(2) Given the lengths of fragments, |Xi Xj|, 1 i j

n, obtained by applying either one of the two restriction enzymes A and B, or both, to determine the order of these fragments.

a = |A| = {ai: 1 i n} from the first digest

b = |B| = {bi: 1 i m} from the second digest.

c = |A B| = |C| = {ci: 1 i l} from first and second digests.

ni mi li

iii cba1 1 1

1- 62

Evolutionary Trees(1)

siamang( )合趾猴

gibbon( )長臂猿

orangutan( )猩猩

human( )人類

gorilla( )大猩猩

chimpanzee( )黑猩猩

1- 63

Evolutionary Trees(2) Genome sequences: Given genomes of

several organisms, to build an evolutionary tree in which the number of mutations (changes) is minimal.

Character matrix: Given a (0, 1) character state matrix of several organisms, to build a perfect evolutionary tree.

Distance matrix: Given a distance matrix of several organisms, to build a tree satisfying the distances between all organisms.

1- 64

Perfect Phylogeny(1)

1- 65

Protein Structure

1- 66

Protein Folding Given the primary structure of a protein, to

compute or evaluate its 3-dimensional structure.

Primary structure (sequence):

1- 67

Protein Folding Problem

The characteristic of each amino: H (hydrophobic, non-polar)

(hating water, 疏水性 ) P (hydrophilic, polar)

(loving water, 親水性 ) The amino acid sequence of a protein can be vi

ewed as a binary sequence of H’s (1’s) and P’s (0’s).

1- 68

Example of H-P Model

Input sequence: 011001001110010

0 1 1 0

0

1

00

1

11

1 0

0

0

0 1 1 0

0

1

00

1

11

1

0

0

0

Score = 5Score = 3

1- 69

Protein folding on H-P Model

The protein folding on H-P model: Given a sequence of 1’s (H’s) and 0’s (P’s), to find a self-avoiding paths embedded in either a 2D or 3D lattice such that the number of pairs of adjacent 1’s is maximized.

NP-complete even for 2D lattice.

1- 70

RNA Secondary Structure Prediction (1)

RNA: {A, G, C, U} Base pairs:

GC (Watson-Crick base pair)

A=U (Watson-Crick base pair)

GU (Wobble base pair) (a,b) is defined as 1 if a and b can form a

base pair; otherwise it is 0.

1- 71


1- 72


X(i,j) is maximum number of base pairs in the sequence aiai+1…aj, i j.

Dynamic Programming:

X(i,j) = 0 if | j i | 1.

i k j 1.

)},()],1(1)1,(max{[

)1,1(max),(

1jk aajkXkiX

jiXjiX

1- 73

Reference - Books Algorithms on strings, trees, and sequences : com

puter science and computational biology, Dan Gusfield, Cambridge University Press, 1997.

Introduction to computational molecular biology, Joao Carlos Setubal and Joao Meidanis, PWS Pub., 1997.

Introduction to computational biology: maps, sequences and genomes, Michael S. Waterman, Champman & Hall, 1995.

Manuscript of Prof. R. C. T. Lee http://www.csie.ncnu.edu.tw/~rctlee/biology.html

1- 74

Reference – Books (Biology) 生物學 C. Starr & R. Taggart 原著丁澤民王偉張世衿連慧瑞編譯現代分子生物學朱玉賢李毅編著藝軒出版社分子生物學入門駒野徹、酒井裕合著何士慶譯科技圖書 DNA 圖解小百科（名詞解釋）威惹利、培瑞、李哈合著潘震澤譯新新聞文化公司

1- 75

Reference – Journals (1) Bioinfomatics (SCI) Bulletin of Mathematical Biology (SCI) Computer Applications in the Biosciences Journal of Computational Biology (SCI expanded) Journal of Mathematical Biology (SCI) Journal of Molecular Biology (SCI) Nucleic Acids Research (SCI) Gene (SCI) Science (SCI)

1- 76

Reference – Journals (2) Genome Research (SCI) PROTEINS: Structure, Function, and Bioinformati

cs (SCI) Gene (SCI) Current Opinion in Structural Biology (SCI) Protein-Structure Function and Bioinfomatics (SC

I) BMC Bioinformatics (SCI Expanded) Computational Biology and Chemistry (SCI) BioSystems (SCI)

1- 77

Reference – Web Sites C. B. Yang http://par.cse.nsysu.edu.tw Bioweb link of C.B.Yang BioWeb http://bioweb.uwlax.edu/ MIT Biology Hypertextbook http://esg-w

ww.mit.edu:8001/esgbio/ Bioinformatics Related Journals http://www.

iscb.org/journals.html NCBI (National Center for Biotechnology Inform

ation http://www.ncbi.nlm.nih.gov/

1- 78

Conclusion (1)Bioinformatics and Computer Science Algorithm: all computing problems. Image processing: 3D images of RNA folds

or protein. Database: massive database and retrieval. Distributed system and parallel processing:

massive storage and accelerating computation.

1- 79

Conclusion (2)

Biology easily has 500 years of exciting problems to work on.

-- Donald E. Knuth

Documents

1- 1 Chapter 1 Introduction. 1- 2 Introduction – Gene( 基因 ) History 1865 Mendel: The basic unit of inheritance is a gene. Mendel’s work was forgotten