67
Bioinformation Technolog Bioinformation Technolog y: y: Case Studies in Bioinformat Case Studies in Bioinformat ics and ics and Biocomputing with DNA Chips Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technol ogy (CBIT) Seoul National University [email protected] http://bi.snu.ac.kr/~btzhang

Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

Embed Size (px)

DESCRIPTION

3 Human Genome Project Genome Health Implications A New Disease Encyclopedia New Genetic Fingerprints New Diagnostics New Treatments Goals Identify the approximate 40,000 genes in human DNA Determine the sequences of the 3 billion bases that make up human DNA Store this information in database Develop tools for data analysis Address the ethical, legal and social issues that arise from genome research

Citation preview

Page 1: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

Bioinformation Technology: Bioinformation Technology: Case Studies in Bioinformatics andCase Studies in Bioinformatics and

Biocomputing with DNA ChipsBiocomputing with DNA Chips

Byoung-Tak ZhangCenter for Bioinformation Technology (CBIT)

Seoul National University

[email protected]://bi.snu.ac.kr/~btzhang

Page 2: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

2

OutlineOutline

Bioinformation Technology Bioinformatics

DNA Chip Data Analysis: IT for BT DNA Computing: BT for IT

DNA Computing with DNA Chips Outlook

Page 3: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

3

Human Genome ProjectHuman Genome Project

Genome Health Implications

A New DiseaseEncyclopedia

New Genetic Fingerprints

NewDiagnostics

NewTreatments

Goals• Identify the approximate 40,000 genes in human DNA• Determine the sequences of the 3 billion bases that make up human DNA• Store this information in database• Develop tools for data analysis• Address the ethical, legal and social issues that arise from genome research

Page 4: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

4

Bioinformatics vs. BiocomputingBioinformatics vs. Biocomputing

BTBTITIT

Bioinformatics

Biocomputing

Page 5: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

5

BioinformaticsBioinformatics

Page 6: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

6

What is Bioinformatics?What is Bioinformatics?

Bioinformatics vs. Computational Biology Bioinformatik (in German): Biology-based computer scien

ce as well as bioinformatics (in English)

Informatics – computer scienceBio – molecular biology

Bioinformatics – solving problems arising from biology using methodology from computer science.

Page 7: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

7

Molecular Biology: Flow of Molecular Biology: Flow of Information Information

DNA RNA Protein Function

DNAPhe Cys LysCysAspCys ArgSerAla

Leu

Protein

ACTGGA AGCTTATC

Page 8: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

8

DNA (Gene) RNA ProteinDNA (Gene) RNA Protein

Controlstatement

TATA start

Termination stop

Controlstatement

Ribosomebinding

Gene

Transcription (RNA polymerase)

mRNA

Protein

Translation (Ribosome)

5’ utr 3’ utr

Page 9: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

9

Nucleotide and Protein SequenceNucleotide and Protein Sequence

aacctgcgga aggatcattaccgagtgcgg gtcctttgggcccaacctcc catccgtgtctattgtaccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg ggcgcctctgccccccgggc ccgtgcccgccggagacccc aacacgaacactgtctgaaa gcgtgcagtctgagttgatt gaatgcaatcagttaaaact ttcaacaatggatctcttgg ttccggctgc tattgtaccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg ggcgcctctgccccccgggc ccgtgcccgccggagacccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg cggagacccc

gcgggcccgc cgcttgtcggccgccggggg ggcgcctctgccccccgggc ccgtgcccgcaacctgcgga aggatcattaccgagtgcgg gtcctttgggcccaacctcc catccgtgtctattgtaccc tgttgcttcggcgggcccgc cgcttgtcggagttaaaact ttcaacaatggatctcttgg ttccggctgc tattgtaccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg ggcgcctctgccccccgggc ccgtgcccgccggagacccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg cggagacccc gcgggcccgc cgcttgtcggccgccggggg ggcgcctctg

cgcttgtcgg ccgccgggggccccccgggc ccgtgcccgccggagacccc aacacgaacactgtctgaaa gcgtgcagtctgagttgatt gaatgcaatcagttaaaact ttcaacaatggatctcttgg aacctgcggaccgagtgcgg gtcctttgggcccaacctcc catccgtgtctattgtaccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg ggcgcctctgagttaaaact ttcaacaatggatctcttgg ttccggctgc tattgtaccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg ggcgcctctgccccccgggc ccgtgcccgccggagacccc tgttgcttcg

SQ sequence 1344 BP; 291 A; C; 401 G; 278 T; 0 other

DNA (Nucleotide) Sequence

CG2B_MARGL Length: 388 April 2, 1997 14:55 Type: P Check: 9613 .. 1 MLNGENVDSR IMGKVATRAS SKGVKSTLGT RGALENISNV ARNNLQAGAK KELVKAKRGM TKSKATSSLQ SVMGLNVEPM EKAKPQSPEP MDMSEINSAL EAFSQNLLEG VEDIDKNDFD NPQLCSEFVN DIYQYMRKLE REFKVRTDYM TIQEITERMR SILIDWLVQV HLRFHLLQET LFLTIQILDR YLEVQPVSKN KLQLVGVTSM LIAAKYEEMY PPEIGDFVYI TDNAYTKAQI RSMECNILRR LDFSLGKPLC IHFLRRNSKA GGVDGQKHTM AKYLMELTLP EYAFVPYDPS EIAAAALCLS SKILEPDMEW GTTLVHYSAY SEDHLMPIVQ KMALVLKNAP TAKFQAVRKK YSSAKFMNVS TISALTSSTV MDLADQMC

Protein (Amino Acid) Sequence

Page 10: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

10

Some FactsSome Facts

1014 cells in the human body. 3 109 letters in the DNA code in every cell in

your body. DNA differs between humans by 0.2% (1 in 500

bases). Human DNA is 98% identical to that of

chimpanzees. 97% of DNA in the human genome has no known

function.

Page 11: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

11

Topics in BioinformaticsTopics in Bioinformatics

Structure analysisStructure analysis Protein structure comparison Protein structure prediction RNA structure modeling

Pathway analysisPathway analysis Metabolic pathway Regulatory networks

Sequence analysisSequence analysis Sequence alignment Structure and function prediction Gene finding

Expression analysisExpression analysis Gene expression analysis Gene clustering

Page 12: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

12

Extension of Bioinformatics ConcExtension of Bioinformatics Concept ept Genomics

Functional genomics Structural genomics

Proteomics: large scale analysis of the proteins of an organism

Pharmacogenomics: developing new drugs that will target a particular disease

Microarray: DNA chip, protein chip

Page 13: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

13

Applications of BioinformaticsApplications of Bioinformatics

Drug design Identification of genetic risk factors Gene therapy Genetic modification of food crops and animals Biological warfare, crime etc.

Personal Medicine? E-Doctor?

Page 14: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

14

Bioinformatics as Information TecBioinformatics as Information Technologyhnology

Bioinformatics

InformationRetrieval

GenBankSWISS-PROT

Hardware

Agent

Machine Learning

Algorithm

Supercomputing

Information filteringMonitoring agent

ClusteringRule discoveryPattern recognition

Sequence alignment

Biomedical text analysis

Database

Page 15: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

15

Background of BioinformaticsBackground of Bioinformatics Biological information infra

Biological information management systems Analysis software tools Communication networks for biological research

Massive biological databases DNA/RNA sequences Protein sequences Genetic map linkage data Biochemical reactions and pathways

Need to integrate these resources to model biological reality and exploit the biological knowledge that is being gathered.

Page 16: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

16

Structural Genomics

FunctionalGenomics Proteomics Pharmaco-

genomics

AGCTAGTTCAGTACATGGATCCATAAGGTACTCAGTCATTACTGCAGGTCACTTACGATATCAGTCGATCACTAGCTGACTTACGAGAGT

Microarray (Biochip)

Infrastructure of Bioinformatics

Areas and Workflow of BioinformAreas and Workflow of Bioinformaticsatics

Page 17: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

17

DNA Chip Data Analysis:DNA Chip Data Analysis:IT for BTIT for BT

Page 18: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

18

cDNA MicroarraycDNA Microarray

cDNA clones(probes)

PCR product amplificationpurification

Printing

Microarray

Hybridize target to microarray

mRNA target

Excitation

Laser 1Laser 2

Emission

Scanning

Analysis

Overlay images and normalize

0.1nl/spot

Page 19: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

19

The Complete Microarray BioinforThe Complete Microarray Bioinformatics Solutionmatics Solution

DataManagement

Databases

StatisticalAnalysis

ImageProcessing

Automation

DataMining

ClusterAnalysis

Page 20: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

20

DNA Chip ApplicationsDNA Chip Applications

Gene discovery: gene/mutated gene Growth, behavior, homeostasis …

Disease diagnosis Cancer classification

Drug discovery: Pharmacogenomics Toxicological research: Toxicogenomics

Page 21: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

21

Disease Diagnosis:Disease Diagnosis:Cancer Classification with DNA MicroarrayCancer Classification with DNA Microarray

- cDNA microarray data of 6567 gene expression levels [Khan ’01].

- Filter genes that are correlated to the classification of cancer using PCA and ANN learning.

- Hierarchical clustering of the DNA chip samples based on the filtered 96 genes.

- Disease diagnosis based on DNA chip.

[Fig.] Flowchart of the experimental procedure.

Page 22: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

22

Disease Diagnosis:Disease Diagnosis:Hierarchical Clustering Based on Gene Expression LevelsHierarchical Clustering Based on Gene Expression Levels

- Hierarchical clustering of cancer by 96 gene expression levels.

- The relation between gene expression and cancer category.

- Four cancer diagnostic categories

[Fig.] The dendrogram of four cancer clusters and gene expression levels (row: genes, column: samples).

Page 23: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

23

AI Methods for DNA Chip Data AI Methods for DNA Chip Data AnalysisAnalysis Classification and prediction

ANNs, support vector machines, etc. Disease diagnosis

Cluster analysis Hierarchical clustering, probabilistic clustering, etc. Functional genomics

Genetic network analysis Differential models, relevance networks, Bayesian netw

orks, etc. Functional genomics, drug design, etc.

Page 24: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

24

Cluster AnalysisCluster Analysis

[DNA microarray dataset]

[Gene Cluster 1]

[Gene Cluster 2]

[Gene Cluster 3]

[Gene Cluster 4]

Page 25: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

25

Methods for Cluster AnalysisMethods for Cluster Analysis

Hierarchical clustering [Eisen ’98] Self-organizing maps [Tamayo ’99] Bayesian clustering [Barash ’01] Probabilistic clustering using latent variables [Shi

n ’00] Non-negative matrix factorization [Shin ’00] Generative topographic mapping [Shin ’00]

Page 26: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

26

Clustering of Cell Cycle-regulated Clustering of Cell Cycle-regulated Genes in Genes in S. cerevisiae S. cerevisiae (the Yeas(the Yeast)t) Identify cell cycle-regulated

genes by cluster analysis. 104 genes are already known to

be cell-cycle regulated. Known genes are clustered into

6 clusters. Cluster 104 known genes and

other genes together. The same cluster

similar functional categories.

[Fig.] 104 known gene expression levels according to the cell cycle(row: time step, column: gene).

Page 27: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

27

Probabilistic Clustering Using Probabilistic Clustering Using Latent VariablesLatent Variables

gi: ith gene

zk: kth clustertj: jth time stepp(gi|zk): generating probability of ith gene given kth clustervk=p(t|zk): prototype of kth cluster

)()()|()|()(

i

kkiikki p

zpzpzpzpg

ggg

i j k

kjkikij ztpzpzpgztf ))|()|()(log(),,( gg

j

kjijki vxsimilarity ),( vx

: (*) objective function(maximized by EM)

Page 28: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

28

Experimental Result:Experimental Result:Identify Cell Cycle-Regulated GenesIdentify Cell Cycle-Regulated Genes

Clustering result

[Table] Clustering result with -factor arrest data. In 4 clusters, the genes, that have high probability of being cell cycle-regulated, were found.

Page 29: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

29

Experimental Result:Experimental Result:Prototype Expression Levels of Found ClustersPrototype Expression Levels of Found Clusters

[Fig.] Prototype expression levels of genes found to be cell cycle-regulated (4 clusters).

• The genes in the same cluster show similar expression patterns during the cell cycle.• The genes with similar expression patterns are likely to have correlated functions.

Page 30: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

30

Clustering Using Non-negative Clustering Using Non-negative Matrix Factorization (NMF)Matrix Factorization (NMF)

NMF (non-negative matrix factorization)

r

aaiaii HW

1

)()( WHG

WHG

G : gene expression data matrix

W : basis matrix (prototypes)

H : encoding matrix (in low

dimension)

0,, aiai HWG

NMF as a latent variable model

h1 hr

g1 g2 gn

W

Whg

h2

Page 31: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

31

Experimental Result:Experimental Result:Five Clusters Found by NMFFive Clusters Found by NMF

5 prototype expression levels during the cell cycle.

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Time step in cell cycle

Expr

essi

on le

vel

Page 32: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

32

Clustering Using GenerativeClustering Using Generative Topographic Mapping (GTM) Topographic Mapping (GTM)• GTM: a nonlinear, parametric mapping y(x;W)

from a latent space to a data space.

y(x;W): mapping

t1

t3

t2

x2

x1

Grid

<Latent space> <Data space>

Visualization

Generation

Page 33: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

33

Experimental Result:Experimental Result:Clusters Found by GTMClusters Found by GTM

Three cell cycle-regulated clusters found by GTMCluster center No. of train

Data/ no. in cluster

Correct no. / test data

Overall mean expression levels (Cln/b) of known genes

S/G2 5 / 1 / 2 (.148 .184 -.367 -.044)

S (0.111 –0.333) 5 / 5 5 / 5 (100%) (1.075 1.482 -.233 -.375)

M/G1 c1 c2 c3

(0.111 0.333)(-0.111 –0.111)(0.323 0.1)

13 / 7 / 2 / 2

1 / 60 / 60 / 6

(-.171 -.573 .091 .311)

G2/M c1 c2

(0.111 0.333)(0.111 0.111)

10 / 5 / 3

0 / 53 / 5 (80%)

(-.616 –1.01 1.832 1.596)

G1 c1 c2

(-0.111 0.333)(-0.111 0.111)

35 / 18 / 7

10 / 16 (62%) 0 / 16

(.894 .907 -.766 -.479)

Page 34: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

34

Experimental Result:Experimental Result:Comparison with other methodsComparison with other methods

Comparison of prototype expression levelsNo. of selected genes

Mean expression levels by GTM

No. of selected genes by Spellman

Mean expression levels by Spellman

S/G2 92 (.13 -.06 -.1 .01) 121 (.13 .05 -.16 .03)

S 25 (.84 .81 -.42 -.33) 71 (.46 .47 -.43 -.18)

M/G1 c1 c2 c3

1203410

(.82 .65 -.65 -.38)(-.04 -.37 -.01 -.11)(.32 .29 -.3 .05)

113 (-.21 -.61 -.04 .07)

G2/M c1 c2

3360

(-.59 -.96 1.34 1.29)(.08 -.30 .51 .57)

195 (-.32 -.62 .49 .54)

G1 c1 c2

122 74(total = 570)

(.92 .74 -.62 -.33)(.79 .82 -.48 -.34)

300

(total = 800)

(.66 .49 -.55 -.33)

Page 35: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

35

Genetic Network AnalysisGenetic Network Analysis

- Discover the complex regulatory interaction among genes.

- Disease diagnosis, pharmacogenomics and toxicogenomics

- Boolean networks

- Differential equations

- Relevance networks [Butte ’97]

- Bayesian networks [Friedman ’00] [Hwang ’00]

[Fig.] Basin of attraction of 12-gene Boolean genetic network model [Somogyi ’96].

Page 36: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

36

Bayesian NetworksBayesian Networks

Represent the joint probability distribution among random variables efficiently using the concept of conditional independence.

BA

C D

Enet) Bayes example (by the )|()|(),|()()(

rule)chain (by ),,,|(),,|(),|()|()(),,,,(

CEPBDPBACPBPAPDCBAEPCBADPBACPABPAP

EDCBAP

•A, C and D are independent given B.

•C asserts dependency between A and B.

•A, B and E are independent given C.

An edge denotes the possibility of the causal relationship between nodes.

Page 37: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

37

Bayesian Networks LearningBayesian Networks Learning

Dependence analysis [Margaritis ’00] Mutual information and 2 test

Score-based search

• D: data, S: Bayesian network structure

NP-hard problem Greedy search Heuristics to find good massive network structures quick

ly (local to global search algorithm)

n

i

q

j

r

kijk

ijkijk

ijij

iji i NN

Sp

SDpSpSDp

1 1 1 )()(

)()(

)(

)|()(),(

Page 38: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

38

The Small Bayesian Network for The Small Bayesian Network for Classification of CancerClassification of Cancer

Zyxin

Leukemiaclass

MB-1

C-mybLTC4STraining error Test error

Bayes nets 0/38 2/34Neural trees 0/38 1/34

RBF networks 0/38 1.3/34

•The Bayesian network was learned by full search using BD (Bayesian Dirichlet) score with uninformative prior [Heckerman ’95] from the DNA microarray data for cancer classification (http://waldo.wi.mit.edu/MPR/).

[Table] Comparison of the classification performance with other methods [Hwang ’00].

Page 39: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

39

Large-Scale Bayesian Network Large-Scale Bayesian Network with with 1171 Genes1171 Genes

- Genetic networks for understanding the regulatory interaction among genes and their derivatives

- Pharmacogenomics and Toxicogenomics

[Fig.] The Bayesian network structure constructed from DNA microarray data for cancer classification (partial view).

Page 40: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

40

DNA Computing: BT for ITDNA Computing: BT for IT

Page 41: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

41

DNA ComputingDNA Computing: BioMolecules a: BioMolecules as Computers Computer

011001101010001 ATGCTCGAAGCT

Page 42: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

42

Why DNA Computing?Why DNA Computing?

6.022 1023 molecules / mole Immense, brute force search of all possibilities

Desktop: 109 operations / sec Supercomputer: 1012 operations / sec 1 mol of DNA: 1026 reactions

Favorable energetics: Gibb’s free energy

1 J for 2 1019 operations Storage capacity: 1 bit per cubic nanometer

-1mol 8kcalG

Page 43: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

43

HPPHPP

...

......

...ATGATGACGACG

TGCTGC

CGACGA

TAATAAGCAGCA

CGTCGT...

...

...

...... ...

...

...

10

3

2 56

4

SolutionSolution

ATGTGCTAACGAACG

ACGCGAGCATAAATGTGCCGTACGCGAGCATAAATGTGCCGT

TAAACG

CGACGT

TAAACGGCAACG

...

...

...

...

CGACGTAGCCGT

...

...

...

ACGCGAGCATAAATGTGCCGTACGCGAGCATAAATGTGCCGTACGCGTAGCCGT

ACGCGT

......

...

...

...

ACGGCATAAATGTGCACGCGTACGCGAGCATAAATGCGATGCCGT

ACGCGAGCATAAATGTGCCGTACGCGAGCATAAATGTGCCGT

... ... .........

ACGCGAGCATAAATGTGCCGTACGCGAGCATAAATGTGCCGT

...

.........

...

Decoding

Ligation

Encoding

Gel Electrophoresis

Affinity Column

ACGCGAGCATAAATGTGCACGCGT

ACGCGAGCATAAATGCGATGCACGCGT

ACGCGAGCATAAATGTGCACGCGT

ACGCGAGCATAAATGCGATGCACGCGT

2

0 13 4

56

Node 0: ACG Node 3: TAANode 0: ACG Node 3: TAANode 1: CGA Node 4: ATGNode 1: CGA Node 4: ATGNode 2: GCA Node 5: TGCNode 2: GCA Node 5: TGC

Node 6: CGTNode 6: CGT

Flow of DNA ComputingFlow of DNA Computing

PCR(Polymerase

Chain Reaction)

Page 44: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

44

Biointelligence on a Chip?Biointelligence on a Chip?

Biological Computer

MolecularElectronics

BioinformationTechnology

Computing Models:The limit of conventional computing models

Computing Devices: The limit of siliconesemiconductor technology

Information Technology

Biotechnology

Biointelligence Chip

Page 45: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

45

Intelligent Biomolecular InformatioIntelligent Biomolecular Information Processingn Processing

Bio-Memory Biocomputing

Theoretical Models

S

GFP

Cytochrome c

S

GFP

Cytochrome c

Bio-Processor

Input AInput AController

OutputReaction Chamber

(Calculating)

Page 46: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

46

Evolvable Biomolecular HardwarEvolvable Biomolecular Hardwaree

Sequence programmable and evolvable molecular systems have been constructed as cell-free chemical systems using biomolecules such as DNA and proteins.

Page 47: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

47

DNA Computers vs. Conventional DNA Computers vs. Conventional ComputersComputers

DNA-based computers Microchip-based computersslow at individual operations fast at individual operations

can do billions of operations simultaneously

can do substantially fewer operations simultaneously

can provide huge memory in small space

smaller memory

setting up a problem may involve considerable preparations

setting up only requires keyboard input

DNA is sensitive to chemical deterioration

electronic data are vulnerable but can be backed up easily

Page 48: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

48

Molecular Operators for DNA Molecular Operators for DNA ComputingComputing• Hybridization: complementary pairing of two single-stranded polynucleotides

5’- AGCATCCA –3’

3’- TCGTAGGT –5’+ 5’- AGCATCCA –3’

3’- TGCTAGGT –5’

• Ligation: attaching sticky ends to a blunt-ended molecule

TGACTACGACTG

ATGCATGCTACG + ATGCATGCTGAC

TACGTACGTGAC

sticky end

Page 49: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

49

Research GroupsResearch Groups

MIT, Caltech, Princeton University, Bell Labs EMCC (European Molecular Computing Consorti

um) is composed of national groups from 11 European countries

BioMIP Institute (BioMolecular Information Processing) at the German National Research Center for Information Technology (GMD)

Molecular Computer Project (MCP) in Japan Leiden Center for Natural Computation (LCNC)

Page 50: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

50

Applications of Biomolecular ComApplications of Biomolecular Computingputing Massively parallel problem solving Combinatorial optimization Molecular nano-memory with fast associative search AI problem solving Medical diagnosis Cryptography Drug discovery Further impact in biology and medicine:

Wet biological data bases Processing of DNA labeled with digital data Sequence comparison Fingerprinting

Page 51: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

51

NACST NACST (Nucleotide Acid Computing Simulation Toolkit)(Nucleotide Acid Computing Simulation Toolkit)

GUI

DNA Sequence Generator

Genetic Algorithm

Ligation Unit

PCR Unit

Electrophoresis UnitAffinity Column Unit

Enzyme Unit

NACST Engine Controller

DNA Sequence Optimizer

Page 52: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

52

NACSTNACSTOutputs Inputs

Page 53: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

53

Combinatorial Problem SolverCombinatorial Problem Solver

1

32

AGCT TAGGP1A P1B

ATGG CATGP2A P2B

CGAT CGAAP3A P3B

10

3

2 5

6

4

3

53

3

7

113

3

9

11

33 7 3

P1B P3A

ATCC GCCT GCTAW13P1B P2A

ATCC ATCA TACCW12

TSP (Traveling Salesman Problem)

Representations

0 1 2 3 4 5 6 0

Page 54: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

54

Combinatorial Problem SolverCombinatorial Problem Solver Weight

representation methods

1. Molecules with high G-C content tend to hybridize easily.

2. Molecules with high G-C content tend to be denatured at higher temperature.

3. Molecules with larger population in tube will have more probability to hybridize.

Hybridization/Ligation

PCR/Gel electrophoresis

Affinity chromatography

PCR/Gel electrophoresis

Temperature Gradient Gel Electrophoresis

Graduate PCR

Page 55: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

55

Experimental Results for 4-TSPExperimental Results for 4-TSP

Hybridization (37°C)Ligation (16 °C 15hr)

PCR (36 cycle)Gel electrophoresis

(10% polyacrylamide gel)

50 bp markerOligomer mixture

Ligation result

Final PCRresult(140bp)

Page 56: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

56

Molecular Theorem ProverMolecular Theorem Prover Resolution refutation method

RQP QTS S TP R

RQ QT

Q

R

nilR is true!

Problem under consideration:

Turn into , add R as

?true , , , ,

RPTSQTSRQP

BA BA

R

RPTSQTSRQP

, , , ,

Page 57: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

57

Molecular Theorem ProverMolecular Theorem Prover(Abstract Implementation)(Abstract Implementation)

Implementation 1 Implementation 2

¬S ¬T Q

¬Q ¬P R

P ¬R

TS

¬S ¬T Q¬Q ¬P R

P ¬R

TS

¬S ¬T Q¬Q ¬P R

P ¬RTS

R

¬Q

Q

¬P¬S

¬T ¬R

T SP

Page 58: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

58

Molecular Theorem ProverMolecular Theorem Prover(Experiments for Method 1)(Experiments for Method 1)

실험 과정 실험 결과

II. DenaturationII. Denaturation

( 95( 95°C 10 min)°C 10 min)

IV. Polyacrylamide gel Electrophoresis(20%)IV. Polyacrylamide gel Electrophoresis(20%)

( PAGE )( PAGE )

V. Detection of solution V. Detection of solution

: 75bp ds DNA: 75bp ds DNA

III. AnnealingIII. Annealing9595°C 1 min °C 1 min 15 °C : 1°C down/min 15 °C : 1°C down/min

I. I. 각 분자들을 혼합각 분자들을 혼합

100pmol/each 100pmol/each Total 20 Total 20 ulul

200 bp200 bp

20 bp20 bp

11 22 33 44 55 66

20 bp DMA marker (Talara)

Mixture Reaction

Page 59: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

59

Solving Logic Problems by Solving Logic Problems by Molecular ComputingMolecular Computing Satisfiability Problem

Find Boolean values for variables that make the given formula true

3-SAT Problem Every NP problems can be see

n as the search for a solution that simultaneously satisfies a number of logical clauses, each composed of three variables.

)or or ( AND )or or ()or or ( AND )or or (

321321

654321

xxxxxxxxxxxx

)()()( 324431 xxxxxx

Page 60: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

DNA Computing with DNA ChipsDNA Computing with DNA Chips

Page 61: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

61

DNA Chips for DNA ComputingDNA Chips for DNA Computing

I. Make: oligomer synthesis

II. Attach (Immobilized): 5’HS-C6-T15-CCTTvvvvvvvvTTCG-3’

III. Mark: hybridization

IV. Destroy: Enzyme rxn (ex.EcoRI)

V. Unmark * 문제를 만족시키지 않는 모든 stran

d 제거

VI. Readout: N cycle 의 마지막 단계에 해가 남게

되 면 , PCR 로 증폭하여 확인 !

Page 62: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

62

Variable Sequences and the Variable Sequences and the Encoding SchemeEncoding Scheme

Page 63: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

63

Tree-dimensional Plot and Tree-dimensional Plot and Histogram of the FluorescenceHistogram of the Fluorescence

S3: w=0, x=0, y=1, z=1 S7: w=0, x=1, y=1, z=1 S8: w=1, x=0, y=0, z=0 S9 : w=1, x=0, y=0, z=1

y=1: (w V x V y) 만족 z=1: (w V y V z) 만족 x=0 or y=1: (x V y) 만족 w=0: (w V y) 만족

Four spots with high fluorescence intensity correspond to the four expected solutions.

DNA sequences identified in the readout step via addressed array hybridization.

Page 64: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

64

OutlookOutlook

IT gets a growing importance in the advancement of BT. Bioinformatics DNA Microarray Data Mining

IT can benefit much from BT. Biocomputing and Biochips DNA Computing (with DNA Chips)

Bioinformation technology (BIT) is essential as a next-generation information technology. In Silico Biology vs. In Vivo Computing

Page 65: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

65

ReferencesReferences [Barash ’01] Barash, Y. and Friedman, N., Context-specific Bayesian

clustering for gene expression data, Proc. of RECOMB’01, 2001. [Butte ’97] Butte, A.J. et al., Discovering functional relationships betw

een RNA expression and chemotherapeutic susceptibility using relevance networks, Proc. Natl Acad. Sci. USA, 94, 1997.

[Eisen ’98] Eisen, M.B. et al., Cluster analysis and display of genome-wide expression patterns, Proc. Natl Acad. Sci. USA, 95, 1998.

[Friedman ’00] Friedman, N. et al, Using Bayesian networks to analyze expression data, Proc. of RECOMB’00, 2000.

[Heckerman ’95] Heckerman, D. et al., Learning Bayesian networks: the combination of knowledge and statistical data, Machine Learning, 20(3), 1995.

[Hwang ’00] Hwang, K.-B. et al., Applying machine learning techniques to analysis of gene expression data: cancer diagnosis, CAMDA’00, 2000.

Page 66: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

66

ReferencesReferences [Khan ’01] Khan, J. et al., Classification and diagnostic prediction of c

ancers using gene expression profiling and artificial neural networks, Nature Medicine, 7(6), 2001.

[Margaritis ’00] Margaritis, D. and Thrun, S., Bayesian network induction via local neighborhoods, Proc. of NIPS’00, 2000.

[Shin ’00] Shin, H.-J. et al., Probabilistic models for clustering cell cycle-regulated genes in the yeast, CAMDA’00, 2000.

[Somogyi ’96] Somogyi, R. and Sniegoski, C.A., Modeling the complexity of genetic networks: understanding multigenic and pleiotropic regulation, Complexity, 1(6), 1996.

[Tamayo ’99] Tamayo, P. et al., Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation, Proc. Natl Acad. Sci. USA, 96, 1999.

Page 67: Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT)

67

More information atMore information at http://cbit.snu.ac.kr/http://cbit.snu.ac.kr/http://bi.snu.ac.kr/http://bi.snu.ac.kr/