Upload
rebecca-cain
View
228
Download
4
Tags:
Embed Size (px)
Citation preview
CSCI 6900/4900 Special Topics in Computer Science
Automata and Formal Grammars for Bioinformatics
Bioinformatics problems
• sequence comparison• pattern/structure search• pattern/structure recognition• relationship of sequences
Algorithm design
• optimal algorithms• heuristic algorithms• parallel algorithms
Probabilistic models
• stochastic finite state automata (HMMs)• stochastic regular grammars• stochastic context-free grammars• more complex grammar models
Probabilistic modeling and algorithms
M: modeling a family of sequences (e.g. RNA) to capture certain properties Q1, Q2, ….
(1) Each sequence x possesses a property Qk(x) with probability Pk(x)
(2) A probability distribution for each sequence x over the properties, i.e., ∑k Pk(x) = 1 for each given x
(3) The most likely property Q*(x) is one with the highest probability,i.e., Q*(x) = arg maxk { Pk(x) }
(4) Algorithms are designed to find the most likely property for given sequences. But how?
Modeling mechanism
M
Computational linguistic systems can describe desired properties of bio sequences
D (sample, training data)assigning probs
Outline for the course
• Part 0: molecular biology basics and review of probability theory
• Part 1: pairwise alignment, HMMs, profile-HMMs, gene finding, and multiple alignment (chapters 1-6)potential research projects: efficient HMM algorithms, gene finding
• Part 2: RNA stem-loops, SCFG, secondary structure prediction, structural homology search (chapters 9-10)
potential research projects: efficient SCFG algorithms, pseudoknot prediction, protein secondary structure prediction
• Part 3: phylogeny reconstruction, probabilistic approaches (chapters 7-8)
potential research projects: grammar modeling of evolution
The ways this course is to be conducted
• To learn new concepts and techniques
Lectures (by the instructor and students)
• To apply learned knowledge to research
Research discussions (lead by students and the instructor)
• To demonstrate learning effectiveness
Presentations of research results (by students)
The central dogma of molecular biology
Nucleotides
• Purines Adenine, Guanine
• Pyrimidines Cytosine, Thymine
Building blocks of DNA
Double helix of DNA
DNA replication
Genetic code
Mutations
(1) synonymous
(2) Missense
(3) nonsense
(4) frame-shift
RNA synthesis
RNA synthesis (cont’)
RNA can fold to itself
Protein synthesis
Biological information flow
Genome
AGACGCTGGTATCGCATTAACTAACGGGTTACTCGGATATTACCTTACTATAGGGCGCTATCGCGCGTTAATCTGGTATC
IntronsExons
Gene sequence
Proteinsequence
Proteinstructure
RegulatoryDNA sequence
Sequencefamily
Structurefamily
Protein-DNAinteractions
Protein-protein interactions
Generegulation
Geneexpression
Proteinfunction
Proteinabundance
Cellularrole
What bioinformatics is NOT:
• Not just using a computer to speed up biology• Not just applying computer algorithms to biology• Not just the accountant of genomic data
What bioinformatics is then:
• The creative use of computers to define and solve central biological puzzles
• The computer becomes an hypothesis machine, making predictions to be tested at the bench.