Upload
randolph-todd
View
221
Download
1
Tags:
Embed Size (px)
Citation preview
04/18/23 Changhui (Charles) Yan 1
Gene Finding
Changhui (Charles) Yan
04/18/23 Changhui (Charles) Yan 2
Gene Finding
Genomes of many organisms have been sequenced
04/18/23 Changhui (Charles) Yan 3
Genome
04/18/23 Changhui (Charles) Yan 4
Completely Sequenced Genomes
04/18/23 Changhui (Charles) Yan 5
Gene Finding
More than 60 eukaryotic genome sequencing projects are underway
04/18/23 Changhui (Charles) Yan 6
Human Genome Project (HGP)
To determine the sequences of the 3 billion bases that make up human DNA 99% human DNA sequence finished to 99.99%
accuracy (April 2003) To identify the approximate 100,000 genes
in human DNA (The estimates has been changed to 20,000-25,000 by Oct 2004) 15,000 full-length human genes identified
(March 2003) To store this information in databases To develop tools for data analysis
04/18/23 Changhui (Charles) Yan 7
Gene Finding
Genomes of many organisms have been sequenced
We need to decipher the raw sequences Where are the genes? What do they encode? How the genes are regulated?
04/18/23 Changhui (Charles) Yan 8
Gene Finding
Homology-based methods, also called `extrinsic methods‘ It seems that only approximately half of
the genes can be found by homology to other known genes (although this percentage is of course increasing as more genomes get sequenced).
Gene prediction methods or `intrinsic methods‘ (http://www.nslij-genetics.org/gene/)
04/18/23 Changhui (Charles) Yan 9
Machine Learning Approach Split data into a training set and a test set Use the training set to train a classifier Test the classifier on test set The classifier then can be applied to novel data
Training data
Machine Learning algorithm
Classifier
Test data
Evaluation of classifier
Novel data
Prediction
04/18/23 Changhui (Charles) Yan 10
Data, examples, classes, classifier
ccgctttttgccagcataacggtgtcga, 1accacgttttttgccagcatttgccagca, 0atcatcacgatcacgaacatcaccacg, 0…
04/18/23 Changhui (Charles) Yan 11
N-fold cross-validation
Training Set Test Set
Round 1
Round 2
Round 3
3-fold cross-validationE.Coli K12 Genome4,639,675
04/18/23 Changhui (Charles) Yan 12
Machine Learning Approach
Training data
Machine Learning algorithm
Classifier
Test data
Evaluation of classifier
Novel data
Prediction
04/18/23 Changhui (Charles) Yan 13
Gene-finders
04/18/23 Changhui (Charles) Yan 14
Prokaryotes vs. Eukaryotes Prokaryotes are organisms
without a cell nucleus. Most prokaryotes are bacteria. Prokaryotes can be divided into
Bacteria and Archaeabacteria. Eukaryotes are organisms which
a membrane-bound nucleus.
04/18/23 Changhui (Charles) Yan 15
Prokaryotes vs. Eukaryotes
Prokaryotes’ genomes are relatively simple: coding region (genes) vs. non-coding region.
Eukaryotes’ genomes are complicated.
04/18/23 Changhui (Charles) Yan 16
Eukaryotic genes