Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
1896 1920 1987 2006
Omics Big Data: Advanced Algorithms for Biological Sequence Analysis
Chaochun Wei (韦朝春)
http://cgm.sjtu.edu.cn/
Jing Li(李婧)
1Spring 2019
Contents
Background
Some examples about Omics Big Data
Course information
• Goal
• Contents
• Organization
• Grading
2
“Biology is an information science” -- Leroy Hood
A Big Picture of Biology
http://genomicscience.energy.gov/biofuels/3
http://genomicscience.energy.gov/biofuels/
A Big Picture of Biology
5
Bioinformatics/Computational Biology /
Systems Biology/Synthetic Biology/Omics Big Data
Bioinformatics
The science of collecting and analyzing
complex biological data such as genetic
codes.
-- Oxford Dictionary
Major research areas
• Sequence analysis
• Genome annotation
• Computational evolutionary biology
• Analysis of gene expression, regulation
• Comparative genomics
• Literature analysis
• Biological systems modeling
• Structural Biology
6
Milestone of Modern Biology:The human genome project
Feb. 15, 2001 Nature Feb. 16, 2001 Science
7
Human Genome Project3 billion dollars,3 billion residues
“This is ...
• a WASTE of tax payers’
money….
• We can do a lot more
8
George ChurchProfessor of Genetics Harvard Medical School
Automated high throughput sequencing
AGAACGACCATCAACTAAATCAAAATGCCTTTCAAACCAGCAGACAACCCAAAATGCCCAAAATGCGGCAAATCCGTATACGCCGCNGAAGAAAAAGTAGCTGGAGGATACAAATACCACAAATCCTGCTTCAAATGCGGTATGTGCAATAAAATGCTCGACTCCACCAACGTAACTGAACACGAAGCTGAATTGTACTGCAAAAATTGCCATGGACGTAAATACGGACCTAAAGGATACGGATTCGGTGGTGGAGCTGGGTGCTTAAGTATGGACGATGGAGCCCAATTCAAGGGAACACAATAATTTTAAGAAGGAATCAATGTGAAGATGGCGGCCAAAACCACACCAACTGTCAGCGGTCGTCAGTTCTACCCTTTTCCATCCCCCACTATACACTAATGTAATATTTTTAGATCTTAAATTACAGACTTAGTTTTAATTTATAAATTTTCGTATGACACGTTATAAATAAGAATTCGGTTATTTTGTAATAATTGAATTAAATAAATCTTATTTAAGACCAAAAAA
9
Next-generation sequencing technology
10
Sanger method: huge lab, numerous machines and staffs
<
Next-generation:one staff, one machine
NATURE METHODS ,16 | VOL.5 NO.1 | JANUARY 2008
NGS platforms
11
Illumina / Solexa HiSeq
Applied Biosystems
ABI 3730XL
Roche / 454 Genome Sequencer FLX
Applied BiosystemsSOLiD
HeliScope™ Single Molecule Sequencer
3rd generation: Oxford NanoporeMinION
Illumina / Solexa MiSeqPacific Biosciences RS II
Comparison of sequencing platforms (2019.2)
12
Platforms Sanger 454HiSeq X Ten *
MiSeq *NovaSeq
6000*
PacBioSequel
**Nanopore
Read length650-1100
150-1000
150 36-300 2x250Average 30k
Up to 2 Mb
# of reads/run 96 0.4-2M 5.3-6 B15M –25M
32-40B ~500k Up to 500
Error rate 10^-3 <10^-2 ~10^-3 ~10^-3 ~10^-3 ~1% Varies
Cost($/Mbp) 5000 ~5 <0.01 ~0.5 <0.001 ~0.3 ~0.1
Time/run~3
hours~7
hours<3 days
4-56 hours
13-38hr< 20 hours
As little as 5 mins
Throughput 100Kb ~1Gb 1.6-1.8Tb540Mb-
15Gb4.8-6Tb
Up to 20 Gb
10-30Gb
* https://www.illumina.com/systems/** https://www.pacb.com/products-and-services/sequel-system/
Latest sequencing platforms
Complete Genomics (obtained by BGI, 2013)
• Finish 10,000 genomes/year (2010)
Oxford Nanopore
• Human genome:$3000, 6 hours (2012)
Visigen
BIGIS (China)
more…13
Ion Torrent
Pacific Biosciences RS IIRead length: Maximum > 30 Kb
Popular sequencing platforms
HiSeq X series (2015)
• Ten: > 18,000 human genomes/year
• <$1000 per 30X genome
NovaSeq 5000 (2017 from Illumina)
• <$100 per human genome
• 2Tb in 2.5 days, 1.6 billion reads 14
* http://www.illumina.com/systems
Latest sequencing platforms
NovaSeq 6000 (2018 from Illumina)
• <$100 per human genome
• 6Tb in 2 days, 20 billion reads
MinION (Oxford Nanopore)
• Use USB 3.0 cable as power
• 10-30Gb/flow cell
• Can be used at ANYWHERE
• $1000/pack(device + kits + membership)
* http://www.illumina.com/systems
BGI Sequencer
Launched on Oct. 2018
6Tb/day
The first a few personal genomes
Craig Venter genome
James Watson genome
2 Koran genomes
1 Chinese genome
2 cancer genomes
1 African genome
Stephen Quake genome
Family of Four by Institute of System Biology
……….
17
Latest personal genome projects
1000 Genome Projects (UK, China, US)
ClinSeq (NHGRI)
International Cancer Genome Consortium (Canada)
23andMe Research Revolution (US)
18
Latest personal genome projects
23andMe Research Revolution (US)
10,000 Human Genome Project (USA)
300,000 Human Genome Project (Iceland)
500,000 Genome Projects (UK *)
1,000,000 Human Genome Project (USA)
Cancer genomes
* http://www.ukbiobank.ac.uk/
Personal genomics
6 billions genomic DNA base pairs, ~22k genes,
>300,000 proteins,
Personalized genomic difference: 6 million bps
Common and rare variations
Common variations
– GWAS generally only have the power
to detect common SNP variations!
How to detect rare variations?
– whole genome sequencing!
In genetic epidemiology, a genome-wide association study (GWAS) is
an examination of many common genetic variants in different
individuals to see if any variant is associated with a trait. GWAS
typically focus on associations between single-nucleotide
polymorphisms (SNPs) and traits like major diseases.
Caner Genomes
TCGA:http://cancergenome.nih.gov/ ICGC:http://icgc.org/
De novo sequencing
targeted sequencing
a large number of small genomes
SNP discovery
without reference genomes
Transcriptom study
Unknown Transcriptom
Metagenome study
Microbial genomes in nature
Epigentics study
Regulatory element
Chip-seq, RNA-seq
Other new projects
High throughput sequence alignment
New ideas, new projects
Metagenomes, Pan-genomes, Single-molecular sequencing
Metagenomes
• HMG
• Marine metagenomes
Pan-genomes
• 3,000 rice genomes (50TB raw reads)
• 1,000 genomes of Arabdopsis thaliana
• 15,000 tomato genomes?
Single-cell sequencing
• Different tissues, development stages, conditions
• …24
Sequencing every cells in a human body
10^14 cells
25
https://www.humancellatlas.org/
Sequencing the world!
26
Sequencing all Eukaryotes
http://www.sciencemag.org/news/2017/02/biologists-propose-sequence-dna-all-life-earth
Sequencing the Earth!
Earth Microbiome Project
• Constructing the Microbial Map for Planet Earth
• 200,000 samples
• 500,000 reconstructed microbial genomes
27
http://www.earthmicrobiome.org/
Single Cell Sequencing
Mouse Cell Atlas
• 400,000 single cell mRNA-seq*
*Han et. al, Cell (2018), 172(5):1091-110728
Big data science in China
The size of data you have been working onNational Science and Technology Survey, 2017.02
https://sojump.com/jq/11572197.aspx
Big data science in China
Where do you publish your data?National Science and Technology Survey, 2017.02
https://sojump.com/jq/11572197.aspx
Big data science in China
Where do you store your data?National Science and Technology Survey, 2017.02
https://sojump.com/jq/11572197.aspx
Big data science in China
Top 10 reasons people don’t share dataNational Science and Technology Survey, 2017.02
https://sojump.com/jq/11572197.aspx
Computer cluster/Super computer
Pi: Ranked 158 world wide, 2013750TB(2013.1-2016.9) 3.6PB (2016.12) 10PB (2019.6)
Go inside
Course organization • Introduction (week 1)
• Part I: Genomics, Transcriptomics (week 2 – 5)
• Theories and algorithms (week 2)
– Chapter 1, Probability, Statistics and Information Theory
– Chapter 2, Algorithm complexity analysis
– Chapter 3, Dynamic Programming
• Sequence analysis methods
– Chapter 4, Sequence comparison (week 3)
– Chapter 5, Sequencing technology(week 3)
– Chapter 6 Hidden Markov Model (week 4)
– Chapter 7 Gene prediction (week 4)
– Chapter 8 Multiple sequence alignment (week 4)
• Metagenomics (week 5)
• Part II: Transcriptomics and Proteomics (week 6-10)
• Part III:Student presentations (Week 11-16)
• *Invited Talk (Week 16)
• Final Exam(Week 17 or 18)35
Course features
Subjects
Biological sequences ( Genomic, transcriptomic and Proteomic seqs)
From reads to whole genomes (peptides to proteomes)
From an individual to a population of omics data
Topics
Algorithms
Models
Biology
Theory and Practice:
Probability Theory
Complexity analysis for algorithms
Design and implementation of a bioinformatics tool, such as an HMM-
based gene prediction system
36
Pre-requirements (*)
Mathematics
Calculus
Probability Theory
Statistics
Advanced Algebra
Computer Science
Programming
Biology
Molecular Biology
37
*You have to meet the prerequisites
References
Biological sequence analysis:Probabilistic Models for
Proteins and Nucleic Acids, R. Durbin, S. Eddy, A. Krogh,
G.Mitchison, Cambridge University Press, 1999
生物信息学基础,孙啸, 陆祖宏, 谢建明清华大学出版社,
2004
Introduction to Algorithms, Thomas Cormen, Charles
Leiserson, and Ronald Rivest, The MIT Press.
Unix and Perl (V.2.3.4), K. Bradnam & I. Korf, 2009
An Introduction to Bioinformatics Algorithms
Neil C. Jones and Pavel A. Pevzner
中译本: 生物信息学算法导论, 【美】 N.C琼斯 P.A.帕夫纳 著
王翼飞 等译 , 化学工业出版社 (生物.医药出版分社)
38
Goals
Basic knowledge about omics data analysis
Some practice about omics big data analysis
Critical reading of scientific literatures
Presentation skills
Grading
Homework 30%
Projects* 20%
Presentation 10%
Exam 40%
* Part of the following datasets have been installed in our
local server for you to use for your project. •3K Rice Genome Project Data:17TB
•10K Human Genome Data:to be downloaded
作业规定
作业允许合作,但是必须注明各人的贡献
作业报告必须用自己的语言独立完成
期末考试需要独立完成
严禁抄袭
抄袭者:不及格(F)
被抄袭者:成绩降一级(AB, BC, CD, DF)
不要迟交
41
Reading materials
Nature. 2001 Feb 15;409(6822):860-921.Initial sequencing and analysis of the human genome. International Human Genome Sequencing Consortium
Homework 1(due 10:00 AM next Wednesday,March 6,2019)
Please include:
A. Your Name, ID, Major
B. Your education background, your long term goals
C. Why do you take this course? Or What do you want to learn from this
course?
D. Do you meet all the prerequisites?
E. What time slot is the best for office hour
F. A short report (a page or two) about the reading materials
Read the paper critically, and give your comments
G. A potential topic that you would like to do for your project
Submit it to http://cgm.sjtu.edu.cn/test/obd/index.html
Questions? Ask TA Huimin Lu: [email protected]