Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf ·...

Preview:

Citation preview

1896 1920 1987 2006

Omics Big Data: Advanced Algorithms for Biological Sequence Analysis

Chaochun Wei (韦朝春)

ccwei@sjtu.edu.cn

http://cgm.sjtu.edu.cn/

Jing Li(李婧)

jing.li@sjtu.edu.cn

1Spring 2019

Contents

Background

Some examples about Omics Big Data

Course information

• Goal

• Contents

• Organization

• Grading

2

“Biology is an information science” -- Leroy Hood

A Big Picture of Biology

http://genomicscience.energy.gov/biofuels/3

http://genomicscience.energy.gov/biofuels/

A Big Picture of Biology

5

Bioinformatics/Computational Biology /

Systems Biology/Synthetic Biology/Omics Big Data

Bioinformatics

The science of collecting and analyzing

complex biological data such as genetic

codes.

-- Oxford Dictionary

Major research areas

• Sequence analysis

• Genome annotation

• Computational evolutionary biology

• Analysis of gene expression, regulation

• Comparative genomics

• Literature analysis

• Biological systems modeling

• Structural Biology

6

Milestone of Modern Biology:The human genome project

Feb. 15, 2001 Nature Feb. 16, 2001 Science

7

Human Genome Project3 billion dollars,3 billion residues

“This is ...

• a WASTE of tax payers’

money….

• We can do a lot more

8

George ChurchProfessor of Genetics Harvard Medical School

Automated high throughput sequencing

AGAACGACCATCAACTAAATCAAAATGCCTTTCAAACCAGCAGACAACCCAAAATGCCCAAAATGCGGCAAATCCGTATACGCCGCNGAAGAAAAAGTAGCTGGAGGATACAAATACCACAAATCCTGCTTCAAATGCGGTATGTGCAATAAAATGCTCGACTCCACCAACGTAACTGAACACGAAGCTGAATTGTACTGCAAAAATTGCCATGGACGTAAATACGGACCTAAAGGATACGGATTCGGTGGTGGAGCTGGGTGCTTAAGTATGGACGATGGAGCCCAATTCAAGGGAACACAATAATTTTAAGAAGGAATCAATGTGAAGATGGCGGCCAAAACCACACCAACTGTCAGCGGTCGTCAGTTCTACCCTTTTCCATCCCCCACTATACACTAATGTAATATTTTTAGATCTTAAATTACAGACTTAGTTTTAATTTATAAATTTTCGTATGACACGTTATAAATAAGAATTCGGTTATTTTGTAATAATTGAATTAAATAAATCTTATTTAAGACCAAAAAA

9

Next-generation sequencing technology

10

Sanger method: huge lab, numerous machines and staffs

<

Next-generation:one staff, one machine

NATURE METHODS ,16 | VOL.5 NO.1 | JANUARY 2008

NGS platforms

11

Illumina / Solexa HiSeq

Applied Biosystems

ABI 3730XL

Roche / 454 Genome Sequencer FLX

Applied BiosystemsSOLiD

HeliScope™ Single Molecule Sequencer

3rd generation: Oxford NanoporeMinION

Illumina / Solexa MiSeqPacific Biosciences RS II

Comparison of sequencing platforms (2019.2)

12

Platforms Sanger 454HiSeq X Ten *

MiSeq *NovaSeq

6000*

PacBioSequel

**Nanopore

Read length650-1100

150-1000

150 36-300 2x250Average 30k

Up to 2 Mb

# of reads/run 96 0.4-2M 5.3-6 B15M –25M

32-40B ~500k Up to 500

Error rate 10^-3 <10^-2 ~10^-3 ~10^-3 ~10^-3 ~1% Varies

Cost($/Mbp) 5000 ~5 <0.01 ~0.5 <0.001 ~0.3 ~0.1

Time/run~3

hours~7

hours<3 days

4-56 hours

13-38hr< 20 hours

As little as 5 mins

Throughput 100Kb ~1Gb 1.6-1.8Tb540Mb-

15Gb4.8-6Tb

Up to 20 Gb

10-30Gb

* https://www.illumina.com/systems/** https://www.pacb.com/products-and-services/sequel-system/

Latest sequencing platforms

Complete Genomics (obtained by BGI, 2013)

• Finish 10,000 genomes/year (2010)

Oxford Nanopore

• Human genome:$3000, 6 hours (2012)

Visigen

BIGIS (China)

more…13

Ion Torrent

Pacific Biosciences RS IIRead length: Maximum > 30 Kb

Popular sequencing platforms

HiSeq X series (2015)

• Ten: > 18,000 human genomes/year

• <$1000 per 30X genome

NovaSeq 5000 (2017 from Illumina)

• <$100 per human genome

• 2Tb in 2.5 days, 1.6 billion reads 14

* http://www.illumina.com/systems

Latest sequencing platforms

NovaSeq 6000 (2018 from Illumina)

• <$100 per human genome

• 6Tb in 2 days, 20 billion reads

MinION (Oxford Nanopore)

• Use USB 3.0 cable as power

• 10-30Gb/flow cell

• Can be used at ANYWHERE

• $1000/pack(device + kits + membership)

* http://www.illumina.com/systems

BGI Sequencer

Launched on Oct. 2018

6Tb/day

The first a few personal genomes

Craig Venter genome

James Watson genome

2 Koran genomes

1 Chinese genome

2 cancer genomes

1 African genome

Stephen Quake genome

Family of Four by Institute of System Biology

……….

17

Latest personal genome projects

1000 Genome Projects (UK, China, US)

ClinSeq (NHGRI)

International Cancer Genome Consortium (Canada)

23andMe Research Revolution (US)

18

Latest personal genome projects

23andMe Research Revolution (US)

10,000 Human Genome Project (USA)

300,000 Human Genome Project (Iceland)

500,000 Genome Projects (UK *)

1,000,000 Human Genome Project (USA)

Cancer genomes

* http://www.ukbiobank.ac.uk/

Personal genomics

6 billions genomic DNA base pairs, ~22k genes,

>300,000 proteins,

Personalized genomic difference: 6 million bps

Common and rare variations

Common variations

– GWAS generally only have the power

to detect common SNP variations!

How to detect rare variations?

– whole genome sequencing!

In genetic epidemiology, a genome-wide association study (GWAS) is

an examination of many common genetic variants in different

individuals to see if any variant is associated with a trait. GWAS

typically focus on associations between single-nucleotide

polymorphisms (SNPs) and traits like major diseases.

Caner Genomes

TCGA:http://cancergenome.nih.gov/ ICGC:http://icgc.org/

De novo sequencing

targeted sequencing

a large number of small genomes

SNP discovery

without reference genomes

Transcriptom study

Unknown Transcriptom

Metagenome study

Microbial genomes in nature

Epigentics study

Regulatory element

Chip-seq, RNA-seq

Other new projects

High throughput sequence alignment

New ideas, new projects

Metagenomes, Pan-genomes, Single-molecular sequencing

Metagenomes

• HMG

• Marine metagenomes

Pan-genomes

• 3,000 rice genomes (50TB raw reads)

• 1,000 genomes of Arabdopsis thaliana

• 15,000 tomato genomes?

Single-cell sequencing

• Different tissues, development stages, conditions

• …24

Sequencing every cells in a human body

10^14 cells

25

https://www.humancellatlas.org/

Sequencing the world!

26

Sequencing all Eukaryotes

http://www.sciencemag.org/news/2017/02/biologists-propose-sequence-dna-all-life-earth

Sequencing the Earth!

Earth Microbiome Project

• Constructing the Microbial Map for Planet Earth

• 200,000 samples

• 500,000 reconstructed microbial genomes

27

http://www.earthmicrobiome.org/

Single Cell Sequencing

Mouse Cell Atlas

• 400,000 single cell mRNA-seq*

*Han et. al, Cell (2018), 172(5):1091-110728

Big data science in China

The size of data you have been working onNational Science and Technology Survey, 2017.02

https://sojump.com/jq/11572197.aspx

Big data science in China

Where do you publish your data?National Science and Technology Survey, 2017.02

https://sojump.com/jq/11572197.aspx

Big data science in China

Where do you store your data?National Science and Technology Survey, 2017.02

https://sojump.com/jq/11572197.aspx

Big data science in China

Top 10 reasons people don’t share dataNational Science and Technology Survey, 2017.02

https://sojump.com/jq/11572197.aspx

Computer cluster/Super computer

Pi: Ranked 158 world wide, 2013750TB(2013.1-2016.9) 3.6PB (2016.12) 10PB (2019.6)

Go inside

Course organization • Introduction (week 1)

• Part I: Genomics, Transcriptomics (week 2 – 5)

• Theories and algorithms (week 2)

– Chapter 1, Probability, Statistics and Information Theory

– Chapter 2, Algorithm complexity analysis

– Chapter 3, Dynamic Programming

• Sequence analysis methods

– Chapter 4, Sequence comparison (week 3)

– Chapter 5, Sequencing technology(week 3)

– Chapter 6 Hidden Markov Model (week 4)

– Chapter 7 Gene prediction (week 4)

– Chapter 8 Multiple sequence alignment (week 4)

• Metagenomics (week 5)

• Part II: Transcriptomics and Proteomics (week 6-10)

• Part III:Student presentations (Week 11-16)

• *Invited Talk (Week 16)

• Final Exam(Week 17 or 18)35

Course features

Subjects

Biological sequences ( Genomic, transcriptomic and Proteomic seqs)

From reads to whole genomes (peptides to proteomes)

From an individual to a population of omics data

Topics

Algorithms

Models

Biology

Theory and Practice:

Probability Theory

Complexity analysis for algorithms

Design and implementation of a bioinformatics tool, such as an HMM-

based gene prediction system

36

Pre-requirements (*)

Mathematics

Calculus

Probability Theory

Statistics

Advanced Algebra

Computer Science

Programming

Biology

Molecular Biology

37

*You have to meet the prerequisites

References

Biological sequence analysis:Probabilistic Models for

Proteins and Nucleic Acids, R. Durbin, S. Eddy, A. Krogh,

G.Mitchison, Cambridge University Press, 1999

生物信息学基础,孙啸, 陆祖宏, 谢建明清华大学出版社,

2004

Introduction to Algorithms, Thomas Cormen, Charles

Leiserson, and Ronald Rivest, The MIT Press.

Unix and Perl (V.2.3.4), K. Bradnam & I. Korf, 2009

An Introduction to Bioinformatics Algorithms

Neil C. Jones and Pavel A. Pevzner

中译本: 生物信息学算法导论, 【美】 N.C琼斯 P.A.帕夫纳 著

王翼飞 等译 , 化学工业出版社 (生物.医药出版分社)

38

Goals

Basic knowledge about omics data analysis

Some practice about omics big data analysis

Critical reading of scientific literatures

Presentation skills

Grading

Homework 30%

Projects* 20%

Presentation 10%

Exam 40%

* Part of the following datasets have been installed in our

local server for you to use for your project. •3K Rice Genome Project Data:17TB

•10K Human Genome Data:to be downloaded

作业规定

作业允许合作,但是必须注明各人的贡献

作业报告必须用自己的语言独立完成

期末考试需要独立完成

严禁抄袭

抄袭者:不及格(F)

被抄袭者:成绩降一级(AB, BC, CD, DF)

不要迟交

41

Reading materials

Nature. 2001 Feb 15;409(6822):860-921.Initial sequencing and analysis of the human genome. International Human Genome Sequencing Consortium

Homework 1(due 10:00 AM next Wednesday,March 6,2019)

Please include:

A. Your Name, ID, Major

B. Your education background, your long term goals

C. Why do you take this course? Or What do you want to learn from this

course?

D. Do you meet all the prerequisites?

E. What time slot is the best for office hour

F. A short report (a page or two) about the reading materials

Read the paper critically, and give your comments

G. A potential topic that you would like to do for your project

Submit it to http://cgm.sjtu.edu.cn/test/obd/index.html

Questions? Ask TA Huimin Lu: linuslu6@outlook.com

Recommended