43
1896 1920 1987 2006 Omics Big Data: Advanced Algorithms for Biological Sequence Analysis Chaochun Wei (韦朝春) [email protected] http://cgm.sjtu.edu.cn/ Jing Li(李婧) [email protected] 1 Spring 2019

Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

1896 1920 1987 2006

Omics Big Data: Advanced Algorithms for Biological Sequence Analysis

Chaochun Wei (韦朝春)

[email protected]

http://cgm.sjtu.edu.cn/

Jing Li(李婧)

[email protected]

1Spring 2019

Page 2: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Contents

Background

Some examples about Omics Big Data

Course information

• Goal

• Contents

• Organization

• Grading

2

Page 3: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

“Biology is an information science” -- Leroy Hood

A Big Picture of Biology

http://genomicscience.energy.gov/biofuels/3

Page 4: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

http://genomicscience.energy.gov/biofuels/

A Big Picture of Biology

Page 5: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

5

Bioinformatics/Computational Biology /

Systems Biology/Synthetic Biology/Omics Big Data

Page 6: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Bioinformatics

The science of collecting and analyzing

complex biological data such as genetic

codes.

-- Oxford Dictionary

Major research areas

• Sequence analysis

• Genome annotation

• Computational evolutionary biology

• Analysis of gene expression, regulation

• Comparative genomics

• Literature analysis

• Biological systems modeling

• Structural Biology

6

Page 7: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Milestone of Modern Biology:The human genome project

Feb. 15, 2001 Nature Feb. 16, 2001 Science

7

Page 8: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Human Genome Project3 billion dollars,3 billion residues

“This is ...

• a WASTE of tax payers’

money….

• We can do a lot more

8

George ChurchProfessor of Genetics Harvard Medical School

Page 9: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Automated high throughput sequencing

AGAACGACCATCAACTAAATCAAAATGCCTTTCAAACCAGCAGACAACCCAAAATGCCCAAAATGCGGCAAATCCGTATACGCCGCNGAAGAAAAAGTAGCTGGAGGATACAAATACCACAAATCCTGCTTCAAATGCGGTATGTGCAATAAAATGCTCGACTCCACCAACGTAACTGAACACGAAGCTGAATTGTACTGCAAAAATTGCCATGGACGTAAATACGGACCTAAAGGATACGGATTCGGTGGTGGAGCTGGGTGCTTAAGTATGGACGATGGAGCCCAATTCAAGGGAACACAATAATTTTAAGAAGGAATCAATGTGAAGATGGCGGCCAAAACCACACCAACTGTCAGCGGTCGTCAGTTCTACCCTTTTCCATCCCCCACTATACACTAATGTAATATTTTTAGATCTTAAATTACAGACTTAGTTTTAATTTATAAATTTTCGTATGACACGTTATAAATAAGAATTCGGTTATTTTGTAATAATTGAATTAAATAAATCTTATTTAAGACCAAAAAA

9

Page 10: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Next-generation sequencing technology

10

Sanger method: huge lab, numerous machines and staffs

<

Next-generation:one staff, one machine

NATURE METHODS ,16 | VOL.5 NO.1 | JANUARY 2008

Page 11: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

NGS platforms

11

Illumina / Solexa HiSeq

Applied Biosystems

ABI 3730XL

Roche / 454 Genome Sequencer FLX

Applied BiosystemsSOLiD

HeliScope™ Single Molecule Sequencer

3rd generation: Oxford NanoporeMinION

Illumina / Solexa MiSeqPacific Biosciences RS II

Page 12: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Comparison of sequencing platforms (2019.2)

12

Platforms Sanger 454HiSeq X Ten *

MiSeq *NovaSeq

6000*

PacBioSequel

**Nanopore

Read length650-1100

150-1000

150 36-300 2x250Average 30k

Up to 2 Mb

# of reads/run 96 0.4-2M 5.3-6 B15M –25M

32-40B ~500k Up to 500

Error rate 10^-3 <10^-2 ~10^-3 ~10^-3 ~10^-3 ~1% Varies

Cost($/Mbp) 5000 ~5 <0.01 ~0.5 <0.001 ~0.3 ~0.1

Time/run~3

hours~7

hours<3 days

4-56 hours

13-38hr< 20 hours

As little as 5 mins

Throughput 100Kb ~1Gb 1.6-1.8Tb540Mb-

15Gb4.8-6Tb

Up to 20 Gb

10-30Gb

* https://www.illumina.com/systems/** https://www.pacb.com/products-and-services/sequel-system/

Page 13: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Latest sequencing platforms

Complete Genomics (obtained by BGI, 2013)

• Finish 10,000 genomes/year (2010)

Oxford Nanopore

• Human genome:$3000, 6 hours (2012)

Visigen

BIGIS (China)

more…13

Ion Torrent

Pacific Biosciences RS IIRead length: Maximum > 30 Kb

Page 14: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Popular sequencing platforms

HiSeq X series (2015)

• Ten: > 18,000 human genomes/year

• <$1000 per 30X genome

NovaSeq 5000 (2017 from Illumina)

• <$100 per human genome

• 2Tb in 2.5 days, 1.6 billion reads 14

* http://www.illumina.com/systems

Page 15: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Latest sequencing platforms

NovaSeq 6000 (2018 from Illumina)

• <$100 per human genome

• 6Tb in 2 days, 20 billion reads

MinION (Oxford Nanopore)

• Use USB 3.0 cable as power

• 10-30Gb/flow cell

• Can be used at ANYWHERE

• $1000/pack(device + kits + membership)

* http://www.illumina.com/systems

Page 16: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

BGI Sequencer

Launched on Oct. 2018

6Tb/day

Page 17: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

The first a few personal genomes

Craig Venter genome

James Watson genome

2 Koran genomes

1 Chinese genome

2 cancer genomes

1 African genome

Stephen Quake genome

Family of Four by Institute of System Biology

……….

17

Page 18: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Latest personal genome projects

1000 Genome Projects (UK, China, US)

ClinSeq (NHGRI)

International Cancer Genome Consortium (Canada)

23andMe Research Revolution (US)

18

Page 19: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Latest personal genome projects

23andMe Research Revolution (US)

10,000 Human Genome Project (USA)

300,000 Human Genome Project (Iceland)

500,000 Genome Projects (UK *)

1,000,000 Human Genome Project (USA)

Cancer genomes

* http://www.ukbiobank.ac.uk/

Page 20: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Personal genomics

6 billions genomic DNA base pairs, ~22k genes,

>300,000 proteins,

Personalized genomic difference: 6 million bps

Page 21: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Common and rare variations

Common variations

– GWAS generally only have the power

to detect common SNP variations!

How to detect rare variations?

– whole genome sequencing!

In genetic epidemiology, a genome-wide association study (GWAS) is

an examination of many common genetic variants in different

individuals to see if any variant is associated with a trait. GWAS

typically focus on associations between single-nucleotide

polymorphisms (SNPs) and traits like major diseases.

Page 22: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Caner Genomes

TCGA:http://cancergenome.nih.gov/ ICGC:http://icgc.org/

Page 23: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

De novo sequencing

targeted sequencing

a large number of small genomes

SNP discovery

without reference genomes

Transcriptom study

Unknown Transcriptom

Metagenome study

Microbial genomes in nature

Epigentics study

Regulatory element

Chip-seq, RNA-seq

Other new projects

High throughput sequence alignment

New ideas, new projects

Page 24: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Metagenomes, Pan-genomes, Single-molecular sequencing

Metagenomes

• HMG

• Marine metagenomes

Pan-genomes

• 3,000 rice genomes (50TB raw reads)

• 1,000 genomes of Arabdopsis thaliana

• 15,000 tomato genomes?

Single-cell sequencing

• Different tissues, development stages, conditions

• …24

Page 25: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Sequencing every cells in a human body

10^14 cells

25

https://www.humancellatlas.org/

Page 26: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Sequencing the world!

26

Sequencing all Eukaryotes

http://www.sciencemag.org/news/2017/02/biologists-propose-sequence-dna-all-life-earth

Page 27: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Sequencing the Earth!

Earth Microbiome Project

• Constructing the Microbial Map for Planet Earth

• 200,000 samples

• 500,000 reconstructed microbial genomes

27

http://www.earthmicrobiome.org/

Page 28: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Single Cell Sequencing

Mouse Cell Atlas

• 400,000 single cell mRNA-seq*

*Han et. al, Cell (2018), 172(5):1091-110728

Page 29: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Big data science in China

The size of data you have been working onNational Science and Technology Survey, 2017.02

https://sojump.com/jq/11572197.aspx

Page 30: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Big data science in China

Where do you publish your data?National Science and Technology Survey, 2017.02

https://sojump.com/jq/11572197.aspx

Page 31: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Big data science in China

Where do you store your data?National Science and Technology Survey, 2017.02

https://sojump.com/jq/11572197.aspx

Page 32: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Big data science in China

Top 10 reasons people don’t share dataNational Science and Technology Survey, 2017.02

https://sojump.com/jq/11572197.aspx

Page 33: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Computer cluster/Super computer

Pi: Ranked 158 world wide, 2013750TB(2013.1-2016.9) 3.6PB (2016.12) 10PB (2019.6)

Page 34: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Go inside

Page 35: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Course organization • Introduction (week 1)

• Part I: Genomics, Transcriptomics (week 2 – 5)

• Theories and algorithms (week 2)

– Chapter 1, Probability, Statistics and Information Theory

– Chapter 2, Algorithm complexity analysis

– Chapter 3, Dynamic Programming

• Sequence analysis methods

– Chapter 4, Sequence comparison (week 3)

– Chapter 5, Sequencing technology(week 3)

– Chapter 6 Hidden Markov Model (week 4)

– Chapter 7 Gene prediction (week 4)

– Chapter 8 Multiple sequence alignment (week 4)

• Metagenomics (week 5)

• Part II: Transcriptomics and Proteomics (week 6-10)

• Part III:Student presentations (Week 11-16)

• *Invited Talk (Week 16)

• Final Exam(Week 17 or 18)35

Page 36: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Course features

Subjects

Biological sequences ( Genomic, transcriptomic and Proteomic seqs)

From reads to whole genomes (peptides to proteomes)

From an individual to a population of omics data

Topics

Algorithms

Models

Biology

Theory and Practice:

Probability Theory

Complexity analysis for algorithms

Design and implementation of a bioinformatics tool, such as an HMM-

based gene prediction system

36

Page 37: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Pre-requirements (*)

Mathematics

Calculus

Probability Theory

Statistics

Advanced Algebra

Computer Science

Programming

Biology

Molecular Biology

37

*You have to meet the prerequisites

Page 38: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

References

Biological sequence analysis:Probabilistic Models for

Proteins and Nucleic Acids, R. Durbin, S. Eddy, A. Krogh,

G.Mitchison, Cambridge University Press, 1999

生物信息学基础,孙啸, 陆祖宏, 谢建明清华大学出版社,

2004

Introduction to Algorithms, Thomas Cormen, Charles

Leiserson, and Ronald Rivest, The MIT Press.

Unix and Perl (V.2.3.4), K. Bradnam & I. Korf, 2009

An Introduction to Bioinformatics Algorithms

Neil C. Jones and Pavel A. Pevzner

中译本: 生物信息学算法导论, 【美】 N.C琼斯 P.A.帕夫纳 著

王翼飞 等译 , 化学工业出版社 (生物.医药出版分社)

38

Page 39: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Goals

Basic knowledge about omics data analysis

Some practice about omics big data analysis

Critical reading of scientific literatures

Presentation skills

Page 40: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Grading

Homework 30%

Projects* 20%

Presentation 10%

Exam 40%

* Part of the following datasets have been installed in our

local server for you to use for your project. •3K Rice Genome Project Data:17TB

•10K Human Genome Data:to be downloaded

Page 41: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

作业规定

作业允许合作,但是必须注明各人的贡献

作业报告必须用自己的语言独立完成

期末考试需要独立完成

严禁抄袭

抄袭者:不及格(F)

被抄袭者:成绩降一级(AB, BC, CD, DF)

不要迟交

41

Page 42: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Reading materials

Nature. 2001 Feb 15;409(6822):860-921.Initial sequencing and analysis of the human genome. International Human Genome Sequencing Consortium

Page 43: Omics Big Data - SJTUcgm.sjtu.edu.cn/index/pub/courses/2019/omics/week-1-1_Introduction.pdf · References Biological sequence analysis:Probabilistic Models for Proteins and Nucleic

Homework 1(due 10:00 AM next Wednesday,March 6,2019)

Please include:

A. Your Name, ID, Major

B. Your education background, your long term goals

C. Why do you take this course? Or What do you want to learn from this

course?

D. Do you meet all the prerequisites?

E. What time slot is the best for office hour

F. A short report (a page or two) about the reading materials

Read the paper critically, and give your comments

G. A potential topic that you would like to do for your project

Submit it to http://cgm.sjtu.edu.cn/test/obd/index.html

Questions? Ask TA Huimin Lu: [email protected]