50
Bioinformatics and Connection to Computational Pathology Fayyaz Minhas Department of Computer Science University of Warwick

Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Bioinformatics and Connection to Computational Pathology

Fayyaz Minhas

Department of Computer Science

University of Warwick

Page 2: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus
Page 3: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Why Bioinformatics?• How do we know that humans and

chimpanzees share more than 95% of their DNA?• Human Genome Project

3

How to

compare?

Page 4: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Why Bioinformatics?

• The knapsack problem• Uses dynamic

programming

4

Page 5: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Why Bioinformatics?

• Tree of life

5

Page 6: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Why Bioinformatics?

• How are humans across the Earth related to each other?• Human Genographic project

6

Page 7: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Why Bioinformatics?

• How can we screen for disease?

7

Page 8: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Why Bioinformatics?

• Personalized medicine

8

Page 9: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Why Bioinformatics?

• How can we fight against diseases like Cancer?

9

Page 10: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Handling viruses• In Silico Prediction and Validations of Domains

Involved in Gossypium hirsutum SnRK1 Protein Interaction with Cotton leaf curl Multan betasatellite encoded βC1

• βC1, pathogenicity determinant encoded by Cotton leaf curl Multan betasatellite interacts with calmodulin-like protein 11 (CML11) in Gossypium hirsutum

10In Silico Prediction and Validations of Domains Involved in Gossypium hirsutum SnRK1 Protein Interaction with Cotton leaf curl Multan betasatellite encoded βC1, Kamal, Hira, Fayyaz ul Amir Afsar Minhas, Hanu Pappu, Imran Amin et al., in Frontiers in Plant Science 10 (2019): 656.Bioinformatics and molecular analysis of Gossypium hirsutum calmodulin-like protein (CML11) interaction with begomovirus-transcription activator protein C2. Hira Kamal, Fayyaz Minhas, et al., in PLoSOne (In press).

Page 11: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Why Bioinformatics?

• How can we find out what are the effects of a certain disease?

11

Page 12: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Why Bioinformatics?

• How can we design new life?

12

https://www.ted.com/talks/craig_venter_unveils_synthetic_life

Page 13: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Molecular Biology Fundamentals

Page 14: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Genome: The ‘Program’

• Genome is the genetic material of an organism

• Deoxyribonucleic acid (DNA)• Encodes these genetic instructions

14

Page 15: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

How is the program stored?

15

Watson & Crick with DNA model

Rosalind Franklin with X-ray image of DNA

Page 16: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Program size: DNA base pairs (bp)

16

Organism # of base pairs # of Chromosomes

Virus

HIV 9193 1

SARS 29751 1

Porcine circovirus 1759 1

Prokayotic

Haemophilus influenzae 1.8x106 1

Escherichia coli (bacterium) 4.6x106 1

Carsonella ruddii 159, 662 (0.16M) 1Eukaryotic

S. cerevisiae (yeast) 1.35x107 17

Drosophila melanogaster (fly) 1.65x108 4

Homo sapiens (human) 2.9x109 23

Paris japonica 150x109 -

http://www.nature.com/news/2006/061009/full/news061009-10.htmlhttp://en.wikipedia.org/wiki/Genome

Page 17: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Execution of the program: Central dogma of molecular biology

17

splicing

(pre-mRNA)

mRNA→ ProteinRibosome

RNA bases to amino acids(A,U,G,C) to (A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Z)

DNA →mRNA: (A,T,G,C) to (A,U,G,C)RNA polymerase

Page 18: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus
Page 19: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Proteins

19

Page 20: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Non-Sense Mutation

• A point mutation in a sequence of DNA that results in a premature stop codon

• Protein product is incomplete or non-functional

• Beta-Thalassemia• Results from a single point mutation

• HBB gene on chromosome 11

• Reduction in production of hemoglobin

• HBB blockage over time leads to decreased Beta-chain synthesis

• Having a single gene for thalassemia may protect against malaria

• One of the most commonly inherited disorders in Pakistan

• With a prevalence rate of 6 % in the Pakistani population

• 5000-9000 children every year

20

Page 21: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

What is going on in your body?

21

What can the cell do?

What is it doing?

How is it doing it?

Page 22: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Sequencing Technologies &

Algorithms

Page 23: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Human Genome Project

• Started in 1990

• Objective: Sequence the human genome by 2005

• Achieved: 2000 • Government consoritium

• Cost: $3 Billion

• Craig Venter’s Celera / Solexa

• $1000 genome project

• 1000 genomes project

23

Page 24: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Sequencing Technologies• Sanger Sequencing

• 454 Sequencing / Roche

• GS Junior System

• GS FLX+ System

• Illumina (Solexa)

• HiSeq System

• Genome analyzer IIx

• MySeq

• Applied Biosystems - Life Technologies

• SOLiD 5500 System

• SOLiD 5500xl System

• Ion Torrent - Life Technologies

• Personal Genome Machine (PGM)

• Proton

• Helicos

• Helicos Genetic Analysis System

• Pacific Biosciences

• PacBio RS

• Oxford Nanopore Technologies

• GridION System

• MinION 24

First Generation

2nd Generation(Next Generation Sequencing, NGS)(Deep Sequencing)(High-throughput sequencing)Amplified Single Molecule SequencingMost widely used right now

3rd Generation(Next Next Generation Sequencing)Single molecule sequencing

HiSeq 2000

Page 25: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Steps in Sequencing

• DNA Extraction

• Preprocessing (Amplification , …)

• Sequencing

• Shotgun sequencing• Reads

• Assembly

• Data analysis

25

Page 26: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Shotgun Sequencing: The case of exploding newspapers

26

Page 27: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Joining overlapping reads

27

Page 28: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Completing the overlap puzzle

28

Page 29: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Sequencing and newspaper explosions: 1• Take (millions of) copies of the DNA you want to

sequence

29

Page 30: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Sequencing and newspaper explosions: 2• Fragment the DNA into smaller pieces

• Because our sequencing technologies can only read very short fragments reliably

30

Page 31: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Sequencing and newspaper explosions: 3• The short fragments resulting from DNA

fragmentation are called reads

• Some reads disappear

31

Page 32: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Sequencing and newspaper explosions: 3• We get the reads but we have no idea where they

came from in the DNA• No position information

• Need to reconstruct the DNA sequence

32

Page 33: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Sequencing and newspaper explosions: 4• Solve it as an overlap puzzle

33

Page 34: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Sequencing and newspaper explosions: 5• Reconcile the pieces

34

Page 35: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Sequencing as a computational problem

35

Page 36: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus
Page 37: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Comparison of de novo assemblers• Zhang, Wenyu, Jiajia Chen, Yang Yang, Yifei Tang, Jing Shang, and Bairong Shen. “A Practical

Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies.” PLoS ONE 6, no. 3 (March 14, 2011). doi:10.1371/journal.pone.0017915.

37

Page 38: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Sequencing: Costs & Amount

38

http://sulab.org/2013/06/sequenced-genomes-per-year/

Page 39: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Applications of Sequencing

Page 40: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

How sequencing machines work

• Input DNA/RNA Sample

• Output• FASTQ Files: Reads stored

• Also have quality information

• Phred quality scoring

• Different machines use different formats on quality

40

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Page 41: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Genome Assembly

• Based on reads, assemble them into a genome

• Whole Genome Sequencing• Input: FASTQ file

• Output: Genome

• Whole Exome or Targeted Sequencing• Input: FASTQ file of reads, Reference Sequence

• Output: Genome

Page 42: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

RNA-Seq: What are you doing?• Input: Reference Genome, RNA reads

• Output: Alignment File (SAM or BAM)• Tell where each read is aligned

Page 43: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

RNA-Seq

Page 44: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Differential expression

http://www.fejes.ca/labels/figures.html

Page 45: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus
Page 46: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Gene Expression meets Pathology

https://www.nejm.org/doi/full/10.1056/NEJMoa021967

Page 47: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Knowledge Transfer between spaces

• Platinum-based combination chemotherapy response in Ovarian Cancer

• 224 Cases (159 Sensitive, 65 Resistant)

• Both Gene Expression and H&E WSIs available

Patient-wise bootstrap ROC

Gene Expression: selected 227 genes

47https://github.com/deroneriksson/python-wsi-preprocessing/blob/master/docs/wsi-preprocessing-in-python/index.md

RBF SVM

(MIL-CNN)

Gene Expression Based

https://arxiv.org/abs/1803.04054

WSI: 80Kx80K(Input Space)

20X

ROI Patches: 512x512Top 50 scoring patches per slide

Color factor (pink/purple)Saturation and value factorTissue quantity factorTissue percentageScore

Ref

Stain Normalized

Source

Positive bag (1 per “sensitive” patient)

Negative bag (1 per “resistant” patients)

Stain NormalizationPatch Extraction Formation of bags for MIL

Pretrained on breast cancer classification

If a tumor is sensitive, then all of it may not be sensitive

If a tumor is resistant, then all (or at least most) of it is not sensitive

MIL Based Loss

max 0,1 − 𝑌𝐵𝑚𝑎𝑥𝑖∈𝑩 𝑓 𝒙𝒊; 𝜽CNN

MIL Training for CNN

Gene Expression

Pathology Imaging

Page 48: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Integration of genetic information changes tile selection

Pathology space only

Heterogeneous feature space

TCG

A-1

3-1

48

8TC

GA

-13

-14

97

TCG

A-2

9-1

70

5TC

GA

-25

-23

93

Pathology space only

Heterogeneous feature space

Page 49: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Thoughts on the future

• Multi-view prediction!• Understanding similarity in gene space and pathology

space

• Linking pathways to pathology

• Understanding causal connections

Page 50: Bioinformatics and Connection to Computational Pathology · Program size: DNA base pairs (bp) 16 Organism # of base pairs # of Chromosomes Virus HIV 9193 1 SARS 29751 1 Porcine circovirus

Biology easily has 500 years of exciting problems to work on.

Knuth.