Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Bioinformatics and Connection to Computational Pathology
Fayyaz Minhas
Department of Computer Science
University of Warwick
Why Bioinformatics?• How do we know that humans and
chimpanzees share more than 95% of their DNA?• Human Genome Project
3
How to
compare?
Why Bioinformatics?
• The knapsack problem• Uses dynamic
programming
4
Why Bioinformatics?
• Tree of life
5
Why Bioinformatics?
• How are humans across the Earth related to each other?• Human Genographic project
6
Why Bioinformatics?
• How can we screen for disease?
7
Why Bioinformatics?
• Personalized medicine
8
Why Bioinformatics?
• How can we fight against diseases like Cancer?
9
Handling viruses• In Silico Prediction and Validations of Domains
Involved in Gossypium hirsutum SnRK1 Protein Interaction with Cotton leaf curl Multan betasatellite encoded βC1
• βC1, pathogenicity determinant encoded by Cotton leaf curl Multan betasatellite interacts with calmodulin-like protein 11 (CML11) in Gossypium hirsutum
10In Silico Prediction and Validations of Domains Involved in Gossypium hirsutum SnRK1 Protein Interaction with Cotton leaf curl Multan betasatellite encoded βC1, Kamal, Hira, Fayyaz ul Amir Afsar Minhas, Hanu Pappu, Imran Amin et al., in Frontiers in Plant Science 10 (2019): 656.Bioinformatics and molecular analysis of Gossypium hirsutum calmodulin-like protein (CML11) interaction with begomovirus-transcription activator protein C2. Hira Kamal, Fayyaz Minhas, et al., in PLoSOne (In press).
Why Bioinformatics?
• How can we find out what are the effects of a certain disease?
11
Why Bioinformatics?
• How can we design new life?
12
https://www.ted.com/talks/craig_venter_unveils_synthetic_life
Molecular Biology Fundamentals
Genome: The ‘Program’
• Genome is the genetic material of an organism
• Deoxyribonucleic acid (DNA)• Encodes these genetic instructions
14
How is the program stored?
15
Watson & Crick with DNA model
Rosalind Franklin with X-ray image of DNA
Program size: DNA base pairs (bp)
16
Organism # of base pairs # of Chromosomes
Virus
HIV 9193 1
SARS 29751 1
Porcine circovirus 1759 1
Prokayotic
Haemophilus influenzae 1.8x106 1
Escherichia coli (bacterium) 4.6x106 1
Carsonella ruddii 159, 662 (0.16M) 1Eukaryotic
S. cerevisiae (yeast) 1.35x107 17
Drosophila melanogaster (fly) 1.65x108 4
Homo sapiens (human) 2.9x109 23
Paris japonica 150x109 -
http://www.nature.com/news/2006/061009/full/news061009-10.htmlhttp://en.wikipedia.org/wiki/Genome
Execution of the program: Central dogma of molecular biology
17
splicing
(pre-mRNA)
mRNA→ ProteinRibosome
RNA bases to amino acids(A,U,G,C) to (A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Z)
DNA →mRNA: (A,T,G,C) to (A,U,G,C)RNA polymerase
Proteins
19
Non-Sense Mutation
• A point mutation in a sequence of DNA that results in a premature stop codon
• Protein product is incomplete or non-functional
• Beta-Thalassemia• Results from a single point mutation
• HBB gene on chromosome 11
• Reduction in production of hemoglobin
• HBB blockage over time leads to decreased Beta-chain synthesis
• Having a single gene for thalassemia may protect against malaria
• One of the most commonly inherited disorders in Pakistan
• With a prevalence rate of 6 % in the Pakistani population
• 5000-9000 children every year
20
What is going on in your body?
21
What can the cell do?
What is it doing?
How is it doing it?
Sequencing Technologies &
Algorithms
Human Genome Project
• Started in 1990
• Objective: Sequence the human genome by 2005
• Achieved: 2000 • Government consoritium
• Cost: $3 Billion
• Craig Venter’s Celera / Solexa
• $1000 genome project
• 1000 genomes project
23
Sequencing Technologies• Sanger Sequencing
• 454 Sequencing / Roche
• GS Junior System
• GS FLX+ System
• Illumina (Solexa)
• HiSeq System
• Genome analyzer IIx
• MySeq
• Applied Biosystems - Life Technologies
• SOLiD 5500 System
• SOLiD 5500xl System
• Ion Torrent - Life Technologies
• Personal Genome Machine (PGM)
• Proton
• Helicos
• Helicos Genetic Analysis System
• Pacific Biosciences
• PacBio RS
• Oxford Nanopore Technologies
• GridION System
• MinION 24
First Generation
2nd Generation(Next Generation Sequencing, NGS)(Deep Sequencing)(High-throughput sequencing)Amplified Single Molecule SequencingMost widely used right now
3rd Generation(Next Next Generation Sequencing)Single molecule sequencing
HiSeq 2000
Steps in Sequencing
• DNA Extraction
• Preprocessing (Amplification , …)
• Sequencing
• Shotgun sequencing• Reads
• Assembly
• Data analysis
25
Shotgun Sequencing: The case of exploding newspapers
26
Joining overlapping reads
27
Completing the overlap puzzle
28
Sequencing and newspaper explosions: 1• Take (millions of) copies of the DNA you want to
sequence
29
Sequencing and newspaper explosions: 2• Fragment the DNA into smaller pieces
• Because our sequencing technologies can only read very short fragments reliably
30
Sequencing and newspaper explosions: 3• The short fragments resulting from DNA
fragmentation are called reads
• Some reads disappear
31
Sequencing and newspaper explosions: 3• We get the reads but we have no idea where they
came from in the DNA• No position information
• Need to reconstruct the DNA sequence
32
Sequencing and newspaper explosions: 4• Solve it as an overlap puzzle
33
Sequencing and newspaper explosions: 5• Reconcile the pieces
34
Sequencing as a computational problem
35
Comparison of de novo assemblers• Zhang, Wenyu, Jiajia Chen, Yang Yang, Yifei Tang, Jing Shang, and Bairong Shen. “A Practical
Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies.” PLoS ONE 6, no. 3 (March 14, 2011). doi:10.1371/journal.pone.0017915.
37
Sequencing: Costs & Amount
•
38
http://sulab.org/2013/06/sequenced-genomes-per-year/
Applications of Sequencing
How sequencing machines work
• Input DNA/RNA Sample
• Output• FASTQ Files: Reads stored
• Also have quality information
• Phred quality scoring
• Different machines use different formats on quality
40
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
Genome Assembly
• Based on reads, assemble them into a genome
• Whole Genome Sequencing• Input: FASTQ file
• Output: Genome
• Whole Exome or Targeted Sequencing• Input: FASTQ file of reads, Reference Sequence
• Output: Genome
RNA-Seq: What are you doing?• Input: Reference Genome, RNA reads
• Output: Alignment File (SAM or BAM)• Tell where each read is aligned
RNA-Seq
Differential expression
http://www.fejes.ca/labels/figures.html
Gene Expression meets Pathology
https://www.nejm.org/doi/full/10.1056/NEJMoa021967
Knowledge Transfer between spaces
• Platinum-based combination chemotherapy response in Ovarian Cancer
• 224 Cases (159 Sensitive, 65 Resistant)
• Both Gene Expression and H&E WSIs available
Patient-wise bootstrap ROC
Gene Expression: selected 227 genes
47https://github.com/deroneriksson/python-wsi-preprocessing/blob/master/docs/wsi-preprocessing-in-python/index.md
RBF SVM
(MIL-CNN)
Gene Expression Based
https://arxiv.org/abs/1803.04054
WSI: 80Kx80K(Input Space)
20X
ROI Patches: 512x512Top 50 scoring patches per slide
Color factor (pink/purple)Saturation and value factorTissue quantity factorTissue percentageScore
Ref
Stain Normalized
Source
Positive bag (1 per “sensitive” patient)
Negative bag (1 per “resistant” patients)
Stain NormalizationPatch Extraction Formation of bags for MIL
Pretrained on breast cancer classification
If a tumor is sensitive, then all of it may not be sensitive
If a tumor is resistant, then all (or at least most) of it is not sensitive
MIL Based Loss
max 0,1 − 𝑌𝐵𝑚𝑎𝑥𝑖∈𝑩 𝑓 𝒙𝒊; 𝜽CNN
MIL Training for CNN
Gene Expression
Pathology Imaging
Integration of genetic information changes tile selection
Pathology space only
Heterogeneous feature space
TCG
A-1
3-1
48
8TC
GA
-13
-14
97
TCG
A-2
9-1
70
5TC
GA
-25
-23
93
Pathology space only
Heterogeneous feature space
Thoughts on the future
• Multi-view prediction!• Understanding similarity in gene space and pathology
space
• Linking pathways to pathology
• Understanding causal connections
Biology easily has 500 years of exciting problems to work on.
Knuth.