Upload
vuongdat
View
214
Download
0
Embed Size (px)
Citation preview
Jaap Heringa Bioinformatica
1
MathematicsStatistics
ComputerScience
Informatics
BiologyMolecular
biology
Medicine
Chemistry
Physics
Bioinformatics
“The best of many worlds”
Gathering knowledge
• Anatomy, architecture
• Dynamics, mechanics
• Informatics(Cybernetics – Wiener, 1948) (Cybernetics has been defined as the science of control in machines and animals)
• Genomics, bioinformatics
Rembrandt, 1632
Newton, 1726
Bioinformatics
We are good at recognising anatomical/dynamical patterns, but not at dealing with informational patterns
2d-3d, crossing street, bumbles, eye dynamics/information
Bioinformatics“Studying informational processes in biological systems”
(Hogeweg Utrecht; early 1970s)
Applying algorithms with mathematical formalisms in biology (genomics) USA started but now everywhere
Taking care of the computational infrastructure and data management everywhere
“Information technology applied to the management and analysis of biological data”(Attwood and Parry-Smith)
Bioinformatics• Offers an ever more essential input to
– Molecular Biology– Pharmacology (drug design)– Agriculture– Biotechnology– Clinical medicine– Anthropology– Forensic science– Chemical industries (detergent industries, etc.)
The Big Bang for Bioinformatics:The Human Genome -- 26 June 2000
Dr Craig VenterCelera Genomics-- Shotgun method
Sir John SulstonHuman Genome Project
Jaap Heringa Bioinformatica
2
The Human Genome
cctggacctc ctgtgcaaga acatgaaaca nctgtggttc ttccttctcc tggtggcagc 60 tcccagatgg gtcctgtccc aggtgcacct gcaggagtcg ggcccaggac tggggaagcc 120 tccagagctc aaaaccccac ttggtgacac aactcacaca tgcccacggt gcccagagcc 180 caaatcttgt gacacacctc ccccgtgccc acggtgccca gagcccaaat cttgtgacac 240 acctccccca tgcccacggt gcccagagcc caaatcttgt gacacacctc ccccgtgccc 300 nnngtgccca gcacctgaac tcttgggagg accgtcagtc ttcctcttcc ccccaaaacc 360 caaggatacc cttatgattt cccggacccc tgaggtcacg tgcgtggtgg tggacgtgag 420 ccacgaagac ccnnnngtcc agttcaagtg gtacgtggac ggcgtggagg tgcataatgc 480 caagacaaag ctgcgggagg agcagtacaa cagcacgttc cgtgtggtca gcgtcctcac 540 cgtcctgcac caggactggc tgaacggcaa ggagtacaag tgcaaggtct ccaacaaagc 600 cctcccagcc cccatcgaga aaaccatctc caaagccaaa ggacagcccn nnnnnnnnnn 660 nnnnnnnnnn nnnnnnnnnn nnnnngagga gatgaccaag aaccaagtca gcctgacctg 720 cctggtcaaa ggcttctacc ccagcgacat cgccgtggag tgggagagca atgggcagcc 780 ggagaacaac tacaacacca cgcctcccat gctggactcc gacggctcct tcttcctcta 840 cagcaagctc accgtggaca agagcaggtg gcagcagggg aacatcttct catgctccgt 900 gatgcatgag gctctgcaca accgctacac gcagaagagc ctctccctgt ctccgggtaa 960 atgagtgcca tggccggcaa gcccccgctc cccgggctct cggggtcgcg cgaggatgct 1020 tggcacgtac cccgtgtaca tacttcccag gcacccagca tggaaataaa gcacccagcg 1080 ctgccctgg 1089
DNA compositional biases
• Base composition of genomes: • E. coli: 25% A, 25% C, 25% G, 25% T• P. falciparum (Malaria parasite): 82%A+T
• Translation initiation:• ATG is the near universal motif indicating the
start of translation in a DNA coding sequence.
Genomics
“DNA makes RNA makes Protein”
Genome contains genes (genetic blueprint)
Genes are expressed into mRNA
mRNA is translated into protein
Proteins perform cellular functions (doers in the cell)
A gene codes for a protein
Protein
mRNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
Human genome -- a few facts• Human genome contains about 30K genes• DNA in each cell comprises ~3 × 109 base pairs • Human body contains ~3.5 × 1012 cells• DNA between different people only varies for 0.2% or
less. So, only 2 letters in 1000 are expected to be different. Over the whole genome, this means that about 5-6 million letters would differ between individuals.
• Large part of DNA not expressed (“junk/nonsense DNA”)
• Eukaryotes: expressed DNA stretches are called exons, which are interrupted by introns
Humans havespliced genes…
Jaap Heringa Bioinformatica
3
DNA makes RNA makes Protein Some further facts about human genes • Comprise about 3% of the genome• Average gene length: ~ 8,000 bp• Average of 5-6 exons/gene• Average exon length: ~200 bp• Average intron length: ~2,000 bp• ~8% genes have a single exon
• some exons can be as small as 1 or 3 bp.• HUMFMR1S is not atypical: 17 exons 40-60 bp
long, comprising <2% of a 67,000 bp gene
Genomic Data Sources• DNA/protein sequence data
(more than 80 genomes)• Expression (microarray) data• Proteome (xray, NMR,
mass spectrometry)• Metabolome• Physiome (spatial,
temporal)• Protein interaction data
IntegrativeBioinformatics
Structural/Functional Genomics
Genetic diseases• Many diseases run in families and are a result of
genes which predispose such family members to these illnesses
• Examples are Alzheimer’s disease, cystic fibrosis (CF), breast or colon cancer, or heart diseases.
• Some of these diseases can be caused by a problem within a single gene, such as with CF.
Genetic diseases (Cont.)• For other illnesses, like heart disease, at least 20-30
genes are thought to play a part, and it is still unknown which combination of problems within which genes are responsible.
• With a “problem” within a gene is meant that a single nucleotide or a combination of those within the gene are causing the disease (or make that the body is not sufficiently fighting the disease).
• Persons with different combinations of these nucleotides could then be unaffected by these diseases.
Genetic diseases (Cont.)Cystic Fibrosis
• Known since very early on (“Celtic gene”)• Inherited autosomal recessive condition (Chr. 7)• Symptoms:
– Clogging and infection of lungs (early death)– Intestinal obstruction– Reduced fertility and (male) anatomical anomalies
• CF gene CFTR has 3-bp deletion leading to Del508 (Phe) in 1480 aa protein (epithelial Cl- channel) –protein degraded in ER instead of inserted into cell membrane
Jaap Heringa Bioinformatica
4
DNA makes RNA makes Protein:Expression data
• More copies of mRNA for a gene leads to more protein
• mRNA can now be measured for all the genes in a cell at ones through microarraytechnology
• Can have 60,000 spots (genes) on a single gene chip
• Colour change gives intensity of gene expression (over- or under-expression)
cDNA microarrays
cDNA clones
cDNA microarraysCompare the genetic expression in two samples of cells
PRINTcDNA from one gene on each spot
SAMPLEScDNA labelled red/greenwith fluorescent dyes
e.g. treatment / control
normal / tumor tissueRobotic printing
HYBRIDIZE
Add equal amounts of labelled cDNA samples to microarray.
SCAN
Laser Detector
Detector measures ratio of induced fluorescence of two samples
Metabolic networks
Glycolysisand
Gluconeogenesis
Kegg database (Japan)
Jaap Heringa Bioinformatica
5
Data explode, for example:Protein Data Bank (PDB): 14500 Protein 3D structures10900 x-ray crystallography, 1810 NMR, 278 theoretical models, others...
Dickerson’s formula: equivalent to Moore’s law
On 27 March 2001 there were 12,123 3D protein structures in the PDB: Dickerson’s formula predicts 12,066 (within 0.5%)!
n = e.19(y-1960)
with y the year.
Not only data explode: computations can explode as well• Many problems can be NP (non-
polynomial) complete: computer time is exponential relative to data size
• We often need to reformulate the problem to make it tractable
• Or use heuristics (clever rules of thumb) to reduce computations
Bioinformatics grand challenges
• Understanding (multi)cellular functioning in terms of genomic data:
• Protein folding problem (IBM) • Complex diseases (cancer, heart disease)• Integrating genomic data• Predicting functions and interactions of all
proteins
Protein folding problemMTSPQAVLFKTGGVLRKAID sequence
N
CWith only 2 angles per amino acid: protein of 100 amino acids has 2100 possible folds!
Active/binding site
Best bet is homology modelling
fold
Protein folding problemProtein structure hierarchical levels
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
PRIMARY STRUCTURE (amino acid sequence)
QUATERNARY STRUCTURE (oligomers)
SECONDARY STRUCTURE (helices, strands)
TERTIARY STRUCTURE (fold)
Jaap Heringa Bioinformatica
6
The DEATH Domain• Present in a variety of Eukaryotic proteins involved with cell death.• Six helices enclose a tightly packed hydrophobic core.• Some DEATH domains form homotypic and heterotypic dimers.
http
://w
ww
.msh
ri.on
.ca/
paw
son
Pyruvate kinasePhosphotransferase
β barrel regulatory domain
α/β barrel catalytic substrate binding domain
α/β nucleotide binding domain
1 continuous + 2 discontinuous domains
Structural domain organisation can be nasty…
Bioinformatics tool
Data Algorithm
BiologicalInterpretation
(model)
toolTool components:• Metric, objective function (model containing biology)• Search function
Bioinformatics tool• Scoring function (‘biology’, most important)
– Metric, objective function, model• Search method
– Optimisation• DP• GA• HMM• MC• Simulated Annealing• MCMC• SVM
“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky)
“Nothing in bioinformatics makes sense except in the light of Biology”
BioinformaticsPattern recognition
Some are easy to describe, others not
• Visual patterns (colour in RGB mode)• Audio patterns (musical scores)• Knitting patterns• Taste: cooking recipes• Smell:
Biological patterns are often not easy to recognise
Jaap Heringa Bioinformatica
7
Multivariate statistics – Cluster analysis
Phylogenetic tree
Scores Similaritymatrix
5×5
1
2
3
4
5
C1 C2 C3 C4 C5 C6 ..
Raw table
Similarity criterion
Cluster criterion
Example: Divergent evolutionPair-wise alignment
T D W V T A L KT D W L - - I K
T D W V T A L K
T D W L I KT D W V I KAncestral sequence
Sequence alignment
(IL mutation and insertion)
(VL mutation)
How to do it?
Pair-wise alignment
Combinatorial explosion- 1 gap in 1 sequence: n+1 possibilities- 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc.
2n (2n)! 22n
= ~n (n!)2 √πn
2 sequences of 300 a.a.: ~1088 alignments2 sequences of 1000 a.a.: ~10600 alignments!
T D W V T A L KT D W L - - I K
Solution: Pair-wise sequence alignment(more than just string matching – guaranteed optimal alignment)
MDAGSTVILCFVGMDAASTILCGS
Amino Acid Exchange Matrix
Gap penalties (open,extension)
Search matrix
MDAGSTVILCFVG-MDAAST-ILC--GS
EvolutionGlobal dynamic programming
Alignment
20×20
Parameters
Global dynamic programming
• Integrating data sources, integrating methods• Integrating data through methods• Making new tools to analyse the genomic data
(integrative data mining) and predict cellular and molecular features, including for example:• Structure, function and interaction of proteins• signalling and metabolic networks• complex diseases
Integrative Bioinformatics Institute VU (IBIVU)
Jaap Heringa Bioinformatica
8
Bioinformatics @ VU• New genomics data is being collected
(pharmacogenomics, VUMC microarray)• Strong biology groups (neural biology,
metabolome, metabolic control)• Great computational groups (HTC,
Visualisation, IC Video wall, Machine learning, Computational intelligence)
• Very good mathematical groups (Statistics, Stochastics)
• You!
Bioinformatics @ VU
• Combine many areas such as mathematics (statistics), computer science (machine learning, high-throughput computing), molecular biology, medicine, etc.
• Analyse and predict molecular features• Make advanced methods and websites• Do you dare?
Bioinformatics teaching @ VU
• “Medische Natuurwetenschappen (MNW)”2nd year:Introduction to Bioinformatics
Bioinformatics teaching @ VU
• New 2-Year Masters Course: mixture of courses and practical projects
• Developing diverse set of courses• Diverse palette of 3/6/9/12-month projects• Student gets mentor for flexible guidance