8
Jaap Heringa Bioinformatica 1 Mathematics Statistics Computer Science Informatics Biology Molecular biology Medicine Chemistry Physics Bioinformatics “The best of many worlds” Gathering knowledge Anatomy, architecture Dynamics, mechanics • Informatics (Cybernetics – Wiener, 1948) (Cybernetics has been defined as the science of control in machines and animals) Genomics, bioinformatics Rembrandt, 1632 Newton, 1726 Bioinformatics We are good at recognising anatomical/dynamical patterns, but not at dealing with informational patterns 2d-3d, crossing street, bumbles, eye dynamics/information Bioinformatics “Studying informational processes in biological systems” (Hogeweg Utrecht; early 1970s) Applying algorithms with mathematical formalisms in biology (genomics) USA started but now everywhere Taking care of the computational infrastructure and data management everywhere “Information technology applied to the management and analysis of biological data” (Attwood and Parry-Smith) Bioinformatics Offers an ever more essential input to – Molecular Biology – Pharmacology (drug design) – Agriculture – Biotechnology – Clinical medicine – Anthropology – Forensic science – Chemical industries (detergent industries, etc.) The Big Bang for Bioinformatics: The Human Genome -- 26 June 2000 Dr Craig Venter Celera Genomics -- Shotgun method Sir John Sulston Human Genome Project

Jaap Heringa Bioinformatica - VUabhulai/doc/bioinformatica.pdf · Jaap Heringa Bioinformatica 6 The DEATH Domain • Present in a variety of Eukaryotic proteins involved with cell

Embed Size (px)

Citation preview

Jaap Heringa Bioinformatica

1

MathematicsStatistics

ComputerScience

Informatics

BiologyMolecular

biology

Medicine

Chemistry

Physics

Bioinformatics

“The best of many worlds”

Gathering knowledge

• Anatomy, architecture

• Dynamics, mechanics

• Informatics(Cybernetics – Wiener, 1948) (Cybernetics has been defined as the science of control in machines and animals)

• Genomics, bioinformatics

Rembrandt, 1632

Newton, 1726

Bioinformatics

We are good at recognising anatomical/dynamical patterns, but not at dealing with informational patterns

2d-3d, crossing street, bumbles, eye dynamics/information

Bioinformatics“Studying informational processes in biological systems”

(Hogeweg Utrecht; early 1970s)

Applying algorithms with mathematical formalisms in biology (genomics) USA started but now everywhere

Taking care of the computational infrastructure and data management everywhere

“Information technology applied to the management and analysis of biological data”(Attwood and Parry-Smith)

Bioinformatics• Offers an ever more essential input to

– Molecular Biology– Pharmacology (drug design)– Agriculture– Biotechnology– Clinical medicine– Anthropology– Forensic science– Chemical industries (detergent industries, etc.)

The Big Bang for Bioinformatics:The Human Genome -- 26 June 2000

Dr Craig VenterCelera Genomics-- Shotgun method

Sir John SulstonHuman Genome Project

Jaap Heringa Bioinformatica

2

The Human Genome

cctggacctc ctgtgcaaga acatgaaaca nctgtggttc ttccttctcc tggtggcagc 60 tcccagatgg gtcctgtccc aggtgcacct gcaggagtcg ggcccaggac tggggaagcc 120 tccagagctc aaaaccccac ttggtgacac aactcacaca tgcccacggt gcccagagcc 180 caaatcttgt gacacacctc ccccgtgccc acggtgccca gagcccaaat cttgtgacac 240 acctccccca tgcccacggt gcccagagcc caaatcttgt gacacacctc ccccgtgccc 300 nnngtgccca gcacctgaac tcttgggagg accgtcagtc ttcctcttcc ccccaaaacc 360 caaggatacc cttatgattt cccggacccc tgaggtcacg tgcgtggtgg tggacgtgag 420 ccacgaagac ccnnnngtcc agttcaagtg gtacgtggac ggcgtggagg tgcataatgc 480 caagacaaag ctgcgggagg agcagtacaa cagcacgttc cgtgtggtca gcgtcctcac 540 cgtcctgcac caggactggc tgaacggcaa ggagtacaag tgcaaggtct ccaacaaagc 600 cctcccagcc cccatcgaga aaaccatctc caaagccaaa ggacagcccn nnnnnnnnnn 660 nnnnnnnnnn nnnnnnnnnn nnnnngagga gatgaccaag aaccaagtca gcctgacctg 720 cctggtcaaa ggcttctacc ccagcgacat cgccgtggag tgggagagca atgggcagcc 780 ggagaacaac tacaacacca cgcctcccat gctggactcc gacggctcct tcttcctcta 840 cagcaagctc accgtggaca agagcaggtg gcagcagggg aacatcttct catgctccgt 900 gatgcatgag gctctgcaca accgctacac gcagaagagc ctctccctgt ctccgggtaa 960 atgagtgcca tggccggcaa gcccccgctc cccgggctct cggggtcgcg cgaggatgct 1020 tggcacgtac cccgtgtaca tacttcccag gcacccagca tggaaataaa gcacccagcg 1080 ctgccctgg 1089

DNA compositional biases

• Base composition of genomes: • E. coli: 25% A, 25% C, 25% G, 25% T• P. falciparum (Malaria parasite): 82%A+T

• Translation initiation:• ATG is the near universal motif indicating the

start of translation in a DNA coding sequence.

Genomics

“DNA makes RNA makes Protein”

Genome contains genes (genetic blueprint)

Genes are expressed into mRNA

mRNA is translated into protein

Proteins perform cellular functions (doers in the cell)

A gene codes for a protein

Protein

mRNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

Human genome -- a few facts• Human genome contains about 30K genes• DNA in each cell comprises ~3 × 109 base pairs • Human body contains ~3.5 × 1012 cells• DNA between different people only varies for 0.2% or

less. So, only 2 letters in 1000 are expected to be different. Over the whole genome, this means that about 5-6 million letters would differ between individuals.

• Large part of DNA not expressed (“junk/nonsense DNA”)

• Eukaryotes: expressed DNA stretches are called exons, which are interrupted by introns

Humans havespliced genes…

Jaap Heringa Bioinformatica

3

DNA makes RNA makes Protein Some further facts about human genes • Comprise about 3% of the genome• Average gene length: ~ 8,000 bp• Average of 5-6 exons/gene• Average exon length: ~200 bp• Average intron length: ~2,000 bp• ~8% genes have a single exon

• some exons can be as small as 1 or 3 bp.• HUMFMR1S is not atypical: 17 exons 40-60 bp

long, comprising <2% of a 67,000 bp gene

Genomic Data Sources• DNA/protein sequence data

(more than 80 genomes)• Expression (microarray) data• Proteome (xray, NMR,

mass spectrometry)• Metabolome• Physiome (spatial,

temporal)• Protein interaction data

IntegrativeBioinformatics

Structural/Functional Genomics

Genetic diseases• Many diseases run in families and are a result of

genes which predispose such family members to these illnesses

• Examples are Alzheimer’s disease, cystic fibrosis (CF), breast or colon cancer, or heart diseases.

• Some of these diseases can be caused by a problem within a single gene, such as with CF.

Genetic diseases (Cont.)• For other illnesses, like heart disease, at least 20-30

genes are thought to play a part, and it is still unknown which combination of problems within which genes are responsible.

• With a “problem” within a gene is meant that a single nucleotide or a combination of those within the gene are causing the disease (or make that the body is not sufficiently fighting the disease).

• Persons with different combinations of these nucleotides could then be unaffected by these diseases.

Genetic diseases (Cont.)Cystic Fibrosis

• Known since very early on (“Celtic gene”)• Inherited autosomal recessive condition (Chr. 7)• Symptoms:

– Clogging and infection of lungs (early death)– Intestinal obstruction– Reduced fertility and (male) anatomical anomalies

• CF gene CFTR has 3-bp deletion leading to Del508 (Phe) in 1480 aa protein (epithelial Cl- channel) –protein degraded in ER instead of inserted into cell membrane

Jaap Heringa Bioinformatica

4

DNA makes RNA makes Protein:Expression data

• More copies of mRNA for a gene leads to more protein

• mRNA can now be measured for all the genes in a cell at ones through microarraytechnology

• Can have 60,000 spots (genes) on a single gene chip

• Colour change gives intensity of gene expression (over- or under-expression)

cDNA microarrays

cDNA clones

cDNA microarraysCompare the genetic expression in two samples of cells

PRINTcDNA from one gene on each spot

SAMPLEScDNA labelled red/greenwith fluorescent dyes

e.g. treatment / control

normal / tumor tissueRobotic printing

HYBRIDIZE

Add equal amounts of labelled cDNA samples to microarray.

SCAN

Laser Detector

Detector measures ratio of induced fluorescence of two samples

Metabolic networks

Glycolysisand

Gluconeogenesis

Kegg database (Japan)

Jaap Heringa Bioinformatica

5

Data explode, for example:Protein Data Bank (PDB): 14500 Protein 3D structures10900 x-ray crystallography, 1810 NMR, 278 theoretical models, others...

Dickerson’s formula: equivalent to Moore’s law

On 27 March 2001 there were 12,123 3D protein structures in the PDB: Dickerson’s formula predicts 12,066 (within 0.5%)!

n = e.19(y-1960)

with y the year.

Not only data explode: computations can explode as well• Many problems can be NP (non-

polynomial) complete: computer time is exponential relative to data size

• We often need to reformulate the problem to make it tractable

• Or use heuristics (clever rules of thumb) to reduce computations

Bioinformatics grand challenges

• Understanding (multi)cellular functioning in terms of genomic data:

• Protein folding problem (IBM) • Complex diseases (cancer, heart disease)• Integrating genomic data• Predicting functions and interactions of all

proteins

Protein folding problemMTSPQAVLFKTGGVLRKAID sequence

N

CWith only 2 angles per amino acid: protein of 100 amino acids has 2100 possible folds!

Active/binding site

Best bet is homology modelling

fold

Protein folding problemProtein structure hierarchical levels

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE (oligomers)

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Jaap Heringa Bioinformatica

6

The DEATH Domain• Present in a variety of Eukaryotic proteins involved with cell death.• Six helices enclose a tightly packed hydrophobic core.• Some DEATH domains form homotypic and heterotypic dimers.

http

://w

ww

.msh

ri.on

.ca/

paw

son

Pyruvate kinasePhosphotransferase

β barrel regulatory domain

α/β barrel catalytic substrate binding domain

α/β nucleotide binding domain

1 continuous + 2 discontinuous domains

Structural domain organisation can be nasty…

Bioinformatics tool

Data Algorithm

BiologicalInterpretation

(model)

toolTool components:• Metric, objective function (model containing biology)• Search function

Bioinformatics tool• Scoring function (‘biology’, most important)

– Metric, objective function, model• Search method

– Optimisation• DP• GA• HMM• MC• Simulated Annealing• MCMC• SVM

“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky)

“Nothing in bioinformatics makes sense except in the light of Biology”

BioinformaticsPattern recognition

Some are easy to describe, others not

• Visual patterns (colour in RGB mode)• Audio patterns (musical scores)• Knitting patterns• Taste: cooking recipes• Smell:

Biological patterns are often not easy to recognise

Jaap Heringa Bioinformatica

7

Multivariate statistics – Cluster analysis

Phylogenetic tree

Scores Similaritymatrix

5×5

1

2

3

4

5

C1 C2 C3 C4 C5 C6 ..

Raw table

Similarity criterion

Cluster criterion

Example: Divergent evolutionPair-wise alignment

T D W V T A L KT D W L - - I K

T D W V T A L K

T D W L I KT D W V I KAncestral sequence

Sequence alignment

(IL mutation and insertion)

(VL mutation)

How to do it?

Pair-wise alignment

Combinatorial explosion- 1 gap in 1 sequence: n+1 possibilities- 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc.

2n (2n)! 22n

= ~n (n!)2 √πn

2 sequences of 300 a.a.: ~1088 alignments2 sequences of 1000 a.a.: ~10600 alignments!

T D W V T A L KT D W L - - I K

Solution: Pair-wise sequence alignment(more than just string matching – guaranteed optimal alignment)

MDAGSTVILCFVGMDAASTILCGS

Amino Acid Exchange Matrix

Gap penalties (open,extension)

Search matrix

MDAGSTVILCFVG-MDAAST-ILC--GS

EvolutionGlobal dynamic programming

Alignment

20×20

Parameters

Global dynamic programming

• Integrating data sources, integrating methods• Integrating data through methods• Making new tools to analyse the genomic data

(integrative data mining) and predict cellular and molecular features, including for example:• Structure, function and interaction of proteins• signalling and metabolic networks• complex diseases

Integrative Bioinformatics Institute VU (IBIVU)

Jaap Heringa Bioinformatica

8

Bioinformatics @ VU• New genomics data is being collected

(pharmacogenomics, VUMC microarray)• Strong biology groups (neural biology,

metabolome, metabolic control)• Great computational groups (HTC,

Visualisation, IC Video wall, Machine learning, Computational intelligence)

• Very good mathematical groups (Statistics, Stochastics)

• You!

Bioinformatics @ VU

• Combine many areas such as mathematics (statistics), computer science (machine learning, high-throughput computing), molecular biology, medicine, etc.

• Analyse and predict molecular features• Make advanced methods and websites• Do you dare?

Bioinformatics teaching @ VU

• “Medische Natuurwetenschappen (MNW)”2nd year:Introduction to Bioinformatics

Bioinformatics teaching @ VU

• New 2-Year Masters Course: mixture of courses and practical projects

• Developing diverse set of courses• Diverse palette of 3/6/9/12-month projects• Student gets mentor for flexible guidance