Bioinformatics Lecture 1. DNA - the basics Drew Berry – DNA animations

Bioinformatics

Lecture 1

DNA - the basics

Drew Berry – DNA animations

http://www.youtube.com/watch?v=WFCvkkDSfIU&index=4&list=PL9CBBEA5A85DBCDEF








Organisation of DNA

• DNA is packed in Chromosomes • Karyotype: chromosome set of a species • Chromosomes are dynamic structuresThe Human karyotype:• 23 pairs of chromosomes• 46 DNA molecules

A CT G

DNA replication• The ability of DNA to replicate itself is a fundamental driver of life• DNA copy is catalysed by enzymes (DNA polymerases)• The complementary strand is synthesised from a template strand, using

deoxynucleotides and a primer• Synthesis is directional (5’->3’)

5’5’

3’3’

Primer

reverse complement copy

Template DNA strand

DeoxyribonucleotidesdNTPs

DNA polymerase

Template

AC

G T

TCAG

The polymerase chain reaction

• Replication requires a DNA polymerase• Thermostable DNA polymerase (eg Taq

polymerase)• Efficient DNA amplification • No error correction Kary Mullis

Nobel prize in chemistry: 1993

Melt DNA (94-98 °)

Anneal primers (50-65 °)

Elongation (72 °)

Exponential replication

DNA Sequencing (Sanger)• PCR Reaction is terminated using

randomly incorporated dideoxynucleosides (ddNP)

• Older methods use radiolabelled phosphate

• Newer methods use ddNP incorporating dyes

• Truncated DNA strands are separated on a gel or by capillary electrophoresis

Next Generation Sequencing• Next generation sequencing refers to methods newer than the

Sanger approach• A variety of techniques developed by different companies• DNA is generally immobilized on a solid support• Very large numbers of small reads • Multiple reads of a each section of genomic DNA (eg 30x)• Assembling the genome becomes a significant computational

problem

• Some ‘single molecule’ methods do not require PCR (reduces errors)• Cost has reduced substantially the $1000 genome!

• Refs: Metzker, M. L. Sequencing Technologies — the Next Generation. Nat. Rev. Genet. 2009, 11, 31–46.

The Human Genome Project• Funded by US government• The human genome was published in February

2001• Project completed in 2003• Cost $US 2.7 billion in 1991 dollars• Hierarchical shotgun sequencing (genome is

broken down into many smaller fragments)• Automated Sanger type sequencing

• Ref: http://www.nature.com/scitable/topicpage/dna-sequencing-technologies-key-to-the-human-828

http://www.nature.com/scitable/topicpage/dna-sequencing-technologies-key-to-the-human-828







Human genome by function• The human genome contains about 21K genes (about 100,000 were expected!)• 98% of the human genome is noncoding DNA• Noncoding DNA can code for regulatory RNAs or otherwise regulate transcription

• Ref: Häggström, Wikiversity Journal of Medicine 1 (2). DOI:10.15347/wjm/2014.008. ISSN 20018762

https://en.wikiversity.org/wiki/Medical_gallery_of_Mikael_H%C3%A4ggstr%C3%B6m_2014

https://en.wikiversity.org/wiki/Medical_gallery_of_Mikael_H%C3%A4ggstr%C3%B6m_2014

http://en.wikipedia.org/wiki/Digital_object_identifier

http://dx.doi.org/10.15347/wjm/2014.008

http://en.wikipedia.org/wiki/International_Standard_Serial_Number

http://www.worldcat.org/issn/20018762

The druggable genome – Current drug targets

Ref: Hopkins, A. L.; Groom, C. R. The Druggable Genome. Nat Rev Drug Discov 2002, 1, 727–730.

The druggable genome – Human genes

Ref: Hopkins, A. L.; Groom, C. R. The Druggable Genome. Nat Rev Drug Discov 2002, 1, 727–730.

Human genome resources

• Three useful sites providing a huge number of resources such as genome browsers

• NCBI: National center of biological information– http://www.ncbi.nlm.nih.gov/– http://www.ncbi.nlm.nih.gov/genome/guide/human/

• UCSC genome browser– http://genome.ucsc.edu/

• Ensembl: European site at the Sanger centre– http://www.ensembl.org

http://www.ncbi.nlm.nih.gov/

http://www.ncbi.nlm.nih.gov/

http://www.ncbi.nlm.nih.gov/genome/guide/human/



http://genome.ucsc.edu/

http://genome.ucsc.edu/

http://www.ensembl.org/

http://www.ensembl.org/

Next-gen Sequencing Overview

• Ref: http://res.illumina.com/documents/products/illumina_sequencing_introduction.pdf

Multiple Genomes

• Ref: McVean et al. An Integrated Map of Genetic Variation From 1,092 Human Genomes. Nature 2012, 491, 56-65.

Bioinformatics

• Sequencing technologies produce enormous amounts of sequence data. What do we want to do with this?– Identify genes – Identify functions of gene products (proteins)– Compare genes between species– Identify relationships (similarities) between

species

The Genetic Code

In general:• Amino acids that share the same biosynthetic pathway tend to have the

same first base in their codons• Amino acids with similar physical properties have similar codons causing

conservative substitutions in the case of mutations or mistranslation

Genetic mutation

The genetic code can be changed by a variety of processesSmall scale:• Damage to DNA (radiation or chemical damage)• Translation errors

Large scale:• Duplication of sections of DNA • Deletion of sections of DNA• Transposition of sections of DNA

The rate of genetic mutation

• The mutation rate (per year or per generation) differs between species and even between different sections of the genome

• Different types of mutations occur with different frequencies

• The average mutation rate is estimated to be ~2.5 × 10−8 mutations per nucleotide site or 175 mutations per diploid genome per generation

• Ref: Nachman, M. W.; Crowell, S. L. Estimate of the Mutation Rate Per Nucleotide in Humans. Genetics, 156, 297 (2000).

Amino acid substitution matrices

• Substitution matrices describe the probability that one AA is converted to another and ‘accepted’

• Matrix is a ‘log odds’ matrix – i.e. here the probability of conversion from Ala to Arg is 1/log(30)

PAM and BLOSUM matrices• Scoring matrices are used to:

– produce sequence alignments and score similarity between two or more protein

– to search a database to find sequences similar to a test sequence • Commonly used families of matrices:

– PAM (Accepted Point Mutation) matrices (Dayhof)• Derived from global alignments of entire proteins• Better for closely related protens

– BLOSUM (BLocks SUbstitution Matrices) matrices (Steven and Henikof)• Derived from local alignments of blocks of sequences• Better for evolutionally divergent sequences

BLAST - Searching genomes

• BLAST is a rapid method for searching protein or DNA sequences in large databases

• Sequences are divided into groups k AAs or BasesPGFHJIQMQVVS PGF, GFH, FHJ, HJI, etc (k=3)

• Common or repeated sequences are discarded• Sections of exact sequence match are searched for• The sequence alignment is expanded from sections

that are exact matches

• Blast can miss difficult matches

http://blast.ncbi.nlm.nih.gov/

http://blast.ncbi.nlm.nih.gov/Blast.cgi

Sequence alignment

• Protein or DNA sequences can be aligned• Differences between sequences are interpreted as

mutations, insertions or deletions• Substitution matrices are used to score the likelihood

of a match• Alignment scores are calculated between pairs of

sequences• Multiple alignments can be performed• Many alignment programs: Clustal, T-coffee,

Clustal

Sequence alignments and protein structural similarity

• Sequence alignments are based on protein/DNA sequence similarity and not on structural similarity

• High sequence similarity implies (but does not guarantee) structural similarity

• High sequence similarity implies (but does not garuantee) similar protein function

Comparison of RMSD when pairs of similar proteins are superimposed using the sequence alignment (X axis) and the protein 3D structures (Y axis)

Ref: Kosloff, M.; Kolodny, R. Sequence-Similar, Structure-Dissimilar Protein Pairs in the PDB. Proteins 2008, 71, 891

Differences between sequence and structural alignment

Chain A versus chain D from PDB ID 1vr4. The two chains are 100% identical in sequence

A: Alignment by sequenceB: Alignment by structureC: Overlaid structures

Ref: Kosloff, M.; Kolodny, R. Sequence-Similar, Structure-Dissimilar Protein Pairs in the PDB. Proteins 2008, 71, 891

Improving sequence alignments

• Adding structural information to sequence alignments can improve their quality

Summary

• This lecture should provide an overview of:

• DNA sequencing and the Polymerase Chain Reaction• Genome sequencing• BLAST searching• Sequence alignments and their limitations

Documents

Bioinformatics Lecture 1. DNA - the basics Drew Berry – DNA animations