Upload
lizbeth-king
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Bioinformatics
Lecture 1
DNA - the basics
Drew Berry – DNA animations
http://www.youtube.com/watch?v=WFCvkkDSfIU&index=4&list=PL9CBBEA5A85DBCDEF
Organisation of DNA
• DNA is packed in Chromosomes • Karyotype: chromosome set of a species • Chromosomes are dynamic structuresThe Human karyotype:• 23 pairs of chromosomes• 46 DNA molecules
A CT G
DNA replication• The ability of DNA to replicate itself is a fundamental driver of life• DNA copy is catalysed by enzymes (DNA polymerases)• The complementary strand is synthesised from a template strand, using
deoxynucleotides and a primer• Synthesis is directional (5’->3’)
5’5’
3’3’
Primer
reverse complement copy
Template DNA strand
DeoxyribonucleotidesdNTPs
DNA polymerase
Template
AC
G T
TCAG
The polymerase chain reaction
• Replication requires a DNA polymerase• Thermostable DNA polymerase (eg Taq
polymerase)• Efficient DNA amplification • No error correction Kary Mullis
Nobel prize in chemistry: 1993
Melt DNA (94-98 °)
Anneal primers (50-65 °)
Elongation (72 °)
Exponential replication
DNA Sequencing (Sanger)• PCR Reaction is terminated using
randomly incorporated dideoxynucleosides (ddNP)
• Older methods use radiolabelled phosphate
• Newer methods use ddNP incorporating dyes
• Truncated DNA strands are separated on a gel or by capillary electrophoresis
Next Generation Sequencing• Next generation sequencing refers to methods newer than the
Sanger approach• A variety of techniques developed by different companies• DNA is generally immobilized on a solid support• Very large numbers of small reads • Multiple reads of a each section of genomic DNA (eg 30x)• Assembling the genome becomes a significant computational
problem
• Some ‘single molecule’ methods do not require PCR (reduces errors)• Cost has reduced substantially the $1000 genome!
• Refs: Metzker, M. L. Sequencing Technologies — the Next Generation. Nat. Rev. Genet. 2009, 11, 31–46.
The Human Genome Project• Funded by US government• The human genome was published in February
2001• Project completed in 2003• Cost $US 2.7 billion in 1991 dollars• Hierarchical shotgun sequencing (genome is
broken down into many smaller fragments)• Automated Sanger type sequencing
• Ref: http://www.nature.com/scitable/topicpage/dna-sequencing-technologies-key-to-the-human-828
Human genome by function• The human genome contains about 21K genes (about 100,000 were expected!)• 98% of the human genome is noncoding DNA• Noncoding DNA can code for regulatory RNAs or otherwise regulate transcription
• Ref: Häggström, Wikiversity Journal of Medicine 1 (2). DOI:10.15347/wjm/2014.008. ISSN 20018762
The druggable genome – Current drug targets
Ref: Hopkins, A. L.; Groom, C. R. The Druggable Genome. Nat Rev Drug Discov 2002, 1, 727–730.
The druggable genome – Human genes
Ref: Hopkins, A. L.; Groom, C. R. The Druggable Genome. Nat Rev Drug Discov 2002, 1, 727–730.
Human genome resources
• Three useful sites providing a huge number of resources such as genome browsers
• NCBI: National center of biological information– http://www.ncbi.nlm.nih.gov/– http://www.ncbi.nlm.nih.gov/genome/guide/human/
• UCSC genome browser– http://genome.ucsc.edu/
• Ensembl: European site at the Sanger centre– http://www.ensembl.org
Next-gen Sequencing Overview
• Ref: http://res.illumina.com/documents/products/illumina_sequencing_introduction.pdf
Multiple Genomes
• Ref: McVean et al. An Integrated Map of Genetic Variation From 1,092 Human Genomes. Nature 2012, 491, 56-65.
Bioinformatics
• Sequencing technologies produce enormous amounts of sequence data. What do we want to do with this?– Identify genes – Identify functions of gene products (proteins)– Compare genes between species– Identify relationships (similarities) between
species
The Genetic Code
In general:• Amino acids that share the same biosynthetic pathway tend to have the
same first base in their codons• Amino acids with similar physical properties have similar codons causing
conservative substitutions in the case of mutations or mistranslation
Genetic mutation
The genetic code can be changed by a variety of processesSmall scale:• Damage to DNA (radiation or chemical damage)• Translation errors
Large scale:• Duplication of sections of DNA • Deletion of sections of DNA• Transposition of sections of DNA
The rate of genetic mutation
• The mutation rate (per year or per generation) differs between species and even between different sections of the genome
• Different types of mutations occur with different frequencies
• The average mutation rate is estimated to be ~2.5 × 10−8 mutations per nucleotide site or 175 mutations per diploid genome per generation
• Ref: Nachman, M. W.; Crowell, S. L. Estimate of the Mutation Rate Per Nucleotide in Humans. Genetics, 156, 297 (2000).
Amino acid substitution matrices
• Substitution matrices describe the probability that one AA is converted to another and ‘accepted’
• Matrix is a ‘log odds’ matrix – i.e. here the probability of conversion from Ala to Arg is 1/log(30)
PAM and BLOSUM matrices• Scoring matrices are used to:
– produce sequence alignments and score similarity between two or more protein
– to search a database to find sequences similar to a test sequence • Commonly used families of matrices:
– PAM (Accepted Point Mutation) matrices (Dayhof)• Derived from global alignments of entire proteins• Better for closely related protens
– BLOSUM (BLocks SUbstitution Matrices) matrices (Steven and Henikof)• Derived from local alignments of blocks of sequences• Better for evolutionally divergent sequences
BLAST - Searching genomes
• BLAST is a rapid method for searching protein or DNA sequences in large databases
• Sequences are divided into groups k AAs or BasesPGFHJIQMQVVS PGF, GFH, FHJ, HJI, etc (k=3)
• Common or repeated sequences are discarded• Sections of exact sequence match are searched for• The sequence alignment is expanded from sections
that are exact matches
• Blast can miss difficult matches
http://blast.ncbi.nlm.nih.gov/
Sequence alignment
• Protein or DNA sequences can be aligned• Differences between sequences are interpreted as
mutations, insertions or deletions• Substitution matrices are used to score the likelihood
of a match• Alignment scores are calculated between pairs of
sequences• Multiple alignments can be performed• Many alignment programs: Clustal, T-coffee,
Clustal
Sequence alignments and protein structural similarity
• Sequence alignments are based on protein/DNA sequence similarity and not on structural similarity
• High sequence similarity implies (but does not guarantee) structural similarity
• High sequence similarity implies (but does not garuantee) similar protein function
Comparison of RMSD when pairs of similar proteins are superimposed using the sequence alignment (X axis) and the protein 3D structures (Y axis)
Ref: Kosloff, M.; Kolodny, R. Sequence-Similar, Structure-Dissimilar Protein Pairs in the PDB. Proteins 2008, 71, 891
Differences between sequence and structural alignment
Chain A versus chain D from PDB ID 1vr4. The two chains are 100% identical in sequence
A: Alignment by sequenceB: Alignment by structureC: Overlaid structures
Ref: Kosloff, M.; Kolodny, R. Sequence-Similar, Structure-Dissimilar Protein Pairs in the PDB. Proteins 2008, 71, 891
Improving sequence alignments
• Adding structural information to sequence alignments can improve their quality
Summary
• This lecture should provide an overview of:
• DNA sequencing and the Polymerase Chain Reaction• Genome sequencing• BLAST searching• Sequence alignments and their limitations