Bio305 genome analysis and annotation 2012

Bio305 Bacterial Genome Annotation and Analysis

Professor Mark Pallen

Overview Features of Bacterial Genomes Genome Sequencing Assembly of bacterial genomes Annotation of bacterial genomes Identifying and annotating CDSs

An ORF is NOT a CDS! Power and pitfalls of using homology

BLAST and PFAM

General features of genomes

Microbial Human Small WSIWYG genomes (Mbp)

Gene density high (>90%) intergenic regions short

very little repetitive or non-coding DNA

Introns very rare Protein-coding genes (CDS) short (~1kbp)

Operons with promoters just upstream

Fewer non-coding RNAs

Very large genomes (Gbp)

Gene density low Only 25% is genes Introns mean only1% codes

Genes can span ≥30 kbp Genes have ~3 transcripts Splicing and splice variants

Promoter regions distant from gene

Bacterial genome organisation

Chromosomes Plasmids

Most commonly single circular chromosome (always DNA) BUT many species have linear chromosome(s) (e.g. Borrelia, Streptomyces, Rhodoccus)

BUT a few species with two chromosomes (e.g. Vibrio cholerae)

Can be mix of circular and linear (e.g. Agrobacterium tumefaciens, B. burgdoferi)

Independent autonomous replicon, can be circular or linear

may integrate into chromosome

copy number varies 1 to 10s often carry non-essential genes that confer an adaptive advantage in certain conditions

Overview of a genome project Choose strain

Fresh isolate or tractable lab strain?

Choose strategy Shotgun sequencing Paired-end sequencing Draft or complete?

Choose chemistry Sanger; 454; Illumina; Ion Torrent

Assembly Automated

Closure and finishing Manually intensive Difficulty depends on how repetitive

Data Release Immediate or delayed?

Annotation Manually intensive bottle neck

Publication

Random shearing

Size selection

Cloning

Sequence each insert with two primers

Pick colonies to create shotgun library

bacterial chromosome

plasmid vector

Plasmid preps

Whole-Genome Shotgun Sanger Sequencing

High-throughput Sequencing100x faster, 100x cheaper!

A disruptive technologySeveral technologies in the marketplace from 2007 onwards 454 (Roche) Illumina Ion Torrent PacBio

Fundamentally new approaches Solid-phase amplification of clonal templates in “molecular colonies” Massive increase in number of “clones” compensates for shorter read length

New chemistries for sequence reading 454: pyrophosphate detection on base addition Illumina: reversible de-protection of fluorescent bases

Random shearing

Size selection


High-Throughput Shotgun Sequencing

Add adaptersAmplifySequence

Illumina Sequencing

The Sequence Assembly Problem Sequencing technologies generate reads of <1000 bp

These reads must be assembled into a single continuous genomic sequence.

Shotgun sequencing exploits many overlapping sequences (high coverage) to infer ordering directly from the sequences themselves

The Repeat Problem Repeats at read ends can be assembled in multiple ways

ATTTATGTGTGTGTGGTGTG

GTGTGGTGTGCACTACTGCT

ACTACTGCTGACTACTGTGTGGTGTG

GTGTGGTGTGATATCCCT

ATTTATGTGTGTGTGGTGTG

GTGTGGTGTGCACTACTGCT

ACTACTGCTGACTACTGTGTGGTGTG

GTGTGGTGTGATATCCCT

Correct

Incorrect

Paired-end Sequencing

Random shearing

Size selection for 3kb or 8kb etc


Add linkers

Circularise

Shear and select on size and presence of linkers

Add adapters

Obtain sequences from either side of linker

known distance apart in genome

Create long fragments of known lengthObtain sequence from paired ends

known distance apartAllows assembly of contigs across repeats into scaffolds

Create long fragments of known lengthObtain sequence from paired ends

known distance apartAllows assembly of contigs across repeats into scaffolds

Scaffold

Contig 3Contig 3Contig 2Contig 2Contig 1Contig 1

Physical Gap

Sequence Gap

Genome Assembly

Re-sequencing Short reads (<200bp) inefficient de novo assembly

Instead they are mapped against a reference genome

Re-sequencing is like assembling a jigsaw puzzle using the image on the lid

SNP calling Comparisons between closely related strains allows identification of SNPs that are informative for Identifying biologically significant changes, e.g. during evolution in lab or patient

Reconstructing phylogenies using neutral changes

Genome annotation Annotation is the addition of information about the predicted sequence features to the flat file of DNA code

Identification of potential coding sequences - CDS

Homology searches to predict function Other features can be annotated as well

rRNAs Potential promoters tRNAs Small non-coding RNAs Repeat sequences Insertion sequences (ISs), transposons, gene fragments

Location of the origin of replication Determination of the number of bases, genes, and G+C%.

How to go from this….?>Escherichia coli K-12 MG1655_3870656-3890655

TGCTGCTGCCTGCTGCGCGGTGCGCTCTACGGATTGCCCGGCGCGATAGAGATCGCTGCCTAAGCCCGCCCCTGCACAACCTGCGTCTATCCACTGCGCCAGGTTTTCTGCGTCACGCCGCAACGGCAAAGACTGCGATGTCCGATGGCAATACCGCTTTTAACGCTTTGATGTATTGCGGACCAAAAGCCGATGACGGAAATATTTTCAGCGCCTGCGGCGCCCGCTTCGAGCGCGGTAAAGGCTTCGGTCGCCGTCGCGCAGCCGGGGCAGACGTCATGCCGTAGCCCACCGCACGGCGGATCACTTCACTATGGATATTGGGCGTAACGATGAGCTGACAGCCCATCCTGGCGAGCGCATCGACCTGTTCAGGTTTCAGTACCGTACCTGCGCCAATCAACGCCTTGTCGCCGTACGCATCAACGATGCGGGAATGCTTTGCTCCCATTGTGGGGAATTCAGCGGGATTTCAACCGCGTCGAACCCGGCGTCAATCACCGCGCCAACATGCGCCAGCGCCTCGTCGGGCGTAATACCGCGCAAAATGGCGATCAGCGGGAGTTTAGTTTGCCACTGCATGAGGATGCTCCTTATACCAGCCTGAAATGCCGTGTCGCCCGCCACCGCCGTCACGTCGCAACCCATCGCCTGAAAGGCTTGCTGGTAGCGCGCGGTCAGCGATGTTCCGGCGACAAGGGTGATGGCGTGTTGATGGGCCACATAGTCGCGCATACTGGGACCTCTGCGCCAATCAACAAACCAGAGAGAAATTCGCTGACCTGTTCGCGGGGAAGTGTTCCCAGCACATGCGAGGCGCGAACTTCAAAAAGCTGCGGCAATATGGCGGGCGTATTAAGACCACGCTCAAGGCCAGCTGTGAAGGCATCGGCAGGTTTTCCTGCGGCGGCAAACCTGCGCCAATCAATGAGTGATTTAACAGTAAATGATGTAATTCACCGGTCATCACGGTGCGAAAATCGTTGATTTGCTGGCTATCGGCCTGCACCCATTTGCAATGGGTTCCGGGCATGACATAAAGAGAGGAAGAGCCAGAGCTCGCGCGCCGATCAATTGTGTTTCTTCGCCGCGCATCACATTGTGGTTATCGTCATGAGAGACACATAATCCGGGAATAATCCAGATATTGTCGCCAACTGACGTTAATTGTTCGCCAATAGACGAAAAACAGGCAGGAACAGATAATACGGTGCAACTTTCCAGCCGACGTTGCTGCCAACCATTCCTGCCATTACCACTGGCGTTTTCTCTTCACGCCAGTCGGTCGTGACTTCTGCTAACACCGCAGCCGGAGATTTTCCGTTCAGGCGCGTGACGCCTGCTTCTGATTGCCTGCTCTCAGGCAGTGGTCGCCCTGATAAAGCCAGGCGCGCAGATTGGTCGATCCCCAGTCAATTGCGATGTAGCGAGCTGTCATGTGATTTCCTTTAACCTTCGTGTCGAGCTGGCGATCATGGTAAGCGCCGCCTGCTCTGCCGCATCGCCGTCCTGATGCGTATCGCATCGAACAGCGCCTTATGTTCCTGGAGCGTTTGCGGCATGTTGGCCTCATCGCCCATCCAGGTTCGTTCAAAAACCGCCCGCTGCAGCGAACTGATCGCAATGCTAAGTTGCTGTAACACCGGGTTATGCACCGACTGCAGCACCGCTCGTGGTAGCGAATATCCGCTTCGTTAAACGCTTCGCGGTCCTGATTGTTGGCAATCATCTCGTTCAGCGCCGATTCAATCTGCGCCAGATCGCTGGAAGTCGCGCGCTCTGCTCCCAACGGGCAATCGCCGGTTCCACCAGATTTCGCACTTCGTCATGGCACTGATAAGCCGTGGGTCGTAGTCATTTTCCAGCACCCATTGCAGTACGTCAGTGTCGAGGTAATTCCACTGGTTACGCGGTGCCACAAACGCCCCGCGATAACGTTTCATTTCAATCAGCCGCTTCGCCATCAGCGAACGGAACACCCACGGATGATGTTGCGCGAGGTTGCAAACTCCTCACAGAGTTCCGCCTCAGCCGGAAGCGGCGAGCCTGGCACGTATTTGCCGTGAACGATCTGTTTACCCAGCGTAATGACAATGCGATCGGTTTTATTGAGAGTCATGGAGAGTCCTTGTGCTTGTATGTTCTTCTCTACTTTACCCCGATCGATGCATAACGCGGCAACTTTGTAGTACCAGCGTGATGACGTTCGCGTTTGCCGTGCGTGTAATGTAGTACAAACTTATATTGTTGTACTACAATTTAGATCACAAAAAGAACAATGCATAAAAAATGACATGCGTCGGGCAGAAATCTGAAAAGGGATATCAGGCGCTAAACAGGAGGGAAAGAAGAGTATGCTTTCAACGGCTTAGCTACTCGTTTAAAGGATTAATCATGAAGTTGAATTTTAAGGGATTTTTTAAGGCTGCCGGTTTATTCCCACTGCGCTGATGCTTTCAGGCTGTATCTCGTATGCTCTGGTTTCCCATACCGCAAAGGGTAGTTCAGGAAAGTATCAATCGCAGTCAGACACCATCACTGGGCTATCGCAGGCAAAAGATAGTAATGGAACAAAAGGCTATGTTTTTGTAGGGGAATCGTGGATTACCTTATCACTGATGGTGCCGATGACATCGTTAAGATGCTCAATGATCCAGCACTTAACCGGCACAATATTCAGGTTGCCGATGACGCAAGATTTGTTTTAAATGCGGGGAAAAAGAAATTTACCGGCACAATATCGCTTTACTACTACGGAATAACGAAGAAGAAAAGGCACTGGCAACGCATTATGGTTTTGCCTGTGGTGTTCAACACTGTACCAGGTCACTGGAAAACCTAAAAGGCACAATCCATGAGAAAAATAAAAACATGGATTACTCAAAGGTGATGGCGTTCTACCATCCATTTAAGTGCGATTTTATGAATACTATTCACCCAGAGGCATTCCGGGATGGTGTTTCCGCAGCATTACTGCCAGTGACTGTTACGCTGGACATCATTACTGCACCGCTGCAATTTCTGGTTGTATATGCAGTAAACCAATAATCAGTAAGCGGGCAAACCGTTTATGCTGTTTGCCCGCCCACAGATTAATTCAGCACATACTTCTCAATAGCAAACGCCACGCCATCTTCAAGGTTAGATTTGGTGACAAAGTTCGCCACTTCTTTCACTGAAGGAATAGCGTTATCCATCGCCACACCGACGCCTGCATATTAATCATTGCGATATCGTTTTCCTGATCGCCAATCGCCATGATTTCTTCCGGTTTAATACCTAACACGTCGGCCAGTGATTTCACCCCCGTACCTTTGTTAACGCGTTTATCGAGGATTTCGAGGAAGTACGGCGCACTTTTCAGCACGGTATATTCTCTTTCACTTCCTGCGGAATACGCGCGATAGCCTGGTCGAGGATGGCGGGTTCATCAATCATCATCACTTTCAGGAACTGGGTATTGGGGTCCATTTTCTCCGCTTCGCAGAACACCAGCGGAATGGTGGCAACGAAGGATTCATGCACCGTGTGTAGCTGATATCACGGTTGGCGGTGTACAGCGTGGTGCGGTCCAGGGCGTGGAAATGAGAACCGACTTCGCGAGAGAGTTTTTCCAGGAAACGATAGTCGTCATAGCTGAGAGCAGTTTGCGCCACGGTGCTACCATCAGCGGCCTTCTGTACCACGCGCCGTTATAAGTAATGCAGTAGTCGCCCGGCTGTTCCATATGCAGCTCTTTCAGGTAGTTGTGCACACCTGCATACGGGCGACCCGTCGTTAGCACGACATTCACGCCACGGGCGCGAGCTGCGGCAATCGCATTTTTAACGGCGGGTGAAAGGTGTGATCGGGCAGCAGAAGGGTGCCATCCATATCGATAGCAATGAGTTTAATAGCCATGAGTTCCCCAGGTAGATTGGTTCCTGACCCATGCTAACGCGATTCCGCTCAAAAATCAGTACAACACCCGAGGGAAAAGGGGGATGCAACGCGCGTGCGTGCTCCCTTTTTGCTTAGCGGAAGAGTTTCCCTTTCAGCAGTTCCATGCCTGCGGAAAGCAGATCGTTATTGGCTTGTGGTGACACTTCACCTTGCGGTGAGAGCGCATCAATAATCTTCGGCAATTGTTCTGCCAGTAAACTGGAAGCTGACTGGTATCCACGCCAAGTTTTTGCCCGAGATCGGACACCGCATTTGTGCCGAGCGCCGATTCCAGTTGCTCGCCACTAACCGATTGATTGCCCTGTTGATTACTCAGCCAGGTTGAGAGAATGGCCCCTAAGCCGCCACTTTGCAGTTTTTCCACAGCACCTGAATGCCGCCCTGCTCCTCAACCCAACTTAAAATAGCCTGATATTTCCCCGCATCGCCTTTCAGAAAGGCACCGACAACTTCATCAAAAAGCCCCATGATAATCACCTGTAAAGCGTTACGTGTTGACCCAAAAAGTATAGATTTGCGGATGATAATTGCGGATTGCAGAAATAAAAAGGGCGGAGATGATCTCCGCCCTTTTCTTATAGCTTCTTGCCGGATGCGGCGTGAACGCCTTATCCGGCCTACAAAATCATGAAAATTCAATACATTGCAAGATTTTCGTAGGCCTGATAAGCGTGCGCATCAGGCACGCTCGCATGGTTAGCGCCATTAAATATCGATATTCGCCGCTTTCAGGGCGTTCTCTTCAATAAACGCACGGCGCGGTTCAACGGCGTCGCCCATCAGCGTGGTGAACAACTGGTCGGCAGCAATCGCATCTTTAACGGTAACCGCAGCATACGACGACTTTCCGGGTCCATAGTGGTTTCCCACAGCTGTTCCGGGTTCATCTCGCCCAGACCTTTATAACGCTGGATGGAGAGGCCGCGACGGGACTCTTTCACCAGCCAGTCCAGCGCCTGCTCGAAGCTGGCTACCGGCTGACGCGCTCGCCACGTTCGATAAACGCATCTTCTTCCAGCAAGCCACGCAGTTTCTCACCCAGCGTGCAGATACGACGATATTCGCCACCGGTGATAAACTCGTGATCCAGCGGATAGTCAGTATCCACACCGTGGGTACGCACGCGAACAATCGGCTCAACAGGTTTTGCTCAGCATTGGTGTGAACATCAAACTTCCACTGGCTGCCGTGCTGTTCTTTGTCGTTCAGTTCGCTGACCAGCGCGTTCACCCAGCGGGTAACGGTCTGCTCATCAGAAAGGTCAGCTTCCGTCAACGTCGGCTGATAGATAAGTCTTTCAGCATTGCTTTCGGATAACGACGCTCCATACGATTGATCATTTTCTGCGTCGCGTTGTACTCAGATACCAGTTTCTCTAACGCTTCGCCAGCCAATGCCGGTGCACTGGCGTTGGTGTGCAGCGTTGCGCCGTCCAGCGCGATAGAGATTGGTACTGATCCATCGCTTCGTCGTCTTTAATGTACTGTTCCTGCTTGCCTTTCTTCACTTTGTACAGCGGCGGCTGAGCGATGTAGACGTGACCGCGTTCAACGATTTCCGGCATCTGACGATAGAAGAAGGTCAACAGCAGCGTACGAATGTGGAGCCGTCGACGTCCGCATCGGTCATGATGATGATGCTGTGATAACGCAGTTTGTCCGGGTTGTACTCGTCACGACCGATACCACAGCCAAGCGCGGTGATAAGCGTCGCCACTTCCTGAGAAGAGAGCATCTTATCGAAGCGCGCTTTCTCGACTTGAGGATTTTACCCTTCAGCGGCAGAATCGCCTGGTTCTTGCGGTTACGCCCCTGCTTCGCAGAGCCGCCCGCGGAGTCCCCTTCCACCAGGTACAGTTCGGAAAGCGCCGGATCGCGTTCCTGGCAGTCTGCCAGTTTGCCCGGCAGGCCCGCAAGTCGAGCGCACCTTTACGGCGGGTCATTTCACGCGCGCGACGCGGCGCTTCACGGGCACGGGCAGCATCGATAATTTTGCCAACCACGATTTTCGCGTCGGTTGGGTTTTCCAGCAGGTATTCTGCCAGCAGTTCGTTCATCTGCTGTTCAACGCCGATTTCACCTCAGAAGAAACCAGTTTGTCTTTGGTCTGGGAGGAGAATTTCGGGTCCGGCACTTTCACGGAAACGACCGCAATCAGGCCTTCACGCGCATCGTCACCGGTGGCGCTGACTTTGGCTTTTTTGCTGTAGCCTTCTTTGTCCATTAGGCGTTCAGGGTACGGGTCATCGCCGCACGGAAGCCTGCCAGGTGAGTACCGCCGTCACGCTGCGGAATGTTGTTGGTAAAGCAGTAGATGTTTTCCTGGAAGCCATCGTTCCACTGCAACGCCACTTCGACGCCAATACCGTCTTTTTCAGTGAGAAGTAGAAGATATTCGGGTGGATCGGCGTTTTGTTCTTGTTCAGATATTCAACGAACGCCTTGATGCCGCCTTCATAGTGGAAGTGGTCTTCTTTGCCGTCGCGCTTGTCGCGCAGACGAATGGAAACGCCGGAGTTGAGGAACGACAACTCCGCAGACGTTTCGCCAGAATTTCATATTCGAACTCGGTCACATTGGTGAAGGTTTCGAGGCTGGGCCAGAAACGCACCATGGTGCCGGTTTTTTCAGTCTCGCCGGTAACCGCCAGCGGGGCCTGCGGTACACCGTGTTCGTAGATCTGACGGTGATTTTACCCTCGCGCTGGATAACCAGCTCCAGTTTTTGCGACAGGGCGTTTACTACCGAAACACCAACGCCGTGCAGACCGCCGGACACTTTATAGGAGTTATCGTCAAATTTACCGCCTGCGTGCAGAACGGTCATGATCACTTCCGCCGCCGA

…to this? FT gene complement(9299..10702)

FT /db_xref="GenBank:2367266”

FT /gene="dnaA”

FT /note="b3702”

FT CDS complement(9299..10702)

FT /db_xref="GI:2367267”

FT /db_xref="PID:g2367267”

FT /function="putative regulator; DNA - replication, repair,

FT restriction/modification”

FT /codon_start=1

FT /protein_id="AAC76725.1”

FT /gene="dnaA”

FT /translation="MSLSLWQQCLARLQDELPATEFSMWIRPLQAELSDNTLALYAPNR

FT FVLDWVRDKYLNNINGLLTSFCGADAPQLRFEVGTKPVTQTPQAAVTSNVAAPAQVAQT

FT QPQRAAPSTRSGWDNVPAPAEPTYRSNVNVKHTFDNFVEGKSNQLARAAARQVADNPGG

FT AYNPLFLYGGTGLGKTHLLHAVGNGIMARKPNAKVVYMHSERFVQDMVKALQNNAIEEF

FT KRYYRSVDALLIDDIQFFANKERSQEEFFHTFNALLEGNQQIILTSDRYPKEINGVEDR

FT LKSRFGWGLTVAIEPPELETRVAILMKKADENDIRLPGEVAFFIAKRLRSNVRELEGAL

FT NRVIANANFTGRAITIDFVREALRDLLALQEKLVTIDNIQKTVAEYYKIKVADLLSKRR

FT SRSVARPRQMAMALAKELTNHSLPEIGDAFGGRDHTTVLHACRKIEQLREESHDIKEDF

FT SNLIRTLSS”

FT /product="DNA biosynthesis; initiation of chromosome

FT replication; can be transcription regulator”

FT /transl_table=11

FT /note="f467; 100 pct identical to DNAA_ECOLI SW: P03004;

FT CG Site No. 851”

Or this?

Caveat Real bioinformaticians do not use graphical web-based tools

Real bioinformaticians use the Unix (typically Linux) command line interface

Often glue programs together into pipelines using Perl

Write programs in e.g. Perl or Python

Aim here is to equip lab-based worker with basic know-how

If you want to become a bioinformatician, do an MSc in Bioinformatics

Sources of information for annotation Comparison with genome sequences from related organisms

Published experimental data Demonstration of function of a gene Demonstration of function of a homologous gene

Review articles on protein families or groups of proteins Prediction that the CDS encodes a member of the family

Prediction that the CDS encodes a conserved motif Protein sequence analysis

Annotations are only predictions Sequences generated from RNA-Seq and protein mass spectrometry support annotations

Expert knowledge on an organism or protein family can assist in annotation

Approaches to functional annotation Most of the work now done automatically by programs Analyses strung together into pipelines, so that on our xBASE site we can assemble then annotate a genome in half an hour

But automated approaches work best if a closely related sequence is available

Wherever there are conflicting predictions, one has to rely on human judgment and interpretation of context Adjusting start codons Fine-tuning descriptors

Annotation should rely on an evidence trail that leads back to experimental results (“genomic isnad”)

GC skewGC skew(G-C)/G+C)(G-C)/G+C)

Identifies origin of Identifies origin of replication and leading replication and leading

lagging strandslagging strands

GC skewGC skew(G-C)/G+C)(G-C)/G+C)

Identifies origin of Identifies origin of replication and leading replication and leading

lagging strandslagging strands

Genes Genes coded by coded by location & location & functionfunction %G+C

Genes Genes shared shared with E. with E. colicoli

Genes Genes unique to unique to S. typhiS. typhi

Base composition aids genome analysis

Analysis of nucleotide sequence data Search for Sequence Features

Promoters, Ribosome-binding Sites Repeats, Inverted Repeats Consensus Sequences for regulator binding site Often rely on sequence motifs

tRNA, rRNA, ncRNA tRNA scan, RFAM, RNAmmer

Gene Finding in bacteria Ab initio gene prediction

By opening reading frame Find ORFs Find credible CDSs within ORFs Resolve conflicting ORFs

By codon usage By Markov models

By homology Similarity Searches via protein or translated BLAST

Comparative genomics

Identifying protein-coding sequences

In bacteria, quick and dirty approach is to find ORFs (open reading frames) Stretches of sequence without termination codons

Can be any of 3 termination codons – TAG, TGA, TAA

BUT variant genetic codes in mycoplasmas

Can be in any of 6 frames – 3 forward and 3 reverse

Do NOT necessarily start with initiation codon

Do NOT confuse ORFS and CDSs CDSs have an initiation codon

Can be any of 3 initiation codons – ATG, GTG, TTG

Has to be in the same frame as the termination codon unless the CDS is frame-shifted

Homology to other protein sequences can help identify a CDS

The problem of conflicting ORFs

Non-coding ORFs

CDSs (note ORF can

extend upstream of start codon)

Actual sequence

10 20 30 40 50 60 70 | | | | | | |ATGAGTACCGCTAAATTAGTTAAATCAAAAGCGACCAATCTGCTTTATACCCGCAACGATGTCTCCGACAGCGAGAAAM S T A K L V K S K A T N L L Y T R N D V S D S E K • V P L N • L N Q K R P I C F I P A T M S P T A R K E Y R • I S • I K S D Q S A L Y P Q R C L R Q R E K

10 20 30 40 50 60 70 | | | | | | |ATGAGTACCGCTAAATTAGTTAAATCAAAAAGCGACCAATCTGCTTTATACCCGCAACGATGTCTCCGACAGCGAGAAM S T A K L V K S K S D Q S A L Y P Q R C L R Q R E • V P L N • L N Q K A T N L L Y T R N D V S D S E K E Y R • I S • I K K R P I C F I P A T M S P T A R K

The Problem of Frameshift Errors

Frameshifted sequence after single base error

CDS Prediction: Graphical Plots

GC content by reading frame

Amino-acid composition by reading frame, compared to average for globular proteins

CDS Prediction: Markov Models Markov Model-based programs Use probabilities of states and transitions between these states to predict features

Glimmer is industry standard for bacterial genomes Can be trained on related genome

Or use long-ORFs (>500 codons) option to bootstrap a model

Problems Smaller genes not statistically significant so thrown out

Algorithms trained with sequences from known genes which biases against genes about which nothing known

Annotation of protein-coding genes Structure and composition Transmembrane domains Signal peptides Post-translational modifications

Homology to other proteins

Function(s) Catalytic activity / cofactors / induction / regulation

Metabolic pathways Structural genes Cellular location

Phase variants, pseudogenes, SNPs, coding repeats, etc.

Annotation pipeline Predict CDSs with Glimmer

On the predicted genes Do homology searches (BLAST) against nearest relative

Port annotation across on orthologues

Apply in-depth analysis to strain-specific genes (or all genes if de novo sequence)

domain searches: PFAM or CDD

PSI-BLAST Perform other analyses: Coiled coils, signal peptides, TM domains

Homology Similarity that arises because of descent from a common ancestor…“The formation of different languages and of distinct species, and the proofs that both have been developed through a gradual process, are curiously parallel… We find in distinct languages striking homologies due to community of descent, and analogies due to a similar process of formation… Languages, like organic beings, can be classed in groups under groups; and they can be classed either naturally according to descent, or artificially by other characters… The survival or preservation of certain favoured words in the struggle for existence is natural selection.”Charles Darwin, 1871 THE DESCENT OF MAN, Chapter 3

Homology Similarities in form (sequence) allow us to infer similarities in “meaning” (structure and function)

Homology is not just sequence similarity Two sequences can be similar without any common ancestry, particularly if low complexity

the cat sat on the mat die Katze sass auf der Matte

vge|GBant88-2 ITLITCVSVKDNSKRYVVAGvge|GEfae9-178 LTLITCDQATKTTGRIIVIAvge|GSpne1-403 MTLITCDPIPTFNKRLLVNFsortase_staur LTLITCDDYNEKTGVWEKRK

Types of Homology Homologues can be divided into Orthologues: lines of descent congruent with whole genome

Paralogues: result of gene duplication

Xenologues: result of HGT

Homology Searches The aim of homology searches is to identify sequences within these databases that are homologous to your sequence.

This involves comparing your sequence with all the database sequences looking for stretches of sequence that appear to be similar

then scoring the matches and ranking them a measure of the significance of the match is given

What is BLAST? Basic Local Alignment Search Tool

Developed in 1990, refined in 1997 (Stephen Altschul)

A method of searching sequence databases to find sequences similar to the input sequence Scans a database for alignments to a query sequence

Fastest and most frequently used sequence alignment tool the industry standard

Can be extremely informative, giving clues to functionality, evolutionary history, important residues

Basis for many forms of bioinformatic analysis

The several flavours of BLAST BLASTP

protein query versus protein sequence database. BLASTN

nucleotide query versus nucleotide sequence database.

BLASTX translated nucleotide query versus protein sequence database

TBLASTN protein query versus translated nucleotide sequence database

TBLASTX translated nucleotide query versus translated nucleotide sequence database.

Chosing the right flavour What program will best suit your query, and desired output?

If you are dealing with a protein-coding gene, comparisons at the protein level give better results Sequence complexity: 20 amino acids versus 4 nucelotides

Moderately similar nucleotide sequences could encode a highly similar protein sequence!

Use BLASTP or a translated BLAST search, rather than BLASTN

Reserve BLASTN for non-coding regions or rRNA/tRNA genes

Low complexity filtering Low complexity sequence with pronounced compositional bias can lead to spurious alignments Modern versions of BLAST

either take into account amino-acid composition or screens out regions of low complexity

At NCBI, adjustment for compositional bias is on but low-complexity filter is off by default For “no stones unturned”

approach, explore results with adjustments and filter on and off

Watch out for… transmembrane or signal

peptide regions coil-coil regions short amino acid repeats

(collagen, elastin) homopolymeric repeats

Understanding BLAST Results Graphic representation of results

Top of graph represents query sequence

Underlying bars show where hits occur

Colors represent alignment scores

Grey areas represent non similar regions surrounded by similar regions

Scrolling over bar shows accession and description of hit

Clicking on a bar takes you to its alignment with the query

Bit Scores

high is good

Bit Scores

high is good

E-values

low is good

E-values

low is good

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/http://www.ncbi.nlm.nih.gov/BLAST/tutorial/

Typical Blast Output

Sum Reading High ProbabilitySequences producing High-scoring Segment Pairs: Frame Score P(N) N emb|X69337|ECDPS E.coli dps gene for binding protein +2 834 6.4e-109 1 gb|U04242|ECU04242 Escherichia coli core starvation p... +3 828 2.7e-106 1 emb|X14180|ECGLNHPQ Escherichia coli glutamine permeas... +3 443 2.8e-53 1gb|U18769|HDU18769 Haemophilus ducreyi fine tangled p... +1 150 4.0e-18 2 dbj|D01016|ANALTI46 Anabaena variabilis lti46 gene. >e... +2 129 4.8e-12 2 gb|M84990|P26BPO Plasmid pOP2621 ORF1 gene, 5' end;... -2 131 6.7e-09 1gb|U16121|HPU16121 Helicobacter pylori neutrophil act... +1 112 1.8e-06 1gb|M32401|TRPTYF1 T.pallidum pallidum antigen TyF1 g... +3 101 5.6e-06 2emb|X71436|RPNTRB R.phaseoli ntrB gene +1 67 0.76 2gb|L35598|DRODGC1A Drosophila melanogaster receptor g... +1 48 0.97 3

Typical Blast Output

gb|U18769|HDU18769 Haemophilus ducreyi fine tangled pili major pilin subunit gene Length = 780 Plus Strand HSPs: Score = 150 (68.0 bits), Expect = 4.0e-18, Sum P(2) = 4.0e-18 Identities = 36/89 (40%), Positives = 46/89 (51%), Frame = +1

Query: 30 ELLNRQVIQFIDLSLITKQAHWNMRGANFIAVHEMLDGFRTALIDHLDTMAERAVQLGGV 89 E L ++ +L+LI K AHWN+ G FIAVHEMLD + D +D +AER LG Sbjct: 253 EALQMRLQGLNELALILKHAHWNVVGPQFIAVHEMLDSQVDEVRDFIDEIAERMATLGVA 432

Query: 90 ALGTTQVINSKTPLKSYPLDIHNVQDHLK 118 G + + YPL QDHLKSbjct: 433 PNGLSGNLVETRQSPEYPLGRATAQDHLK 519

Domain database searches Rationale

Now that databases very large, can be difficult to interpret Blast results when 1000s of hits

If one part of protein has many hits and another part has few hits, useful information may get swamped or lost

Solution Search databases that contain collections of protein domains/families

Pfam pfam.sanger.ac.uk/

CDD www.ncbi.nlm.nih.gov/cdd

Represented as sequence alignments and/or HMMs

Annotated with information about key features of domain

Pfam domains

Pfam search results

Signal PeptideSignal Peptide AA proteaseprotease

BB

Coiled coil domainCoiled coil domain CC

Homology lies in Homology lies in one domainone domain

Signal PeptideSignal Peptide

Protein A“a protease”

Protein B

Protein C

The Annotation Catastrophe

But functional assignment for whole of But functional assignment for whole of protein A comes from another domain, protein A comes from another domain, carried across in error, so proteins B carried across in error, so proteins B and C get misannotated as proteasesand C get misannotated as proteases

But functional assignment for whole of But functional assignment for whole of protein A comes from another domain, protein A comes from another domain, carried across in error, so proteins B carried across in error, so proteins B and C get misannotated as proteasesand C get misannotated as proteases

Annotation: rules to consider Don’t trust your computer blindly Adopt Cartesian doubt!

Examine and think about your results Confirm with multiple lines of evidence BLAST genomic context PFAM

Overview Features of Bacterial Genomes Genome Sequencing Assembly of bacterial genomes Annotation of bacterial genomes Identifying and annotating CDSs

An ORF is NOT a CDS! Power and pitfalls of using homology

BLAST and PFAM

Technology

Bio305 genome analysis and annotation 2012