Genome Analysis and Genome Comparison. Outline Overview Why do comparative genomic analysis? Assumptions/Limitations Genome Analysis and Annotation Standard

Genome Analysis and Genome Analysis and Genome ComparisonGenome Comparison

OutlineOutline

• Overview• Why do comparative genomic analysis?• Assumptions/Limitations• Genome Analysis and Annotation Standard Procedure• General Purposes Databases for Comparative

Genomics• Organism Specific Databases• Genome Analysis Environments• Genome Sequence Alignment Programs• Genomic Comparison Visualization Tools

Genome Project – statusGenome Project – status

June 14, 2002:

• 93 Published Complete genomes – 16 archaeal– 65 bacterial– 12 eukaryotic

• 284 Ongoing Prokaryotic Genomes

• 195 Ongoing Eukaryotic Genomes

Some of the prokaryotic genomesSome of the prokaryotic genomes

Bacteroides fragilis Opportunistic In progress Bordetella bronchiseptica Veterinary In progress Bordetella parapertussis Whooping cough Bordetella pertussis Whooping cough Complete Burkholderia cepacia Lung infections in CF In progress Burkholderia pseudomallei Melliodosis In progress Chlamidophila abortus Veterinary Funded Clostridium botulinum Botulism Funded Clostridium difficile Colitis In progress Corynebacterium diphtheriae Diphtheria Complete Erwinia carotovora Plant pathogen Funded Escherichia/Shigella spp. (5) Various In progress Mycobacterium bovis Tuberculosis In progress Mycobacterium marinum Various In progress Neisseria meningitidis (serogroup C) Bacterial meningitis In progress Salmonella typhi Typhoid fever Complete Salmonella spp. (5) Various In progress Staphylococcus aureus (MRSA) Various (Nosocomial) Complete Staphylococcus aureus (MSSA) Various (Community acquired) In progress Streptococcus pneumoniae Bacterial meningitis In progress Streptococcus pyogenes Various (ARF-associated) In progress Streptococcus suis Veterinary In progress Streptococcus uberis Veterinary In progress Streptomyces coelicolor Non-pathogenic Complete Tropheryma whipelli Whipple’s disease In progress Wolbachia (Culex quinquefasciatus) Vector (Bancroftian filariasis) In progress Wolbachia (Onchocerca volvulus) River Blindness Funded Yersinia enterocolitica Food poisoning In progress Yersinia pestis Plague Complete

Complete

Some of the eukaryotic genomesSome of the eukaryotic genomes

Aspergillus fumigatus Farmer’s lung In progress Dictyostelium discoideum Soil amoeba In progress Entamoeba histolitica Amoebic dysentry In progress Leishmania major Leishmaniasis In progress Plasmodium falciparum Malaria In progress Schistosoma mansoni Bilharzia In progress Schizosaccharomyces pombe Fission yeast Complete Theileria annulata Veterinary In progress Toxoplasma gondii Toxoplasmosis In progress Trypanosoma brucei Sleeping sickness In progress

Bioinformatics Flow ChartBioinformatics Flow Chart

6. Gene & Protein expression data

7. Drug screening

Ab initio drug design ORDrug compound screening in database of molecules

8. Genetic variability

1a. Sequencing

1b. Analysis of nucleic acid seq.

2. Analysis of protein seq.

3. Molecular structure prediction

4. molecular interaction

5. Metabolic and regulatory networks

Complete sequence

Shotgun reads

Contigs

Genomic DNA

Shearing/Sonication

Subclone and Sequence

Assembly

Finishing

Finishing read

Genome Sequencing - ReviewGenome Sequencing - Review

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

•Most genome will be sequenced and can be sequenced;

few problem are unsolvable.

Clone by clone vs whole genome shotgun

•Problem lies in understanding what you have:

•Gene prediction/gene finding

•Annotation

Subcloning; generate small insert libraries

Assembly: Process of taking raw single-pass reads into contiguous consensus sequence (Phred/Phrap)

Assembly

Libraries

Strategy

Sequencing

Closure: Process of ordering and merging consensus sequences into a single contiguous sequence

Closure

Annotation -DNA features (repeats/similarities)-Gene finding-Peptide features-Initial role assignment-Others- regulatory regions

Release Release data to the public e.g. EMBL or GenBank

Annotation of eukaryotic genomesAnnotation of eukaryotic genomes

transcription

RNA processing

translation

AAAAAAA

Genomic DNA

Unprocessed RNA

Mature mRNA

Nascent polypeptide

folding

Reactant A Product BFunction

Active enzyme

ab initio gene prediction

Comparative gene prediction

Functional identification

Gm3

Why do comparative genomics?Why do comparative genomics?

• Many of the genes encoded in each genome from the genome projects had no known or predictable function

• Analysis of protein set from completely sequenced genomes• Uniform evolutionary conservation of proteins in microbial genomes,

70% of gene products from sequenced genomes have homologs in distant genomes (Koonin et al., 1997)

• Function of many of these genes can be predicted by comparing different genomes of known functional annotation and transferring functional annotation of proteins from better studied organisms to their orthologs in lesser studied organisms.

• Cross species comparison to help reveal conserved coding regions• No prior knowledge of the sequence motif is necessary• Complement to algorithmic analysis

Assumptions/LimitationAssumptions/Limitation

• Homologous genes are relatively well preserved while noncoding regions tend to show varying degrees of conservation. Conserved noncoding regions are believed to be important in regulating gene expression, maintaiing structural organization of the genome and most likely other possible functions.

• Cross species comparative genomics is influenced by the evolutionary distance of the compared species.

Genome Analysis and Annotation: General ProcedureGenome Analysis and Annotation: General Procedure

• Basic procedure to determine the functional and structural annotation of uncharacterized proteins:

• Use a sequence similarity search programs such as BLAST or FASTA to identify all the functional regions in the sequence. If greater sensitivity is required then the Smith-Waterman algorithm based programs are preferred with the trade-off greater analysis time.

• Identify functional motifs and structural domains by comparing the protein sequence against PROSITE, BLOCKS, SMART, CDD, or Pfam.

• Predict structural features of the protein such as signal peptides, transmembrane segments, coiled-coil regions, and other regions of low sequence complexity

• Generate a secondary and tertiary (if possible) structure prediction• Annotation:

– Transfer of function information from a well-characterized organism to a lesser studied organism and/or

– Use phylogenetic patterns (or profiles) and/or – Use the phylogenetic pattern search tools (e.g. through COGs) to perform a

systematic formal logical operations (AND, OR, NOT) on gene sets -- differential genome display (Huynen et al., 1997).

Genome Analysis and Annotation:Genome Analysis and Annotation:One Possible ProcedureOne Possible Procedure

• Basic procedure to determine the functional and structural annotation of uncharacterized proteins:

• Use a sequence similarity search programs such as BLAST or FASTA to identify all the functional regions in the sequence. If greater sensitivity is required then the Smith-Waterman algorithm based programs are preferred with the trade-off greater analysis time.

• Identify functional motifs and structural domains by comparing the protein sequence against PROSITE, BLOCKS, SMART, CDD, or Pfam.

• Predict structural features of the protein such as signal peptides, transmembrane segments, coiled-coil regions, and other regions of low sequence complexity

• Generate a secondary and tertiary (if possible) structure prediction• Transfer of function information from a well-characterized organism to a

lesser studied organism and/or use phylogenetic patterns (or profiles) and/or use the phylogenetic pattern search tools (e.g. through COGs) to perform a systematic formal logical operations (AND, OR, NOT) on gene sets -- differential genome display (Huynen et al., 1997)..

Automated Genome AnnotationAutomated Genome Annotation

• GeneQuiz – limited number of searches/day

• MAGPIE – outside users cannot submit own seq

• PEDANT – commercial version allow for full capacity

• SEALS – semi automated

General Databases Useful for General Databases Useful for Comparative GenomicsComparative Genomics

• Locus Link/RefSeq: http://www.ncbi.nih.gov/LocusLink/• PEDANT -Protein Extraction Description ANalysis Tool

http://pedant.gsf.de/• MIPS – http://mips.gsf.de/• COGs - Cluster of Orthologous Groups (of proteins)

http://www.ncbi.nih.gov/COG/• KEGG - Kyoto Encyclopedia of Genes and Genomes

http://www.genome.ad.jp/kegg/• MBGD - Microbial Genome Database

http://mbgd.genome.ad.jp/• GOLD - Genome OnLine Database

http://wit.integratedgenomics.com/GOLD/• TOGA – http://www.tigr.org/xxxxx

Problems with existing sequence alignments Problems with existing sequence alignments algorithms for genomic analysisalgorithms for genomic analysis

• Most algorithms were developed for comparing single protein sequences or DNA sequences containing a single gene

• Most algorithms were based on assigning a score to all the possible alignments (usually by the sum of the similarity/identity values for each aligned residue minus a penalty for the introduction of gaps) and then finding the optimal or near-optimal alignment based on the chosen scoring scheme.

• Unfortunately, most of these programs cannot accurately handle long alignments.

• Linear-space type of Smith-Waterman variants are too computationally intensive requiring specialized hardware (memory-limited) or very time-consuming. Higher speed vs increased sensitivity.

Genome-size comparative alignment toolsGenome-size comparative alignment tools• ASSIRC - Accelerated Search for SImilarity Regions in Chromosomes

– ftp://ftp.biologie.ens.fr/pub/molbio/ (Vincens et al. 1998)• BLAT –

– http://genome.ucsc.edu/cgi-bin/hgBlat?command=start (Kent xxx)• DIALIGN - DIagonal ALIGNment

– http://www.gsf.de/biodv/dialign.html (Morgenstern et al. 1998; Morgenstern 1999(• DBA - DNA Block Aligner

– http://www.sanger.ac.uk/Software/Wise2/dba.shtml (Jareborg et al. 1999(• GLASS - GLobal Alignment SyStem

– http://plover.lcs.mit.edu/ (Batzoglou et al. 2000)• LSH-ALL-PAIRS - Locality -Sensitve Hashing in ALL PAIRS

– Email: [email protected] (Buhler 2001)• MegaBlast

– http://www.ncbi.nih.gov/blast/ (Zhang 2000)• MUMmer - Maximal Unique Match (mer)

– http://www.tigr.org/softlab/ (Delcher et al. 1999)• PIPMaker - Percent Identity Plot MAKER

– http://biocse.psu.edu/pipmaker/ (Schwartz et al. 2000)• SSAHA – Sequence Search and Alignment by Hashing Algorithm

– http://www.sanger.ac.uk/Software/analysis/SSAHA/ • WABA - Wobble Aware Bulk Aligner

– http://www.cse.ucsc.edu/~kent/xenoAli/ (Kent & Zahler 2000)

SSAHASSAHA• Sequence Search and Alignment by Hashing Algorithm• Software tool for very fast matching and alignment of

DNA sequences.• Achieves fast search speed by converting sequence

information into a hash table data structure which can then be searched very rapidly for matches

• http://www.sanger.ac.uk/Software/analysis/SSAHA/• Run from the Unix command line• Need > 1GB RAM (needs a lot of memory)• SSAHA algorithm best for application requiring exact or

“almost exact” matches between two sequences – e.g. SNP detection, fast sequence assembly, ordering and orientation of contigs

Genome Analysis EnvironmentGenome Analysis Environment

• MAGPIE - Automated Genome Project Investigation Environment

• PEDANT

• SEALS

Problems with Visualizing GenomesProblems with Visualizing Genomes

• Alignment programs output often were visualized by text file, which can be intuitively difficult to interpret when comparing genomes.

• Visualization tools needed to handle the complexity and volume of data and present the information in a comprehensive and comprehensible manner to a biologist for interpretation.

• Genome Alignment Visualization tools need to provide: – interpretable alignments, – gene prediction and database homologies from different sources– Interactive features: real time capabilities, zooming, searching specific

regions of homologies– Represent breaks in synteny– Multiple alignments display– Displaying contigs of unfinished genomes with finished genomes– Handle various data formats– Software availabilty (no black box)

Genome Comparison Visualization ToolGenome Comparison Visualization Tool

• ACT - Artemis Comparison Tool (displays parsed BLAST alignments; based on Artemis – an annotation tool)– http://www.sanger.ac.uk/Software/ACT/

• Alfresco (displays DBA alignments and ...)– http://www.sanger.ac.uk/Software/Alfresco/ (Jareborg & Durbin

2000)• PipMaker (displays BlastZ alignments)

– http://bio.cse.psu.edu/pipmaker/ (Schwartz et al. 2000) • Enteric/Menteric/Maj (displays Blastz alignments)

– http://glovin.cse.psu.edu/enterix/ (Florea et al. 2000; McClelland et al. 2000)

• Intronerator (displays WABA alignments and ...)– http://www.cse.ucsc.edu/~kent/intronerator/ (Kent & Zahler 2000b)

• VISTA (Visualization Tool for Alignment) (displays GLASS alignments)– http://www-gsd.lbl.gov/vista/

• SynPlot (displays DIALIGN and GLASS alignments)– http://www.sanger.ac.uk/Users/igrg/SynPlot/

Artemis Comparison Tool (ACT)Artemis Comparison Tool (ACT)

- ACT is a DNA sequence comparison viewer based on Artemis

- Can read complete EMBL and GenBank entries or sequence in FASTA or raw format

- Additional sequence feature can be in EMBL, GenBank, GFF format

- ACT is free software and is distributed under the GNU Public License

- Java based software- Latest release 2.0 better support Eukaryotic Genome

Comparison

http://www.sanger.ac.uk/Software/ACT/

ASSIRCASSIRC• Accelerated Search for SImilarity Regions in Chromosome• ASSIRC finds regions of similarity in pair-wise genomic sequence

alignments. • The method involves three steps:

– (i) identification of short exact chains of fixed size, called 'seeds', common to both sequences, using hashing functions;

– (ii) extension of these seeds into putative regions of similarity by a 'random walk' procedure (i.e. the four bases are associated;

– (iii) final selection of regions of similarity by assessing alignments of the putative sequences.

• We used simulations to estimate the proportion of regions of similarity not detected for particular region sizes, base identity proportions and seed sizes.

• This approach can be tailored to the user's specifications.• They looked for regions of similarity between two yeast chromosomes (V

and IX). The efficiency of the approach was compared to those of conventional programs BLAST and FASTA, by assessing CPU time required and the regions of similarity found for the same data set.

• http://www.biologie.ens.fr/perso/vincens/assirc.html• ftp://ftp.biologie.ens.fr/pub/molbio/assirc.tar.gz

BLATBLAT• Only DNA sequences of 25,000 or less bases and protein or translated sequence of 5000 or less

letters will be processed. If multiple sequences are submitted at the same time, the total limit is 50,000 bases or 12,500 letters.

• BLAT on DNA is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or shorter sequence alignments. It will find perfect sequence matches of 33 bases, and sometimes find them down to 22 bases. BLAT on proteins finds sequences of 80% and greater similarity of length 20 amino acids or more. In practice DNA BLAT works well on primates, and protein blat on land vertebrates

• BLAT is not BLAST. DNA BLAT works by keeping an index of the entire genome in memory. The index consists of all non- overlapping 11-mers except for those heavily involved in repeats. The index takes up a bit less than a gigabyte of RAM. The genome itself is not kept in memory, allowing BLAT to deliver high performance on a reasonably priced Linux box. The index is used to find areas of probable homology, which are then loaded into memory for a detailed alignment. Protein BLAT works in a similar manner, except with 4-mers rather than 11-mers. The protein index takes a little more than 2 gigabytes

• BLAT was written by Jim Kent. Like most of Jim's software interactive use on this web server is free to all. Sources and executables to run batch jobs on your own server are available free for academic, personal, and non-profit purposes. Non- exclusive commercial licenses are also available. Contact Jim for details.

Documents

Genome Analysis and Genome Comparison. Outline Overview Why do comparative genomic analysis? Assumptions/Limitations Genome Analysis and Annotation Standard