Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

http://www.faculty.ucr.edu/~tgirke/Teaching/Gen240B_2003.ppt

Web-based/Open-source Tools for Bioinformatics and

Genome Analysis

Bioinformatics AreasA. Traditional Bioinformatics Sequence analysis Gene expression analysis Proteomics Metabolic profiling Phenotypes Networks

B. Structural Bioinformatics Molecular modeling Drug design

C. Biological Databases

Systems Biology

Focus of this Seminar

1. Sequences

2. Structure

3. Expression

4. Functional Groups

Bio* Projects and Databases

1. Some Analysis Steps Fragment Assembly: ESTs and genes Mapping Annotation

Gene predictions ORFs, UTRs, introns, exons, promoters Lots of errors in eukaryote genomes!!

Similarity searches BLAST, FASTA, Smith-Waterman

Gene families Domain databases Multiple alignments

Structure/Function 2D, 3D structure (availability?)

Important Sequence Databases Selection

NCBIEntrez: http://www.ncbi.nlm.nih.gov/Batch Entrez: http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgiDownloads: ftp://ftp.ncbi.nih.gov/blast/db/

EMBL-EBIGeneral: http://www.ebi.ac.uk/Downloads: http://www.ebi.ac.uk/FTP/

Swiss-ProtGeneral: http://us.expasy.org/Downloads: http://us.expasy.org/expasy_urls.html

TIGRGeneral: http://www.tigr.org/Downloads: ftp://ftp.tigr.org/pub/data/

Protein Data Bank (PDB)General: http://www.rcsb.org/pdb/Downloads: ftp://ftp.rcsb.org/pub/pdb/data

Example: NCBI

Sequence Database SearchesImportant search algorithms Swiss-Waterman, FASTA, BLAST

BLAST Flavors: http://www.ncbi.nlm.nih.gov/Sitemap/index.html#BLAST

BLAST: BLASN, BLASTP, TBLASTN, TBLASTX Psi-BLAST: Position-Specific Iterated BLAST RPS-BLAST: Reverse Position-Specific BLAST Phi-BLAST: Pattern Hit Initiated BLAST Mega-BLAST: 10 faster than BLASTN BLAST2: pairwise comparisons WU-BLAST: Washington University BLAST

Download of NCBI BLAST tools: ftp://ftp.ncbi.nih.gov/toolbox/

Homework AssignmentFinish only one assignment!

Go to http://www.ncbi.nlm.nih.gov/, select protein DB, run query: P450 & hydroxylase & human [organism], select under ‘Limits’ SwissProt

report final query syntax from ‘Details’ page.

Save GIs from this final query to file (select ‘GI List’ format under display) report how many GIs you retrieved

Retrieve the corresponding sequences through Batch-Entrez (http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi) using GI list file as query input -> save sequences in FASTA format

Generate multiple alignment and tree of these sequences using Multalign (http://prodes.toulouse.inra.fr/multalin/multalin.html)

save multiple alignment and tree to file

identify putative heme binding cysteine

Open corresponding SwissProt page (http://us.expasy.org/sprot/) for first P450 sequence in your list Compare putative heme binding cysteine and compare with consensus pattern from Prosite database

Report corresponding Pfam ID

How many mouse (Mus musculus) sequences are in this family (use ‘species tree’ on Pfam db)

BLASTP against nr database (use again first P450 in your list), select on “See Conserved Domains from CDD” (this runs RPS-BLAST), click on red P450 domain.

Compare resulting alignment with result from MultAlin

View 3D structure in Cn3D, save structure (screen shot) and highlight heme binding cysteine

Remote Homology Detection

Psi-BLAST/RPS-BLAST HMMs: HMMER, SAM Domain databases Fold recognition approaches (Meta Servers)

Protein Domain DatabasesSelection

PFAM http://pfam.wustl.edu/

PROSITE http://us.expasy.org/prosite/

ProDom http://prodes.toulouse.inra.fr/prodom/2002.1/html/

home.php

InterPro http://www.ebi.ac.uk/interpro/

Selection of Tools for Promoter Analysis

Verbumculus, UC Riverside• http://www.cs.ucr.edu/%7Estelo/Verbumculus/

AlignACE & ScanACE• http://arep.med.harvard.edu/mrnadata/mrnasoft.html

MEME and META-MEME, San Diego Super Computer Center:

• http://www.sdsc.edu/Research/biology/

Regulatory Sequence Analysis Tools (RSA)• http://rsat.ulb.ac.be/rsat/

Gibbs Motif Sampler, Coldspring Harbor: • http://argon.cshl.org/ioschikz/gibbsDNA/mgibbsDNA-form.html

Motif Sampler, searches for over-represented motifs

• http://www.esat.kuleuven.ac.be/~thijs/Work/MotifSampler.html

Stanford, motif finding in upstream sequences• http://genome-www4.stanford.edu/cgi-bin/ewing/oligoAnalysis.pl

Example: RSA

Promoter DatabasesSelection

Regulatory Sequence Analysis Tools (RSA) http://rsat.ulb.ac.be/rsat/

Eukaryotic Promoter Database http://www.epd.isb-sib.ch/

Human Promoter Database http://zlab.bu.edu/%7Emfrith/HPD.html

Arabidopsis http://exon.cshl.org/cgi-bin/atprobe/atprobe.pl

Alternative HomeworkDo only one assignment!

Work through tutorial of Regulatory Sequence Analysis Tools (http://rsat.ulb.ac.be/rsat/). Provide short summary for different tools

2. Protein Modeling

Tool collection: http://faculty.ucr.edu/~tgirke/Links.htm Databases:

Protein Data Bank: General: http://www.rcsb.org/pdb/ Downloads: ftp://ftp.rcsb.org/pub/pdb/data

More databases: http://faculty.ucr.edu/~tgirke/Links.htm#Databases

3. Microarrays and Chips

Definition: Hybridization-based technique that allows simultaneous

analysis of thousands of samples on a solid substrate.

Applications: Examples

Transcriptional Profiling

Gene copy number Resequencing Genotyping Single-nucleotide polymorphism DNA-protein interaction Insertional library screening Identification of new cell lines Etc.

Developing Areas: Protein arrays Chemical arrays

Why Microarrays?

Simultaneous analysis of over 50,000 genes

Signaling and Metabolic Networks

Regulatory genes

First step in discovery of gene function

Prediction of limiting factors in biological processes

Rapid analysis of mutants and transgenics

Reduce time of costly clinical studies and field trials

DNA Arrays

gene expression

Input Samples Outputs

WT

MutantsTransgenics

Treatmentsbiotic, abiotic, chemicals

Prognosis

Diagnosis

Target identification

Basic Analysis Steps

Image analysis Filtering, background correction Standardization, scaling and normalization Significance analysis (replicates) Cluster analysis (time series) Integration with sequence and functional

information

Planning Steps of Transcriptional Profiling Experiments

1. Biological question(s), e.g.:

- Which genes are up or down-regulated in a mutant/transgenic line?

- Which genes cycle during a series of treatments?

2. Selection of best biological samples

- Minimize variability in sample collection.

3. Develop validation and follow-up strategy for expected expression hits

- e.g. real-time PCR and analysis of transgenics or mutants

4. Choose type of experiment

- pairwise: e.g.WT vs. Mutant/Transgenic

- series of time points or treatments

allows cluster analysis

5. Choose Reference

- sample with maximum number of expressed genes (maxim. biolog.information)

- pooled RNA of all points: less variability from reference, saves chips

WTt1 WTt2

MTt1 MTt2

WTt1

WTt1 WTt2 WTt3 WTt4 WTt5

Planning Steps of Transcriptional Profiling Experiments

6. How many replicates?

- biological replicate: starts with sample collection

- technical replicate: starts usually with same RNA isolation

- dye-swaps: (1) WT-Cy3:MT-Cy5, (2) WT-Cy5:MT-Cy3

7. Management of sample collection and RNA isolation

- Define a “realistic” volume

- RNA quality tests!!!!

8. cDNA/cRNA labeling

- Which labeling technique? RNA amplification, reliability, sensitivity, etc.

9. Array hybridizations and post-processing

10. Array scanning

Important Pattern Recognition (clustering) Methods

Hierarchical clustering single, average (UPGMA) and complete

linkage Non-hierarchical clustering

Self Organizing Maps (SOM) k-means

Dimension Reduction Analysis Principal Component Analysis

Neural Networks & Machine Learning

Tools for Microarray Analysis

Image analysis: ScanAlyze Normalization: SNOMAD, R projects Mining/clustering: J-Express, R projects Much more: http://faculty.ucr.edu/%7Etgirke/Links.htm#Profiling

Example of an Integrated Clustering Tool: J-Express

Microarray DatabasesSelection

Stanford Microarray Database (SMD) http://genome-www5.stanford.edu/MicroArray/SMD/

Gene Expression Omnibus (GEO)http://www.ncbi.nlm.nih.gov/geo/

- Go to the SNOMAD page (Standardization and Normalization of Microarray Data):

http://pevsnerlab.kennedykrieger.org/snomadinput.html

- Select “Use an Example dataset to see how SNOMAD works” and chose either option #2 (Incyte dataset) or

#3 (Affymetrix dataset). If you prefer you can use your own or other public data instead. A good resource to

download public data is the Stanford site: http://genome-www5.stanford.edu/cgi-bin/SMD/publicData.pl

- Select all possible transformations and graphs and submit the data for processing.

- Report: Give a short description (one or two sentences) for each graph/transformation of the returned results.

Alternative Homework AssignmentDo only one assignment!

4. Functional Groups

Assigning “Biological Meaning” to Profiling Data

Protein Families COGs (43 genomes, NCBI):

http://www.ncbi.nlm.nih.gov/COG/ Protein Domain Databases (PFAM)

Gene Ontology ConsortiumDf: controlled vocabulary for all organisms

http://www.geneontology.org/

Pathways KEGG Metabolic Pathways

http://www.genome.ad.jp/kegg/kegg2.html WIT Database (39 genomes)

http://wit.mcs.anl.gov/WIT2/

Toolboxes for BioinformaticiansPopular scripting languages

Perl: http://www.perl.com/Python: http://www.python.org/

Bio* modules for processing data from databases and applicationsBioPerl: http://bio.perl.org/BioPython: http://biopython.org/BioJava: http://www.biojava.org/BioRuby: http://bioruby.org/

StatisticsR: http://www.R-project.orgBioConductor (Microarray): http://www.bioconductor.org/

Database systemsMySQL: http://www.mysql.com/PostgreSQL: http://www.postgresql.org/

Documents

Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis