Upload
ralf-gilbert
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
http://www.faculty.ucr.edu/~tgirke/Teaching/Gen240B_2003.ppt
Web-based/Open-source Tools for Bioinformatics and
Genome Analysis
Bioinformatics AreasA. Traditional Bioinformatics Sequence analysis Gene expression analysis Proteomics Metabolic profiling Phenotypes Networks
B. Structural Bioinformatics Molecular modeling Drug design
C. Biological Databases
Systems Biology
Focus of this Seminar
1. Sequences
2. Structure
3. Expression
4. Functional Groups
Bio* Projects and Databases
1. Some Analysis Steps Fragment Assembly: ESTs and genes Mapping Annotation
Gene predictions ORFs, UTRs, introns, exons, promoters Lots of errors in eukaryote genomes!!
Similarity searches BLAST, FASTA, Smith-Waterman
Gene families Domain databases Multiple alignments
Structure/Function 2D, 3D structure (availability?)
Important Sequence Databases Selection
NCBIEntrez: http://www.ncbi.nlm.nih.gov/Batch Entrez: http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgiDownloads: ftp://ftp.ncbi.nih.gov/blast/db/
EMBL-EBIGeneral: http://www.ebi.ac.uk/Downloads: http://www.ebi.ac.uk/FTP/
Swiss-ProtGeneral: http://us.expasy.org/Downloads: http://us.expasy.org/expasy_urls.html
TIGRGeneral: http://www.tigr.org/Downloads: ftp://ftp.tigr.org/pub/data/
Protein Data Bank (PDB)General: http://www.rcsb.org/pdb/Downloads: ftp://ftp.rcsb.org/pub/pdb/data
Example: NCBI
Sequence Database SearchesImportant search algorithms Swiss-Waterman, FASTA, BLAST
BLAST Flavors: http://www.ncbi.nlm.nih.gov/Sitemap/index.html#BLAST
BLAST: BLASN, BLASTP, TBLASTN, TBLASTX Psi-BLAST: Position-Specific Iterated BLAST RPS-BLAST: Reverse Position-Specific BLAST Phi-BLAST: Pattern Hit Initiated BLAST Mega-BLAST: 10 faster than BLASTN BLAST2: pairwise comparisons WU-BLAST: Washington University BLAST
Download of NCBI BLAST tools: ftp://ftp.ncbi.nih.gov/toolbox/
Homework AssignmentFinish only one assignment!
Go to http://www.ncbi.nlm.nih.gov/, select protein DB, run query: P450 & hydroxylase & human [organism], select under ‘Limits’ SwissProt
report final query syntax from ‘Details’ page.
Save GIs from this final query to file (select ‘GI List’ format under display) report how many GIs you retrieved
Retrieve the corresponding sequences through Batch-Entrez (http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi) using GI list file as query input -> save sequences in FASTA format
Generate multiple alignment and tree of these sequences using Multalign (http://prodes.toulouse.inra.fr/multalin/multalin.html)
save multiple alignment and tree to file
identify putative heme binding cysteine
Open corresponding SwissProt page (http://us.expasy.org/sprot/) for first P450 sequence in your list Compare putative heme binding cysteine and compare with consensus pattern from Prosite database
Report corresponding Pfam ID
How many mouse (Mus musculus) sequences are in this family (use ‘species tree’ on Pfam db)
BLASTP against nr database (use again first P450 in your list), select on “See Conserved Domains from CDD” (this runs RPS-BLAST), click on red P450 domain.
Compare resulting alignment with result from MultAlin
View 3D structure in Cn3D, save structure (screen shot) and highlight heme binding cysteine
Remote Homology Detection
Psi-BLAST/RPS-BLAST HMMs: HMMER, SAM Domain databases Fold recognition approaches (Meta Servers)
Protein Domain DatabasesSelection
PFAM http://pfam.wustl.edu/
PROSITE http://us.expasy.org/prosite/
ProDom http://prodes.toulouse.inra.fr/prodom/2002.1/html/
home.php
InterPro http://www.ebi.ac.uk/interpro/
Selection of Tools for Promoter Analysis
Verbumculus, UC Riverside• http://www.cs.ucr.edu/%7Estelo/Verbumculus/
AlignACE & ScanACE• http://arep.med.harvard.edu/mrnadata/mrnasoft.html
MEME and META-MEME, San Diego Super Computer Center:
• http://www.sdsc.edu/Research/biology/
Regulatory Sequence Analysis Tools (RSA)• http://rsat.ulb.ac.be/rsat/
Gibbs Motif Sampler, Coldspring Harbor: • http://argon.cshl.org/ioschikz/gibbsDNA/mgibbsDNA-form.html
Motif Sampler, searches for over-represented motifs
• http://www.esat.kuleuven.ac.be/~thijs/Work/MotifSampler.html
Stanford, motif finding in upstream sequences• http://genome-www4.stanford.edu/cgi-bin/ewing/oligoAnalysis.pl
Example: RSA
Promoter DatabasesSelection
Regulatory Sequence Analysis Tools (RSA) http://rsat.ulb.ac.be/rsat/
Eukaryotic Promoter Database http://www.epd.isb-sib.ch/
Human Promoter Database http://zlab.bu.edu/%7Emfrith/HPD.html
Arabidopsis http://exon.cshl.org/cgi-bin/atprobe/atprobe.pl
Alternative HomeworkDo only one assignment!
Work through tutorial of Regulatory Sequence Analysis Tools (http://rsat.ulb.ac.be/rsat/). Provide short summary for different tools
2. Protein Modeling
Tool collection: http://faculty.ucr.edu/~tgirke/Links.htm Databases:
Protein Data Bank: General: http://www.rcsb.org/pdb/ Downloads: ftp://ftp.rcsb.org/pub/pdb/data
More databases: http://faculty.ucr.edu/~tgirke/Links.htm#Databases
3. Microarrays and Chips
Definition: Hybridization-based technique that allows simultaneous
analysis of thousands of samples on a solid substrate.
Applications: Examples
Transcriptional Profiling
Gene copy number Resequencing Genotyping Single-nucleotide polymorphism DNA-protein interaction Insertional library screening Identification of new cell lines Etc.
Developing Areas: Protein arrays Chemical arrays
Why Microarrays?
Simultaneous analysis of over 50,000 genes
Signaling and Metabolic Networks
Regulatory genes
First step in discovery of gene function
Prediction of limiting factors in biological processes
Rapid analysis of mutants and transgenics
Reduce time of costly clinical studies and field trials
DNA Arrays
gene expression
Input Samples Outputs
WT
MutantsTransgenics
Treatmentsbiotic, abiotic, chemicals
Prognosis
Diagnosis
Target identification
Basic Analysis Steps
Image analysis Filtering, background correction Standardization, scaling and normalization Significance analysis (replicates) Cluster analysis (time series) Integration with sequence and functional
information
Planning Steps of Transcriptional Profiling Experiments
1. Biological question(s), e.g.:
- Which genes are up or down-regulated in a mutant/transgenic line?
- Which genes cycle during a series of treatments?
2. Selection of best biological samples
- Minimize variability in sample collection.
3. Develop validation and follow-up strategy for expected expression hits
- e.g. real-time PCR and analysis of transgenics or mutants
4. Choose type of experiment
- pairwise: e.g.WT vs. Mutant/Transgenic
- series of time points or treatments
allows cluster analysis
5. Choose Reference
- sample with maximum number of expressed genes (maxim. biolog.information)
- pooled RNA of all points: less variability from reference, saves chips
WTt1 WTt2
MTt1 MTt2
WTt1
WTt1 WTt2 WTt3 WTt4 WTt5
Planning Steps of Transcriptional Profiling Experiments
6. How many replicates?
- biological replicate: starts with sample collection
- technical replicate: starts usually with same RNA isolation
- dye-swaps: (1) WT-Cy3:MT-Cy5, (2) WT-Cy5:MT-Cy3
7. Management of sample collection and RNA isolation
- Define a “realistic” volume
- RNA quality tests!!!!
8. cDNA/cRNA labeling
- Which labeling technique? RNA amplification, reliability, sensitivity, etc.
9. Array hybridizations and post-processing
10. Array scanning
Important Pattern Recognition (clustering) Methods
Hierarchical clustering single, average (UPGMA) and complete
linkage Non-hierarchical clustering
Self Organizing Maps (SOM) k-means
Dimension Reduction Analysis Principal Component Analysis
Neural Networks & Machine Learning
Tools for Microarray Analysis
Image analysis: ScanAlyze Normalization: SNOMAD, R projects Mining/clustering: J-Express, R projects Much more: http://faculty.ucr.edu/%7Etgirke/Links.htm#Profiling
Example of an Integrated Clustering Tool: J-Express
Microarray DatabasesSelection
Stanford Microarray Database (SMD) http://genome-www5.stanford.edu/MicroArray/SMD/
Gene Expression Omnibus (GEO)http://www.ncbi.nlm.nih.gov/geo/
- Go to the SNOMAD page (Standardization and Normalization of Microarray Data):
http://pevsnerlab.kennedykrieger.org/snomadinput.html
- Select “Use an Example dataset to see how SNOMAD works” and chose either option #2 (Incyte dataset) or
#3 (Affymetrix dataset). If you prefer you can use your own or other public data instead. A good resource to
download public data is the Stanford site: http://genome-www5.stanford.edu/cgi-bin/SMD/publicData.pl
- Select all possible transformations and graphs and submit the data for processing.
- Report: Give a short description (one or two sentences) for each graph/transformation of the returned results.
Alternative Homework AssignmentDo only one assignment!
4. Functional Groups
Assigning “Biological Meaning” to Profiling Data
Protein Families COGs (43 genomes, NCBI):
http://www.ncbi.nlm.nih.gov/COG/ Protein Domain Databases (PFAM)
Gene Ontology ConsortiumDf: controlled vocabulary for all organisms
http://www.geneontology.org/
Pathways KEGG Metabolic Pathways
http://www.genome.ad.jp/kegg/kegg2.html WIT Database (39 genomes)
http://wit.mcs.anl.gov/WIT2/
Toolboxes for BioinformaticiansPopular scripting languages
Perl: http://www.perl.com/Python: http://www.python.org/
Bio* modules for processing data from databases and applicationsBioPerl: http://bio.perl.org/BioPython: http://biopython.org/BioJava: http://www.biojava.org/BioRuby: http://bioruby.org/
StatisticsR: http://www.R-project.orgBioConductor (Microarray): http://www.bioconductor.org/
Database systemsMySQL: http://www.mysql.com/PostgreSQL: http://www.postgresql.org/