35
Essential Bioinformatics and Biocomputing Essential Bioinformatics and Biocomputing ( ( LSM2104: Section LSM2104: Section I) I) Biological Databases and Biological Databases and Bioinformatics Software Bioinformatics Software Prof. Chen Yu Zong Prof. Chen Yu Zong Tel: 6874-6877 Tel: 6874-6877 Email: Email: [email protected] [email protected] http://xin.cz3.nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS Room 07-24, level 7, SOC1, NUS January 2003 January 2003

Lecture 5: Bioinformatics software

  • Upload
    gudrun

  • View
    36

  • Download
    0

Embed Size (px)

DESCRIPTION

Essential Bioinformatics and Biocomputing ( LSM2104: Section I) Biological Databases and Bioinformatics Software Prof. Chen Yu Zong Tel: 6874-6877 Email: [email protected] http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS January 2003. Lecture 5: Bioinformatics software. Outline : - PowerPoint PPT Presentation

Citation preview

Page 1: Lecture 5:  Bioinformatics software

Essential Bioinformatics and Biocomputing Essential Bioinformatics and Biocomputing ((LSM2104: SectionLSM2104: Section I) I)

Biological Databases andBiological Databases andBioinformatics SoftwareBioinformatics Software

Prof. Chen Yu ZongProf. Chen Yu Zong

Tel: 6874-6877Tel: 6874-6877Email: Email: [email protected]@nus.edu.sghttp://xin.cz3.nus.edu.sghttp://xin.cz3.nus.edu.sg

Room 07-24, level 7, SOC1, NUSRoom 07-24, level 7, SOC1, NUSJanuary 2003January 2003

Page 2: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

22

Lecture 5: Bioinformatics softwareOutline:

– Types of bioinformatics software • Sequence, pattern and domain

• Evolutionary analysis

• Visualization

• Modeling and prediction (sequence, structure and function)

• Data mining (bibliographic and text searches)

– Examples

Page 3: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

33

Types of Bioinformatics software

1. Analysis of biological data/systems and characterization of molecules and sequences.

2. Analysis and interpretation of experimental results

3. Simulation of laboratory experiments, important for tackling large scale problems

4. Predictions that lead to the design of experiments

5. Bioinformatics software can be accessed via WWW, or through integrated software packages (such as Emboss, GCG, Staden, DNAstar, …). It may be coupled with databases, or may stand alone.

Page 4: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

44

Bioinformatics softwareMajor sources

• Software package at ExPASy Molecular Biology Server http://www.expasy.org ; http://au.expasy.org

• Software at PBIL Bio-Informatique Lyonnais http://pbil.univ-lyon1.fr/

• Toolbox at EBI European Bioinformatics Institute http://www.ebi.ac.uk/Tools/index.html

Page 5: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

55

Bioinformatics software• Major types of bioinformatics tools

• Sequence analysis tools• Sequence comparison• Pattern and domain search• Evolutionary analysis• Prediction of sequence structure and function• Visualization of molecular structures• Structure modeling• Bibliographic and text searches• Specialized and other tools

Page 6: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

66

Bioinformatics software

Sequence analysis tools

This kind of software focuses on extraction and

comparison of properties in DNA and protein sequences

– Sequence analysis provides for identification of domains,

structure, and function, and other properties

- The analysis of individual sequences helps with sequence comparison

• Textbook chapter 5, pages 81-93

Page 7: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

77

Bioinformatics software

Sequence analysis tools

This kind of software focuses on extraction and

comparison of DNA and protein sequence

properties such as

– composition of nucleotide or protein sequences– codon usage in DNA– translation and backtranslation

Textbook chapter 5, pages 81-93

Page 8: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

88

Bioinformatics softwareComposition of nucleotide or protein sequences

• Composition (frequency of occurrence of a nucleotide or of an amino acid) is the most basic analysis. It can give us important functional and structural clues.

• For example, CG-rich regions called CpG islands are often found in promoters. A short region just before the splice site at the end of introns often has high C+T content.

Page 9: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

99

Bioinformatics softwareComposition of protein and DNA sequences

• Web:– NPS@ Network Protein Sequence @nalysis

http://npsa-pbil.ibcp.fr/ (Amino-acid composition)

– AA Composition http://molbiol.soton.ac.uk/compute/aacomp.html

• JEMBOSS (in our own laboratory)– http://srs1.bic.nus.edu.sg/jnlp/ (nucleic, composition,

compseq)

Page 10: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

1010

Bioinformatics software

Page 11: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

1111

Bioinformatics software

Page 12: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

1212

Bioinformatics softwareCodon usage in DNA

• Web:

– Count-codon program in Codon Usage Database http://www.kazusa.or.jp/codon/countcodon.html (needs start and stop codons at the start and the end of the sequence)

– Tool for Gene to Codon Usage Table http://www.entelechon.com/eng/genetocut.html

– (does not care about start and stop codons)

• JEMBOSS (in the laboratory)– http://srs1.bic.nus.edu.sg/jnlp/ (nucleic, codon usage,

cusp)

DNA coding region should have only one stop codon

Page 13: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

1313

Bioinformatics software

Page 14: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

1414

Bioinformatics software

Page 15: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

1515

Bioinformatics softwareTranslation (DNA to protein) and back translation (protein to DNA)

• Web:– Translate tool at ExPASy http://au.expasy.org/tools/dna.html

(DNA to protein)

• JEMBOSS (in the laboratory)– http://srs1.bic.nus.edu.sg/jnlp/ (DNA to protein and reverse)

(nucleic, translation, transeq; nucleic, translation, backtranseq)

If we translate and back translate the same sequence we will typically not get the same sequence as the starting one.

Page 16: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

1616

Bioinformatics SoftwareSequence comparison (the most important software) This will be taught next month by A/P Tan Tin Wee.

Web:• Local alignment (BLAST, FASTA)

– http://www.ebi.ac.uk/fasta33/ – http://www.ncbi.nlm.nih.gov/BLAST/ – http://www.ebi.ac.uk/blast2/

• Multiple alignment (Clustal W)– http://www.ebi.ac.uk/clustalw/index.html

• JEMBOSS (in the laboratory)– http://srs1.bic.nus.edu.sg/jnlp/ Local alignment: Smith-Waterman (alignment, local, water) Global alignment: Needleman-Wunsh (alignment, global, needle)

Page 17: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

1717

Bioinformatics softwareEvolutionary analysis

• Multiple sequence alignments can be used as measures of evolutionary distance between proteins. The phylogeny systems are used to represent evolutionary distances between sequences.

• WebPhylip• http://sdmc.krdl.org.sg:8080/~lxzhang/phylip/

• GeneBee• http://www.genebee.msu.su/services/phtree_reduced.html

Read textbook, page 83.

Page 18: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

1818

Bioinformatics software

Page 19: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

1919

Bioinformatics softwarePrediction of sequence structure and function

• Sequences that have similar structure often have similar function. For many sequences we can extract secondary and tertiary structure from the PDB database.

• What if our sequence is not in the PDB? We can predict structure of a biological sequence using appropriate software.

• There are several programs for prediction of secondary structure. For prediction of tertiary structure we can do modelling.

• http://npsa-pbil.ibcp.fr (PHD method for secondary structure prediction)

Page 20: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

2020

Bioinformatics software• Secondary structure prediction:

Page 21: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

2121

Bioinformatics software• Secondary structure prediction:

– The PHD program predicted four alpha helices in the human IL-2 (red). The number of helices is correct, but their lengths and boundaries are not correct (purple).

– When we make a prediction in bioinformatics, we must have an idea about the accuracy of prediction programs.

– To assess the accuracy of a program, we can test it with known data. Our test must have sufficient examples, so that we can make reasonable conclusions.

Page 22: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

2222

Secondary structure prediction Bioinformatics software• alpha –Lactalbumin PDB 1A4V • http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_server.html

Page 23: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

2323

Bioinformatics software• We used nine different programs for prediction of secondary

structure of alpha–Lactalbumin (PDB 1A4V).

• The results show that the best predictions for this molecule were from “Predator”, while DSC was the laggard.

• This test does not mean that Predator is the best of the tested programs, nor that DSC is the worst. To make such conclusions we must make test set first. The test set should contain the examples from the family of proteins that our query protein belongs to.

• The learning point – none of the prediction programs (and this applies across all bioinformatics software, not only secondary structure prediction) is 100% accurate. The users must be cautious when interpreting results from the predictive software.

Page 24: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

2424

Bioinformatics software• Common measure (other measures also exist)

• Sensitivity SE=TP/(TP+FN)• Specificity SP=TN/(TN+FP)

• For example, prediction of binding peptides to a particular receptor• Experimental Predicted Class• Example 1 Binder Binder True positive (TP)• Example 2 Non-binder Non-binder True negative (TN)• Example 3 Binder Non-binder False negative (FN)• Example 4 Non-binder Binder False positive (FP)

• Prediction system that has SE=0.8 and SP=0.9 will correctly predict 8 of 10 experimental positives, and for each 10 experimental negatives it will make one false prediction. This prediction accuracy may be very good for prediction of peptide binding, but is not very good for some other predictions, for example gene prediction.

Page 25: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

2525

Bioinformatics software• Prediction of 3-D structure

• Various modelling programs– comparative modelling, using known structures as templates– ab initio modelling, using atomic simulation, residue statistics, etc.

• These methods will be covered later in the course

• An example of the comparative modelling software is SWISS-MODEL http://www.expasy.org/swissmod/SWISS-MODEL.html

• This model is provided by email.

• This tool has the facility for assessing the quality of predictions

Page 26: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

2626

Bioinformatics software

Page 27: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

2727

Bioinformatics software

Page 28: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

2828

Bioinformatics software

Page 29: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

2929

Bioinformatics software

• Software for visualisation of 3-D structures. Provides different views to 3-D molecular structure, which will be taught by A/P Shoba.

– Chime, Rasmol (they use files in PDB format)

– Scorpion database uses Chime. Chime can be downloaded from: http://www.mdli.com/downloads/downloads.html?uid=&key=&id=1

Page 30: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

3030

Bioinformatics software

Page 31: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

3131

Bioinformatics software

Page 32: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

3232

Bioinformatics software• Text searches• Text searching software is used associated with

databases. Most commonly we search by keywords or combinations of keywords.

• Examples of PubMed searches:– Diabetes –181,672

matches– Diabetes AND IDDM – 35,841– Diabetes AND IDDM AND autoimmunity – 1,109– Diabetes OR autoimmunity – 190,674– Diabetes[Title/Abstract] – 114,624

• The last example is more advanced PubMed option “preview/index”

Page 33: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

3333

Bioinformatics software Summary of Today’s lectureSummary of Today’s lecture

• Why bioinformatics software?

• Types of software: sequence, motif, evolution, visualization, structural modeling, simulation, test search.

• Examples of selected software: – Sequence composition– DNA-protein sequence translation– Evolutionary analysis– Protein secondary structure prediction– Comparative modeling– Text search

• To be taught later: Sequence comparison, visualization etc.

Page 34: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

3434

Summary of the Section:Biological databases and bioinformatics software

• We first focused on biological databases. We covered topics:– discussed types of biological databases

– briefly described popular databases

– structure of the GenBank and SWISS-PROT entries

– searching biological databases

– types of questions that can be answered by searching databases

– completeness and errors in the databases

Page 35: Lecture 5:  Bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104)ing (LSM2104)

3535

Summary of the Section:Biological databases and bioinformatics software

• The second topic was bioinformatics software. We covered:– why do we need bioinformatics software?– briefly described major types of bioinformatics software– described software for sequence composition, codon usage,

translation and backtranslation– introduced the concept of sequence alignment, evolutionary

analysis– secondary and tertiary structure prediction, molecular

visualization– accuracy of prediction software– text searching