2nd Lec Student Copy_3

Embed Size (px)

Citation preview

  • 8/12/2019 2nd Lec Student Copy_3

    1/19

    Introduction to Bioinformatics

    2ndlecture

    Muhammad Muddassir Ali

    [email protected]

    IBBT

    mailto:[email protected]:[email protected]
  • 8/12/2019 2nd Lec Student Copy_3

    2/19

    Topics of this lecture

    Course aims and learning goals

    Databases (Primary and secondary) Different Sequence Formats

    NCBI

  • 8/12/2019 2nd Lec Student Copy_3

    3/19

    Learning outcomes of the lecture

    The student should be able to:

    Extensively define concepts of databases and its types

    Understand the NCBI and its related basic information

  • 8/12/2019 2nd Lec Student Copy_3

    4/19

    Database just a collection of data???

    DB = structured (organized) collection of data

    Must have:

    Routines for adding new entries and updating old entries

    Methods for handling user queries, i.e. access to the data

    Libraries of life sciences information, collected from scientific experiments,published literature, high-throughput experiment technology, and computational

    analyses.

    contain information from research areas including genomics,proteomics,

    metabolomics, microarraygene expression, andphylogenetics.

    information contained in biological databases includes gene function,

    structure, localization (both cellular and chromosomal), clinical effects of

    mutations as well as similarities of biological sequences and structures.

    http://en.wikipedia.org/wiki/Genomicshttp://en.wikipedia.org/wiki/Proteomicshttp://en.wikipedia.org/wiki/Metabolomicshttp://en.wikipedia.org/wiki/Microarrayhttp://en.wikipedia.org/wiki/Phylogeneticshttp://en.wikipedia.org/wiki/Phylogeneticshttp://en.wikipedia.org/wiki/Microarrayhttp://en.wikipedia.org/wiki/Metabolomicshttp://en.wikipedia.org/wiki/Proteomicshttp://en.wikipedia.org/wiki/Genomics
  • 8/12/2019 2nd Lec Student Copy_3

    5/19

    Types of DBs:

    Relational Object-oriented

    Flat file, i.e. all DB entries stored

    in text file(s)

    Many biological DBs are in flat file format:

    Historical reasons (thats what biologists started

    with)

    Easy to distribute, download and store

    No need for database management software

    On the basis of format

    style in which data is

    stored

  • 8/12/2019 2nd Lec Student Copy_3

    6/19

    Types of DBs:

    Primary

    secondary database

    On the basis kind of data

    included

    Biological

    experiments

    Secondary database

    Primary database analysis

  • 8/12/2019 2nd Lec Student Copy_3

    7/19

    Exemples

    Primary database

    Gene bank

    Entrez

    EMBL and DDBJ

    The Sequence Retrieval System

    SWISSPROT

    Uniprot

    Secondary Databases

    prosite

    prints pfam

    interpro

  • 8/12/2019 2nd Lec Student Copy_3

    8/19

    Sequence formats

    Plain sequence format

    In plain sequence format may only contain onesequence, while most other formats

    accept several sequences in one file.

    An example sequence in plain format is:

    ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGC

    CACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGAC

    AGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTG

    ACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGC

    CCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTT

    CTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGT

    CACGCAAG TTTAATTACAGACCTGAA

  • 8/12/2019 2nd Lec Student Copy_3

    9/19

    EMBL format contain several sequences.

    One sequence entry starts with an identifier line ("ID"), followed by further

    annotation lines.

    line starting with "SQ" and the end of the sequence is marked by two slashes ("//").

    An example sequence in EMBL format is:

    ID AB000263 standard; RNA; 240BP.

    XX

    AC AB000263;

    XX

    DE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.

    XX

    SQ Sequence 240 BP;

    acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg 60

    ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg 120

    caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc 180

    aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag 240

    //

  • 8/12/2019 2nd Lec Student Copy_3

    10/19

    FASTA format

    fast A.. all, for protein, P and Nucleotide N can contain severalsequences.

    Each sequence begins with a single-line description, followed by

    lines of sequence data.

    The description line must begin with a greater-than (">")symbolin the first column.

    >AB000263 |acc=AB000263|descr=Homo sapiens mRNA for

    prepro cortistatin like peptide, complete cds.|len=368

    ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGG

    GTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAA

    GCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCT

    CGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGG

    GCCCCTCATAGGAGAGG

  • 8/12/2019 2nd Lec Student Copy_3

    11/19

    GCG format

    Contains exactly one sequence Begins with annotation lines

    Start of the sequence is marked by a line ending with

    two dot ("..") characters. sequence identifier, the

    sequence length.

    only be used if the file was created with the GCG

    package.

  • 8/12/2019 2nd Lec Student Copy_3

    12/19

    An example sequence in GCG format is

    ID AB000263 standard; RNA; 121 BP.

    XX

    AC AB000263

    XX

    DE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.

    XX

    SQ Sequence 121 BP;

    AB000263 Length: ..

    1 acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg 61 ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg

    121 caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc

  • 8/12/2019 2nd Lec Student Copy_3

    13/19

  • 8/12/2019 2nd Lec Student Copy_3

    14/19

    GenBank format

    can contain several sequences. starts with a line containing the word LOCUSand a number of annotation

    lines. The start of the sequence is marked by a line containing "ORIGIN"

    and the end of the sequence is marked by two slashes ("//").

    LOCUSAB000263 181 bp mRNA linear PRI 05-FEB-1999

    DEFINITION Homo sapiens mRNA for prepro cortistatin like peptide,complete cds.

    ACCESSION AB000263

    Origin

    1 acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg

    61 ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg

    121 caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc

    181 aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag

    //

  • 8/12/2019 2nd Lec Student Copy_3

    15/19

    NCBIAt NCBI (National Center for

    Biotechnology Information)

    Founded 1982

    Nucleotide sequencesAnyone can add new data:

    bankItweb-based submission

    of a single sequence

    sequinsoftware for larger

    submissions

    Scientific journals require sequence submission

  • 8/12/2019 2nd Lec Student Copy_3

    16/19

    Locus

    Unique, but may change

    Shows organism

    Here, SCU = S. cerevisiae Accession number

    Unique and permanent

    No info, just a number

    Most reliable identification

    CDS

    Coding sequence

    Exons

    Translation shown

    Origin

    The actual nucleotide seq

  • 8/12/2019 2nd Lec Student Copy_3

    17/19

  • 8/12/2019 2nd Lec Student Copy_3

    18/19

  • 8/12/2019 2nd Lec Student Copy_3

    19/19

    Summary

    Databases

    Types of databases

    Sequence formats NCBI