2nd Lec Student Copy_3

8/12/2019 2nd Lec Student Copy_3

1/19

Introduction to Bioinformatics

2ndlecture

Muhammad Muddassir Ali

[email protected]

IBBT
mailto:[email protected]:[email protected]


2/19

Topics of this lecture

Course aims and learning goals

Databases (Primary and secondary) Different Sequence Formats

NCBI


3/19

Learning outcomes of the lecture

The student should be able to:

Extensively define concepts of databases and its types

Understand the NCBI and its related basic information


4/19

Database just a collection of data???

DB = structured (organized) collection of data

Must have:

Routines for adding new entries and updating old entries

Methods for handling user queries, i.e. access to the data

Libraries of life sciences information, collected from scientific experiments,published literature, high-throughput experiment technology, and computational

analyses.

contain information from research areas including genomics,proteomics,

metabolomics, microarraygene expression, andphylogenetics.

information contained in biological databases includes gene function,

structure, localization (both cellular and chromosomal), clinical effects of

mutations as well as similarities of biological sequences and structures.
http://en.wikipedia.org/wiki/Genomicshttp://en.wikipedia.org/wiki/Proteomicshttp://en.wikipedia.org/wiki/Metabolomicshttp://en.wikipedia.org/wiki/Microarrayhttp://en.wikipedia.org/wiki/Phylogeneticshttp://en.wikipedia.org/wiki/Phylogeneticshttp://en.wikipedia.org/wiki/Microarrayhttp://en.wikipedia.org/wiki/Metabolomicshttp://en.wikipedia.org/wiki/Proteomicshttp://en.wikipedia.org/wiki/Genomics


5/19

Types of DBs:

Relational Object-oriented

Flat file, i.e. all DB entries stored

in text file(s)

Many biological DBs are in flat file format:

Historical reasons (thats what biologists started

with)

Easy to distribute, download and store

No need for database management software

On the basis of format

style in which data is

stored


6/19

Types of DBs:

Primary

secondary database

On the basis kind of data

included

Biological

experiments

Secondary database

Primary database analysis


7/19

Exemples

Primary database

Gene bank

Entrez

EMBL and DDBJ

The Sequence Retrieval System

SWISSPROT

Uniprot

Secondary Databases

prosite

prints pfam

interpro


8/19

Sequence formats

Plain sequence format

In plain sequence format may only contain onesequence, while most other formats

accept several sequences in one file.

An example sequence in plain format is:

ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGC

CACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGAC

AGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTG

ACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGC

CCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTT

CTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGT

CACGCAAG TTTAATTACAGACCTGAA


9/19

EMBL format contain several sequences.

One sequence entry starts with an identifier line ("ID"), followed by further

annotation lines.

line starting with "SQ" and the end of the sequence is marked by two slashes ("//").

An example sequence in EMBL format is:

ID AB000263 standard; RNA; 240BP.

XX

AC AB000263;

XX

DE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.

XX

SQ Sequence 240 BP;

acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg 60

ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg 120

caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc 180

aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag 240

//


10/19

FASTA format

fast A.. all, for protein, P and Nucleotide N can contain severalsequences.

Each sequence begins with a single-line description, followed by

lines of sequence data.

The description line must begin with a greater-than (">")symbolin the first column.

>AB000263 |acc=AB000263|descr=Homo sapiens mRNA for

prepro cortistatin like peptide, complete cds.|len=368

ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGG

GTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAA

GCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCT

CGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGG

GCCCCTCATAGGAGAGG


11/19

GCG format

Contains exactly one sequence Begins with annotation lines

Start of the sequence is marked by a line ending with

two dot ("..") characters. sequence identifier, the

sequence length.

only be used if the file was created with the GCG

package.


12/19

An example sequence in GCG format is

ID AB000263 standard; RNA; 121 BP.

XX

AC AB000263

XX

DE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.

XX

SQ Sequence 121 BP;

AB000263 Length: ..

1 acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg 61 ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg

121 caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc


13/19


14/19

GenBank format

can contain several sequences. starts with a line containing the word LOCUSand a number of annotation

lines. The start of the sequence is marked by a line containing "ORIGIN"

and the end of the sequence is marked by two slashes ("//").

LOCUSAB000263 181 bp mRNA linear PRI 05-FEB-1999

DEFINITION Homo sapiens mRNA for prepro cortistatin like peptide,complete cds.

ACCESSION AB000263

Origin

1 acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg

61 ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg

121 caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc

181 aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag

//


15/19

NCBIAt NCBI (National Center for

Biotechnology Information)

Founded 1982

Nucleotide sequencesAnyone can add new data:

bankItweb-based submission

of a single sequence

sequinsoftware for larger

submissions

Scientific journals require sequence submission


16/19

Locus

Unique, but may change

Shows organism

Here, SCU = S. cerevisiae Accession number

Unique and permanent

No info, just a number

Most reliable identification

CDS

Coding sequence

Exons

Translation shown

Origin

The actual nucleotide seq


17/19


18/19


19/19

Summary

Databases

Types of databases

Sequence formats NCBI

Documents

2nd Lec Student Copy_3