38
MBV3070 Bioinformatikk

MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Embed Size (px)

Citation preview

Page 1: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

MBV3070

Bioinformatikk

Page 2: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Pensumliste MBV3070 - BioinformatikkArthur M. Lesk: Introduction to Bioinformatics. Oxford

University Press 2002. 270 sider

I tillegg:

1. Tom Kristensen: Sekvenssammenstillinger. 7 sider.

2. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673-4680.

3. D.G:Higgins, J.D.Thompson and T.J.Gibson: Using CLUSTAL for multiple sequence alignments. Methods Enzymol. 266 (1994) 383-402

4. ??? (Genfinning)

5. ???? (Mikromatriser

 

Page 3: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Innledning. Sekvensering. Databaser. Entrez og SRS. Dotplots Parvis sekvenssammenstilling FASTA og BLAST Flersekvenssammenstilling. ClustalW/ClustalX Motiver, profiler, PSI-BLAST Fylogeni Genomer. Analyse av genomisk DNA. Genfinning Mikromatriser (Ola Myklebost/Ole Chr. Lindgjærde) Proteinmodellering Proteinmodellering Proteinmodellering

Fremdriftsplan

Vincent Eijsink

Page 4: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Nyttige nettsteder for MBV3070

Emnets hjemmeside: http://www.uio.no/studier/emner/matnat/molbio/MBV3070/v04/

Lærebokas hjemmeside: http://www.oup.com/uk/lesk/bioinf/

Page 5: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Hva er bioinformatikk?The NIH Biomedical Information Science and Technology Initiative Consortium agreed on the following definitions of bioinformatics and computational biology recognizing that no definition could completely eliminate overlap with other activities or preclude variations in interpretation by different individuals and organizations.

Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.

Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.

Page 6: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Andre måter å definere bioinformatikk på "The mathematical, statistical and computing

methods that aim to solve biological problems using DNA and amino acid sequences and related information." Fredj Tekaja, Institute Pasteur

”The use of computers to store, retrieve, analyze or predict the composition or the structure of biomolecules.” Damian Councell, bioinformatics.org

Page 7: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

“It tries experiments. It wakes up every morning, does a little mutagenesis, changes a nucleotide here and there, and sees how it works. If it’s a success, it keeps the notes. In this notebook, we have all of the information of the greatest experimental tinkerer ever.”

“For the last three and a half billion years, evolution has been taking notes.”

Dr. Eric LanderDirector of the Whitehead InstituteMIT Center for Genome Research

Page 8: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Hva betyr dette?

Page 9: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Base symbols

A Adenine C Cytosine G Guanine T Thymine U Uracil R Guanine / Adenine

(puRine) Y Cytosine / Thymine

(pYrimidine) K Guanine / Thymine

(Keto) M Adenine / Cytosine

(aMino)

S Guanine / Cytosine(Strong)

W Adenine / Thymine (Weak)

B Guanine / Thymine / Cytosine (not A)

D Guanine / Adenine / Thymine (not C)

H Adenine / Cytosine / Thymine (not G)

V Guanine / Cytosine / Adenine (not T)

N Adenine / Guanine / Cytosine / Thymine

Page 10: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Hvorfor tvetydige symboler?

Sekvenseringsinstrumenter vil ikke alltid kunne lese sekvensen entydig

I konsensussekvenser er det nyttig med tvetydige symboler

Sekvens 1 aagcggtaccag

Sekvens 2 aaacagcaccaa

Konsensus aarcrgyaccar

Page 11: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Den genetiske kode

Page 12: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Den genetiske kode

Page 13: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Aminosyresymboler A Ala alanine B Asx aspartic acid or asparagine C Cys cysteine D Asp aspartic acid E Glu glutamic acid F Phe phenylalanine G Gly glycine H His histidine I Ile isoleucine K Lys lysine L Leu leucine M Met methionine N Asn asparagine P Pro proline

Q Gln glutamine R Arg arginine S Ser serine T Thr threonine U Sec selenocysteine V Val valine W Trp tryptophan X Xaa unknown or 'other' amino

acid Y Tyr tyrosine Z Glx glutamic acid or glutamine

(or substances such as 4-carboxyglutamic acid and 5-oxoproline that yield glutamic acid on acid hydrolysis of peptides)

Page 14: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

To måter å sekvensere på

Shotgun-sekvensering: Dette er strategien som ble valgt av Celera for kommersiell sekvensering av det humane genom

Ordnet sekvensering (top down): Denne strategien ble brukt i den ”offentlige” sekvensering av genomet, i et internasjonalt samarbeid

Page 15: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Ovenfra og nedover-strategi for sekvensering

Page 16: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

To måter å sekvensere genomet på

BAC to BAC Sequencing

The BAC to BAC approach first creates a crude physical map of the whole genome before sequencing the DNA. Constructing a map requires cutting the chromosomes into large pieces and figuring out the order of these big chunks of DNA before taking a closer look and sequencing all the fragments.

Whole Genome Shotgun Sequencing

The shotgun sequencing method goes straight to the job of decoding, bypassing the need for a physical map. Therefore, it is much faster.

Page 17: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Fragmentering av genomet

BAC to BAC Sequencing

Whole Genome Shotgun Sequencing

Page 18: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Kloning av fragmentene

BAC to BAC Sequencing

Whole Genome Shotgun Sequencing

                          

    

Page 19: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Plassering på kartet av BAC-klonene

BAC to BAC Sequencing

Whole Genome Shotgun Sequencing

This step not needed in shotgun sequencing

Page 20: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Subkloner fra BAC-klonene

BAC to BAC Sequencing

Whole Genome Shotgun Sequencing

This step not needed in shotgun sequencing

Page 21: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Sekvensering av klonene

BAC to BAC Sequencing

Whole Genome Shotgun Sequencing

Page 22: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Råsekvens fra et sekvenseringsinstrument

                                          

          

Page 23: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Oppbygging av sammenhengende sekvenser

BAC to BAC Sequencing

Whole Genome Shotgun Sequencing

Page 24: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Sammensetting av enkeltsekvenser til større sekvenser

                                     

               

Page 25: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

DNA sequencing 2001

Page 26: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Biological databases

Primary databases (archival)– GenBank, EMBL, DDBJ, PDB

Secondary databases (curated)– PIR, SwissProt and everything else

Page 27: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Database Categories Listhttp://www3.oup.co.uk/nar/database/c/

Genomics Databases (non-vertebrate) Human and other Vertebrate Genomes Human Genes and Diseases Metabolic and Signaling Pathways Microarray Data and other Gene Expression Databases Nucleotide Sequence Databases Other Molecular Biology Databases Protein sequence databases Proteomics Resources RNA sequence databases Structure Databases

In all 548 databases, 162 more than one year ago

Page 28: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

GenBank entryLOCUS LISOD 756 bp DNA BCT 30-JUN-1993

DEFINITION L.ivanovii sod gene for superoxide dismutase.

ACCESSION X64011 S78972

NID g44010

VERSION X64011.1 GI:44010

KEYWORDS sod gene; superoxide dismutase.

SOURCE Listeria ivanovii.

ORGANISM Listeria ivanovii

Bacteria; Firmicutes; Bacillus/Clostridium group; Bacillaceae;

Listeria.

REFERENCE 1 (bases 1 to 756)

AUTHORS Haas,A. and Goebel,W.

TITLE Cloning of a superoxide dismutase gene from Listeria ivanovii by

functional complementation in Escherichia coli and characterization

of the gene product

JOURNAL Mol. Gen. Genet. 231 (2), 313-322 (1992)

MEDLINE 92140371

REFERENCE 2 (bases 1 to 756)

AUTHORS Kreft,J.

TITLE Direct Submission

JOURNAL Submitted (21-APR-1992) J. Kreft, Institut f. Mikrobiologie,

Universitaet Wuerzburg, Biozentrum Am Hubland, 8700 Wuerzburg, FRG

Page 29: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

GenBank entry (cont.)FEATURES Location/Qualifiers

source 1..756

/organism="Listeria ivanovii"

/strain="ATCC 19119"

/db_xref="taxon:1638"

RBS 95..100

/gene="sod"

gene 95..746

/gene="sod"

CDS 109..717

/gene="sod"

/EC_number="1.15.1.1"

/codon_start=1

/transl_table=11

/product="superoxide dismutase"

/protein_id="CAA45406.1"

/db_xref="SWISS-PROT:P28763"

/translation="MTYELPKLPYTYD… terminator 723..746

/gene="sod"

BASE COUNT 247 a 136 c 151 g 222 t

ORIGIN

1 cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat

61 gtaatttctt //

Page 30: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

EMBL database entryEMBL:TRBG361

ID TRBG361 standard; RNA; PLN; 1859 BP.

XX

AC X56734; S46826;

XX

SV X56734.1

XX

DT 12-SEP-1991 (Rel. 29, Created)

DT 15-MAR-1999 (Rel. 59, Last updated, Version 9)

XX

DE Trifolium repens mRNA for non-cyanogenic beta-glucosidase

XX

KW beta-glucosidase.

XX

OS Trifolium repens (white clover)

OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;

OC Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; Rosidae;

OC eurosids I; Fabales; Fabaceae; Papilionoideae; Trifolieae; Trifolium.

XX

Page 31: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

EMBL database entry (cont.)RN [5]

RP 1-1859

RX MEDLINE; 91322517.

RA Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.;

RT "Nucleotide and derived amino acid sequence of the cyanogenic

RT beta-glucosidase (linamarase) from white clover (Trifolium repens L.).";

RL Plant Mol. Biol. 17:209-219(1991).

XX

RN [6]

RP 1-1859

RA Hughes M.A.;

RT ;

RL Submitted (19-NOV-1990) to the EMBL/GenBank/DDBJ databases.

RL M.A. Hughes, UNIVERSITY OF NEWCASTLE UPON TYNE, MEDICAL SCHOOL, NEW CASTLE

RL UPON TYNE, NE2 4HH, UK

XX

DR AGDR; X56734; X56734.

DR MENDEL; 11000; Trirp;1162;11000.

DR SWISS-PROT; P26204; BGLS_TRIRP.

XX

Page 32: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

EMBL database entry (cont.)FH Key Location/Qualifiers

FH

FT source 1..1859

FT /db_xref="taxon:3899"

FT /organism="Trifolium repens"

FT /tissue_type="leaves"

FT /clone_lib="lambda gt10"

FT /clone="TRE361"

FT CDS 14..1495

FT /db_xref="SWISS-PROT:P26204"

FT /note="non-cyanogenic"

FT /EC_number="3.2.1.21"

FT /product="beta-glucosidase"

FT /protein_id="CAA40058.1"

FT /translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRSSFPRGFI

FT FGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITVDQYHRYKEDVGIMK

FT DQNMDSYRFSI….

FT mRNA 1..1859

FT /evidence=EXPERIMENTAL

XX

SQ Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other;

aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt 60

cacaattact tccacaaatg cagttgaagc ttctactctt cttgacatag gtaacctgag 120

tcggagcagt tttcctcgtg

Page 33: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

EMBL database fieldsNote that each line begins with a two-character line code, which indicates the type of information contained in the line. The currently used line types, along with their respective line codes, are listed below:

ID - identification (begins each entry; 1 per entry)

AC - accession number (>=1 per entry)

SV - new sequence identifier (>=1 per entry)

DT - date (2 per entry)

DE - description (>=1 per entry)

KW - keyword (>=1 per entry)

OS - organism species (>=1 per entry)

OC - organism classification (>=1 per entry)

OG - organelle (0 or 1 per entry)

RN - reference number (>=1 per entry)

RC - reference comment (>=0 per entry)

Page 34: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

EMBL database fields (cont.) RP - reference positions (>=1 per entry)

RX - reference cross-reference (>=0 per entry)

RA - reference author(s) (>=1 per entry)

RT - reference title (>=1 per entry)

RL - reference location (>=1 per entry)

DR - database cross-reference (>=0 per entry)

FH - feature table header (0 or 2 per entry)

FT - feature table data (>=0 per entry)

CC - comments or notes (>=0 per entry)

XX - spacer line (many per entry)

SQ - sequence header (1 per entry)

bb - (blanks) sequence data (>=1 per entry)

// - termination line (ends each entry; 1 per entry)

Page 35: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

The feature tableThe overall goal of the feature table design is to provide an extensive vocabulary for describing features in a flexible framework for manipulating them. The Feature Table documentation represents the shared rules that allow the three databases to exchange data on a daily basis.

The range of features to be represented is diverse, including regions which:

perform a biological function,

affect or are the result of the expression of a biological function,

interact with other molecules,

affect replication of a sequence,

affect or are the result of recombination of different sequences,

are a recognizable repeated unit,

have secondary or tertiary structure,

exhibit variation, or

have been revised or corrected.

Page 36: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Feature table terminology

The format and wording in the feature table use common biological research terminology whenever possible. For example, an item in the new feature table such as:

Key Location/Qualifiers

CDS 23..400

/product="alcohol dehydrogenase"

/gene="adhI"

might be read as:

The feature CDS is a coding sequence beginning at base 23 and ending at base 400, has a product called 'alcohol dehydrogenase' and corresponds to the gene called 'adhI'.

Page 37: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Feature table terminology (cont.)A more complex description:

Key Location/Qualifiers

CDS join(544..589,688..1032)

/product="T-cell receptor beta-chain"

/partial

which might be read as:

This feature, which is a partial coding sequence is formed by joining the indicated elements to form one contiguous sequence encoding a product called T-cell receptor beta-chain.

Page 38: MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I

Feature key examplesKey Description

conflict Separate determinations of the "same" sequence differ

rep_origin Origin of replication

protein_bind Protein binding site on DNA

CDS Protein-coding sequence

misc_RNA Generic label for an undefined RNA

insertion_seq Insertion element

D-loop Mitochondrial or other D-loop structure