76
Databases in bioinformatics II Marcela Davila-Lopez Department of Medical Biochemistry and Cell Biology Institute of Biomedicine BIOINFORMATICS AND SYSTEMS BIOLOGY, MSC PROGR Sequence analysis, UMF018, 2009

Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II

Marcela Davila-LopezDepartment of Medical Biochemistry and Cell Biology

Institute of Biomedicine

BIOINFORMATICS AND SYSTEMS BIOLOGY, MSC PROGR Sequence analysis, UMF018, 2009

Page 2: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 2

Overview

– Genome sequencing

– Sequencing methods• Sanger, Maxam• Next generation methods (2nd, 3rd)• Uses• Implications

– RefSeq vs GenBank

– TraceArchive

– Refining searches at Entrez

– eUtilis (programer utilities)

Page 3: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 3

Organization of GenBankQuery specific subsets particular technique interpretation of data

from a proper biological point of view

TraditionalBulk

Direct Submissions (Sequin and BankIt)AccurateWell characterized

PRI PrimateROD RodentMAM Mammalian VRT Other VertebrateINV InvertebratePLN Plant and FungalBCT Bacterial and ArchealVRL ViralPHG PhageSYN Synthetic (cloning vectors)UNA Unannotated

EST Expressed Sequence Tag GSS Genome Survey SequenceHTG High Throughput GenomicSTS Sequence Tagged SiteHTC High Throughput cDNAPAT PatentWGS Whole Genome ShutgunENV Environmental Samples CON Constructed sequences

Batch Submission (Email and FTP)InaccuratePoorly characterized

Benson DA, et al. 2008. Nucleic Acids Research

Page 4: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 4

Why Sequencing Genomes

Page 5: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 5

Why Sequencing Genomes

Remarkable similar molecular level despite their obvious outward differences

genes similar DNA sequence tend to perform ≈ functions

Understanding the function of a gene in one organism we may get an idea of what function that gene may perform in a more complex organism (humans)

Applied to various fields: medicine, biological engineering, forensics, etc, etc ...............

Page 6: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 6

Archon X Prize

"the first Team that can build a device and use it to sequence 100 human genomes within 10 days or less, with an accuracy of no more than one error in every 100,000 bases sequenced, with sequences accurately covering at least 98% of the genome, and at a recurring cost of no more than $10,000 (US) per genome." $10 million

HGP 1993 1st draft 2000 final 2003 ($3 billion)

2007

$2 million in 2 months

James WatsonCraig Venter

2008 $60,000-$100,00 in 4 weeks

Page 7: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 7

Personalized medicine

Deep sequencing mutations, cancer genetics, pharmacogenetics

Lab computer infrastructuredata storagedata trasnfer capacity

Training lab personnel

Statistical methodsincidental discoveriesuncertain clinical significance

EthicsConsent (children, incompetent adults)Results with uncertain clinical significancePrediction of serious diseases that can’t be prevented/treatedResults with implications to family memberesWhole genome data to analize a small portionTime and place of data storageAccess to data: patient, physician, insurance companies, policeWhen can it be used: identification of disaster victims, confirmation of

citizenship

Page 8: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 8

Sequencing methods

1954 Whitfeld PR. - Sequencing by degradation

1975-1977 W. Gilbert – A. Maxam (chemical modification)F. Sanger (chain termination)

Next generation

Cyclic array sequencingIllumina/SolexaRoche/454AB SOLiDHelicos/HeliScope

Sequencing by HybridizationAffymetrix

Sequencing in real time (3rd generation)Oxford Nanopore TechnologiesPacific Bioscience SMRT

Page 9: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 9

Maxam-Gilbert sequencing

- Chemical modification of DNA(radiolabelling)

- Cleavage at specific bases(G,G+A,C,C+T)

- Size-separated(gel electrophoresis)

- Autoradiography(X-ray film)

Strong band 1st w/ weaker band in the 2nd AStrong band 2nd w/ weaker band in the 1st GBand in 3rd and 4th CBand only in 4th T

Maxam AM, Gilbert W., A new method for sequencing DNA, Proc Natl Acad Sci U S A. 1977 Feb;74(2):560-4

PROS: Purified DNA could be used directly

CONS: Technical complexUse of hazardous chemicalsDifficulties to scale-up

Page 10: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 10

Sanger method

dNTP (deoxynucleotide) ddNTP (dideoxynucleotide)

Arthur Kornberg DNA replicationChain termination

Page 11: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 11

Sanger method: labeled dNTP

Radio/fluorescentlylabeled dNTP

DNA templatePolymerasePrimerdNTPddNTP

Page 12: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 12

Sanger method : labeled dNTP

A C T G

Page 13: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 13

Sanger method: dye-labeled primer

Dye-labeled primer

PROS: Upon completion, these four reactions can be combined into one lane on a gel, and run on a machine that can scan the lanes with a laser

http://www.escience.ws/b572/L8/L8.htm

Page 14: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 14

Sanger method: dye-labeled terminator

Dye-labeled terminator

PROS: Use an optical system fastermore economicautomated

Single reaction (≠ dye for each nt)

Page 15: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 15

Large scale sequencing strategies

Sanger: Not practical to sequence a complete genomeOnly about 1000 bases can be sequenced accuratelyA primer of known sequence is required

A Privately-funded Sequencing Project :Celera Genomics

The Publically-funded HGP: NIH/NSF

Page 16: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 16

Sequencing methods

1954 Whitfeld PR. - Sequencing by degradation

1975-1977 W. Gilbert – A. Maxam (chemical modification)F. Sanger (chain termination)

Next generation

Cyclic array sequencingIllumina/SolexaRoche/454AB SOLiDHelicos/HeliScope

Sequencing by HybridizationAffymetrix

Sequencing in real time (3rd generation)Oxford Nanopore TechnologiesPacific Bioscience SMRT

Page 17: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 17

Cyclic array sequencing1.- DNA library preparation (ligation of adaptors)

2.- Amplificationemulsion PCR (ePCR)

Page 18: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 18

Cyclic array sequencing

3.- Sequencing reaction

4.- Imaging

bridge PCR

5.- Bioinformatics:

Polymerase-basedLigation-basedPyrosequencing

image analysis, statistical measures, assembly …

Page 19: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 19

Illumina/Solexa genome analyzer

http://www.illumina.com/media.ilmn?Title=Sequencing-Workflow-Video&Cap=&Img=spacer.gif&PageName=illumina%20sequencing%20technology&PageURL=203&Media=10

http://www.illumina.com/

Sequencing by synthesis

Detects the fluorescence of the added nucleotide at each position while synthesizing the complementary strand.Reverse terminator

Page 20: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

http://www.biotagebio.com/DynPage.aspx?id=7454

Databases in bioinformatics II 20

Pyrosequencing

Pyrogram

C G T C C G G A

Sulfurylase

Luciferin

(1)PPi

(1)ATP

Oxyluciferin

Luciferase

http://www.roche-applied-science.com/publications/multimedia/genome_sequencer/flx_multimedia/wbt.htm

Pyrosequencing

Detects the activity of DNA polymerase with a chemiluminescentenzyme by synthesizing the complementary strand.

Page 21: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 21

Applied biosystems / SOLiD System TM

http://appliedbiosystems.cnpg.com/Video/flatFiles/699/index.aspx

http://www3.appliedbiosystems.com/AB_Home/index.htm

Sequencing by ligation

Uses the enzyme DNA ligase to identify the nucleotide present at a given position in a DNA sequence.

2-base color encoding data

1 dye = 4 possible di-nucelotides

2 bases are interrogated in each ligation reaction providing increased specificity

Page 22: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 22

Sequencing by ligation

Primer round 1

Primer round 2

Total of 5 primer rounds

Page 23: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 23

Sequencing by ligation

Page 24: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 24

Sequencing by ligation

Ref seq

CS Ref

CS Reads

CS consensus

BS consensusPolymorphism

Error

RE-sequencing

Higher accuracy in built-in error checking capabilitydiscrimination between measurement errors and SNP

Page 25: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 25

Helicos Heliscope TM

http://www.helicosbio.com/Technology/TrueSingleMoleculeSequencing/tSMStradeHowItWorks/tabid/162/Default.aspx

http://www.helicosbio.com/Default.aspx?base

Sequencing by synthesis

Page 26: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 26

Affymetrixhttp://www.affymetrix.com/index.affx

Sequencing by hybridization

Microarray – DNA chip (non-enzymatic)

Hybridization

Probe

Image

Page 27: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 27

Sequencing by hybridization

TGC ATG CCC GTA

CTA CAA GAT AAA

GCG GGG TAG CAT

TGA TTC TTT CGT

T G CG C G

C G TG T A

T A GT G C G T A G

T G CG T AG C GT A GC G T

ACGCATC ACGCATC ACGCATC ACGCATC

ACGCATC ACGCATC ACGCATC ACGCATC

ACGCATC ACGCATC ACGCATC ACGCATC

ACGCATC ACGCATC ACGCATC ACGCATC

1. DNA sample

2. Hybridization

3. Spectrum

4. Reconstruct the sequence

A C G C A T C

Drmanac R et al. Adv Biochem Eng Biotechnol. 2002

Page 28: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 28

Sequencing by hybridization

ACC GCG CCT CCACCG TCC GCC CTC

Page 29: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 29

Sequencing by hybridizationOligomers in chip = 4 # bases

In our example: 3bp = 64 oligomers

25 bases = 1,125,899,906,842,624 oligomers!

Probe: 5-25 bases

Probe overlapEach base is read by multiple probes SNP

Not homogeneous hybridization conditions melting temparature depends strongly on the ratio on GC ATRepeats

A C G C A T C

Page 30: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 30

Pacific Biosciences / SMRTTM

technology

http://www.pacificbiosciences.com/video_lg.html

http://www.pacificbiosciences.com/

Single Molecule Real TimeNot commercially availablePlatform for single molecule real time detection based on DNA Polymerase activity.

Page 31: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 31

Oxford NanoporeTM Technologies

http://www.nanoporetech.com/sequences/index/34

http://www.nanoporetech.com/

Reads the sequence as a DNA strand transits through nanopores

transmembrane cellular proteins

Voltage electrical current

Amount of current is very sensitive to the size and shape of the nanopore.

G

T

A

C

Page 32: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 32

VisiGen Biotechnologies, Inchttp://visigenbio.com/home.html

http://visigenbio.com/technology_movie_streaming.html

Intelligent BioSystemshttp://www.intelligentbiosystems.com/index%20mod%201.html

Complete genomicshttp://www.completegenomics.com/

More on sequencing methods …

Page 33: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 33

Sequencing and gene expression

Although important goals of any sequencing project may be to obtain a genomic sequence and identify a complete set of genes, the ultimate goal is to gain an understanding of when, where, and how a gene is turned on, a process commonly referred to as gene expression.

Expression in normal circumstances

altered state (?)

Identify and study the protein(s) coded by a geneIdentify gene (Genome bioinformatics)

Page 34: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 34

Redundancy at GenBank

Many sequences are represented more than once in GenBank

huge degrees of Redundancy

2003 RefSeq collection : curated secondary databasenon-redundtantselected organisms

•Genome DNA (assemblies)•Transcripts (RNA)•Protein

Page 35: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 35

RefSeq vs GenBank

GenBank RefSeqNot curated Curated

Author submits NCBI creates from existing data

Only author can revise NCBI reivses as new data emerge

Multiple records from same loci common

Single records for each moleculer of major organisms

Records can contradict each other

No limit to species included Limitied to model organisms

Data exchange among INDSC members Exclusive NCBI database

Akin to primary literature Akin to review articles

Proteins identified and linked Proteins and transcripts identified and linked

Access via NCBI Nucleotide db Access via Nucl. and Protein db

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook

Page 36: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 36

RefSeq accession numbershttp://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook

CuratedAutomated

mRNA: NM_000000Gene:

NG_000000

Model mRNA:XM_000000

protein:NP_000000

Model RNA: XR_000000

RNA: NR_000000

Model protein: XP_000000

Contig: NT_000000NW_000000

Chromosome: NC_000000

Page 37: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 37

Trace Archive

2001 NCBI and EMBL/ENSEMBLpurpose collect raw data at sequencing centers worldwidePERMANENT repository of single-pass reads (300-1,000 nt)

Hunt for polymorphisms in gene sequences Insigths to the impact of genetic variation on health

http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?

2,1112,309,330 traces2009-11-06

Page 38: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 38

Entrez

Page 39: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 39

Refining search resultshttp://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpentrez.section.EntrezHelp.Searching_Entrez_usihttp://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helppubmed.section.pubmedhelp.Searching_PubMed

Page 40: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 40

Limits

Refine search results retrieve only the most relevant documents

Allow restriction of a search to a defined subset of the database

Page 41: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 41

Refining search results

Page 42: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 42

Index

Alphabetical lists of terms from searchable database fields

Used to browse and/or select the terms by which records and/or data are described

Page 43: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 43

Refining search results

Page 44: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 44

Search Field Descriptions and Qualifiers

Index search field Qualifier

Accession [ACCN] or [ACCESSION]

All Fields [ALL] or [ALL FIELDS]

Author [AUTH] or [AUTHOR]

EC/RN Number [ECNO]

Feature Key [FKEY]

Filter [FILT] or [SB]

Gene Name [GENE]

Issue [ISS] or [ISSUE]

Keyword [KYWD] or [KEYWORD]

Journal Name [JOUR] or [JOURNAL]

Modification Date [MDAT]

Organism [ORGN] or [ORGANISM]

Page Number [PAGE]

Primary Accession [PACC]

Index search field Qualifier

Properties [PROP]

Protein Name [PROT]

Publication Date [PDAT]

SeqID String [SQID]

Sequence Length [SLEN]

Substance Name [SUBS]

Text Word [WORD]

Title [TITL]

Volume [VOL]

Entrez date [EDAT]

Journal title [TA]

Language [LA]

MeSH term [MH]

Title/Abstract [TIAB]

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpentrez.table.EntrezHelp.T7http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helppubmed.section.pubmedhelp.Search_Field_Descrip

Page 45: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 45

Advanced search statements

term [field] OPERATOR term [field]

Find all human nucleotide sequences with D-loop annotations

Find Drosophila population studies published in the Journal of Molecular Evolution

D-loop[FKEY] AND human[ORGN] in Nucleotide database

j mol evol[JOUR] AND drosophila[ORGN] in PopSet database

Page 46: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 46

History

Provides a record of the searches performed during a search session.

Database specificLost after eight hours of inactivity

Used to review, revise, or combine the results of earlier searches.

Page 47: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 47

Combining results

Page 48: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 48

Query translation

Page 49: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 49

Details

Display your search strategy as translated using Entrez's search and syntax rules

Error messages, when applicable

Page 50: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 50

Author search

Page 51: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 51

Example - author

Page 52: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 52

Example - journal

Page 53: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 53

eUtils: Entrez Programming Utilities

•Tools that provide access to Entrez data outside of the regular web query interface. • Set of 7 server-side programs• Helpful for retrieving search results (manipulated in another environment)• Perl, Python, Java, and C++• Currently includes 35 databases

http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html

ESearch

ESummary

EGQuery

EInfo

EFetch

ELink

EPost

Espell

• Perform searches on large datasets• Implement data pipelines for genomic, proteomic, or

microarray analysis • Create automated searches to keep local databases

current • Create and download customized datasets • Seamlessly combine local data with NCBI data • Develop a focused interface to NCBI data

URL Result(XML)

Page 54: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 54

Common Entrez Engine

Assemble a list of UIDs

ESearch (for a given db)

EGQuery (global version all db)

ESummary (for a list of UIDs)

Retrieve a brief summary record (DocSum)

Page 55: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 55

URL

http://www.ncbi.nlm.nih.gov/sites/gquery?term=cancer+stem+cells

[Base_URL] [Query][Eutils_URL]

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=taxonomy&id=9913&retmode=xml

[Base_URL] [Query][DB][Eutils_URL]

Page 56: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 56

URL: DB

Entrez Database E-Utility Database Name

3D Domains domains

Domains cdd

Genome genome

Nucleotide nucleotide

OMIM omim

PopSet popset

Protein protein

ProbeSet geo

PubMed pubmed

Structure structure

SNP snp

Taxonomy taxonomy

UniGene unigene

UniSTS unists

Each Entrez DB has an E-Utility name (used instead of its original name)

[Base_URL] [Query][DB][Eutils_URL]eSearch =

Page 57: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 57

URL: Query

EFetchEGQuery Espell EInfo ESearch ESummary ELink EPost

Tax Seq Lit

db X X X X X X X X X

term X X X X

field X

reldate X X

mindate X X

maxdate X X

datatype X X

retstart X X X X

retmax X X X X

retmode X X X X X X

rettype X X X X

history X X X X X X X

WebEnv X X X X X X X

query_key X X X X X X X

id X X X X X X

report X

strand X

seq_start X

seq_stop X

dbfrom X

cmd X

[Base_URL] [Query][DB][Eutils_URL]eSearch =

Page 58: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 58

Espell

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi?db=pubmed&term=brest+cancer

Retrieves spelling suggestions when available

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi?

Only PubMed

Page 59: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 59

EInfo

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pubmed

Provides detailed information about a given database:term counts, last update and available links

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?

Page 60: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 60

EGQuery

Provides Entrez database counts in XML for a single search using GQuery

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi?term=brca1+OR+brca2&rettype=html

Page 61: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 61

ESummary

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=11850928,11482001&retmode=xml

xml, ref, html, text, asn.1

Retrieves DocSums from a list of primary IDs

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?

Page 62: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 62

ELink

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nuccore&db=protein&id=7140346

Existence of an external/Related Articles link from a list of UIDsRetrieves related IDs to a list of UIDs (same db, external db)

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?

Page 63: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 63

ELink

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=10611131&retmode=ref&cmd=prlinks

Creates a hyperlink to the primary LinkOut provider for a specific IDLists LinkOut URLs and attributes for multiple IDs.

Page 64: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 64

ESearch

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=cancer&reldate=60&datetype=edat&retmax=10

Returns a list of matching UIDs (text search) in a given Entrez database

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?

edat, mdat, pdat

Page 65: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 65

EFetch

Generates formatted output for a list of input IDs: abstracts from PubMedFASTA format from Protein

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?

DBs:Literature Database

PubMed, Journals, PubMed Central, OMIM

Sequence and other Molecular Biology DatabasesNucleotide,Protein, Gene, etc.

Taxonomy

Page 66: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 66

EFetch - Literature

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=12091962,9997&retmode=html&rettype=abstract

Page 67: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 67

Rettype

Rettype scope Description

count PubMed Hits counts

uilist all Default format for viewing hits

sort PubMed and gene

abstract PubMed

citation PubMed

medline PubMed

full PubMed

native all Default format for viewing sequences

fasta sequence FASTA view of a sequence

gb nucleotide GenBank view for sequences

est dbEST EST Report.

gp protein GenPept view

seqid sequence To convert list of gis into list of seqids.

acc sequence To convert list of gis into list of accessions

chr dbSNP only SNP Chromosome Report.

Page 68: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 68

EFetch - Sequences

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=5&rettype=fasta

strand 1(+), 2(-)

Page 69: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 69

Efetch - Taxonomy

http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=44689&report=docsum

uilist, brief, docsum, xml

Page 70: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 70

Search in Journals for the term obstetrics:

In PubMed display PMIDs 12091962 and 9997 in html retrieval mode and abstract retrieval type:

From Entrez Gene display as xml the GenomeID 2:

To retrieve PubMed related articles for proteins 61742829 with a publication date from 1995 to the present:

Excercise

Page 71: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 71

Combining eUtils calls

The eUtils are useful when used by themselves in single URLs; however their full potential is reached when successive eUtils URLs are combined to create a data pipeline

• Retrieving data records matching an Entrez query

ESearch → ESummaryESearch → EFetch

• Finding IDs linked to records matching an Entrez query

ESearch → ELink

• Retrieving data records in database B linked to records in database A matching an Entrez query

ESearch → ELink → ESummaryESearch → ELink → EFetch

Page 72: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 72

a PERL example

TASK: Retrieve protein sequences of the factor IX in fasta format

my $Base_URL = "http://www.ncbi.nlm.nih.gov/entrez/eutils/" ;

my $esearch_URL = "esearch.fcgi?" ;

my $DB = "db=protein&";

my $Query = "term=factor ix human";

my $esearch_Parameters= "retmax=1&usehistory=y&";

my $E_search =

"$Base_URL$esearch_URL$DB$esearch_Parameters$Query";

http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&retmax=1&usehistory=y&term=factor ix human

ESearch → EFetch

Page 73: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 73

Output from ESearch

Page 74: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 74

QueryKey - WebEnv

$WebEnv: cookie value used with EFetch in place of primary ID result list (encoded server address)

$QueryKey: value used for a history search number (label)

corresponds to a UID list for subsequent search strategies

Page 75: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 75

a PERL example

my $efetch_URL= "efetch.fcgi?";

my $efetch_Parameters =

"rettype=fasta&retmode=text&query_key=$QueryKey&WebEnv=$WebEnv";

my $E_fetch = "$Base_URL$efetch_URL$DB$efetch_Parameters" ;

http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=fasta&retmode=text&query_key=1&WebEnv=0ujfmXBW0U0hNr3FjaUutLkz1bR-NnJ9kp5vybL3u1AbTQdD7uMETHEtG5N@1EE047D172B3B8D0_0015SID

ESearch → EFetch

TASK: Retrieve protein sequences of the factor IX in fasta format

Page 76: Databases in bioinformatics II - Göteborgs universitetbio.biomedicine.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 9 Maxam-Gilbert sequencing-Chemical

Databases in bioinformatics II 76

Output from EFetch