12
Introduction The increased demands for DNA sequencing services and greater volume of data have created a need for software designed to support sequencing activities and applications. Commercial data management systems, such as the Finch-Server (Geospiza Inc., Seattle, WA), provide laboratories with robust data processing and quality analysis tools designed specifically to support DNA sequencing. Even laboratories new to sequencing can implement these systems quickly to analyze projects and identify problem areas. Data management systems offer a variety of benefits. Two immediate benefits are a decrease in cost and a shorter time to complete sequencing projects. Small increases in quality or read length can have a large impact on the number of sequencing reactions required to complete a project. Because the overall cost of a sequencing project is determined primarily by the number of reads, project costs are lowered when data management tools are used to detect and correct problems in a timely fashion. A later benefit to researchers is the ability to link sequence data to sample history. It is important to know, for example, if a sequence was obtained from a cloned DNA fragment or from a clinical sample that might contain a heterogeneous mixture of sequences. Finch-Servers were designed to solve common problems and meet the data management requirements shared by many laboratories. This approach benefits customers because multiple users participate in the design, review, and testing process. Additional benefits include a shorter time frame from purchase to use and a more robust and adaptable system, 10 Geospiza’s Finch-Server: A Complete Data Management System for DNA Sequencing Sandra Porter, Joe Slagel, and Todd Smith Geospiza, Inc., Seattle, WA DNA Sequencing: Optimizing the Process and Analysis Edited by Jan Kieleczawa ©2005 Jones and Bartlett Publishers 131 KOO10 5/31/04 12:17 PM Page 131

10 Geospiza’s Finch-Server: A Complete Data Management

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 10 Geospiza’s Finch-Server: A Complete Data Management

Introduction

The increased demands for DNA sequencing services and greater volumeof data have created a need for software designed to support sequencingactivities and applications. Commercial data management systems, suchas the Finch-Server (Geospiza Inc., Seattle, WA), provide laboratories withrobust data processing and quality analysis tools designed specifically tosupport DNA sequencing. Even laboratories new to sequencing canimplement these systems quickly to analyze projects and identify problemareas.

Data management systems offer a variety of benefits. Two immediatebenefits are a decrease in cost and a shorter time to complete sequencingprojects. Small increases in quality or read length can have a large impacton the number of sequencing reactions required to complete a project.Because the overall cost of a sequencing project is determined primarilyby the number of reads, project costs are lowered when data managementtools are used to detect and correct problems in a timely fashion. A laterbenefit to researchers is the ability to link sequence data to sample history.It is important to know, for example, if a sequence was obtained from a cloned DNA fragment or from a clinical sample that might contain aheterogeneous mixture of sequences.

Finch-Servers were designed to solve common problems and meet the data management requirements shared by many laboratories. Thisapproach benefits customers because multiple users participate in thedesign, review, and testing process. Additional benefits include a shortertime frame from purchase to use and a more robust and adaptable system,

10 Geospiza’s Finch-Server: AComplete Data ManagementSystem for DNA Sequencing

Sandra Porter, Joe Slagel, and Todd SmithGeospiza, Inc., Seattle, WA

DNA Sequencing: Optimizing the Process and AnalysisEdited by Jan Kieleczawa©2005 Jones and Bartlett Publishers 131

KOO10 5/31/04 12:17 PM Page 131

Page 2: 10 Geospiza’s Finch-Server: A Complete Data Management

unlike custom systems that require a long development process and sig-nificant time investment on the part of the laboratory.

This chapter describes the software requirements of different types ofsequencing laboratories and discusses how Finch-Servers help those laboratories meet their needs.

The Geospiza Finch-Server

A Finch-Server is a Web application with selected components of theFinch-Suite and a relational database management system (RDBMS),either Oracle (Redwood Shores, CA) or Solid (Mountain View, CA),installed on a computer with a UNIX or Linux operating system. Techni-cians and researchers use Web browsers such as Internet Explorer orNetscape to interact with the Finch-Server through a local intranet, or, inthe case of geographically dispersed sites, over a secure Internet connec-tion. Finch-Servers operate either as stand-alone systems or within a dis-tributed computing environment.

Finch-Systems are Finch-Servers designed for different types of labo-ratories and different data processing requirements. Each Finch-Systemincludes a Finch-Server along with selected components of the Finch-Suite, a set of integrated software modules designed to support datamanagement and analysis. These include the: Sequencing RequestManager, Instrument Manager, Chromatogram Manager, AssemblyManager, Data Repository Manager, and Basic Local Alignment SequenceTool (BLAST) Manager. Components of the Finch-Suite are organizedwithin the Finch Core DNA Sequencing System (Figure 10.1), and theFinch Assembly and BLAST Systems (Figure 10.2). The Complete FinchDNA Sequencing System contains all of the Finch-Suite components.

Many laboratories need to store large numbers of data files andsequence assemblies. Customers have reported storing well over half amillion chromatogram files in the Finch-Server and generating sequenceassemblies from over 60,000 reads. These systems also provide a way forresearchers to maintain their original, unprocessed data files, thus makingit possible to apply new base-calling technology or other analysis tools toolder data and data from varied sources. Storage media can be added toaccommodate an increasing quantity of data over time. The Finch-Server,therefore, is able to provide a robust, scalable system, for storing chromatogram files, sequence assemblies, sequence databases, andBLAST results.

Much of the data stored in the Finch-Server are presented in tablesthat can be sorted by column. For example, one can easily sort reads bythe number of vector clones by selecting the appropriate column heading.Customized data views also can be generated using data browsers that

132 Porter, Slagel, and Smith

KOO10 5/27/04 5:25 PM Page 132

Page 3: 10 Geospiza’s Finch-Server: A Complete Data Management

quickly extract selected information. A laboratory manager might use thedata browser to view all the sequencing runs performed with a certaininstrument. A researcher setting up an assembly might select all of thereads from a genomic library that match E. coli and use the move icon totransfer them to a “trash” folder, thus keeping E. coli sequences out of theassembly.

Further, Finch-Servers store information in a relational database man-agement system, allowing users to obtain selected information inresponse to an SQL (Structured Query Language) statement. The capac-ity to perform customized queries using SQL and obtain customizedreports and information is a powerful tool for additional research, processmanagement, and project oversight.

Data Management for Core Facilities

Managing Sequencing Requests

Core laboratories in academic institutions and biotechnology companieshave a variety of general and customized requirements that softwaresystems must be able to meet. First, the laboratory’s customers need a con-venient method for submitting work requests. This is accomplished in theFinch-Server through Web forms that allow customers to make requestsusing their own desktop computers. Core lab customers use the Finch-Sequencing Request Manager to enter experimental information forexperiments performed on different scales. This information is stored inthe database, allowing retrieval at a later date. Web forms, designed toaccommodate tubes, 96- or 384-well plates, or batch modes, simplify dataentry and sample naming (Figure 10.1A). A configurable, automatednaming system assigns unique names to each sample. For example,samples that are used for sequence assembly must be named in a certainway to be compatible with the Phrap assembly program (6). The Sequenc-ing Request Manager helps customers name samples appropriately witha minimal amount of extra work.

Sample Tracking

Service laboratories also require the ability to track samples through eachstage of the sequencing process. Not only can laboratory personnel deter-mine where samples are located in a queue, the Finch-Server allows cus-tomers to monitor the status of their own samples over the Internet,saving time spent on the phone. At the end of the sequencing process, theFinch-Server simplifies data delivery. Customers can either view (andstore) their data in the Finch-Server or download the data to their desktopcomputer.

Chapter 10. The Finch-Server Data Management System 133

KOO10 5/27/04 5:25 PM Page 133

Page 4: 10 Geospiza’s Finch-Server: A Complete Data Management

Fin

ch(a

)

(b)

(g)

134

KOO10 5/27/04 5:25 PM Page 134

Page 5: 10 Geospiza’s Finch-Server: A Complete Data Management

(c)

(e)

(d)

(f)

Figu

re 1

0.1.

Th

e Fi

nch

Cor

e D

NA

Seq

uen

cin

g S

yste

m.(

a) A

web

pag

e fr

om t

he F

inch

-Seq

uenc

ing

Req

uest

Man

ager

. Par

t of

the

form

for

sub

mit

ting

a r

eque

st t

o se

quen

ce s

ampl

es f

rom

a 9

6 w

ell

plat

e is

sho

wn.

Sin

gle

wel

ls c

an b

e se

lect

ed,

or m

ulti

ple

wel

ls,

chos

en b

y se

lect

ing

colu

mn

or r

ow h

ead

ings

. (b)

The

Fin

ch-C

hrom

atog

ram

Man

ager

Fol

der

Rep

ort.

Thi

s re

port

sum

mar

izes

qua

lity

and

sel

ecte

d s

tati

stic

s fo

r al

l th

e re

ads

in a

n in

div

idua

l fo

lder

, inc

lud

ing

the

num

ber

of s

eque

nces

tha

t m

atch

vec

tor

sequ

ence

s, E

.co

li, o

r ot

her

sele

cted

seq

uenc

es. (

c) T

he C

hrom

atog

ram

Det

ails

rep

ort,

also

fro

m t

he C

hrom

atog

ram

Man

ager

. Qua

lity

scor

es, f

rom

Phre

d, a

re s

how

n fo

r ea

ch b

ase

in th

e re

ad, a

long

wit

h th

e se

quen

ce a

nd o

ther

info

rmat

ion.

Lin

ks a

re a

vaila

ble

for

view

ing

the

trac

efil

e (d

) an

d/

or d

ownl

oad

ing

the

dat

a. (

d)

Ach

rom

atog

ram

tra

ce fi

le t

oget

her

wit

h a

plot

of

Phre

d q

ualit

y sc

ores

. (e)

The

Seq

uenc

-in

g R

un D

etai

ls r

epor

t fr

om t

he F

inch

-Ins

trum

ent

Man

ager

. T

he r

ead

len

gth

(dar

ker

colo

r) a

nd s

eque

nce

qual

ity

(lig

ht c

olor

) ar

esh

own

on th

e y

axis

and

the

ind

ivid

ual c

apill

arie

s an

d/

or la

nes,

on

the

x ax

is. T

he r

egul

arly

rep

eati

ng b

lock

s of

poo

r qu

alit

y sa

mpl

es(i

ndic

ated

by

a d

ark

colo

r) r

esul

t fr

om a

plu

gged

fan

g in

the

seq

uenc

ing

inst

rum

ent.

(f)

The

Ins

trum

ent

Cap

illar

y U

sage

rep

ort.

Ave

rage

seq

uenc

e qu

alit

y is

rep

rese

nted

by

colo

rs r

angi

ng f

rom

bri

ght

gree

n (b

est)

to

blac

k (p

oor)

. Run

dat

es a

re s

how

n at

the

top

,fr

om r

ight

to

left

. Pro

blem

cap

illar

ies

(or

sam

ples

) ar

e sh

own

by t

he b

lack

box

es in

the

firs

t tw

o co

lum

ns. (

g) T

he F

inch

Cor

e D

NA

Sequ

enci

ng S

yste

m o

verv

iew

. L

abor

ator

ies

subm

it r

eque

sts

usin

g th

e Se

quen

cing

Req

uest

Man

ager

. T

he l

ab u

ses

the

Inst

rum

ent

Man

ager

to s

et u

p sa

mpl

e sh

eets

and

dow

nloa

ds

them

to th

e se

quen

cing

inst

rum

ent.

Dat

a ar

e lo

aded

into

the

Chr

omat

ogra

m M

anag

eran

d t

he c

usto

mer

s ge

t th

e re

sult

s.

135

KOO10 5/27/04 5:25 PM Page 135

Page 6: 10 Geospiza’s Finch-Server: A Complete Data Management

Security

Core facilities often have clients from different laboratories, making itimportant to control access to data. Customers, on the other hand, wantto share data with collaborators and other members of their laboratory,yet simultaneously prevent data access by competitors. These require-ments are further complicated by the needs of the core facility. Personnelin the core facility must be able to view all of the data in order to monitorthe quality of the laboratory’s work. This problem is solved in the Finch-Server with a user hierarchy that allows administrators and technicianswider access than the facility’s customers and assigns researchers to spe-cific lab groups. All researchers within a lab group are able to view databelonging to that group but are prevented from viewing data owned byothers. Only the facility administrator can add or delete researchers froma lab group by the facility director. As a result, researchers can share datathrough the Web in a protected manner with their collaborators andselected colleagues, while laboratory personnel can monitor data qualitywithout compromising proprietary information.

Preparation of Sample Sheets and Managing Instrument-Related Data

After a customer has submitted a work request to a service laboratory, thelaboratory needs to review requests and organize workflow. Techniciansuse the Finch Instrument Manager to create sample sheets in the Finch-Server by combining samples from different work requests to produce theoptimum number of samples for each sequencing instrument. Finch-Servers support all of the widely used sequencing instruments, includingthose from ABI instruments (Applied Biosystems, Foster City, CA), theMegaBACE (Amersham, Piscataway, NJ), the Beckman CEQ2000(Beckman, Fullerton, CA), and others. Laboratory technicians downloadsample sheets to their sequencing instrument and complete the sequenc-ing run. Instrument-related service information and serial numbers alsocan be stored in the Instrument Manager.

Management of Chromatogram Files and Quality Control

The Finch Chromatogram Manager provides a useful tool for storingsequences and chromatogram-related information. Chromatogram filesfrom a completed sequencing run are uploaded and stored in the Chromatogram Manager. The Chromatogram Manager also can be usedto store individual files or packages of related chromatogram files that havebeen downloaded from public or private databases through the Internet.

Chromatogram file data are checked by the Finch-Server during the upload process to prevent accidental loading of duplicate

136 Porter, Slagel, and Smith

KOO10 5/27/04 5:25 PM Page 136

Page 7: 10 Geospiza’s Finch-Server: A Complete Data Management

chromatograms. This feature allowed one customer to diagnose an unsus-pected problem in their instrument software. The instrument softwareapparently created two files with different file names but identical chro-matogram data. This bug was confirmed when the customer checked withthe instrument vendor. Once in the Finch-Server, the chromatogram filesenter a data processing pipeline. Some of the steps in this pipeline include:base-calling, quality measurement, and quality trimming with eitherPhred (3–6), TraceTuner (Paracel, Inc.), or the KB base caller (AppliedBiosystems); identification of vector sequences with Cross Match (6); andthe generation of graphical reports that summarize run information andstatistics for sets of sequences (examples are shown in Figure 10.1b–d).Data quality analysis can be customized using pipelines to select the base-caller depending on the sequencing instrument, or laboratory.

Sequencing Instrument Performance

Instrument reports provide feedback to laboratories about the status ofeach run and the performance of each sequencing instrument. Blockedliquid transfer devices or problem capillaries can be quickly identified byviewing graphical reports (Figure 10.1, e, f). The ability to monitor cap-illary performance for changes in data quality over time helps laborato-ries determine when instruments require service. Service information andidentification numbers are easily stored in the Finch-Server, providing ameans to quickly view the service and repair history for any sequencinginstrument.

Billing

Shared facilities need to monitor use and bill customers accordingly, in atimely fashion. Billing numbers can be stored in the Finch-Server whenwork requests are submitted. Structured Query Language (SQL) state-ments can be used in conjunction with billing numbers or other identifi-cation, to generate reports that define how resources are used, the amountof work performed, who requested the work, and the request date. Thesereports can be generated automatically, through custom services, anddelivered to accounting departments.

Data Management for Large-Scale Sequencing

Laboratories engaged in genomic and production sequencing need addi-tional features in a data management system. Production sequencing,

Chapter 10. The Finch-Server Data Management System 137

KOO10 5/27/04 5:25 PM Page 137

Page 8: 10 Geospiza’s Finch-Server: A Complete Data Management

expressed sequence tags (ESTs) clustering, single-nucleotide polymor-phism (SNP) discovery, and genomic sequencing require the ability toassemble sequences; set up, maintain, and update sequence databases;pipelines can be designed specifically for automated data processing, fil-tering, and BLAST searches (1). These additional requirements can be metby stand-alone Finch systems, such as the Finch Assembly System and/orthe Finch BLAST System; or with the Complete Finch DNA SequencingSystem, which includes all of the components of the core system in addi-tion to the Finch-Assembly Manager, the Finch-Data Repository Manager,and the Finch-BLAST Manager. An overview of these components isshown in Figure 10.2.

DNA Sequence Assembly

The Finch-Assembly Manager works with the Chromatogram Manager toassemble sets of reads into longer, contiguous sequences. Folders withspecific sets of reads are created and organized in the ChromatogramManager. To assemble sequences, one chooses the appropriate folder andassembly program, and presses a button to start the assembly. The Assem-bly Manager provides different options for sequence assembly, includingPhrap (http://www.phrap.org; Green, International Human GenomeSequencing Consortium, 2001), a parallel version, SPS-Phrap (SouthwestParallel Software, Albuquerque, NM, USA), or both.

The Finch-Server’s Relational Database Management System(RDBMS) stores all the parameters and details of the assembly, making iteasy to repeat the assembly later with new or additional data. Sequenceassemblies also can be treated as a type of experiment and performed withsets of sequences and different parameters. Becuase all of the assemblyresults and conditions are stored in the RDBMS, the optimum parametersare easily located for later work.

Phrap assembles reads by comparing pairs of sequences, locatingoverlapping regions, and using quality information from the base-callerto help reconstruct the original sequence. The assembly can be improvedby incorporating additional information about each read. For example,Phrap qualities—those used to build the final contig—incorporate infor-mation about the relative orientation of each read, and the sequencingchemistry. If a base has been confirmed by an alternate method (sequenc-ing the opposite strand or using a different type of chemistry), Phrapassigns a higher quality value to the base at that position. The highestquality bases are used to build a mosaic, which represents the sequenceof the contig.

Information from each assembly is provided in tables and graphicalreports. Tables include information about the number of reads in eachcontig, the location where each read aligns, and links to the read sequence,

138 Porter, Slagel, and Smith

KOO10 5/27/04 5:25 PM Page 138

Page 9: 10 Geospiza’s Finch-Server: A Complete Data Management

Finch(a)

(b) (c) (d)

Figure 10.2. The Finch-Assembly and BLAST Systems. (a) The Finch-Assembly Manager uses read data, stored in the Chromatogram Manager, to setup sequence assemblies. The Finch-Data Repository Manager maintains currentversions of public and private sequence databases. BLAST searches of selecteddatabases are conducted with the Finch-BLAST Manager. Query sequences can bepasted in a form, uploaded from a file, or selected from either the ChromatogramManager or the Assembly Manager. Data processing pipelines can be set up toperform queries in an automated fashion. (b) The Assembly Set Details reportfrom the Assembly Manager. This report provides information from the assemblyalong with a graph showing the number of reads per kilobase (y axis) for differ-ent length contigs (x axis). (c) A contig report from the Assembly Manager showsthe Phrap quality across the length of the contig (top), sequence discrepancies asa function of quality (middle), and the number of sequences mapping to each partof the contig (bottom). (d) A screenshot from the BLAST Manager shows BLASTparameters and a table with the results.

139

KOO10 5/27/04 5:25 PM Page 139

Page 10: 10 Geospiza’s Finch-Server: A Complete Data Management

quality values, and trace file. Chimera reports are provided to help diag-nose experimental artifacts. If further information is needed, say the locations of potential deletion clones, the Phrap output file can be downloaded for viewing.

Figure 10.2 b and c show two of the Assembly Manager’s graphicalreports. For this experiment, reads from several ESTs were obtained fromthe Washington University Genome Center (St. Louis, MO), stored in theChromatogram Manager, and assembled with the Assembly Manager. Agraph of the assembly results (Figure 10.2b) shows the number of readsper kb (y axis) versus the contig length (x axis). Because the reads usedin this assembly were obtained from ESTs, sequences with a large numberof reads/kb were likely to represent highly expressed genes, repetitivesequences, mitochondrial sequences, or ribosomal RNA. If these resultswere obtained from assembling reads from a genomic sequencing project,a large number of reads/kb might help flag problem contigs and findassembly problems due to repetitive sequences.

Graphical reports also provide an overview of the Phrap qualityvalues for each contig, the coverage depth, and positions with discrepantbases. High-quality sequence discrepancies are a strong indicator ofpotential polymorphisms and valuable in SNP discovery (2).

Maintaining Updated Sequence Databases

Many companies and sequencing centers have found it helpful to main-tain local copies of sequence databases. Not only can the searches be fasterwhen using local resources, data are protected because the searches areperformed in a secure environment. Researchers don’t have to take riskswith valuable information by sending proprietary sequences over theInternet. Further, local searches allow one to search proprietary or customdatabases that aren’t available over the Internet.

To meet this need, Geospiza developed the Finch BLAST System(Figure 10.2), which allows researchers to maintain up-to-date versions ofany sequence database and use BLAST as a search tool. The content inpublic and private sequence databases changes on a daily basis, makingit difficult for companies and research institutions to keep local versionscurrent. The Finch-Data Repository Manager operates through acommand line interface and updates local copies of databases on a regularschedule. Users specify which sequence data to retrieve from remote orlocal sites and designate where the files should be stored. When newsequences are retrieved, databases are automatically updated and log filesare generated. E-mail reports instantly notify users that updates haveoccurred. This system also enables researchers to maintain multiple data-bases and multiple versions of each database.

140 Porter, Slagel, and Smith

KOO10 5/27/04 5:25 PM Page 140

Page 11: 10 Geospiza’s Finch-Server: A Complete Data Management

The Finch-BLAST Manager

The Finch-BLAST Manager is used to perform BLAST searches of eithernucleotide or protein sequence databases (Figure 10.2d). Researchers cansearch one or more databases, concurrently, with single or multiplesequences. Several of the BLAST programs are available in the BLASTManager, including blastn, blastp, blastx, tblastx, and tblastn. The BLASTmanager stores the search results and parameters, thus allowing compar-isons to be made between database searches at different points of time.

Data processing pipelines can be employed by users to assemble readsstored in the Chromatogram Manager, and automated BLAST searches ofselected databases performed with assembled sequences or individualreads. For example, a BLAST search can be carried out whenever updatesoccur in a specific database. BLAST can be employed also for qualitycontrol. In one example, a batch of clinical samples from polymerase chainreaction (PCR) assays was mixed up by a commercial sequencing lab. Theproblem was uncovered by constructing a custom database of theexpected PCR products and comparing the reads with the expected prod-ucts using BLAST.

Filters also are available in the BLAST Manager that allow users tomask low complexity sequences that might obscure search results. Repeti-tive sequences can be masked by adding RepeatMasker (Geospiza, Inc.,Seattle, WA, USA) to the system. The addition of filters permits users toset up powerful screening systems designed to enhance data mining,genome sequencing, SNP discovery, or other applications.

Data Management Over the Internet: iFinch

iFinch is a Finch-Server designed for individual researchers and/orsmaller laboratories with short term sequencing projects. Subscriptions toiFinch allow small laboratories immediate access to integrated data man-agement tools without making a long-term investment. Unlike otherFinch-Servers, Geospiza acts as the system administrator for iFinch,ensuring for example, that data are backed up on a regular basis, hard-ware is kept up to date, and that the system runs smoothly. Remote accessallows laboratories to store data off-site, securely, minimizing the risk ofdata loss.

References

1. Altschul, S.F., Madden, T.L., Schäffer, A.A., et al. 1997. Gapped BLAST andPSI-BLAST: a new generation of protein database search programs. NucleicAcids Res 25: 3389–3402.

Chapter 10. The Finch-Server Data Management System 141

KOO10 5/27/04 5:25 PM Page 141

Page 12: 10 Geospiza’s Finch-Server: A Complete Data Management

2. Clifford R., Edmonson, M., Hu, Y., et al. 2000. Expression-based genetic/physical maps of single-nucleotide polymorphisms identified by the cancergenome anatomy project. Genome Res 8: 1259–1265.

3. Ewing, B., and Green, P. 1998. Base-calling of automated sequencer tracesusing Phred. II. Error probabilities. Genome Res 8: 186–194.

4. Ewing, B., Hillier, L., Wendl, M.C., and Green, P. 1998. Base-calling of auto-mated sequencer traces using Phred. I. Accuracy assessment. Genome Res 8:175–185.

5. Green, E. 2001. Strategies for the systematic sequencing of complex genomes.Nature Rev Genet 2: 573–583.

6. International Human Genome Sequencing Consortium. 2001. Initial sequenc-ing and analysis of the human genome. Nature 409: 813–958.

142 Porter, Slagel, and Smith

KOO10 5/27/04 5:25 PM Page 142