27
Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute.

Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Embed Size (px)

Citation preview

Page 1: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

Developments at Sanger

Anthony Rogers Wellcome Trust Sanger Institute.

Page 2: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

Overview• The build procedure• Stats for the year• Team changes• Model changes.

“new gene model”Variation

• Future plansInterProimproved mapping of data to genesmove off wormsrv2new nematodesnew data types

Page 3: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

Wellcome TrustSanger Insitute

Cold SpringHarbor Laboratory

Washington University in St. Louis

California Institute of Technology

● RNAi● Microarray● Anatomy / Cell● Homology groups (KOGS)● SAGE data● Gene Ontology● Papers / References● Person / Author● Detailed Functional Annotation

● Gene prediction annotation● SNPs

● PCR_products / Oligos● 3D structures● Yeast 2 Hybrid interactions

Website and tools

Gene prediction annotationGenetic DataAllelesGene name info ( incl unique ids )Strains

Data Integration and analysis

The WormBase Consortium

Page 4: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

Build Overview

To FTP site and CSHLDev site

CalTechSanger CSHLWashU

WormBase

EMBLAlign all cDNAs and

build transcripts

Map expt data eg RNAi, oligos, Alleles

mysqlWORMPEP

DNA

Sanger Compute Farm

Blastx, blastp, RepeatMaskPFAM, tmhmm etc

Load homology data

Export GFF, agp, DNA files.

Build release files

Page 5: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

Release cycle

From WS124 (March 2004) – WS150 (October 2005) - 26 releases.

All but 2 of these were on schedule. Those that were late were due to Sanger wide systems problems associated with moving to new building.

After W134 changed (with SAB approval) to three weekly cycle.If releases on time - Why?

• Increases in data meant gradual increase in time.

• Lots of releases were “Just in time”

• Time pressure meant that fixes weren’t been made properly.

• Reduced staff meant that less development was being done.

Page 6: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

Gene stats

More polyA / TSL etc and fixing BLAT errors

Page 7: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

Experimental Data Stats I

New data class

Page 8: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

Experimental Data Stats IIIncorporation of genome wide experiments

Page 9: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

Other classes of interestInParanoid

Page 10: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

Staff Changes

Mary Ann Tuli

Gary Williams

• Great improvement in documentation of procedures.Gene structure curationAllele curationgenetic map functions in acedbSequence feature annotation ( polyA, TSL)

• Fresh view of methods for doing things.

Keith Bradnam

Choa-Kung Chen

Dan Lawson Michael Han

Page 11: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

“Where is the new Gene model Keith!?!”

Page 12: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

The problem

≈ Worm genes first existed as Locus objects≈ e.g. dpy-1

≈ Then genes existed as Sequence objects≈ e.g. F31D4.3

≈ Some genes exist as both Locus and Sequence objects

≈ Gene names change…a lot!

Page 13: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

Locus Sequence

C09D8.1ptp-3

ptp-1

ypp-1

YPP/1

C09D8.1a

C09D8.1b

ptp-3a ptp-3b

GeneWBGene0000001

Other names

Main CGC name Sequence name

CDS

ptp-1

The Plan

Page 14: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

Linking to a gene

Paper [cgc4265]AntibodyAllele

C09D8.1ptp-3

ptp-1

ypp-1

YPP/1

C09D8.1a

C09D8.1b

ptp-3a ptp-3b

GeneWBGene0000001

C09D8.1c

abc-1

RNAi result

Page 15: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

Progress!Progress !

• The (no longer new) Gene model is in place.• All Genes now have Gene_ids• Gene history tracking info stored

• merges, splits etc

• Next part of the plan was to have a central database serving ids

Page 16: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

Working versionSanger “single sign-on”User specific operations

Operation selection

Not just WBGene_ids - Variation, RNAi, Person

Page 17: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

Variation ModelLocus

SNPsClassical Genes

Gene Clusters

AlleleDeletions

Transposon_insertions

Lots of shared data structures (Tags) eg Mapping data, Names, connections to CDSs

VariationGreater code efficiency and managability for both build and web

Easier to search

Page 18: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

Imminent arrivals and the Future

• InterPro

• Refined Mapping

• Moving build machine

• New nematodes

• New data types

Page 19: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

InterPro• Useful data used in many other resources so a good ‘point of

reference for non-worm specialists.

• We previously got ours from UniProt or ad hoc from St Louis.

• Many databases are covered by InterPro.

• Prosite, Prints, Pfam, SMART, PIRSF, etc.

• Usual way of searching for database hits is to use interproscan, but this is incompatible with Sanger farm.

• Run each database search individually using existing architecture from BLAST etc and stores the results.

• We merge hits with the same InterPro ID

Page 20: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

Merging hits from databases

PfamPrintsPrositePirsf

IPR006209 IPR008401 IPR006502

Protein

Results similar but not identical to iprscan

Page 21: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

InterPro hits per protein

15 Proteins with >100 domains (max. 186)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 220

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

6000

6500

7000

7500

InterPro Domains

Pro

tein

s

Page 22: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

Improved Mapping of Variations to Genes

We can describe much more accurately how a mutation affects a gene . . - donor and acceptor splice sites- introns / exons- motifs like polyAs and TSLs

... and for coding changes give the amino acid differences.

Variations

Page 23: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

sra-9

ttc tta F L

Currently only connection to Gene

Future will specify that the SNP is in coding sequence and that it causes a specified amino acid change.

Described by Tags in the database, so searchable.

Predicted snp_AH6[1]

Page 24: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

Implementationx

One table per chromosome, so all can be loaded together

GFF dataexons, introns, transcripts, SNPS, alleles etc

I

IIIII IV

V

X

All chromosomes can be run in parallel

cbi1 = 3 x 2cpu

Page 25: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

Death of wormsrv2Death of wormsrv2

5 years ago Sanger network = bad

Bought shiney fast new computer

Become too slow and isolation is a pain

Now Sanger network = Good !

Move to use informatics cluster - fast and parallel

Means modification of majority of code base

Page 26: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

New nematodes

New nematode genomes

• C. briggsae is a forerunner . . . semi-curated genesetbrigpep2protein annotation ( PFAM, tmhmm, signalp )ortholog assignment ( InParanoid - Erich Sonnhammer )blastpblastxwaba ( Jim Kent’s genome alignment tool )

• We intend to do all of this for each of the new genomes.• Mostly done for C.remanei

Page 27: Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute

Advisory Board Meeting, CSHL 2005

New Data Types

Any new data types impact on build new model developmentscripts to integrate and check the data

Eg Mass spec data:Been in contact with Gennifer Merrihew