Upload
amie-bailey
View
222
Download
1
Embed Size (px)
Citation preview
Advisory Board Meeting, CSHL 2005
Developments at Sanger
Anthony Rogers Wellcome Trust Sanger Institute.
Advisory Board Meeting, CSHL 2005
Overview• The build procedure• Stats for the year• Team changes• Model changes.
“new gene model”Variation
• Future plansInterProimproved mapping of data to genesmove off wormsrv2new nematodesnew data types
Advisory Board Meeting, CSHL 2005
Wellcome TrustSanger Insitute
Cold SpringHarbor Laboratory
Washington University in St. Louis
California Institute of Technology
● RNAi● Microarray● Anatomy / Cell● Homology groups (KOGS)● SAGE data● Gene Ontology● Papers / References● Person / Author● Detailed Functional Annotation
● Gene prediction annotation● SNPs
● PCR_products / Oligos● 3D structures● Yeast 2 Hybrid interactions
Website and tools
Gene prediction annotationGenetic DataAllelesGene name info ( incl unique ids )Strains
Data Integration and analysis
The WormBase Consortium
Advisory Board Meeting, CSHL 2005
Build Overview
To FTP site and CSHLDev site
CalTechSanger CSHLWashU
WormBase
EMBLAlign all cDNAs and
build transcripts
Map expt data eg RNAi, oligos, Alleles
mysqlWORMPEP
DNA
Sanger Compute Farm
Blastx, blastp, RepeatMaskPFAM, tmhmm etc
Load homology data
Export GFF, agp, DNA files.
Build release files
Advisory Board Meeting, CSHL 2005
Release cycle
From WS124 (March 2004) – WS150 (October 2005) - 26 releases.
All but 2 of these were on schedule. Those that were late were due to Sanger wide systems problems associated with moving to new building.
After W134 changed (with SAB approval) to three weekly cycle.If releases on time - Why?
• Increases in data meant gradual increase in time.
• Lots of releases were “Just in time”
• Time pressure meant that fixes weren’t been made properly.
• Reduced staff meant that less development was being done.
Advisory Board Meeting, CSHL 2005
Gene stats
More polyA / TSL etc and fixing BLAT errors
Advisory Board Meeting, CSHL 2005
Experimental Data Stats I
New data class
Advisory Board Meeting, CSHL 2005
Experimental Data Stats IIIncorporation of genome wide experiments
Advisory Board Meeting, CSHL 2005
Other classes of interestInParanoid
Advisory Board Meeting, CSHL 2005
Staff Changes
Mary Ann Tuli
Gary Williams
• Great improvement in documentation of procedures.Gene structure curationAllele curationgenetic map functions in acedbSequence feature annotation ( polyA, TSL)
• Fresh view of methods for doing things.
Keith Bradnam
Choa-Kung Chen
Dan Lawson Michael Han
“Where is the new Gene model Keith!?!”
Advisory Board Meeting, CSHL 2005
The problem
≈ Worm genes first existed as Locus objects≈ e.g. dpy-1
≈ Then genes existed as Sequence objects≈ e.g. F31D4.3
≈ Some genes exist as both Locus and Sequence objects
≈ Gene names change…a lot!
Advisory Board Meeting, CSHL 2005
Locus Sequence
C09D8.1ptp-3
ptp-1
ypp-1
YPP/1
C09D8.1a
C09D8.1b
ptp-3a ptp-3b
GeneWBGene0000001
Other names
Main CGC name Sequence name
CDS
ptp-1
The Plan
Advisory Board Meeting, CSHL 2005
Linking to a gene
Paper [cgc4265]AntibodyAllele
C09D8.1ptp-3
ptp-1
ypp-1
YPP/1
C09D8.1a
C09D8.1b
ptp-3a ptp-3b
GeneWBGene0000001
C09D8.1c
abc-1
RNAi result
Advisory Board Meeting, CSHL 2005
Progress!Progress !
• The (no longer new) Gene model is in place.• All Genes now have Gene_ids• Gene history tracking info stored
• merges, splits etc
• Next part of the plan was to have a central database serving ids
Advisory Board Meeting, CSHL 2005
Working versionSanger “single sign-on”User specific operations
Operation selection
Not just WBGene_ids - Variation, RNAi, Person
Advisory Board Meeting, CSHL 2005
Variation ModelLocus
SNPsClassical Genes
Gene Clusters
AlleleDeletions
Transposon_insertions
Lots of shared data structures (Tags) eg Mapping data, Names, connections to CDSs
VariationGreater code efficiency and managability for both build and web
Easier to search
Advisory Board Meeting, CSHL 2005
Imminent arrivals and the Future
• InterPro
• Refined Mapping
• Moving build machine
• New nematodes
• New data types
Advisory Board Meeting, CSHL 2005
InterPro• Useful data used in many other resources so a good ‘point of
reference for non-worm specialists.
• We previously got ours from UniProt or ad hoc from St Louis.
• Many databases are covered by InterPro.
• Prosite, Prints, Pfam, SMART, PIRSF, etc.
• Usual way of searching for database hits is to use interproscan, but this is incompatible with Sanger farm.
• Run each database search individually using existing architecture from BLAST etc and stores the results.
• We merge hits with the same InterPro ID
Advisory Board Meeting, CSHL 2005
Merging hits from databases
PfamPrintsPrositePirsf
IPR006209 IPR008401 IPR006502
Protein
Results similar but not identical to iprscan
Advisory Board Meeting, CSHL 2005
InterPro hits per protein
15 Proteins with >100 domains (max. 186)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 220
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
6000
6500
7000
7500
InterPro Domains
Pro
tein
s
Advisory Board Meeting, CSHL 2005
Improved Mapping of Variations to Genes
We can describe much more accurately how a mutation affects a gene . . - donor and acceptor splice sites- introns / exons- motifs like polyAs and TSLs
... and for coding changes give the amino acid differences.
Variations
Advisory Board Meeting, CSHL 2005
sra-9
ttc tta F L
Currently only connection to Gene
Future will specify that the SNP is in coding sequence and that it causes a specified amino acid change.
Described by Tags in the database, so searchable.
Predicted snp_AH6[1]
Advisory Board Meeting, CSHL 2005
Implementationx
One table per chromosome, so all can be loaded together
GFF dataexons, introns, transcripts, SNPS, alleles etc
I
IIIII IV
V
X
All chromosomes can be run in parallel
cbi1 = 3 x 2cpu
Advisory Board Meeting, CSHL 2005
Death of wormsrv2Death of wormsrv2
5 years ago Sanger network = bad
Bought shiney fast new computer
Become too slow and isolation is a pain
Now Sanger network = Good !
Move to use informatics cluster - fast and parallel
Means modification of majority of code base
Advisory Board Meeting, CSHL 2005
New nematodes
New nematode genomes
• C. briggsae is a forerunner . . . semi-curated genesetbrigpep2protein annotation ( PFAM, tmhmm, signalp )ortholog assignment ( InParanoid - Erich Sonnhammer )blastpblastxwaba ( Jim Kent’s genome alignment tool )
• We intend to do all of this for each of the new genomes.• Mostly done for C.remanei
Advisory Board Meeting, CSHL 2005
New Data Types
Any new data types impact on build new model developmentscripts to integrate and check the data
Eg Mass spec data:Been in contact with Gennifer Merrihew