Upload
christian-cummings
View
221
Download
2
Embed Size (px)
Citation preview
Ten Years and Change
the MX data archive at ALS 8.3.1
ALS 8.3.1 data collection history
0
10
20
30
40
50
60
70
2001200220032004200520062007200820092010201120122013
actual
doubling = 2.8 years
tera
byte
s (u
ncom
pres
sed)
ALS 8.3.1 data collection history
0
10
20
30
40
50
60
70
2001200220032004200520062007200820092010201120122013
Proteum 300
Q210
Q315 (907)
Q315r (926)
tera
byte
s (u
ncom
pres
sed)
ALS 8.3.1 data collection history
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
2001200220032004200520062007200820092010201120122013
Proteum 300
Q210
Q315 (907)
Q315r (926)
imag
es x
106
ALS 8.3.1 data collection historyim
ages
x 1
06
ALS 8.3.1 data collection history
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5images collected
PDB data collection
PDB deposition
imag
es x
106
1250
1000
750
500
250
0
PD
B e
ntr
ies
ALS 8.3.1 data collection historyim
ages
x 1
06
1250
1000
750
500
250
0
PD
B e
ntr
ies
imag
es x
106
1250
1000
750
500
250
0
PD
B e
ntr
ies
DVD data archive: 82 TB
Which data go with which PDB?
• 260,000 images are called “test”
• cell: 48 62 84 90 101 104– is within 5 Å and 5° of 16,000 PDBs
focusing on 2001-2006
• 490 PDBs credit ALS 8.3.1 with data
• 44 of these didn’t actually collect data
• 64 collected data, but no credit
1. images from 2001-2006
2. collected “near” edges
3. find “runs” of >10 images
4. unify multi-wedge sets
5. run labelit & XDS
6. >70% complete?
7. I/σ > 10
8. reduced cell vs PDB
1,604,031
682,712
3602
3331
2524
1479
1054
1 to 200+
Which data go with which PDB?
Responses to inquiries
“I have to find my old note book as I have no idea what that is.”
“I have changed jobs a few times since and am really far away from crystallography now.”
“Will see what I can find.”
“We solved it but never published it. Sorry!”
DVD data archive
Primary failure mode of DVDs
dataset identification protocol
1. images from 2001-2006
2. collected “near” edges
3. find “runs” of >10 images
4. sort out multi-wedge sets
5. run XDS
6. >70% complete?
7. I/σ > 10
8. reduced cell vs PDB
1,604,031
682,712
3602
3331
2524
1479
1054
1 to 200+
Unit Cell: 90.9 90.9 46.8 90 90 120
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.00 0.50 1.00 1.50 2.00
best
Rcr
yst a
fter
rig
id-b
ody
refin
emen
t
RMS unit cell length deviation (Å)
1hh7 M. TB CSOR
1rb5
myoglobin
MAD/SAD datasets
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.20 0.30 0.40 0.50 0.60
Ris
o vs
PD
B d
epos
it
best Rcryst after rigid-body refinement
Published
non-isomorphous
Unsolved?
EGDADec 01 19:45:12 2001 egda46_*1_E#_###.img (1112 images, Se MAD)Dec 02 15:10:06 2001 egda27_*1_###.img (180, 1A, native?)Dec 02 19:21:55 2001 egdau1_*1_###.img (427, 8000eV (U?) SAD)Dec 02 20:58:26 2001 egdau1_*2_###.img (360, 8000eV (U?) SAD)Jun 01 14:07:43 2002 egda60_*1_###.img (360, Lutetium SAD)
“I think that these EGDA data sets are very likely some of xxx’s data sets, he was working on E.coli guanine deaminase, something he brought from yyy. No structure was ever published James, xxx was unable to solve the structure from these data.”
~2.9 ÅP21212
R = 0.32Rfree = 0.39
PDB ID: ????
E. coliguaninedeaminase
Summary
• saving data could double productivity
• unit cell is not a good score
• lossy compression: rallying cry?
• backup vs archive
• metadata: what do we really know?
Brief Summary
• this is a lot of work.
• who is going to pay for it?
backblaze.com “pod” server
backblaze.com offers “unlimited storage” data backup for $5/month.
backblaze offers
“unlimited storage” data backup for
$5/month.
backblazedoes not sellthese “pods”,but “protocase.com” will.
compresses 4.2x
compresses 337x
compresses 5x, but only one per dataset!
compresses 3.5x
compressed ~50x
compresses 5.2x
Lossy compression vs R/Rfree
0.18
0.2
0.22
0.24
0.26
0.28
0.3
0.32
0.34
0.36
0.38
1 10 100
R_cryst
R_free
R f
acto
r
compression ratio