33
Ten Years and Change the MX data archive at ALS 8.3.1

Ten Years and Change

  • Upload
    brenna

  • View
    36

  • Download
    2

Embed Size (px)

DESCRIPTION

Ten Years and Change. the MX data archive at ALS 8.3.1. Acknowledgements. ALS 8.3.1 creator: Tom Alber 8.3.1 PRT head: Jamie Cate Center for Structure of Membrane Proteins Membrane Protein Expression Center II Center for HIV Accessory and Regulatory Complexes W. M. Keck Foundation - PowerPoint PPT Presentation

Citation preview

Page 1: Ten Years and Change

Ten Years and Change

the MX data archive at ALS 8.3.1

Page 2: Ten Years and Change

AcknowledgementsALS 8.3.1 creator: Tom Alber 8.3.1 PRT head: Jamie Cate

Center for Structure of Membrane ProteinsMembrane Protein Expression Center II

Center for HIV Accessory and Regulatory Complexes

W. M. Keck FoundationPlexxikon, Inc.

M D Anderson CRCUniversity of California Berkeley

University of California San FranciscoNational Science Foundation

University of California Campus-Laboratory Collaboration GrantHenry Wheeler

The Advanced Light Source is supported by the Director, Office of Science, Office of Basic Energy Sciences, Materials Sciences Division, of the US Department of Energy under contract No. DE-AC02-05CH11231 at Lawrence Berkeley National Laboratory.

Page 3: Ten Years and Change

ALS 8.3.1 data collection history

0

10

20

30

40

50

60

70

2001200220032004200520062007200820092010201120122013

actual

doubling = 2.8 years

tera

byte

s (u

ncom

pres

sed)

Page 4: Ten Years and Change

ALS 8.3.1 data collection history

0

10

20

30

40

50

60

70

2001200220032004200520062007200820092010201120122013

Proteum 300

Q210

Q315 (907)

Q315r (926)

tera

byte

s (u

ncom

pres

sed)

Page 5: Ten Years and Change

ALS 8.3.1 data collection history

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

2001200220032004200520062007200820092010201120122013

Proteum 300

Q210

Q315 (907)

Q315r (926)

imag

es x

106

Page 6: Ten Years and Change

DVD data archive: 68 TB

Page 7: Ten Years and Change

DVD data archive

Page 8: Ten Years and Change
Page 9: Ten Years and Change
Page 10: Ten Years and Change

50 TB

Page 11: Ten Years and Change

Primary failure mode of DVDs

Page 12: Ten Years and Change

Primary failure mode of DVDs

3000 files remain unrecoverable (~0.1%)

Page 13: Ten Years and Change

Which data go with which PDB?

• 260,000 images are called “test”

• cell: 48 62 84 90 101 104– is within 5 Å and 5° of 16,000 PDBs

focusing on 2001-2006

• 490 PDBs credit ALS 8.3.1 with data

• 44 of these didn’t actually collect data

• 64 collected data, but no credit

Page 14: Ten Years and Change

1. images from 2001-2006

2. collected “near” edges

3. find “runs” of >10 images

4. unify multi-wedge sets

5. run labelit & XDS

6. >70% complete?

7. I/σ > 10

8. reduced cell vs PDB

1,604,031

682,712

3602

3331

2524

1479

1054

1 to 200+

Which data go with which PDB?

Page 15: Ten Years and Change

Unit Cell: 90.9 90.9 46.8 90 90 120

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.00 0.50 1.00 1.50 2.00

best

Rcr

yst a

fter

rig

id-b

ody

refin

emen

t

RMS unit cell length deviation (Å)

1hh7 M. TB CSOR

1rb5

myoglobin

Page 16: Ten Years and Change

MAD/SAD datasets

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.20 0.30 0.40 0.50 0.60

Ris

o vs

PD

B d

epos

it

best Rcryst after rigid-body refinement

Published

non-isomorphous

Unsolved?

Page 17: Ten Years and Change

Responses to inquiries

“I have to find my old note book as I have no idea what that is.”

“I have changed jobs a few times since and am really far away from crystallography now.”

“Will see what I can find.”

“We solved it but never published it. Sorry!”

Page 18: Ten Years and Change

EGDA

Dec 01 19:45:12 2001 egda46_*1_E#_###.img (1112 images, Se MAD)Dec 02 15:10:06 2001 egda27_*1_###.img (180, 1A, native?)Dec 02 19:21:55 2001 egdau1_*1_###.img (427, 8000eV (U?) SAD)Dec 02 20:58:26 2001 egdau1_*2_###.img (360, 8000eV (U?) SAD)Jun 01 14:07:43 2002 egda60_*1_###.img (360, Lutetium SAD)

“I think that these EGDA data sets are very likely some of xxx’s data sets, he was working on E.coli guanine deaminase, something he brought from yyy. No structure was ever published James, xxx was unable to solve the structure from these data.”

Page 19: Ten Years and Change

~2.9 ÅP21212

R = 0.32Rfree = 0.39

PDB ID: ????

E. coliguaninedeaminase

Page 20: Ten Years and Change

Metadata: can we rely on it?

Duquerroy, et al. (1994). "Lobster enolase crystallized by serendipity", Proteins: Struct., Funct., Bioinf. 18, 390-393.

authors were after lobsterarginine kinase

got enolase instead

arginine kinase structurestill unknown

Page 21: Ten Years and Change

compresses 4.2x

raw image

Page 22: Ten Years and Change

compresses 337x

just spots

Page 23: Ten Years and Change

compresses 5x, but only one per dataset!

pixel-wisemedianacross

dataset

Page 24: Ten Years and Change

compresses 3.5x

deviationfrom

median in“non-spot”

areas

Page 25: Ten Years and Change

compressed ~50x

after h264of non-spot

areas

Page 26: Ten Years and Change

compresses 5.2x

differencebetweenraw and

compressed

Page 27: Ten Years and Change

Lossy compression vs R/Rfree

0.18

0.2

0.22

0.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

1 10 100

R_cryst

R_free

R f

acto

r

compression ratio

Page 28: Ten Years and Change

backblaze.com “pod” server

backblaze.com offers “unlimited storage” data backup for $5/month.

Page 29: Ten Years and Change

backblaze offers

“unlimited storage” data backup for

$5/month.

Page 30: Ten Years and Change

backblazedoes not sellthese “pods”,but “protocase.com” does.

Page 31: Ten Years and Change
Page 32: Ten Years and Change

Summary

• saving data could double productivity

• unit cell is not a good score

• lossy compression: rallying cry?

• backup vs archive

• metadata: what do we really know?

Page 33: Ten Years and Change

Brief Summary

• this is a lot of work.

• who is going to pay for it?