59
Calcul Québec - Université Laval Building a Storage System for Genomics 1 HPCS 2014 Halifax, NS [email protected] [email protected]

HPCS2014 - Building a storage system for genomics

Embed Size (px)

Citation preview

Page 1: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Building a Storage System for Genomics

1

HPCS 2014 Halifax, NS

[email protected] [email protected]

Page 2: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

AgendaGenomics storage project background Reviewing and optimizing the proposal Network + politics issues Writing the RFP Lessons learned !

!

2

Page 3: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Genomics storage project: backgroundFCI Leading Edge Fund (2012 competition) “Human and Microbial Integrative Genomics”

Project Lead: Dr. Jacques Simard, CRCHUQ, (Université Laval) 16 researchers from Université Laval

Bioinformatics and Computational Infrastructure Arnaud Droit Large storage component

3

Page 4: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

CRCHUQGenomics, proteomics and metabolomics Data sources: HiSeq 2500, 2000 and MiSeq Applications: RAY, genomics pipeline … Some researchers already active HPC users

Jacques Corbeil, Sébastien Boisvert (RAY), Arnaud Droit, Yohan Bossé

4

Page 5: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Site specifications: PhysicalNumber of racks in silo: 56 max Floor loading capacity: 940 lb/pi2

5

Page 6: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Site specifications: Power6

1 MW generator

Campus Data Center

CII: Centre des Infrastructure Informationelles

Silo: 1.1 MW available (~33% used) 72 kW UPS (+ generator)

25 kV hydro line

2 MVA transformer

Page 7: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Site specifications: CoolingRack cooling: 100% air No CRAC units! Using campus wide

chilled-water loop for cooling Cooling capacity: 1.5 MW Residual heat transferred to campus hot-

water loop Partial free air cooling (up to 300 kW)

7

Cooling coils

Air blowers

Free air cooling

Page 8: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Site specifications: Networking8

5,64

&HQWUH�GH�FDOFXO�8/�&RORVVH�

'RUVDOH�8QLYHUVLWp�/DYDO�

���*�

�* ��*

��*

)�2��QHWZRUN�UHDFKLQJ�DOO�KRVSLWDO�UHVHDUFK�

QHWZRUNV

5RXWHXU�&DOFXO�4XpEHF

$XWUHV�FHQWUHV�&DOFXO�4XpEHF�8GH0��0F*LOO�

��*

��*

&DQDULH ��*$XWUHV�VLWHV&DOFXO�&DQDGD

&DPSXV�8QLYHUVLWp�/DYDO

Fibre optic network to Québec hospital research networks

Page 9: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Timeline9

2014

Feb 26 All tests pass system accepted

Jan 6 Physical installation

2013

Oct 3 RFP published

Jan 15 MSSS meeting

Jan 22 Acceptance testing starts

Nov 20 RFP winner announced

number of meetings with vendors/manufacturers

July 9 MSSS derogation

2012Feb First meeting

number of meetings with vendors/manufacturers

Nov identify FW issue

April FCI conditions met

March finalize budget

Page 10: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Researcher LEF proposalInitial contact: Researcher->VPR->CC staff Initial meetings: Review proposal, discussions

Researchers planned to install storage at CRCHUQ facilities Based on quote from local supplier

Review and optimize Discuss possible optimization in proposal Scheduled meetings with HPC storage suppliers

10

Page 11: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

CFI LEFDiscussed option to host storage at CC site

Install some storage and compute at CRCHUQ Bulk storage at UL/CQ/CC site High speed connectivity already in place (10G UL-CRCHUQ)

Sounds simple, right? …

11

Page 12: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Concerns raisedEase of access to CC hosted storage Security

12

Page 13: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Refining the proposalEvaluate benefits to hosts at CQ/CC

Power, cooling infrastructure already in place O&M handled by CQ staff. Collaborate with CRCHUQ sysadmin

Initial budget planned room renovations and extra A/C: $$ saved for more infrastructure

13

Page 14: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

MOUMOU sent to CFI (Jan 2013)

CRCHUQ and CQ/CC staff work on RFP and acquisition process CQ/CC staff manage storage Storage is for exclusive usage of CRCHUQ Local storage for CRCHUQ, Parallel FS at CQ/UL Archival (tapes) will use existing system available at remote

CQ/CC site

14

Page 15: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Genomics Storage Components15

&5&+84

5pVHDX�,QILQLEDQG�4'5

5pVHDX���*�(WKHUQHW

FRORVVH�

/XVWUH�VFUDWFK�����7%�

/XVWUH�KRPH�����7%�

*HQRPH�)6

����QRHXGV�GH�FDOFXO

JHQRPH�ORJLQ���GDWDPRYHU

FRORVVH�

��*ORFDO)6

,OOXPLQD

VHUYHXU�ILFKLHU���GDWDPRYHU

VWDWLRQ�WUDYDLOVWDWLRQ�WUDYDLOVWDWLRQ�WUDYDLO

,OOXPLQD

Page 16: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

InterconnectUniversité Laval owns fibre optics MAN Interconnects all QC hospital research networks

16

5,64

&HQWUH�GH�FDOFXO�8/�&RORVVH�

'RUVDOH�8QLYHUVLWp�/DYDO�

���*�

�* ��*

��*

):�&+8/

/DERUDWRLUH�ELR�LQIRUPDWLTXH

��*

&5&+84

5RXWHXU�&DOFXO�4XpEHF

$XWUHV�FHQWUHV�&DOFXO�4XpEHF�8GH0��0F*LOO�

��*

��*

):�ELRLQIR�

&DQDULH ��*$XWUHV�VLWHV&DOFXO�&DQDGD

&DPSXV�8QLYHUVLWp�/DYDO

Page 17: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Test node

5,64

&HQWUH�GH�FDOFXO�8/�&RORVVH�

'RUVDOH�8QLYHUVLWp�/DYDO�

���*�

�* ��*

��*

):�&+8/

/DERUDWRLUH�ELR�LQIRUPDWLTXH

��*

&5&+84

5RXWHXU�&DOFXO�4XpEHF

$XWUHV�FHQWUHV�&DOFXO�4XpEHF�8GH0��0F*LOO�

��*

��*

):�ELRLQIR�

&DQDULH ��*$XWUHV�VLWHV&DOFXO�&DQDGD

&DPSXV�8QLYHUVLWp�/DYDO

Network testing17

Test node (VM)

4 Gbps

141 Mbps

“IS-QC” Firewall Limits flows to 1.2 Gbps

Page 18: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

“Pile of firewalls”CRCHUQ already manages it’s security firewall at it’s periphery

IS-QC under MSSS authority acts as “safety valve”

Work with CRCHUQ to request derogation to remove IS-QC

18

Page 19: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

MSSS DerogationDocument and prepare meeting in Dec 2012 Jan 2013: meeting with MSSS security staff Jan 2013: regional security coordinator refusal Feb 2013: CRCHUQ director writes to deputy minister

of MSSS IT July 2013: Deputy minister (MSSS IT) visits UL/CQ Derogation done.

19

Page 20: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Network MeasurementsperfSonar for periodic network measurements

20

5,64

&HQWUH�GH�FDOFXO�8/�&RORVVH�

'RUVDOH�8QLYHUVLWp�/DYDO�

���*�

�* ��*

��*

):�&+8/

/DERUDWRLUH�ELR�LQIRUPDWLTXH

��*

&5&+84

5RXWHXU�&DOFXO�4XpEHF

$XWUHV�FHQWUHV�&DOFXO�4XpEHF�8GH0��0F*LOO�

��*

��*

):�ELRLQIR�

&DQDULH ��*$XWUHV�VLWHV&DOFXO�&DQDGD

&DPSXV�8QLYHUVLWp�/DYDO

Page 21: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

ArchivalUse existing tape archives at CQ

Plenty of network bandwidth (for now…)

21

5,64

&HQWUH�GH�FDOFXO�8/�&RORVVH�

'RUVDOH�8QLYHUVLWp�/DYDO�

���*�

�* ��*

��*

):�&+8/

/DERUDWRLUH�ELR�LQIRUPDWLTXH

��*

&5&+84

5RXWHXU�&DOFXO�4XpEHF

$XWUHV�FHQWUHV�&DOFXO�4XpEHF�8GH0��0F*LOO�

��*

��*

):�ELRLQIR�

&DQDULH ��*$XWUHV�VLWHV&DOFXO�&DQDGD

&DPSXV�8QLYHUVLWp�/DYDO

Page 22: HPCS2014 - Building a storage system for genomics

V1.0Calcul Québec - Université Laval

RFP

22

Page 23: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Building the RFPAn iterative process

Based on multiple meetings with researchers + Expertise and market knowledge of local HPC team

23

Vendors

RFPResearcher HPC team

Page 24: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Premise 2 storage systems with different requirements in

very different environments Parallel storage

Large and high-speed in modern datacenter with plenty of power and cooling

On-site storage Smaller capacity with slower interconnect in air-conditioned

server room

24

Page 25: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

ChallengesBudget is limited. We want to get the most out

of it ! But the most of what ?

Parallel storage capacity/ Parallel storage write speed On-site storage capacity etc…

25

Page 26: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Challenges (cont.)Most of the budget to be allocated to the

parallel storage To enable computing and mid-term storage

On-site storage must be large enough. No more. A quality based RFP allows for such distinctions

26

Page 27: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

How large is large enoughThe sequencing platform could generate 10TB of

data per week Operating at full capacity

40 TB would provide 1 month of buffering

27

On-site StorageSequencers Parallel Storage

Buffering Automated Data Synchronisation

Page 28: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

quality based RFPWe chose to publish a quality based RFP

In contrast to a lowest-bidder process

!

Evaluated on cost + « quality criteria » Vendors are asked to spend at least 95% of budget

28

Page 29: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Challenges (solution)Define 2 indépendant sets of requirements !

Use the « quality criteria » to let vendors know what they should prioritize More weight will be given to the parallel storage components

!

29

Page 30: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Hardware only or integrated solution ? A) Hardware only: Write an RFP to buy X TB of raw disk space

+ Y servers and the accompanying interconnect. Integrate everything in-house to deploy a storage system.

B) Integrated solution: Ask for a complete system to meet a size and performance requirements.

First things first30

Page 31: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Integrated SolutionCumbersome question … Lustre, GPFS or

anything Should we ask for a specific parallel FS ?

Some parallel FS are tied to a specific vendor or a very small set of vendors

Went with Lustre because it is a multi-vendor ecosystem … and our team is already familiar with it

31

Page 32: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Fostering competitionThe RFP can be so specific as to open the door

only to a single product !

Or it can let bidders come up with their own solution to our problem

32

Specific product Surprise…

Page 33: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Fostering competition (cont.)Vendors know when a RFP is targeted to them

They will price accordingly

Inversely, vendors will not bid if they do not feel they have a fair chance Less bid will often equal « higher price »

A less constrained RFP will generally attract more proposals !

33

Page 34: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Fostering competition (cont.)Example of being too specific:

« Storage units with 60 drives in raid5, 8+2 configuration » !

Such a statement could apply to a single vendor, while limiting the available technologies

34

Page 35: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Spec'ing a storage systemPower & Cooling capacity Physical space and room topology Compatibility with existing infrastructure

Software Physical

35

Page 36: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Physical infrastructure36

Document floor/rack plan Maximum weight per square foot ?

How much space do we actually have ? Where does the system need to connect?

Both power and interconnect Cable length

Page 37: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Power & Cooling37

How much electrical capacity is available Total? Per rack ? UPS ?

Can our room cooling system handle that much new power ?

Page 38: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Requirements for parallel storage1 PB usable (or more) Lustre FS

Compatible with Lustre clients 1.8.9 and 2.4.x

20 GB/s aggregate read/write speed (or more) Drives and Lustre servers redundancy

« how » is purposely left unspecified

Infiniband interconnect 2:1 blocking factor with computing resources

38

Page 39: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Requirements for parallel storageVendor to provide all interconnect

Leaf IB switch, ethernet switch for management and cabling

Site provides uplink to core switches 20KW maximum electrical consumption

Vendor to supply PDUs (switched) Site to connect PDUs to existing electrical infrastructure

39

Page 40: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Requirements for on-site storageExport network filesystem

Compatible with sequencers, Windows 7, Linux and Mac

10G Ethernet interconnect 50 TB usable capacity (or more)

with option to grow up to 300TB

Drives and servers redundancy

40

Page 41: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Requirements for on-site storageSite to provide all cabling and interconnect for

on-site storage PDUs and rack space provided by the site

41

Page 42: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Measuring the quality of a proposalFinal evaluation is based on « adjusted price »

calculated from the bid price and the rating of the « quality criteria » given by the evaluation committee

!

« adjusted price » can vary from the real price by up to 30%

42

Page 43: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Quality criteria43

Parallel Storage 45 %

On-site Storage 20 %Interconnect & Networking 10 %Vendor’s Experience & Reputation 25 %

Page 44: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Quality criteria (cont.)!

In the 1st three categories, meeting the base requirements gives a passing score of 70%. Any specs or meaningful features above base requirements will improve the mark.

!

44

Page 45: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Quality criteria (cont.)In the « vendor » category, score is based on the

bidder’s experience in deploying similar systems with a requirement for at least 1 such system in the past 18 months.

Support structure and resume of the lead architect for the project are also a factor.

!

45

Page 46: HPCS2014 - Building a storage system for genomics

V1.0Calcul Québec - Université Laval

Benchmarks & stability tests

46

Page 47: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Acceptance testsWe define stability tests to validate the system

can operate in a real production environment. !

We run synthetic benchmarks to make sure the system hits the performance targets set by the vendor as requested by the quality criteria.

47

Page 48: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Stability testsTo validate normal operation

Homogenous firmware and software versions everywhere No errors or warning Verify the systems reboots cleanly Lustre mounts properly

Simulate drive failures Verify rebuild process

48

Page 49: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

BenchmarksWe set some base rules

No custom tools. Re-use existing software Let the vendor tune the tests for his system

But test must be large enough to avoid cache effect

What to benchmark Read/write speed of single target : IOZone Maximum aggregate read/write speed : IOR Maximum I/O operations per second (IOPS) : mdtest

49

Page 50: HPCS2014 - Building a storage system for genomics

V1.0Calcul Québec - Université Laval

RFP results

50

Page 51: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

BidsWe got 6 valid proposals

Parallel storage capacity varied from more than 60% across bids Aggregate speed for parallel storage varied by almost 50% On-site capacity varied by almost 100% On-site storage went from a NAS on ZFS to full fledges Lustre or

GPFS systems

51

Page 52: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

System selectedParallel storage: Xyratex CS6000

1.4 PB usable Lustre FS 12 OSS and 4 targets per OSS 4 TB NL SAS drives +SSD for journals

30 GB/s maximum aggregated R/W speed

On-site storage: Xyratex CS1500 120TB usable Lustre FS (scales to 7 PB) 4 CIFS/NFS exporters

52

Page 53: HPCS2014 - Building a storage system for genomics

V1.0Calcul Québec - Université Laval

Deployment

53

Page 54: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

OperationBoth system in production since early february Parallel storage dedicated to research group

mounted on compute ressources

Data transfers are enabled by Globus endpoints on dedicated DTNs at both sites.

Todo: Review network topology for transfers Perfsonar nodes to be deployed at research center

54

Page 55: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Operation (cont.)Researchers need a CC account to access Parallel

Storage Access control and allocations are a challenge

Shared spreadsheet filled by research center to allocate space on parallel FS for their users (Cumbersome!)

Integration with the CCDB would leverage existing system to manage storage allocations

!

55

Page 56: HPCS2014 - Building a storage system for genomics

V1.0Calcul Québec - Université Laval

Lessons learned

56

Page 57: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Lessons learnedTime consuming (2 year projects)

Mostly thrust and relationship building Time needed to write an RFP should not be underestimated

Benefit for the research group Access to a team of specialist to lead their project Major cost saving on the infrastructure. No investment to

upgrade an existing server room (UPS, Power, Cooling, etc)

57

Page 58: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Cost to integrate CS6000Installation: 900$ (rack enclosure) Power: 1457$ (new outlets) Cooling: 0$ Infiniband: Used existing cables

6 CXP - QSFP cables (18 QDR links)

58

Page 59: HPCS2014 - Building a storage system for genomics

Calcul Québec - Université Laval

Improving the processSharing RFPs between Compute Canada site

could ease the process for new projects Common benchmarks across Compute Canada

would help when designing acceptance tests Applies to both storage and computing

59