Upload
frederick-lefebvre
View
156
Download
1
Embed Size (px)
Citation preview
Calcul Québec - Université Laval
Building a Storage System for Genomics
1
HPCS 2014 Halifax, NS
Calcul Québec - Université Laval
AgendaGenomics storage project background Reviewing and optimizing the proposal Network + politics issues Writing the RFP Lessons learned !
!
2
Calcul Québec - Université Laval
Genomics storage project: backgroundFCI Leading Edge Fund (2012 competition) “Human and Microbial Integrative Genomics”
Project Lead: Dr. Jacques Simard, CRCHUQ, (Université Laval) 16 researchers from Université Laval
Bioinformatics and Computational Infrastructure Arnaud Droit Large storage component
3
Calcul Québec - Université Laval
CRCHUQGenomics, proteomics and metabolomics Data sources: HiSeq 2500, 2000 and MiSeq Applications: RAY, genomics pipeline … Some researchers already active HPC users
Jacques Corbeil, Sébastien Boisvert (RAY), Arnaud Droit, Yohan Bossé
4
Calcul Québec - Université Laval
Site specifications: PhysicalNumber of racks in silo: 56 max Floor loading capacity: 940 lb/pi2
5
Calcul Québec - Université Laval
Site specifications: Power6
1 MW generator
Campus Data Center
CII: Centre des Infrastructure Informationelles
Silo: 1.1 MW available (~33% used) 72 kW UPS (+ generator)
25 kV hydro line
2 MVA transformer
Calcul Québec - Université Laval
Site specifications: CoolingRack cooling: 100% air No CRAC units! Using campus wide
chilled-water loop for cooling Cooling capacity: 1.5 MW Residual heat transferred to campus hot-
water loop Partial free air cooling (up to 300 kW)
7
Cooling coils
Air blowers
Free air cooling
Calcul Québec - Université Laval
Site specifications: Networking8
5,64
&HQWUH�GH�FDOFXO�8/�&RORVVH�
'RUVDOH�8QLYHUVLWp�/DYDO�
���*�
�* ��*
��*
)�2��QHWZRUN�UHDFKLQJ�DOO�KRVSLWDO�UHVHDUFK�
QHWZRUNV
5RXWHXU�&DOFXO�4XpEHF
$XWUHV�FHQWUHV�&DOFXO�4XpEHF�8GH0��0F*LOO�
��*
��*
&DQDULH ��*$XWUHV�VLWHV&DOFXO�&DQDGD
&DPSXV�8QLYHUVLWp�/DYDO
Fibre optic network to Québec hospital research networks
Calcul Québec - Université Laval
Timeline9
2014
Feb 26 All tests pass system accepted
Jan 6 Physical installation
2013
Oct 3 RFP published
Jan 15 MSSS meeting
Jan 22 Acceptance testing starts
Nov 20 RFP winner announced
number of meetings with vendors/manufacturers
July 9 MSSS derogation
2012Feb First meeting
number of meetings with vendors/manufacturers
Nov identify FW issue
April FCI conditions met
March finalize budget
Calcul Québec - Université Laval
Researcher LEF proposalInitial contact: Researcher->VPR->CC staff Initial meetings: Review proposal, discussions
Researchers planned to install storage at CRCHUQ facilities Based on quote from local supplier
Review and optimize Discuss possible optimization in proposal Scheduled meetings with HPC storage suppliers
10
Calcul Québec - Université Laval
CFI LEFDiscussed option to host storage at CC site
Install some storage and compute at CRCHUQ Bulk storage at UL/CQ/CC site High speed connectivity already in place (10G UL-CRCHUQ)
Sounds simple, right? …
11
Calcul Québec - Université Laval
Concerns raisedEase of access to CC hosted storage Security
12
Calcul Québec - Université Laval
Refining the proposalEvaluate benefits to hosts at CQ/CC
Power, cooling infrastructure already in place O&M handled by CQ staff. Collaborate with CRCHUQ sysadmin
Initial budget planned room renovations and extra A/C: $$ saved for more infrastructure
13
Calcul Québec - Université Laval
MOUMOU sent to CFI (Jan 2013)
CRCHUQ and CQ/CC staff work on RFP and acquisition process CQ/CC staff manage storage Storage is for exclusive usage of CRCHUQ Local storage for CRCHUQ, Parallel FS at CQ/UL Archival (tapes) will use existing system available at remote
CQ/CC site
14
Calcul Québec - Université Laval
Genomics Storage Components15
&5&+84
5pVHDX�,QILQLEDQG�4'5
5pVHDX���*�(WKHUQHW
FRORVVH�
/XVWUH�VFUDWFK�����7%�
/XVWUH�KRPH�����7%�
*HQRPH�)6
����QRHXGV�GH�FDOFXO
JHQRPH�ORJLQ���GDWDPRYHU
FRORVVH�
��*ORFDO)6
,OOXPLQD
VHUYHXU�ILFKLHU���GDWDPRYHU
VWDWLRQ�WUDYDLOVWDWLRQ�WUDYDLOVWDWLRQ�WUDYDLO
,OOXPLQD
Calcul Québec - Université Laval
InterconnectUniversité Laval owns fibre optics MAN Interconnects all QC hospital research networks
16
5,64
&HQWUH�GH�FDOFXO�8/�&RORVVH�
'RUVDOH�8QLYHUVLWp�/DYDO�
���*�
�* ��*
��*
):�&+8/
/DERUDWRLUH�ELR�LQIRUPDWLTXH
��*
&5&+84
5RXWHXU�&DOFXO�4XpEHF
$XWUHV�FHQWUHV�&DOFXO�4XpEHF�8GH0��0F*LOO�
��*
��*
):�ELRLQIR�
&DQDULH ��*$XWUHV�VLWHV&DOFXO�&DQDGD
&DPSXV�8QLYHUVLWp�/DYDO
Calcul Québec - Université Laval
Test node
5,64
&HQWUH�GH�FDOFXO�8/�&RORVVH�
'RUVDOH�8QLYHUVLWp�/DYDO�
���*�
�* ��*
��*
):�&+8/
/DERUDWRLUH�ELR�LQIRUPDWLTXH
��*
&5&+84
5RXWHXU�&DOFXO�4XpEHF
$XWUHV�FHQWUHV�&DOFXO�4XpEHF�8GH0��0F*LOO�
��*
��*
):�ELRLQIR�
&DQDULH ��*$XWUHV�VLWHV&DOFXO�&DQDGD
&DPSXV�8QLYHUVLWp�/DYDO
Network testing17
Test node (VM)
4 Gbps
141 Mbps
“IS-QC” Firewall Limits flows to 1.2 Gbps
Calcul Québec - Université Laval
“Pile of firewalls”CRCHUQ already manages it’s security firewall at it’s periphery
IS-QC under MSSS authority acts as “safety valve”
Work with CRCHUQ to request derogation to remove IS-QC
18
Calcul Québec - Université Laval
MSSS DerogationDocument and prepare meeting in Dec 2012 Jan 2013: meeting with MSSS security staff Jan 2013: regional security coordinator refusal Feb 2013: CRCHUQ director writes to deputy minister
of MSSS IT July 2013: Deputy minister (MSSS IT) visits UL/CQ Derogation done.
19
Calcul Québec - Université Laval
Network MeasurementsperfSonar for periodic network measurements
20
5,64
&HQWUH�GH�FDOFXO�8/�&RORVVH�
'RUVDOH�8QLYHUVLWp�/DYDO�
���*�
�* ��*
��*
):�&+8/
/DERUDWRLUH�ELR�LQIRUPDWLTXH
��*
&5&+84
5RXWHXU�&DOFXO�4XpEHF
$XWUHV�FHQWUHV�&DOFXO�4XpEHF�8GH0��0F*LOO�
��*
��*
):�ELRLQIR�
&DQDULH ��*$XWUHV�VLWHV&DOFXO�&DQDGD
&DPSXV�8QLYHUVLWp�/DYDO
Calcul Québec - Université Laval
ArchivalUse existing tape archives at CQ
Plenty of network bandwidth (for now…)
21
5,64
&HQWUH�GH�FDOFXO�8/�&RORVVH�
'RUVDOH�8QLYHUVLWp�/DYDO�
���*�
�* ��*
��*
):�&+8/
/DERUDWRLUH�ELR�LQIRUPDWLTXH
��*
&5&+84
5RXWHXU�&DOFXO�4XpEHF
$XWUHV�FHQWUHV�&DOFXO�4XpEHF�8GH0��0F*LOO�
��*
��*
):�ELRLQIR�
&DQDULH ��*$XWUHV�VLWHV&DOFXO�&DQDGD
&DPSXV�8QLYHUVLWp�/DYDO
V1.0Calcul Québec - Université Laval
RFP
22
Calcul Québec - Université Laval
Building the RFPAn iterative process
Based on multiple meetings with researchers + Expertise and market knowledge of local HPC team
23
Vendors
RFPResearcher HPC team
Calcul Québec - Université Laval
Premise 2 storage systems with different requirements in
very different environments Parallel storage
Large and high-speed in modern datacenter with plenty of power and cooling
On-site storage Smaller capacity with slower interconnect in air-conditioned
server room
24
Calcul Québec - Université Laval
ChallengesBudget is limited. We want to get the most out
of it ! But the most of what ?
Parallel storage capacity/ Parallel storage write speed On-site storage capacity etc…
25
Calcul Québec - Université Laval
Challenges (cont.)Most of the budget to be allocated to the
parallel storage To enable computing and mid-term storage
On-site storage must be large enough. No more. A quality based RFP allows for such distinctions
26
Calcul Québec - Université Laval
How large is large enoughThe sequencing platform could generate 10TB of
data per week Operating at full capacity
40 TB would provide 1 month of buffering
27
On-site StorageSequencers Parallel Storage
Buffering Automated Data Synchronisation
Calcul Québec - Université Laval
quality based RFPWe chose to publish a quality based RFP
In contrast to a lowest-bidder process
!
Evaluated on cost + « quality criteria » Vendors are asked to spend at least 95% of budget
28
Calcul Québec - Université Laval
Challenges (solution)Define 2 indépendant sets of requirements !
Use the « quality criteria » to let vendors know what they should prioritize More weight will be given to the parallel storage components
!
29
Calcul Québec - Université Laval
Hardware only or integrated solution ? A) Hardware only: Write an RFP to buy X TB of raw disk space
+ Y servers and the accompanying interconnect. Integrate everything in-house to deploy a storage system.
B) Integrated solution: Ask for a complete system to meet a size and performance requirements.
First things first30
Calcul Québec - Université Laval
Integrated SolutionCumbersome question … Lustre, GPFS or
anything Should we ask for a specific parallel FS ?
Some parallel FS are tied to a specific vendor or a very small set of vendors
Went with Lustre because it is a multi-vendor ecosystem … and our team is already familiar with it
31
Calcul Québec - Université Laval
Fostering competitionThe RFP can be so specific as to open the door
only to a single product !
Or it can let bidders come up with their own solution to our problem
32
Specific product Surprise…
Calcul Québec - Université Laval
Fostering competition (cont.)Vendors know when a RFP is targeted to them
They will price accordingly
Inversely, vendors will not bid if they do not feel they have a fair chance Less bid will often equal « higher price »
A less constrained RFP will generally attract more proposals !
33
Calcul Québec - Université Laval
Fostering competition (cont.)Example of being too specific:
« Storage units with 60 drives in raid5, 8+2 configuration » !
Such a statement could apply to a single vendor, while limiting the available technologies
34
Calcul Québec - Université Laval
Spec'ing a storage systemPower & Cooling capacity Physical space and room topology Compatibility with existing infrastructure
Software Physical
35
Calcul Québec - Université Laval
Physical infrastructure36
Document floor/rack plan Maximum weight per square foot ?
How much space do we actually have ? Where does the system need to connect?
Both power and interconnect Cable length
Calcul Québec - Université Laval
Power & Cooling37
How much electrical capacity is available Total? Per rack ? UPS ?
Can our room cooling system handle that much new power ?
Calcul Québec - Université Laval
Requirements for parallel storage1 PB usable (or more) Lustre FS
Compatible with Lustre clients 1.8.9 and 2.4.x
20 GB/s aggregate read/write speed (or more) Drives and Lustre servers redundancy
« how » is purposely left unspecified
Infiniband interconnect 2:1 blocking factor with computing resources
38
Calcul Québec - Université Laval
Requirements for parallel storageVendor to provide all interconnect
Leaf IB switch, ethernet switch for management and cabling
Site provides uplink to core switches 20KW maximum electrical consumption
Vendor to supply PDUs (switched) Site to connect PDUs to existing electrical infrastructure
39
Calcul Québec - Université Laval
Requirements for on-site storageExport network filesystem
Compatible with sequencers, Windows 7, Linux and Mac
10G Ethernet interconnect 50 TB usable capacity (or more)
with option to grow up to 300TB
Drives and servers redundancy
40
Calcul Québec - Université Laval
Requirements for on-site storageSite to provide all cabling and interconnect for
on-site storage PDUs and rack space provided by the site
41
Calcul Québec - Université Laval
Measuring the quality of a proposalFinal evaluation is based on « adjusted price »
calculated from the bid price and the rating of the « quality criteria » given by the evaluation committee
!
« adjusted price » can vary from the real price by up to 30%
42
Calcul Québec - Université Laval
Quality criteria43
Parallel Storage 45 %
On-site Storage 20 %Interconnect & Networking 10 %Vendor’s Experience & Reputation 25 %
Calcul Québec - Université Laval
Quality criteria (cont.)!
In the 1st three categories, meeting the base requirements gives a passing score of 70%. Any specs or meaningful features above base requirements will improve the mark.
!
44
Calcul Québec - Université Laval
Quality criteria (cont.)In the « vendor » category, score is based on the
bidder’s experience in deploying similar systems with a requirement for at least 1 such system in the past 18 months.
Support structure and resume of the lead architect for the project are also a factor.
!
45
V1.0Calcul Québec - Université Laval
Benchmarks & stability tests
46
Calcul Québec - Université Laval
Acceptance testsWe define stability tests to validate the system
can operate in a real production environment. !
We run synthetic benchmarks to make sure the system hits the performance targets set by the vendor as requested by the quality criteria.
47
Calcul Québec - Université Laval
Stability testsTo validate normal operation
Homogenous firmware and software versions everywhere No errors or warning Verify the systems reboots cleanly Lustre mounts properly
Simulate drive failures Verify rebuild process
48
Calcul Québec - Université Laval
BenchmarksWe set some base rules
No custom tools. Re-use existing software Let the vendor tune the tests for his system
But test must be large enough to avoid cache effect
What to benchmark Read/write speed of single target : IOZone Maximum aggregate read/write speed : IOR Maximum I/O operations per second (IOPS) : mdtest
49
V1.0Calcul Québec - Université Laval
RFP results
50
Calcul Québec - Université Laval
BidsWe got 6 valid proposals
Parallel storage capacity varied from more than 60% across bids Aggregate speed for parallel storage varied by almost 50% On-site capacity varied by almost 100% On-site storage went from a NAS on ZFS to full fledges Lustre or
GPFS systems
51
Calcul Québec - Université Laval
System selectedParallel storage: Xyratex CS6000
1.4 PB usable Lustre FS 12 OSS and 4 targets per OSS 4 TB NL SAS drives +SSD for journals
30 GB/s maximum aggregated R/W speed
On-site storage: Xyratex CS1500 120TB usable Lustre FS (scales to 7 PB) 4 CIFS/NFS exporters
52
V1.0Calcul Québec - Université Laval
Deployment
53
Calcul Québec - Université Laval
OperationBoth system in production since early february Parallel storage dedicated to research group
mounted on compute ressources
Data transfers are enabled by Globus endpoints on dedicated DTNs at both sites.
Todo: Review network topology for transfers Perfsonar nodes to be deployed at research center
54
Calcul Québec - Université Laval
Operation (cont.)Researchers need a CC account to access Parallel
Storage Access control and allocations are a challenge
Shared spreadsheet filled by research center to allocate space on parallel FS for their users (Cumbersome!)
Integration with the CCDB would leverage existing system to manage storage allocations
!
55
V1.0Calcul Québec - Université Laval
Lessons learned
56
Calcul Québec - Université Laval
Lessons learnedTime consuming (2 year projects)
Mostly thrust and relationship building Time needed to write an RFP should not be underestimated
Benefit for the research group Access to a team of specialist to lead their project Major cost saving on the infrastructure. No investment to
upgrade an existing server room (UPS, Power, Cooling, etc)
57
Calcul Québec - Université Laval
Cost to integrate CS6000Installation: 900$ (rack enclosure) Power: 1457$ (new outlets) Cooling: 0$ Infiniband: Used existing cables
6 CXP - QSFP cables (18 QDR links)
58
Calcul Québec - Université Laval
Improving the processSharing RFPs between Compute Canada site
could ease the process for new projects Common benchmarks across Compute Canada
would help when designing acceptance tests Applies to both storage and computing
59