Are blade server suitable for HPTC? This talk covers the pros and cons of building your next cluster using blades. Talk given at International Supercomputing blade workshop in 2007.
- 1. Blades for HPTC
-
- Informatics Systems Groups
2. Introduction
-
- What is our HPTC workload?
-
- What are the challenges of doing cluster computing?
-
- Sanger's experience with blade systems.
3. The Science 4. The Post Genomic Era
- Genomes now available for many organisms.
TCCTCTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG
GAATTGTCAGACATATACCAAATCCCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAA
TTGGAAAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA
TTTAGAGAAGAGAAAGCAAACATATTATAAGTTTAATTCTTATATTTAAAAATAGGAGCC
AAGTATGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC
TTGAGACCAGGAGTTTGATACCAGCCTGGGCAACATAGCAAGATGTTATCTCTACACAAA
ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG
AAGCAGGAGGGTTACTTGAGCCCAGGAGTTTGAGGTTGCAGTGAGCTATGATTGTGCCAC
TGCACTCCAGCTTGGGTGACACAGCAAAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG
AACATCTCATTTTCACACTGAAATGTTGACTGAAATCATTAAACAATAAAATCATAAAAG
AAAAATAATCAGTTTCCTAAGAAATGATTTTTTTTCCTGAAAAATACACATTTGGTTTCA
GAGAATTTGTCTTATTAGAGACCATGAGATGGATTTTGTGAAAACTAAAGTAACACCATT
ATGAAGTAAATCGTGTATATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC 5.
Deciphering the genome
- Sequence need to be analysed.
-
- Are the genes related to other genes via evolution?
- This analysis is known as gene annotation.
- Provides the basis for new questions:
-
- What happens when the genes go wrong?
-
- How do genes interact with one another?
-
- What do the genes we have never seen before do?
6. Annotation at Sanger
- Wehave both human and machine annotation efforts.
-
- Havanna Group:manual annotation (10% coverage).
-
- Ensembl project: automated annotation of 26 vertebrate
genomes.
- Data pooled into the Ensembl database.
-
- Access via website (8M hits / week).
-
- Direct sql access(~150 queries / second).
-
- Core databases 250GB / month.
- Software is all Open Source (Apache style license).
- Data is free for download.
7. TCCTCTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG
GAATTGTCAGACATATACCAAATCCCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAA
TTGGAAAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA
TTTAGAGAAGAGAAAGCAAACATATTATAAGTTTAATTCTTATATTTAAAAATAGGAGCC
AAGTATGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC
TTGAGACCAGGAGTTTGATACCAGCCTGGGCAACATAGCAAGATGTTATCTCTACACAAA
ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG
AAGCAGGAGGGTTACTTGAGCCCAGGAGTTTGAGGTTGCAGTGAGCTATGATTGTGCCAC
TGCACTCCAGCTTGGGTGACACAGCAAAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG
AACATCTCATTTTCACACTGAAATGTTGACTGAAATCATTAAACAATAAAATCATAAAAG
AAAAATAATCAGTTTCCTAAGAAATGATTTTTTTTCCTGAAAAATACACATTTGGTTTCA
GAGAATTTGTCTTATTAGAGACCATGAGATGGATTTTGTGAAAACTAAAGTAACACCATT
ATGAAGTAAATCGTGTATATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC
Ensembl Annotation 8. Ensembl Annotation 9. Ensembl Annotation 10.
Ensembl Annotation 11. How is the data generated?
- Ensembl provides a framework for automated annotation.
-
- Scientist describes annotation required.
- Rulemanager generates a set of compute tasks.
-
- ~20,000 jobs for a moderate genome.
- Runner executes the jobs.
-
- Takes care of dependencies, failures.
-
- LSF used as DRM for execution of jobs.
-
- Results and state stored in mysql databases.
-
- Newly sequenced genomes are areincorporated into Ensembl
reasonably easily.
12. Genebuild Workflow 13. System requirements
- Many algorithms involved.
-
- perl / java pipeline managers.
- Integer, not floating point, intensive.
-
- General compute rather than specialised processors.
-
- 64 bit memory size is nice, but not essential.
- Lots and lots of disk I/O.
-
- 500GB genomic dataset searched by the pipeline.
- Minimal interprocess communication.
14. System requirements
- System is embarrassingly parallel.
-
- Scales well when we add more nodes.
- We don't need low-latency interconnects.
- Well suited to clusters of commodity hardware.
- (We also need HA clusters for the queuing system and mysql
databases, but that is another presentation)
15. Cluster MK 1
- 360 DS10L 1U servers in 9 racks.
16. But...
-
- New genomes are sequenced.
-
- Errors in old genomes corrected.
- We want to compare all genomes against all others.
17. Compute demand grows with the data
- Science exceeds current compute capacity every 18 months.
- We need a bigger cluster every 18 months.
-
- Keep the current one running and help the users!
18. 5 clusters in 6 years
- 20x increase in compute capacity.
-
-
- (Moore's law helps a bit, but that is capacitors, not Spec
Int.)
-
- Clusters are really hard.
19. Why are clusters hard? 20. Scaling
- Everyone talks about code scaling.
-
- Will my application run on more nodes?
-
- If we double the cluster size, will we have double the
admins?
-
- If it is hard today, what will it be like in 18 months?
-
- If we have to spend less admin time per node, will reliability
suffer?
-
- We should be spending time helping users optimise code.
- Everything that can go wrong on a server can go wrong on a
cluster node.
-
- But we have hundreds of nodes.
21. Clusters Get More Complex
-
- 360CPUs, local disk storage, single fast ethernet.
-
- Multiple trunked GigE networks, cluster filesystems, SAN
storage, multiple architectures (ia32, AMD64, token ia64 and
alpha).
- Bleeding edge hardware / software stacks.
-
- google may not be your friend if you are the first to find the
problem.
22. Manageability is the key
- Numerous, complex systems are hard to manage.
- Clusters need good management tools.
- The fastest cluster in the world is of no use if it does not
stay up long enough to run your jobs.
- Manageability is our number 1 priority when designing
clusters.
-
- We do not buy on price/performance.
-
- We buy on price/manageability.
23. Cluster Management Life Cycle
-
- Getting the cluster configured.
24. Installation
-
- Like disk space, data centres go to 80% full 6 months after
they are built.
-
- Total heat output vs density.
-
- Each system needs multiple network cables.
-
-
- public network, private network, SAN, mgt network.
-
- Don't forget the switching.
- But the cluster got delivered last week, why can't I run
jobs?
25. Commissioning
- Getting the system up and running.
-
- OS deployment usually last!
-
-
- BIOS, NIC, mgt processor, FC HBA etc.
-
- Standardise BIOS settings.
-
-
- HT, memory interleave etc.
-
- Machines with failed DIMMS, CPUs
-
- OS installation, local customisations.
26. Production
-
- Hardware failures should be detected and the admin told.
-
- Ideally they should be detected before they are fatal.
-
- Black hole machines painful on HPTC clusters.
-
- Can you get a remote console?
-
- Can I power off my cluster from home at 2:00am?
-
- Can I do it before my machines melt?
27. How do blades help? 28. How Do Blades Help?
- Manageability touches on hardware and software.
-
- Good manageability requires smart software and smart
hardware.
- Blades have smart hardware.
-
- Management processors on blades and in chassis.
- Blades have smart software.
-
- Vendors supply OS deployment and management tools.
- Unit of administration is the chassis, not the blade.
-
- We end up managing a smaller number of smarter entities.
29. Smart Hardware
-
- Sits on the blade and/or the chassis.
-
- Key enabler. Almost all benefits flow from this.
-
- Hardware Inventory (MAC addresses, BIOS revs etc).
-
- Remote console (SOL, VNC).
-
- Machine health (memory, fans, CPUs).
-
- BIOS twiddling (PXE boot).
-
- Integrated switch management.
30. Smart software
-
- Provides window into what the hardware is doing.
-
- Provides remote console, power and alerting.
-
- Typically golden image installers.
-
- Allow for rapid and consistent OS installation.
-
- Quick / automated re-tasking of machines.
- May be integrated into single product.
31. 32. 33. Web interface 34. Management Interface
-
- Easy to get to grips and find features.
- Command line is even better.
-
- Command line means we can script it.
- Command line tools allow you to integrate blade management with
existing tools.
-
- You do not have to use the vendor suggested solution.
35. Why Extend Existing Tools?
- Vendor tools can be limiting.
-
- Tend to be windows centric as windows is a pain to manage.
-
- May not work with non standard network or disk configs.
- Linux already has good deployment tools.
-
- Not quite fully automated.
- Management processor command line interface.
-
- We can script and do whatever we want.
-
- Use existing deployment tools to install blades.
-
- Can cope with whatever twisted configs we want to run.
36. The Cluster Management Life Cycle Revisited
- How do blades make it easier?
37. Cluster MK5
-
- 140 dual core /dual CPU blades.
-
- 2 GigE trunked private network.
-
- Disk config: hardware RAID1 for OS.
38. Installation
- Blades take up less space.
-
- Less space to clear / tidy.
- Integrated power and networking.
39. Installation
- 42 1U servers with 3 GigE networks:
- 70 blades in 5 chassis with 3GigE networks:
- One person can rack and patch a cabinet of blades in a
day.
40. Consolidated networking and power
41. Cabling 42. Commissioning
-
- Script setsstatic IP addresses, alerts etc.
-
- Script configures network switches.
-
- Script update all blade and mgt module firmwares.
- ~0.5 day for the initial config on 10 chassis.
43. Commissioning
- We extended the FAI Debian auto installer.
-
- It can cope with our non-standard network and disk
topologies.
-
- Open Source generic system: future-proof.
-
- Harvest MAC addresses from mgt processor.
-
- PXE boot blades into FAI.
-
- Construct raid, flash system BIOS, set BIOS flags.
-
- OS and SW installation and customisation.
-
- Set blade to boot of disks and reboot.
- 160 seconds for a full OS and software install.
-
- Run script, go drink tea.
- Command line tools crucial.
44. Production
-
- Remote power and remote console.
-
- Alerts go into helpdesk system.
-
- Manage cluster from anywhere I can get ssh.
-
- DSH: run commands on all blades.
-
- cfengine: manage config files.
-
- ganglia / LSF: load monitoring.
-
- smartmontools for disk failures.
-
- Emergency shutdown script.
-
- Runs round mgt processors and powers off blades.
-
- Keep blowers etc going to reduce heat stress.
45. Blades make large clusters easier
-
- Grown from 360 to 1456 CPUs.
-
- Shrunks from 360 systemto 42 chassis.
46. How many admins?
- It takes1 adminday /week to look after a 1456 CPU cluster.
-
- Gone down when we moved from servers to blades.
-
- cf TCO studies on the web.
-
- 1 full time admin for 40-50 unix machines.
- We look after all the rest of the Sanger systems too!
- We spend more time helping users rather than poking
hardware.
-
- We get good usage out of our cluster.
47. Can blades help you? 48. Blade Pros / Cons
- Blades cost more up front.
-
- Pay for the chassis, even if you never fill it.
- Management savings only realised on larger installations.
-
- Would you use blades for a 8 node cluster?
- However, as cluster size increases, costs change.
-
- Management savings multiply as cluster size increases.
-
- Less power overall, but in a small space.
-
- Price / performance / watt ?
49. Interconnects
- We do not use low latency interconnects.
- Blade chassis share a backplane.
-
- Typically 4 GB/s backplane.
-
- Limit full bandwidth of the blades.
- Blades have limited specialised network options.
-
- Single half height PCI card.
-
- Currently limited to 4x Infiniband, gigabit and SAN.
50. Conclusions
- Good management is the key, whether you run blade or
servers.
-
- Good management is easier on blades.
- Blades can do anything a standard server can.
-
- In less of your space and inless of your time.
- If you are building larger clusters,consider blades.
51. Acknowledgements
- Informatics Systems Group
52. 53. Storage Concepts 54. The data problem
- Pipeline is IO bound in many places.
-
- 500GB of genomic data to search.
- Keep the data as close as possible to the compute.
-
- Blast over NFS is a complete disaster.
-
- Data / IO problems common on bioinformatics clusters > 20
nodes.
Data NFS server Bottleneck 55. InitialStrategy
- Keep the data on local disk.
-
- Copy thedataset to each machine in the cluster.
Nodes Disk Data 56. Data Scaling
- Data management was a real headache.
-
- Ever-expanding dataset was copied to each machine in the farm
(400-1000 nodes).
-
- Data grown from 50-500GB.
- Copying data onto 1000+machines takes time.
-
- 0.5-2 days for large data pushes, even with clever
approaches.
- Ensuring data integrity is hard.
- Experience showed it was not a scalable approach for the
future.
57. Cluster file systems
- Early 2003 started Investigation cluster file systems for farm
usage.
- Most machines had gigabit connections.
-
- Network speeds near to local disk speeds. (120 Mbytes /S).
- Bitten hard by Tru64 end of life.
-
- We have ~300TB of data on Tru64/Advfs clusterfs.
-
- We need a future proof storage solution.
-
- Binary kernel modules are evil.
-
- We often run non-standard kernels.
58. Initial Implementation
- No cluster file system which would scale to all nodes in the
cluster.
-
- Assessed a large number of systems.
- GPFS was the one we settled on.
-
- Not all nodes need SAN connections.
-
- Not open source (you have to start somewhere).
- Divide farm up into a number of small systems.
-
- Chassis is an obvious unit.
-
- File systemsspanned 2 or 3 chassis of blades.
-
- End up with 20 file systems
- Keeping 20 file systems in sync is (relatively) easy.
59. Topology I
- 10x28 clusters of local NSDs
-
- GPFS striped across local disks on all nodes.
-
- Data accessed via gigabit.
-
- Limited by replication level on GPFS and how often we expect
machine failures.
- Performance limited by network.
-
- 80MBytes/s single client.
- Requires no special hardware.
Switch 60. Topology II: Hybrid
-
- Server machines have SAN storage.
-
- Client machines talk to servers over the LAN.
- Not every machine needs SAN.
-
- Clients do IO to multiple server machines.
-
- Eliminates single server bottleneck.
SAN Switch 61. Future implementation
- Expand cluster file system to the whole cluster.
-
- Allows users to manage their own data.
-
- Use cluster file system for general scratch/work space.
-
- Open source (v. x is propriety, v. x-1 is open sourced).
-
- Scales to 1000s of nodes.
-
- Performs well; in pilots our network is the bottleneck.
-
- Easy (ish) to add more network.
62. Lustre Config 10G 10G 4G 4G 2G 2G OST OST OST OST OST OST
OST OST MDS ADM 63. The network is vital.
- Cluster IO is very stressful for networks.
-
- We can fill gigabit links from of a single client.
- Large amounts of gigabit networking.
- Non blocking switches critical.
64.