Download odp - Blades for HPTC

Transcript
  • 1. Blades for HPTC
      • Guy Coates
      • Informatics Systems Groups
      • [email_address]

2. Introduction

  • The science.
    • What is our HPTC workload?
  • Why are clusters hard?
    • What are the challenges of doing cluster computing?
  • How do blades help us?
    • Sanger's experience with blade systems.
  • Can blades help you?
    • What can blades not do?

3. The Science 4. The Post Genomic Era

  • Genomes now available for many organisms.
  • What does it mean?

TCCTCTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG GAATTGTCAGACATATACCAAATCCCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAA TTGGAAAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA TTTAGAGAAGAGAAAGCAAACATATTATAAGTTTAATTCTTATATTTAAAAATAGGAGCC AAGTATGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC TTGAGACCAGGAGTTTGATACCAGCCTGGGCAACATAGCAAGATGTTATCTCTACACAAA ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG AAGCAGGAGGGTTACTTGAGCCCAGGAGTTTGAGGTTGCAGTGAGCTATGATTGTGCCAC TGCACTCCAGCTTGGGTGACACAGCAAAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG AACATCTCATTTTCACACTGAAATGTTGACTGAAATCATTAAACAATAAAATCATAAAAG AAAAATAATCAGTTTCCTAAGAAATGATTTTTTTTCCTGAAAAATACACATTTGGTTTCA GAGAATTTGTCTTATTAGAGACCATGAGATGGATTTTGTGAAAACTAAAGTAACACCATT ATGAAGTAAATCGTGTATATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC 5. Deciphering the genome

  • Sequence need to be analysed.
    • Where are the genes?
    • What do the genes do?
    • Are the genes related to other genes via evolution?
  • This analysis is known as gene annotation.
  • Provides the basis for new questions:
    • What happens when the genes go wrong?
    • How do genes interact with one another?
    • What do the genes we have never seen before do?

6. Annotation at Sanger

  • Wehave both human and machine annotation efforts.
    • Havanna Group:manual annotation (10% coverage).
    • Ensembl project: automated annotation of 26 vertebrate genomes.
  • Data pooled into the Ensembl database.
    • Access via website (8M hits / week).
    • Perl/Java/SQL APIs.
    • Bulk download via FTP.
    • Direct sql access(~150 queries / second).
    • Core databases 250GB / month.
  • Software is all Open Source (Apache style license).
  • Data is free for download.

7. TCCTCTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG GAATTGTCAGACATATACCAAATCCCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAA TTGGAAAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA TTTAGAGAAGAGAAAGCAAACATATTATAAGTTTAATTCTTATATTTAAAAATAGGAGCC AAGTATGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC TTGAGACCAGGAGTTTGATACCAGCCTGGGCAACATAGCAAGATGTTATCTCTACACAAA ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG AAGCAGGAGGGTTACTTGAGCCCAGGAGTTTGAGGTTGCAGTGAGCTATGATTGTGCCAC TGCACTCCAGCTTGGGTGACACAGCAAAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG AACATCTCATTTTCACACTGAAATGTTGACTGAAATCATTAAACAATAAAATCATAAAAG AAAAATAATCAGTTTCCTAAGAAATGATTTTTTTTCCTGAAAAATACACATTTGGTTTCA GAGAATTTGTCTTATTAGAGACCATGAGATGGATTTTGTGAAAACTAAAGTAACACCATT ATGAAGTAAATCGTGTATATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC Ensembl Annotation 8. Ensembl Annotation 9. Ensembl Annotation 10. Ensembl Annotation 11. How is the data generated?

  • Ensembl provides a framework for automated annotation.
    • Scientist describes annotation required.
  • Rulemanager generates a set of compute tasks.
    • ~20,000 jobs for a moderate genome.
    • ~10,000 CPU/hours.
  • Runner executes the jobs.
    • Takes care of dependencies, failures.
    • LSF used as DRM for execution of jobs.
    • Results and state stored in mysql databases.
  • Extensible and reusable.
    • Newly sequenced genomes are areincorporated into Ensembl reasonably easily.

12. Genebuild Workflow 13. System requirements

  • Many algorithms involved.
    • Blast, exonerate( C )
    • perl / java pipeline managers.
    • 400 binaries in all.
  • Integer, not floating point, intensive.
    • General compute rather than specialised processors.
  • Moderatememory sizes.
    • 64 bit memory size is nice, but not essential.
  • Lots and lots of disk I/O.
    • 500GB genomic dataset searched by the pipeline.
    • I/O bound in many parts.
  • Minimal interprocess communication.
    • Odd 4 node MPI jobs.

14. System requirements

  • System is embarrassingly parallel.
    • Scales well when we add more nodes.
  • We don't need low-latency interconnects.
    • Ethernet is fine.
  • Well suited to clusters of commodity hardware.
  • (We also need HA clusters for the queuing system and mysql databases, but that is another presentation)

15. Cluster MK 1

  • 360 DS10L 1U servers in 9 racks.
  • Bog standard cluster.

16. But...

  • Data keeps on coming in.
    • New genomes are sequenced.
    • Errors in old genomes corrected.
  • We want to compare all genomes against all others.

17. Compute demand grows with the data

  • Science exceeds current compute capacity every 18 months.
  • We need a bigger cluster every 18 months.
    • Keep the current one running and help the users!

18. 5 clusters in 6 years

  • 20x increase in compute capacity.
      • (Moore's law helps a bit, but that is capacitors, not Spec Int.)
  • What did we learn?
    • Clusters are really hard.

19. Why are clusters hard? 20. Scaling

  • Everyone talks about code scaling.
    • Will my application run on more nodes?
  • Do admins scale?
    • If we double the cluster size, will we have double the admins?
    • If it is hard today, what will it be like in 18 months?
    • If we have to spend less admin time per node, will reliability suffer?
    • We should be spending time helping users optimise code.
  • Everything that can go wrong on a server can go wrong on a cluster node.
    • But we have hundreds of nodes.
    • Hundreds of problems.

21. Clusters Get More Complex

  • MK1 cluster:
    • 360CPUs, local disk storage, single fast ethernet.
  • MK5 cluster:
    • Multiple trunked GigE networks, cluster filesystems, SAN storage, multiple architectures (ia32, AMD64, token ia64 and alpha).
  • Bleeding edge hardware / software stacks.
    • Non trivial problems.
    • google may not be your friend if you are the first to find the problem.

22. Manageability is the key

  • Numerous, complex systems are hard to manage.
  • Clusters need good management tools.
  • The fastest cluster in the world is of no use if it does not stay up long enough to run your jobs.
  • Manageability is our number 1 priority when designing clusters.
    • We do not buy on price/performance.
    • We buy on price/manageability.

23. Cluster Management Life Cycle

  • Installation.
    • Bolting the thing in.
  • Commissioning.
    • Getting the cluster configured.
  • Production.
    • Doing some useful work.

24. Installation

  • Where to put the racks?
    • Like disk space, data centres go to 80% full 6 months after they are built.
  • Power / Aircon.
    • You need to have enough.
    • Total heat output vs density.
  • Networking.
    • Each system needs multiple network cables.
      • public network, private network, SAN, mgt network.
    • Don't forget the switching.
  • But the cluster got delivered last week, why can't I run jobs?

25. Commissioning

  • Getting the system up and running.
    • OS deployment usually last!
  • Initial configuration.
    • Firmware updates.
      • BIOS, NIC, mgt processor, FC HBA etc.
    • Standardise BIOS settings.
      • HT, memory interleave etc.
    • RAID configuration.
  • DOA Discovery.
    • Machines with failed DIMMS, CPUs
  • OS Deployment.
    • OS installation, local customisations.
    • Application stack.

26. Production

  • Broken Hardware.
    • Hardware failures should be detected and the admin told.
    • Ideally they should be detected before they are fatal.
    • Black hole machines painful on HPTC clusters.
  • Sysadmin tasks.
    • Software updates etc.
  • Emergencies.
    • Can you get a remote console?
    • Console logs / oopses.
  • Doomsday scenarios.
    • Power or AC failures.
    • Can I power off my cluster from home at 2:00am?
    • Can I do it before my machines melt?

27. How do blades help? 28. How Do Blades Help?

  • Manageability touches on hardware and software.
    • Good manageability requires smart software and smart hardware.
  • Blades have smart hardware.
    • Management processors on blades and in chassis.
    • (And some servers now.)
  • Blades have smart software.
    • Vendors supply OS deployment and management tools.
  • Unit of administration is the chassis, not the blade.
    • We end up managing a smaller number of smarter entities.

29. Smart Hardware

  • Management processor.
    • Sits on the blade and/or the chassis.
    • Key enabler. Almost all benefits flow from this.
  • Basic Features.
    • Hardware Inventory (MAC addresses, BIOS revs etc).
    • Remote power.
    • Remote console (SOL, VNC).
    • Machine health (memory, fans, CPUs).
    • Alerting.
  • Advanced Features.
    • BIOS twiddling (PXE boot).
    • Firmware updates.
    • Integrated switch management.

30. Smart software

  • Management Suite.
    • Provides window into what the hardware is doing.
    • Provides remote console, power and alerting.
  • OS deployment suite.
    • Typically golden image installers.
    • Allow for rapid and consistent OS installation.
    • Quick / automated re-tasking of machines.
    • Software inventories.
  • May be integrated into single product.

31. 32. 33. Web interface 34. Management Interface

  • Web interfaces are nice.
    • Easy to get to grips and find features.
  • Command line is even better.
    • Command line means we can script it.
  • Command line tools allow you to integrate blade management with existing tools.
    • You do not have to use the vendor suggested solution.
    • Magic of open source.

35. Why Extend Existing Tools?

  • Vendor tools can be limiting.
    • Tend to be windows centric as windows is a pain to manage.
    • May not work with non standard network or disk configs.
  • Linux already has good deployment tools.
    • Why re-invent the wheel?
    • Not quite fully automated.
  • Management processor command line interface.
    • We can script and do whatever we want.
  • Extend existing tools.
    • Use existing deployment tools to install blades.
    • Can cope with whatever twisted configs we want to run.

36. The Cluster Management Life Cycle Revisited

  • ...But with blades.
  • How do blades make it easier?

37. Cluster MK5

  • 560 CPUs
    • 140 dual core /dual CPU blades.
    • 10 chassis, 2 cabinets.
  • OS:
    • Debian / AMD64.
  • Networking:
    • 1 GigE external network.
    • 2 GigE trunked private network.
  • Storage:
    • Disk config: hardware RAID1 for OS.
    • Cluster filesystem.

38. Installation

  • Blades take up less space.
    • Less space to clear / tidy.
  • Integrated power and networking.
    • Fewer cables.

39. Installation

  • 42 1U servers with 3 GigE networks:
    • 42 10/100 mgt cables.
    • 126 GigE cables.
    • 42 power cables.
    • External switches.
  • 70 blades in 5 chassis with 3GigE networks:
    • 5 10/100 mgt cables.
    • 15 GigE cables.
    • 20 power cables.
    • No external switches.
  • One person can rack and patch a cabinet of blades in a day.
    • I know, I've done it!

40. Consolidated networking and power

  • 14 servers:

41. Cabling 42. Commissioning

  • Bootstrap blade chassis.
    • Configure mgt module.
    • Script setsstatic IP addresses, alerts etc.
    • Script configures network switches.
  • FW Updates.
    • Script update all blade and mgt module firmwares.
  • ~0.5 day for the initial config on 10 chassis.

43. Commissioning

  • We extended the FAI Debian auto installer.
    • We use it already.
    • It can cope with our non-standard network and disk topologies.
    • Open Source generic system: future-proof.
  • Install sequence:
    • Harvest MAC addresses from mgt processor.
    • PXE boot blades into FAI.
    • Construct raid, flash system BIOS, set BIOS flags.
    • OS and SW installation and customisation.
    • Set blade to boot of disks and reboot.
  • 160 seconds for a full OS and software install.
    • Run script, go drink tea.
  • Command line tools crucial.

44. Production

  • Management processor.
    • Remote power and remote console.
    • Hardware failures.
    • Alerts go into helpdesk system.
    • Manage cluster from anywhere I can get ssh.
  • Standard linux tools.
    • DSH: run commands on all blades.
    • cfengine: manage config files.
    • ganglia / LSF: load monitoring.
    • smartmontools for disk failures.
  • Doomsday scenario.
    • Emergency shutdown script.
    • Runs round mgt processors and powers off blades.
    • Keep blowers etc going to reduce heat stress.

45. Blades make large clusters easier

    • Grown from 360 to 1456 CPUs.
    • Shrunks from 360 systemto 42 chassis.

46. How many admins?

  • It takes1 adminday /week to look after a 1456 CPU cluster.
    • Gone down when we moved from servers to blades.
    • cf TCO studies on the web.
    • 1 full time admin for 40-50 unix machines.
      • (Windows is half that).
  • We look after all the rest of the Sanger systems too!
  • We spend more time helping users rather than poking hardware.
    • We get good usage out of our cluster.

47. Can blades help you? 48. Blade Pros / Cons

  • Blades cost more up front.
    • Pay for the chassis, even if you never fill it.
  • Management savings only realised on larger installations.
    • Would you use blades for a 8 node cluster?
  • However, as cluster size increases, costs change.
    • Management savings multiply as cluster size increases.
  • Power density is high.
    • Less power overall, but in a small space.
    • Price / performance / watt ?

49. Interconnects

  • We do not use low latency interconnects.
    • We do Gigabit + SAN
  • Blade chassis share a backplane.
    • Typically 4 GB/s backplane.
    • Limit full bandwidth of the blades.
    • What is the latency hit?
  • Blades have limited specialised network options.
    • Single half height PCI card.
    • Currently limited to 4x Infiniband, gigabit and SAN.

50. Conclusions

  • Good management is the key, whether you run blade or servers.
    • Good management is easier on blades.
  • Blades can do anything a standard server can.
    • In less of your space and inless of your time.
  • If you are building larger clusters,consider blades.

51. Acknowledgements

  • Informatics Systems Group
    • Tim Cutts
    • Mark Rae
    • Simon Kelley
    • Andy Flint
    • Gildas Le Nadan
    • Peter Clapham
  • Special Projects Group
    • John Nicholson
    • Martin Burton
    • Russell Vincent
    • Dave Holland

52. 53. Storage Concepts 54. The data problem

  • Pipeline is IO bound in many places.
    • 500GB of genomic data to search.
  • Keep the data as close as possible to the compute.
    • Blast over NFS is a complete disaster.
    • Data / IO problems common on bioinformatics clusters > 20 nodes.

Data NFS server Bottleneck 55. InitialStrategy

  • Keep the data on local disk.
    • Copy thedataset to each machine in the cluster.

Nodes Disk Data 56. Data Scaling

  • Data management was a real headache.
    • Ever-expanding dataset was copied to each machine in the farm (400-1000 nodes).
    • Data grown from 50-500GB.
  • Copying data onto 1000+machines takes time.
    • 0.5-2 days for large data pushes, even with clever approaches.
  • Ensuring data integrity is hard.
    • Black Holes syndrome.
  • Experience showed it was not a scalable approach for the future.

57. Cluster file systems

  • Early 2003 started Investigation cluster file systems for farm usage.
  • Most machines had gigabit connections.
    • Network speeds near to local disk speeds. (120 Mbytes /S).
  • Bitten hard by Tru64 end of life.
    • We have ~300TB of data on Tru64/Advfs clusterfs.
    • No migration path.
    • We need a future proof storage solution.
  • Should beOpen Source
    • Binary kernel modules are evil.
    • We often run non-standard kernels.

58. Initial Implementation

  • No cluster file system which would scale to all nodes in the cluster.
    • Assessed a large number of systems.
  • GPFS was the one we settled on.
    • Not all nodes need SAN connections.
    • Not open source (you have to start somewhere).
  • Divide farm up into a number of small systems.
    • Chassis is an obvious unit.
    • File systemsspanned 2 or 3 chassis of blades.
    • End up with 20 file systems
  • Keeping 20 file systems in sync is (relatively) easy.

59. Topology I

  • 10x28 clusters of local NSDs
    • GPFS striped across local disks on all nodes.
    • Data accessed via gigabit.
  • 2 chassisper cluster.
    • Limited by replication level on GPFS and how often we expect machine failures.
  • Performance limited by network.
    • 80MBytes/s single client.
  • Requires no special hardware.

Switch 60. Topology II: Hybrid

  • 4x42 node clusters.
    • Server machines have SAN storage.
    • Client machines talk to servers over the LAN.
  • Not every machine needs SAN.
    • Clients do IO to multiple server machines.
    • Eliminates single server bottleneck.

SAN Switch 61. Future implementation

  • Expand cluster file system to the whole cluster.
    • Single copy of the data.
    • Allows users to manage their own data.
    • Use cluster file system for general scratch/work space.
    • Eliminate NFS.
  • Implementing Lustre.
    • Open source (v. x is propriety, v. x-1 is open sourced).
    • Scales to 1000s of nodes.
    • Performs well; in pilots our network is the bottleneck.
    • Easy (ish) to add more network.

62. Lustre Config 10G 10G 4G 4G 2G 2G OST OST OST OST OST OST OST OST MDS ADM 63. The network is vital.

  • Cluster IO is very stressful for networks.
    • We can fill gigabit links from of a single client.
  • Large amounts of gigabit networking.
    • Multiple gigabit trunks.
  • Non blocking switches critical.

64.


Recommended