Download odp - Blades for HPTC

1. Blades for HPTC

Guy Coates

Informatics Systems Groups

[email_address]

The science.

What is our HPTC workload?

Why are clusters hard?

What are the challenges of doing cluster computing?

How do blades help us?

Sanger's experience with blade systems.

Can blades help you?

What can blades not do?

Genomes now available for many organisms.

What does it mean?

Sequence need to be analysed.

Where are the genes?

What do the genes do?

Are the genes related to other genes via evolution?

This analysis is known as gene annotation.

Provides the basis for new questions:

What happens when the genes go wrong?

How do genes interact with one another?

What do the genes we have never seen before do?

Wehave both human and machine annotation efforts.

Havanna Group:manual annotation (10% coverage).

Ensembl project: automated annotation of 26 vertebrate genomes.

Data pooled into the Ensembl database.

Access via website (8M hits / week).

Perl/Java/SQL APIs.

Bulk download via FTP.

Direct sql access(~150 queries / second).

Core databases 250GB / month.

Software is all Open Source (Apache style license).

Data is free for download.

Ensembl provides a framework for automated annotation.

Scientist describes annotation required.

Rulemanager generates a set of compute tasks.

~20,000 jobs for a moderate genome.

~10,000 CPU/hours.

Runner executes the jobs.

Takes care of dependencies, failures.

LSF used as DRM for execution of jobs.

Results and state stored in mysql databases.

Extensible and reusable.

Newly sequenced genomes are areincorporated into Ensembl reasonably easily.

Many algorithms involved.

Blast, exonerate( C )

perl / java pipeline managers.

400 binaries in all.

Integer, not floating point, intensive.

General compute rather than specialised processors.

Moderatememory sizes.

64 bit memory size is nice, but not essential.

Lots and lots of disk I/O.

500GB genomic dataset searched by the pipeline.

I/O bound in many parts.

Minimal interprocess communication.

Odd 4 node MPI jobs.

System is embarrassingly parallel.

Scales well when we add more nodes.

We don't need low-latency interconnects.

Ethernet is fine.

Well suited to clusters of commodity hardware.

(We also need HA clusters for the queuing system and mysql databases, but that is another presentation)

360 DS10L 1U servers in 9 racks.

Bog standard cluster.

Data keeps on coming in.

New genomes are sequenced.

Errors in old genomes corrected.

We want to compare all genomes against all others.

Science exceeds current compute capacity every 18 months.

We need a bigger cluster every 18 months.

Keep the current one running and help the users!

20x increase in compute capacity.

(Moore's law helps a bit, but that is capacitors, not Spec Int.)

What did we learn?

Clusters are really hard.

Everyone talks about code scaling.

Will my application run on more nodes?

Do admins scale?

If we double the cluster size, will we have double the admins?

If it is hard today, what will it be like in 18 months?

If we have to spend less admin time per node, will reliability suffer?

We should be spending time helping users optimise code.

Everything that can go wrong on a server can go wrong on a cluster node.

But we have hundreds of nodes.

Hundreds of problems.

MK1 cluster:

360CPUs, local disk storage, single fast ethernet.

MK5 cluster:

Multiple trunked GigE networks, cluster filesystems, SAN storage, multiple architectures (ia32, AMD64, token ia64 and alpha).

Bleeding edge hardware / software stacks.

Non trivial problems.

google may not be your friend if you are the first to find the problem.

Numerous, complex systems are hard to manage.

Clusters need good management tools.

The fastest cluster in the world is of no use if it does not stay up long enough to run your jobs.

Manageability is our number 1 priority when designing clusters.

We do not buy on price/performance.

We buy on price/manageability.

Installation.

Bolting the thing in.

Commissioning.

Getting the cluster configured.

Production.

Doing some useful work.

Where to put the racks?

Like disk space, data centres go to 80% full 6 months after they are built.

Power / Aircon.

You need to have enough.

Total heat output vs density.

Networking.

Each system needs multiple network cables.

public network, private network, SAN, mgt network.

Don't forget the switching.

But the cluster got delivered last week, why can't I run jobs?

Getting the system up and running.

OS deployment usually last!

Initial configuration.

Firmware updates.

BIOS, NIC, mgt processor, FC HBA etc.

Standardise BIOS settings.

HT, memory interleave etc.

RAID configuration.

DOA Discovery.

Machines with failed DIMMS, CPUs

OS Deployment.

OS installation, local customisations.

Application stack.

Broken Hardware.

Hardware failures should be detected and the admin told.

Ideally they should be detected before they are fatal.

Black hole machines painful on HPTC clusters.

Sysadmin tasks.

Software updates etc.

Emergencies.

Can you get a remote console?

Console logs / oopses.

Doomsday scenarios.

Power or AC failures.

Can I power off my cluster from home at 2:00am?

Can I do it before my machines melt?

Manageability touches on hardware and software.

Good manageability requires smart software and smart hardware.

Blades have smart hardware.

Management processors on blades and in chassis.

(And some servers now.)

Blades have smart software.

Vendors supply OS deployment and management tools.

Unit of administration is the chassis, not the blade.

We end up managing a smaller number of smarter entities.

Management processor.

Sits on the blade and/or the chassis.

Key enabler. Almost all benefits flow from this.

Basic Features.

Hardware Inventory (MAC addresses, BIOS revs etc).

Remote power.

Remote console (SOL, VNC).

Machine health (memory, fans, CPUs).

Alerting.

Advanced Features.

BIOS twiddling (PXE boot).

Firmware updates.

Integrated switch management.

Management Suite.

Provides window into what the hardware is doing.

Provides remote console, power and alerting.

OS deployment suite.

Typically golden image installers.

Allow for rapid and consistent OS installation.

Quick / automated re-tasking of machines.

Software inventories.

May be integrated into single product.

Web interfaces are nice.

Easy to get to grips and find features.

Command line is even better.

Command line means we can script it.

Command line tools allow you to integrate blade management with existing tools.

You do not have to use the vendor suggested solution.

Magic of open source.

Vendor tools can be limiting.

Tend to be windows centric as windows is a pain to manage.

May not work with non standard network or disk configs.

Linux already has good deployment tools.

Why re-invent the wheel?

Not quite fully automated.

Management processor command line interface.

We can script and do whatever we want.

Extend existing tools.

Use existing deployment tools to install blades.

Can cope with whatever twisted configs we want to run.

...But with blades.

How do blades make it easier?

560 CPUs

140 dual core /dual CPU blades.

10 chassis, 2 cabinets.

Debian / AMD64.

Networking:

1 GigE external network.

2 GigE trunked private network.

Storage:

Disk config: hardware RAID1 for OS.

Cluster filesystem.

Blades take up less space.

Less space to clear / tidy.

Integrated power and networking.

Fewer cables.

42 1U servers with 3 GigE networks:

42 10/100 mgt cables.

126 GigE cables.

42 power cables.

External switches.

70 blades in 5 chassis with 3GigE networks:

5 10/100 mgt cables.

15 GigE cables.

20 power cables.

No external switches.

One person can rack and patch a cabinet of blades in a day.

I know, I've done it!

14 servers:

Bootstrap blade chassis.

Configure mgt module.

Script setsstatic IP addresses, alerts etc.

Script configures network switches.

FW Updates.

Script update all blade and mgt module firmwares.

~0.5 day for the initial config on 10 chassis.

We extended the FAI Debian auto installer.

We use it already.

It can cope with our non-standard network and disk topologies.

Open Source generic system: future-proof.

Install sequence:

Harvest MAC addresses from mgt processor.

PXE boot blades into FAI.

Construct raid, flash system BIOS, set BIOS flags.

OS and SW installation and customisation.

Set blade to boot of disks and reboot.

160 seconds for a full OS and software install.

Run script, go drink tea.

Command line tools crucial.

Management processor.

Remote power and remote console.

Hardware failures.

Alerts go into helpdesk system.

Manage cluster from anywhere I can get ssh.

Standard linux tools.

DSH: run commands on all blades.

cfengine: manage config files.

ganglia / LSF: load monitoring.

smartmontools for disk failures.

Doomsday scenario.

Emergency shutdown script.

Runs round mgt processors and powers off blades.

Keep blowers etc going to reduce heat stress.

Grown from 360 to 1456 CPUs.

Shrunks from 360 systemto 42 chassis.

It takes1 adminday /week to look after a 1456 CPU cluster.

Gone down when we moved from servers to blades.

cf TCO studies on the web.

1 full time admin for 40-50 unix machines.

(Windows is half that).

We look after all the rest of the Sanger systems too!

We spend more time helping users rather than poking hardware.

We get good usage out of our cluster.

Blades cost more up front.

Pay for the chassis, even if you never fill it.

Management savings only realised on larger installations.

Would you use blades for a 8 node cluster?

However, as cluster size increases, costs change.

Management savings multiply as cluster size increases.

Power density is high.

Less power overall, but in a small space.

Price / performance / watt ?

We do not use low latency interconnects.

We do Gigabit + SAN

Blade chassis share a backplane.

Typically 4 GB/s backplane.

Limit full bandwidth of the blades.

What is the latency hit?

Blades have limited specialised network options.

Single half height PCI card.

Currently limited to 4x Infiniband, gigabit and SAN.

Good management is the key, whether you run blade or servers.

Good management is easier on blades.

Blades can do anything a standard server can.

In less of your space and inless of your time.

If you are building larger clusters,consider blades.

Informatics Systems Group

Tim Cutts

Mark Rae

Simon Kelley

Andy Flint

Gildas Le Nadan

Peter Clapham

Special Projects Group

John Nicholson

Martin Burton

Russell Vincent

Dave Holland

Pipeline is IO bound in many places.

500GB of genomic data to search.

Keep the data as close as possible to the compute.

Blast over NFS is a complete disaster.

Data / IO problems common on bioinformatics clusters > 20 nodes.

Keep the data on local disk.

Copy thedataset to each machine in the cluster.

Data management was a real headache.

Ever-expanding dataset was copied to each machine in the farm (400-1000 nodes).

Data grown from 50-500GB.

Copying data onto 1000+machines takes time.

0.5-2 days for large data pushes, even with clever approaches.

Ensuring data integrity is hard.

Black Holes syndrome.

Experience showed it was not a scalable approach for the future.

Early 2003 started Investigation cluster file systems for farm usage.

Most machines had gigabit connections.

Network speeds near to local disk speeds. (120 Mbytes /S).

Bitten hard by Tru64 end of life.

We have ~300TB of data on Tru64/Advfs clusterfs.

No migration path.

We need a future proof storage solution.

Should beOpen Source

Binary kernel modules are evil.

We often run non-standard kernels.

No cluster file system which would scale to all nodes in the cluster.

Assessed a large number of systems.

GPFS was the one we settled on.

Not all nodes need SAN connections.

Not open source (you have to start somewhere).

Divide farm up into a number of small systems.

Chassis is an obvious unit.

File systemsspanned 2 or 3 chassis of blades.

End up with 20 file systems

Keeping 20 file systems in sync is (relatively) easy.

10x28 clusters of local NSDs

GPFS striped across local disks on all nodes.

Data accessed via gigabit.

2 chassisper cluster.

Limited by replication level on GPFS and how often we expect machine failures.

Performance limited by network.

80MBytes/s single client.

Requires no special hardware.

4x42 node clusters.

Server machines have SAN storage.

Client machines talk to servers over the LAN.

Not every machine needs SAN.

Clients do IO to multiple server machines.

Eliminates single server bottleneck.

Expand cluster file system to the whole cluster.

Single copy of the data.

Allows users to manage their own data.

Use cluster file system for general scratch/work space.

Eliminate NFS.

Implementing Lustre.

Open source (v. x is propriety, v. x-1 is open sourced).

Scales to 1000s of nodes.

Performs well; in pilots our network is the bottleneck.

Easy (ish) to add more network.

Cluster IO is very stressful for networks.

We can fill gigabit links from of a single client.

Large amounts of gigabit networking.

Multiple gigabit trunks.

Non blocking switches critical.