CSC Site Update HP Nordic TIG April 2008 Janne Ignatius Marko Myllynen Dan Still


Citation preview

CSC Site UpdateCSC Site Update

HP Nordic TIGHP Nordic TIG

April 2008Janne IgnatiusMarko MyllynenDan Still

CSC at glance

Founded in 1970 as a technical support unit for Univac 1108

Reorganized as a company, CSC - Scientific Computing Ltd. in 1993

All shares to the Ministry of Education of Finland in 1997

Operate on a non-profit principle

Facilities in Keilaniemi, Espoo, since March, 2005


CSC is the national IT center for science developing and providing services for universities, research institutes, and industry.


CSC is well known and appreciated in Finland as well as abroad as a pioneer, collaboration partner, and center of competence in the field of IT technology for science.

CSC at a Glance

CSC’s Services







Louhi - Cray XT4 Supercomputer

1st phase installed 04/2007 1012 computing nodes each having 2.6 GHz

AMD Opteron dual core processor High bandwidth low latency interconnect

(SeaStar2) 1 - 2 GB memory per core Peak performance 10.6 teraflops Final configuration (to be installed Q3/2008)

core count open, 1-2 GB memory per core Peak performance 70+ teraflops

Murska - HP CP4000 BL ProLiant Supercluster

Installed 04/2007, expanded 11/2007 544 compute nodes each having two 2.6

GHz AMD Opteron dual core processor 2176 compute cores 4x DDR InfiniBand interconnect 5 TB total memory: 256 nodes * 4GB, 128 *

8GB, 128 * 16GB, 32 * 32GB 100 TB SFS/Lustre file system Peak performance 11.3 teraflops

Murska - HP CP4000 BL ProLiant, cont.

RHEL 4 based HP XC 3.1 cluster operating system SLURM/LSF HP-MPI PGI, PathScale, GNU, TotalView, ACML, … HP Xtools, collectl, mpe2, …

Blade hardware working surprisingly well Interconnect working nicely Disk system also working ok after initial issues

• MSA20 disk array failure recovery suboptimal• SFS quota still limited to 4 TB

System constantly in heavy use

Murska - HP CP4000 BL Availability

Three unexpected breaks after Nov 2007 upgrades• 29.1.2008: SFS hang, fixed with disk array reset• 30.1.2008: Ethernet switch died (in the cabin where several

power supplies had died few days earlier..)• 12.3.3008: SFS hang, fixed with disk array reset

System availability since Nov 2007 95%-100%

System usage since Nov 2007 30%-100%

Sepeli - HP ProLiant DL145 Cluster

Installed 2005 128 (earlier 256) compute nodes 512 cores and 2 TB memory

• 4x DDR InfiniBand / GigE interconnect

4 TB PVFS2 / NFS disk system Peak performance 3.1 teraflops Earlier part of national M-grid,

now being dedicated to LHC use (particle collision data analysis)

Sepeli - HP ProLiant DL145 Cluster, cont.

RHEL 4 based Rocks 3.1 cluster operating system SGE

Overall system lifespan price/performance quite satisfactory

InfiniBand hardware very stable

Grid Engine tight integration with multiple MPI flavors labor-intensive

DL145 iLO initially unreliable, improved over time

Material Sciences National Grid Infrastructure (M-grid) A joint project of CSC, 7 Finnish universities

and Helsinki Institute of Physics funded by the Finnish Academy for the National Research Infrastructure Program in the Grid area

Aims to build a homogeneous PC-cluster environment with theoretical peak of approx. 3 teraflops per 350 nodes

Environment• Hardware: Provided by HP. Dual AMD Opteron

1.8-2.2 GHz nodes with 2-8 GB memory, 1-2 TB shared storage, separate 2xGE (communications and NFS), remote administration

• OS: NPACI Rocks Cluster Distribution / 64 bit, based on Red Hat Enterprise Linux 3, 4

• Grid middleware: NorduGrid ARC Grid MW compiled

• With Globus 3.2.1 libraries, Sun Grid Engine as LRMS

• Centrally managed configuration with Cfengine

CSC• Administration tasks• Maintains Operating

System, LRMS, Grid middleware, certain libraries• Separate small test cluster for testing new software releases, • Tools for system monitoring, integrity checking, etc.

CSC• Administration tasks• Maintains Operating

System, LRMS, Grid middleware, certain libraries• Separate small test cluster for testing new software releases, • Tools for system monitoring, integrity checking, etc.

Some international activities




Thank You!
