INCOIS HPC Training - Indian Institute of Tropical Meteorologyaadityahpc.tropmet.res.in/Aaditya/INCOIS/INCOIS training.pdf · General Parallel file system . ... (four billion - GPFS

© 2013 IBM Corporation

INCOIS – HPC Training


Agenda

Technical

Infrastructure

–Cluster layout

–Compute • Sandy bridge

–Management • xCAT

(Provisioning tool)

– Interconnect • FDR Mellanox

–Storage • GPFS

–Software stack

Intel Cluster Studio

–Compiler

• Optimization

methodology

–MPI/OpenMP • Features and

optimizations

–Math Library (MKL)

–Debugging • Parallel application

debugging

–Profiling/Tracing • VTUNE trace

analyzer

Job Scheduler / Cluster

Manager

–LSF • Basic Architecture

• Current configuration

• Scheduling policies

• Troubleshooting

• Profiling

• Queues and Priorities

• Fault tolerance

• Submission and

Management

–Hands On

IDRIS Workshop - Technical Infrastructure - 1-1-0.ppt




IDRIS Workshop - Intel MPI - 1-1-0.ppt


Technical Infrastructure


Tools Compilers

Applications

Operating System

Scientific Libraries Message Passing Interface

Parallel File System

Job Scheduler Cluster Administration

Hardware

High performance computing stack


Cluster Overview

800 TeraFlops High Performance Computing System IBM iDataPlex cluster, which

features 38,144 Intel Sandy Bridge processors and 149 TB of memory

The login and compute nodes are populated with two Intel Sandy Bridge 8-core

processors.

FDR 10 Infiniband interconnect in a Fat Tree configuration as its high-speed

network for MPI messages and IO traffic

For High performance parallel file system we used GPFS, a most stable and higly

reliable for HPC clusters

compute node has two 8-core processors (16 cores) with its own Red Hat

Enterprise Linux OS, sharing 64 GBytes of memory

The cluster is intended to be used as a batch-scheduled jobs

All executions that require large amounts of system resources must be sent to the

compute nodes by batch job submission through job scheduler





IBM System x iDataPlex Compute Building Block

72 x IBM System x iDataPlex dx360 M4 server

– 2x E5-2670 SandyBridge-EP 2.6GHz/1600 cache 20MB 8-core

– 8 x 8G DDR3-1600 DIMMs (4GB/core) Total: 64GB/node

– Dual-port Infiniband FDR14 Mezzazine Card

4 X Mellanox 36-port Managed FDR14 IB Switch

– 4 Leaf IB Switch

– 18 x compute nodes connected to each leaf switches.

– 18 Uplinks from every leaf switch connects at IB Main Switches

Management Network

– 2 x BNT RackSwitch G8052F

– 4 x 1 Gb Connections from each switch acts as uplink for a

flawless flow of management traffic

IBM System x iDataPlex Rack with with RDHX (water cooling)

Performance

– –2.60 GHz x 8 Flops/cycle (AVX) = 20.8 GFlops/core

– –16 core x 20.8 GFlops/core = 332.8 GFlops/node

– –72 nodes x 332.8 GFlops/node = 23.96 TFlops/rack


Compute


IBM System x iDataPlex dx360 M4 Compute Node

iDataPlex Rack server

1U Node Density 84 Nodes / 84U Rack

Support SSI Planars (EP & EN)

Shared Power –Common Form Factor (CFF)

Shared Cooling –80mm Fans

HPC Nodes incl 2x 1GbE down and 10GbE or

40G/QSFP IB Mezz card option


Intel SandyBridge microprocessor

New architecture (Tock cycle) bring new features:

– Up to 8 cores per socket

– AVX vector units (double peak FP performance)

– Larger and faster caches

– Improved TLB ( Turbo lookaside buffer)

– Higher memory bandwidth per core

– Enhanced Turbo Mode

– Enhanced Hyper Threading mode

– …

SandyBridge-EP

model


SandyBridge-EP microprocessor

In addition, Sandy Bridge also introduces support for AVX (Advanced vector) extensions within an updated execution stack, enabling 256-bit floating point (FP) operations to be decoded and executed as a single micro-operation (uOp).

The effect of this is a doubling in peak FP capability, sustaining 8 double precision FLOPs/cycle.


SandyBridge-EP microprocessor

Sandy Bridge processor integrates a high

performance, bidirectional ring

architecture interconnecting

– CPU cores, Last Level Cache (LLC, or

L3), PCIe, QPI, memory controller

– Able to return 32 Bytes of data on each

cycle

each physical LLC segment is loosely

associated with a corresponding core

– But cache is also shared among all

cores as a logical unit

The ring and LLC are clocked with the CPU

core, so cache and memory

– latencies have dropped as compared

to the previous generation architecture

– bandwidths are significantly

improved.


Turbo Boost

Turbo Boost – Allows dynamically increasing

CPU clock-speed on demand • « Dynamic over clocking »

– Frequency will increase in increments of 100 MHz

• When the processor has not reached its thermal and electrical limits

• When the user's workload demands additional performance

• Until a thermal or power limit is reached or Until the maximum speed for the number of active cores is reached

Important note: – On 4 sockets systems (like x3750m4), the 2.4GHz CPU will only achieve 2.8GHz

Turbo upside on a 4S-EP (this is intentionally limited by Intel) – It is lower than the turbo upside for an equivalent 2-socket EP processor (which

would achieve 3.1GHz).


Storage

© 2013 IBM Corporation 17

GPFS Storage server

IBM System x GPFS Storage Server: Bringing HPC

Technology to the Mainstream

•Better, Sustained Performance

- Industry-leading throughput using efficient De-Clustered

RAID Techniques

•Better Value

–Leverages System x servers and Commercial JBODS

•Better Data Security

–From the disk platter to the client.

–Enhanced RAID Protection Technology

•Affordably Scalable

–Start Small and Affordably

–Scale via incremental additions

–Add capacity AND bandwidth


•3 Year Warranty

–Manage and budget costs

•IT-Facility Friendly

–Industry-standard 42u 19 inch rack

mounts

–No special height requirements

–Client Racks are OK!

•And all the Data Management/Life

Cycle Capabilities of GPFS – Built in!


General Parallel file system

© 2013 IBM Corporation 20

Parallel Filesystem

GPFS: A file system for high performance computing. as a shared disk, parallel file system for AIX, Linux clusters

Software features: snapshots, replication and multi-site connectivity are included in the GPFS

license. There are no license keys besides client and server to add-on, you

get all of the features up front.

Number of files:

• 2 Billion per file system

• 256 file systems

• Max File System Size: 2^99 bytes

• Max File Size = File system size

Disk IO:

•AIX 134 GB/sec

•Linux 66 GB/sec

Number of nodes:

• 1 to 8192


• GPFS 2.3, or later, architectural file system size limit

– 2^99 bytes

– Current tested limit ~2 PB

• Total number of files per file system

– 4,000,000,000 (four billion - GPFS 3.4 created file system, two billion on 3.2 or earlier

GPFS versions)

• Total number of nodes: 8,192

– A node is in a cluster if:

• The node shows in mmlscluster (shows up in mmlscluster) or

• The node is in a remote cluster and is mounting a file system in the local cluster

• Maximum number of mounted file systems

– 256

– Before GPFS 3.2, 64 file systems

• Maximum disk size

– Limited by disk device driver and O/S

Architecture Stat


GPFS provides a highly scalable file management infrastructure

Optimizes storage utilization by centralizing management

Provides a flexible scalable alternative to a growing number of

NAS appliances

Highly available grid computing infrastructure

Scalable information lifecycle tools to manage growing data volumes

What GPFS provides


LAN

SAN

NSD Clients

SAN

GPFS

Seamless capacity

and performance

scaling

Centrally deployed,

managed, backed up

and grown

NSD Servers

Massive namespace

support

Architecture : Diagram


Internal design


• The GPFS kernel extension provides:

– Interfaces to the operating system vnode and VFS.

• Flow:

– Application makes file system calls to the O/S.

– O/S presents calls to the GPFS kernel extension.

• GPFS appears to the application as just another file system.

– GPFS kernel extension will either satisfy requests using information already available

or send a message to the GPFS daemon to complete the request.

– The GPFS daemon

• It performs all I/O and buffer management, including read ahead for sequential reads

and write behind operations.

• All I/O is protected by token management to ensure file system consistency.

• Multi-threaded with some threads dedicated to specific functions.

– Examples include space allocation, directory management (insert and removal), and

quotas.

• Disk I/O is initiated on threads of the daemon.

Kernel Extension


Manager nodes

• Global lock manager

• File system configuration: recovery, adding disks, …

• Disk space allocation manager

• Quota manager

• File metadata manager - maintains file metadata integrity

File system nodes

• Run user programs, read/write data to/from storage nodes

• Implement virtual file system interface

• Cooperate with manager nodes to perform metadata operations

Storage nodes

• Implement block I/O interface

• Shared access to file system and manager nodes

• Interact with manager nodes for recovery

Node Roles


Use mmdelnode to remove a node from a cluster: mmdelnode { -a | -N Node[,Node…] |

NodeFile|NodeClass]

–Cannot be primary or secondary GPFS cluster configuration

node (unless removing entire cluster)

–Cannot be an NSD server (unless removing entire cluster)

–Can be run from any node remaining in the GPFS cluster

–GFPS daemon must be stopped on node being deleted

Deleting some nodes:

–Avoid unexpected consequences due to quorum loss

Deleting a cluster using the mmdelnode command: mmdelnode -a

Adminitration : Node Deletion


Disks are added to a file system using the mmadddisk command:

mmadddisk Device {"DiskDesc[;DiskDesc...]“ | -F DescFile}

[-a] [-r]

[-v{yes|no}] [-N {Node[,Node...] | NodeFile |

NodeClass}]

Optionally, rebalance the data ( -r) (recommended but can cause performance

impact while rebalancing).

The file system can be mounted or unmounted.

The NSD must be created before it can be added using mmadddisk.

– Create new disk (mmcrnsd) – Reuse available disk (mmlsnsd –F)

# mmlsnsd -F

File system Disk name Primary node

Backup node

-----------------------------------------------------

(free disk) gpfs3nsd (directly attached)

Adding disks


Managing disks within a file system

– Disk errors

– Performance evaluation

– Planning for migration

Modify disk state using the mmchdisk command # mmchdisk

Usage:

mmchdisk Device {resume | start} -a

[-N {Node[,Node...] | NodeFile |

NodeClass}]

or

mmchdisk Device {suspend | resume | stop | start |

change}

{-d "DiskDesc[;DiskDesc...]" | -F DescFile}

[-N {Node[,Node...] | NodeFile |

NodeClass}]

Example

– Restart disk after fixing storage failure

Changing disk attributes


A disk can be replaced by a new disk.

– Need a free NSD as large or larger than original

– Cannot replace stopped disk

– Cannot replace disk if only disk in file system

– Do not need to unmount file system

– No need to re-stripe

– File system can be mounted or unmounted

It is replaced using the mmrpldisk command.

Usage: mmrpldisk Device DiskName {DiskDesc | -F DescFile}

[-v {yes | no}]

[-N {Node[,Node...] | NodeFile

| NodeClass}]

Replacing Disks


Disks are removed from a file system using the mmldedisk command. – Migrates data to remaining disks in file system – Removes disk from file system descriptor – Can be run from any node in cluster

The mmdeldisk command: – Usage:

mmdeldisk Device {"DiskName[;DiskName...]" | -F DiskFile} [-a] [-c]

[-r] [-N {Node[,Node...] | NodeFile |

NodeClass}]

Usage scenarios: – If disk is not failing and still readable by GPFS:

• Suspend the disk (mmchdisk disk_name suspend). • Re-stripe to rebalance all data onto other disks (mmrestripefs –b). • Delete the disk (mmdeldisk).

If disk is permanently damaged and file system is replicated: – Suspend and stop disk (mmchdisk disk_name suspend; mmchdisk disk_name stop)

– Re-stripe and restore replication for the file system, if possible (mmrestripefs –r)

– Delete the disk from the file system (mmdeldisk)

Deleting a Disk


mmchfs command

– Usage: mmchfs Device [-A {yes | no | automount}] [-D {posix |

nfs4}] [-E {yes | no}]

[-F

MaxNumInodes[:NumInodesToPreallocate]]

[-k {posix | nfs4 | all}] [-K {no |

whenpossible | always}]

[-m DefaultMetadataReplicas] [-o

MountOptions]

[-Q {yes | no}] [-r

DefaultDataReplicas] [-S {yes | no}]

[-T Mountpoint] [-t DriveLetter] [-V

{full | compat}] [-z {yes | no}]

or mmchfs Device -W NewDeviceName

Cannot modify

– Blocksize

– Logfile (-L LogFileSize in mmcrfs)

– MaxDataReplicas and MaxMetadataReplicas

– numnodes

File system


Quotas are set using the mmedquota command.

Issue mmedquota to explicitly set quotas for a user, groups, or filesets. mmedquota {-u [-p ProtoUser] User... |

-g [-p ProtoGroup] Group... |

-j [-p ProtoFileset] Fileset... |

-d {-u User... | -g Group... | -j Fileset}

|

-t {-u | -g | -j}}

– Confirm using mmrepquota command.

Example: Edit quota for user user1 # mmedquota –u user1

*** Edit quota limits for USR tests

NOTE: block limits will be rounded up to the next

multiple of the block size.

block units may be: K, M, or G.

fs1: blocks in use: 0K, limits (soft = 0K, hard = 0K)

inodes in use: 0, limits (soft = 0, hard = 0)

Setting up user quota


Cluster management


What is xCAT?

Extreme Cluster(Cloud) Administration Toolkit

– Open Source Linux/AIX/Windows Scale-out Cluster

Management Solution

Design Principles

– Build upon the work of others • Leverage best practices

– Scripts only (no compiled code) • Portable

• Source

– Vox Populi -- Voice of the People • Community requirements driven

• Do not assume anything


What does xCAT do?

Remote Hardware Control – Power, Reset, Vitals, Inventory, Event Logs, SNMP alert processing – xCAT can even tell you which light path LEDs are lit up remotely

Remote Console Management – Serial Console, SOL, Logging / Video Console (no logging)

Remote Destiny Control – Local/SAN Boot, Network Boot, iSCSI Boot

Remote Automated Unattended Network Installation – Auto-Discovery

• MAC Address Collection • Service Processor Programming • Remote Flashing

– Kickstart, Autoyast, Imaging, Stateless/Diskless, iSCSI

Scales! Think 100,000 nodes.

xCAT will make you lazy. No need to walk to datacenter again.


Functionality

Remote Hardware Control

– Power, reset, vitals, inventory, event logs, SNMP alert processing

Remote Console Management

– Serial console, SOL, logging

Remote Destiny Control

– Local boot, network boot, iSCSI boot

Parallel Cluster control

– parallel shell, parallel rsync, parallel secure copy, parallel ping

Remote Automated Unattended Network Installation

– Auto-discovery

• MAC address collection

• Service processor programming

– Remote flashing

– Kickstart, Autoyast, imaging, stateless/diskless

Easy to Use and it Scales! Think 100000 nodes

– xCAT will make you lazy - no need to walk to datacenter again


Architecture

A single xCAT Management Node (MN) for N number of nodes.

– A single node DHCP/TFTP/HTTP/NFS server.

– Scales to ~128 nodes.

• If staggered boot is used, this can scale to 1024 nodes (tested)


Scale Infrastructure

A single xCAT management node with multiple service nodes providing boot services to increasing scaling.

Can scale to 1000s and 10000s of nodes.

xCAT already provides this support for large diskfull clusters and it can by applied to stateless as well.

The number of nodes and network infrastructure will determine the number of DHCP/TFTP/HTTP servers required for a parallel reboot with no DHCP/TFTP/HTTP timeouts.

The number of DHCP servers does not need to equal the number of TFTP or HTTP servers. TFTP servers NFS mount read-only the /tftpboot and image directories from the management node to provide a consistent set of kernel, initrd, and file system images.

node001 node002 ... nodennn

DHCP TFTP HTTP NFS(hybrid)


nodennn + 1 nodennn + 2 ... nodennn + m


...

IMNmgmt node

service node01 service node02 service nodeNN

IMN...


Tables and Database

xCAT stores all information about the nodes and subsystems it manages in a

database.

– XCAT default database is located in /etc/xcat in sqlite tables. XCAT can be

instructed to store the tables in MySQL, PostgreSQL or DB2 as well.

For most installations you won't need to even fill up half of the tables!

– And for the tables that you do need, in most cases you'll only need to put one

line in the table!

There are lot of tables but only some tables are for common to Linux and AIX, some

are for only AIX, some just for monitoring, some for advanced functions (virtual

machines, iSCSI settings), …

xCAT comes with a rich set of functions for manipulating tables.



Provisioning methods

HD Memory

Node

xCAT

Stateful – Diskful

Local - HD - Flash

Stateful – Disk-Elsewhere

San - iSCSi

Stateless – Disk Optional

Memory RAM - CRAM - NFS

xCAT xCAT

OS

In

stal

ler

HD Memory HD Memory

SAN/iSCSI/NAS

OS

In

stal

ler

Imag

e

Pu

sh

Node Node

OS

OS

• HD

• Flash

• RAM

• CRAM

OS

Statelite


Management &Monitoring


Job Scheduler Intel Cluster suite

Documents

INCOIS HPC Training - Indian Institute of Tropical Meteorologyaadityahpc.tropmet.res.in/Aaditya/INCOIS/INCOIS training.pdf · General Parallel file system . ... (four billion - GPFS