36
© 2008 IBM Corporation http://w3.ibm.com/ibm/presentations A Database-Centric Approach to System Management The Blue Gene Supercomputer Tom Budnik Mark Megerian August 2008

A Database-Centric Approach to System Management The Blue Gene Supercomputer

Embed Size (px)

DESCRIPTION

A Database-Centric Approach to System Management The Blue Gene Supercomputer. Tom Budnik Mark Megerian August 2008. Database-Centric System Management. Why use a database for Blue Gene? Need a software representation of the Blue Gene hardware - PowerPoint PPT Presentation

Citation preview

Page 1: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2008 IBM Corporation

A Database-Centric Approach to System Management

The Blue Gene Supercomputer

Tom Budnik

Mark Megerian

August 2008

Page 2: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation2

Database-Centric System Management Why use a database for Blue Gene?

Need a software representation of the Blue Gene hardware A machine of such large scale requires a persistent means of storing errors (RAS

events), job history, block definitions, environmental readings, etc. DB2 is the central repository of ALL system information

Allows control system components to get hardware information and topology from the database, which is always kept current

Blue Gene Navigator pulls majority of data it displays from the database All current jobs, as well as all completed jobs are stored

Admins can see a history of every job that has ever been run on the machine We record start time and end time, as well as the number of nodes used, and this information

is used by Navigator to compute machine utilization All service actions, and replaced hardware, are tracked in the database

DB2 is used as a method of communication between components Setting values in the database can trigger actions in other components Can simplify the design by having policy stored in the database itself via procedures,

triggers, and constraints instead of the code Enforces consistency across components and reduces bugs

DB2 and the Control System run on the “Service Node” machine, which controls the Blue Gene nodes (pSeries running Linux)

Page 3: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation3

Database-Centric System Management - Benefits

DB2 provides the storage of all data (except logs). This provides a well-known set of interfaces for:

Querying data using existing tools or SQL Building web interfaces and browser-based tools using JSF, PHP, Java, CLI, and

many other established technologies Standard classes so all code can easily interact with the database

System administrators can learn DB2 from books and classes New team members can come up to speed quickly Customers can write their own tools, no hidden or closed data structures Functions such as backup and recovery, performance settings, and

security are handled by DB2 DB2 is a robust, commercial database, able to handle large multi-user apps

Page 4: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation4

Basic SQL Concepts Schema

The collection of objects such as tables, views, indexes, and triggers that define the database Blue Genes uses BGPSYSDB

Table (most common database object) A table is a collection of rows of data, organized into columns The table definition (CREATE TABLE) describes the columns and their names and data

types (integer, float, character, timestamp, etc.) Once a table is created, you can insert, update, delete rows, and query the contents Tables can be joined to other tables, and sorted and nested, to create many useful and

complex constructions of data

Example:CREATE TABLE TBGPBlockUsersCREATE TABLE TBGPBlockUsers

((

blockId char(32) NOT NULL,blockId char(32) NOT NULL,

username char(32) NOT NULL,username char(32) NOT NULL,

CONSTRAINT BGPBlkUsers_pk PRIMARY KEY (blockId, username),CONSTRAINT BGPBlkUsers_pk PRIMARY KEY (blockId, username),

CONSTRAINT BGPBlkUsers_fk FOREIGN KEY (blockId)CONSTRAINT BGPBlkUsers_fk FOREIGN KEY (blockId)

REFERENCES TBGPBlock(blockId) ON DELETE CASCADEREFERENCES TBGPBlock(blockId) ON DELETE CASCADE

););

CREATE ALIAS BGPBlockUsers for TBGPBlockUsers;CREATE ALIAS BGPBlockUsers for TBGPBlockUsers;

Page 5: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation5

Basic SQL Concepts - continued Views

A view is a virtual view of data, it stores a description of how to retrieve and map the data, but it stores no data itself

Generally used to present the same data in different ways, and act like a “virtual” table

Example:

CREATE VIEW BGPMidplane as SELECT serialnumber, productid, CREATE VIEW BGPMidplane as SELECT serialnumber, productid, machineserialnumber, status, ismaster, posinmachine as location FROM machineserialnumber, status, ismaster, posinmachine as location FROM TBGPMidplane;TBGPMidplane;

Index (a stored, sorted set of pointers to rows) Like a view, an index contains no actual “data” An index is built to sequence the rows using a certain set of columns that is frequently

used for searching and sorting A full table scan through millions of rows for a particular value would take several

minutes, where a lookup using an index over that column is often sub second Indexes are kept current as the data changes, so a large number of indexes can impact

update performance. There is a tradeoff between query performance, and only necessary and useful indexes should be created.

Example:

CREATE INDEX EventLogJ on Tbgpeventlog (jobid, recid desc)CREATE INDEX EventLogJ on Tbgpeventlog (jobid, recid desc)

Page 6: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation6

Basic SQL Concepts - continued

Triggers A trigger allows you to define an action to take place, generally when data is updated Triggers can be defined on an insert, update, or delete of rows in a table Triggers can fire “before” the action, and possibly modify the action taking place Triggers can also fire after the action Triggers can generate errors to block the action

Example:create trigger sc_history_icreate trigger sc_history_i

after insert on tbgpservicecardafter insert on tbgpservicecard

referencing new as nreferencing new as n

for each row mode db2sqlfor each row mode db2sql

begin atomic begin atomic

insert into tbgpservicecard_history insert into tbgpservicecard_history

(serialNumber, productId, midplanepos, status,vpd, action)(serialNumber, productId, midplanepos, status,vpd, action)

valuesvalues

(n.serialNumber, n.productId, n.midplanepos, n.status, n.vpd, 'I');(n.serialNumber, n.productId, n.midplanepos, n.status, n.vpd, 'I');

end @end @

Page 7: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation7

Basic SQL Concepts - continued Constraint

A constraint is a “rule” that is enforced by the database Check constraints give a list of valid values for a column Unique constraints enforce uniqueness on values in a column, or set of columns Referential Integrity constraints enforces values in a “child” table exist in “parent” table

Example:CREATE TABLE TBGPMidplane

(

serialNumber char(19) ,

productId char(16) NOT NULL,

machineSerialNumber char(19) ,

posInMachine char(6) NOT NULL,

CONSTRAINT BGPMidPo_chk CHECK ( posInMachine LIKE 'R__-M_' ),

status char(1) NOT NULL WITH DEFAULT 'A' ,

CONSTRAINT BGPMidSt_chk CHECK ( status IN ('A','M','E', 'S') ),

isMaster char(1) NOT NULL WITH DEFAULT 'T',

CONSTRAINT BGPMidMs_chk CHECK ( isMaster IN ('T', 'F') ),

vpd VARCHAR(4096) FOR BIT DATA,

seqId BIGINT NOT NULL WITH DEFAULT 0,

CONSTRAINT BGPMidpplane_pk PRIMARY KEY (posInMachine),

CONSTRAINT BGPMidMachineId_fk FOREIGN KEY (machineSerialNumber)

REFERENCES TBGPMachine (serialNumber) ON DELETE RESTRICT,

CONSTRAINT BGPMidplaneType_fk FOREIGN KEY (productId)

REFERENCES TBGPProductType (productId)

)

Page 8: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation8

DB2 Naming Guidelines for BG/P

Tables always start with TBGP, such as TBGPNodeCard or TBGPLinkCard Names are NOT case sensitive in SQL

For each table, there is a view that has the more user-friendly columns These are named without the T, such as BGPNodeCard In cases where some information is omitted from the view

If there is no need for any derived columns in the view, or omitted columns, then an alias is created

i.e. BGPClockCard The net effect is that most all the time, using the “BGP” name will show what you want

If there is a history being kept, then _history is added to the end

Page 9: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation9

BG/P Tables TBGPBlock TBGPBPBlockMap TBGPSmallBlock TBGPLinkBlockMap TBGPProductType TBGPMachine TBGPMachineSubnet TBGPMidplane TBGPNodeCard TBGPNode TBGPServiceCard TBGPLinkCard TBGPClockCard TBGPBulkPowerSupply TBGPSwitch TBGPCable TBGPClockCable TBGPLinkChip TBGPICON TBGPFanModule TBGPJob TBGPEthGateway TBGPEGWMachineMap TBGPPortBlockMap TBGPBlockUsers TBGPMidplaneSubnet TBGPNodeSubnet TBGPServiceAction TBGPUserPrefs

TBGPReplacement_history

TBGPMachine_history

TBGPMidplane_history

TBGPNodeCard_history

TBGPNode_history

TBGPServiceCard_history

TBGPLinkCard_history

TBGPClockCard_history

TBGPLinkChip_history

TBGPIcon_history

TBGPFanModule_history

TBGPJob_history

TBGPServiceCardEnvironment

TBGPFanEnvironment

TBGPClockCardEnvironment

TBGPBULKPOWEREnvironment

TBGPNodeCardPOWEREnvironment

TBGPLinkCardPOWEREnvironment

TBGPSrvcCardPOWEREnvironment

TBGPLinkChipEnvironment

TBGPLinkCardEnvironment

TBGPNodeEnvironment

TBGPNodeCardEnvironment

TBGPEventLog

TBGPERRCodes

TBGPDiagRuns

TBGPDiagBlocks

TBGPDiagResults

TBGPDiagTests

Page 10: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation10

BG/P Views

BGPMidplane BGPMidplaneAll BGPNodeCard BGPNodeCardAll BGPNode BGPNodeAll BGPServiceCard BGPServiceCardAll BGPLinkCard BGPLinkCardAll BGPClockCardAll BGPBulkPowerSupplyAllBGPLinkChip BGPLinkChipAllBGPFanModule BGPFanModuleAll BGPLink BGPClockCardEnvironmentBGPDiagTests

BGPNodeCardCountBGPLinkCardCountBGPServiceCardCountBGPNodeCountBGPBasePartitionBGPBPBlockStatusBGPSwitchLinksBGPLinkBlockStatusBGPSwitchPortBGPPortBlockStatusBGPBlockSize

Page 11: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

11 Extreme Scalability with BlueGene/L © 2005 IBM Corporation

BG/P DB2 StructureDB2

Configuration Database

Operational Database

Environmental Database

RAS Database

Configuration database is the representation of all the hardware on the system

Operational database contains information and status for things that do not correspond directly to a single piece of hardware such as jobs, partitions, and history

Environmental database keeps current values for all of hardware components on the system, such as fan speeds, temperatures, voltages

RAS database collects hard errors, soft errors, machine checks, and software problems detected from the compute complex

Page 12: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

12 Extreme Scalability with BlueGene/L © 2005 IBM Corporation

BG/P DB2 StructureDB2

Configuration Database

Configuration database is the representation of all the hardware on the system

Machine Midplanes Service Cards Link Cards Link Chips Node Cards Processor Cards

Compute & I/O Nodes Cables Clock Cards Fan Modules

Populated during initial system install and kept current during hardware service actions

Page 13: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

13 Extreme Scalability with BlueGene/L © 2005 IBM Corporation

BG/P DB2 StructureDB2

Operational Database

Operational database contains information and status for things that do not correspond directly to a single piece of hardware such as jobs, partitions, and history

Blocks (partitions) Jobs Job history Switch settings Link <-> Block map Block users

Maintained by the Blue Gene control system running on the service node

Page 14: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

14 Extreme Scalability with BlueGene/L © 2005 IBM Corporation

BG/P DB2 Structure

DB2

Environmental Database

Environmental database keeps current values for all of hardware components on the system, such as fan speeds, temperatures, voltages

Fan Modules Desired and actual fan speed Voltages Temperatures

Service Cards Ambient temp Voltages

Node Cards Chip temps Temp limits Wiring faults

Link Cards Power Status Temps

Hardware Monitor reads and stores information on customizable intervals

By default, BG/P purges the data every 3 months (mmcs_envs_purge_months=3). The db.properties configuration can be altered to store more or less data as required by the local environment.

Page 15: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

15 Extreme Scalability with BlueGene/L © 2005 IBM Corporation

BG/P DB2 StructureDB2

RAS Database

RAS database collects hard errors, soft errors, machine checks, and software problems detected from the compute complex.

RAS events collected for bad hardware, missing cards, bad memory, bad cables

RAS events collected from compute complex while jobs are running, from kernel interrupts

RAS events generated by HW monitoring, for wiring faults, bad cards, fan speeds, over temps

RAS events generated by MMCS during link training, software errors, file system errors

Page 16: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation16

Putting It All Together – Database Populate/Verification

Install team runs a Perl script (dbPopulate.pl) that populates the database with the expected configuration for the Blue Gene system.

The machine is powered on, and the InstallServiceAction program finds all hardware on the service network and verifies the database matches with the actual hardware config

This information is also modified and kept current during service actions (card replacement, recabling, etc.)

VerifyCables program confirms that the Torus network cabling is correct and VerifyIpAddresses confirms that the IO card IP addresses are correct

BGPMidplane

BGPCable

BGPServiceCard

BGPLinkCard

BGPNodeCard

BGPNode

Page 17: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation17

Putting It All Together – Partitioning

Partitions are defined and the information is stored in DB2 Partitions can be defined using Navigator block builder, console commands

like genblock, using Bridge API pm_add_partition, or dynamically created by an external scheduler or mpirun

BGPBlock

BGPBpBlockMap

BGPLinkBlockMap

BGPPortBlockMap

BGPSmallBlock

Page 18: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation18

Putting It All Together – Booting

Partition information from the database is used to boot the hardware and prepare it for running jobs

Database contains the kernel images for the IO nodes and Compute nodes Database contains all the switch settings needed to program the link chips

in order to create the Torus or Mesh Database relates block information to specific hardware

BGPBlock

BGPBpBlockMap

BGPLinkBlockMap

BGPPortBlockMap

BGPSmallBlock

BGPMidplane

BGPNodeCard

BGPLinkCard

BGPNode

Page 19: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation19

Putting It All Together – Booting

Database prevents overlap by doing arbitration of nodes, switches, and cables This allows multiple partitions to be booted, provided they do not share the same nodes,

switch ports, or cables They can, however, share the same switch, which allows for pass-through Any attempt to boot a partition that overlaps with an already booted partition with fail

with a message that the hardware that is already in use

BGPBlock

BGPBpBlockMap

BGPLinkBlockMap

BGPPortBlockMap

BGPSmallBlock

Page 20: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation20

Putting It All Together – Job execution

Jobs are submitted to booted blocks Job submission is done via console, mpirun, submit, Bridge APIs, or external scheduler Submitter must either be block owner or block user Control system polls hardware for RAS events during job execution and writes them into

the RAS Event Log table. Each event is identified by the exact piece of hardware on which it occurred.

Control system polls for job completion and writes into the job history table the start and end time, number of nodes, exit status, etc.

BGPJob

BGPBlock

BGPBlockUsers

BGPEventlog

BGPJob_History

Page 21: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation21

Navigator: Web Interface to DB

Browser interface to view DB2 data Supports the viewing of RAS data, configuration data, diagnostics data, environmental

data and operational data Can be used to see how the hardware fits together Can be used to find trouble areas, hardware anomalies Eliminates the need to have SQL expertise to view basic Blue Gene information from the

database

Page 22: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation22

Blue Gene Navigator (Job History)

Page 23: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation23

Blue Gene Navigator (Blocks)

Page 24: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation24

Blue Gene Navigator (RAS Event Log)

Page 25: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation25

Blue Gene Navigator (Block Visualizer)

Page 26: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation26

Summary

DB2 is the central repository of all control system information Database information is not just passively recorded, but rather its an

integrated communications method for the control system It has greatly enhanced our product:

Writing reports and queries for job utilization Querying RAS events and error trends Building end user tools Training new people, faster learning curve Can test the control system without any real hardware Lower cost of ownership for customers with better tools and accessible data

Stability and performance have been excellent, so its been one thing we have not had to spend a lot of time tuning or debugging

Page 27: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2008 IBM Corporation

BG/P Security

Tom BudnikAugust 2008

Page 28: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation28

Security Admin Tool (bguser.pl)

Role Capability

user Submit jobs through mpirun (HPC) Submit jobs using submit command (HTC) Read access to some of data (job/block status) on Service node, through Navigator Access to the Front End nodes Complete access to compilers/tool chain/etc. for development on the Front End nodes

developer Submit jobs through mpirun (HPC) Submit jobs using submit command (HTC) Read access to some data (job/block status) on Service node, through Navigator Controlled and limited access to Service node - requires user ID on SN No root access but has elevated privileges Complete access to compilers/tool chain/etc. for development on the Front End nodes Debugging with coreprocessor

admin Complete access to Blue Gene/P functions on the Service Node and Front End Node(s)

Service (IBM) Access to required debug tools, system logs, and read access to database

The Security Administration tool assigns authority consistently to users who access the Blue Gene system. The tool authorizes users to various predetermined functions on the system by adding their profile to a selected group. The groups are: bgpuser, bgpdeveloper, bgpservice and bgpadmin

The groups are created on the Service Node when the Blue Gene/P system is installed Can edit the program to define the groups differently

Existing Linux users are added to groups by running the bguser.pl utility: ./bguser.pl [options] options are: --user userName --role [user/developer/service/admin]

Page 29: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation29

Service Node

Groups

db2rasdb

db2iadm1 (DB2 client)

db2fadm1 (DB2 client)

db2asgrp (DB2 client)

Users

bgpsysdb (DB2 server instance)

bgpdb2c (DB2 client instance)

bgpadmin

mpirun

bgpuser

bgpdeveloper

bgpadmin

bgpservice

Page 30: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation30

Front End Node

Groups bgpadmin bgpservice bgpdeveloper bgpuser

Users mpirun

Profile /etc/profile.d/bgp.sh

Page 31: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation31

Database Access on Service Node

The db.properties file contains the information required to access the database. Typically located in /bgsys/local/etc

Keywords of interest:database_name=bgdb0

database_user=bgpsysdb

database_password=thepassword

database_schema_name=bgpsysdb

system=BGP

min_pool_connections=1 Access to Blue Gene DB from FEN is discouraged for security reasons

Reason for front-end and back-end mpirun

Page 32: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation32

Navigator Security Authority

There are three roles defined in Navigator: End User, Service, and Administrator. The Linux group to Navigator role mapping is defined in the local Navigator configuration file (/bgsys/local/etc/navigator-config.xml). Note that the value can be a Linux group name or gid.

Administrator groups Users in these Linux groups have access to all the sections in Navigator. To have multiple

groups, add a <administrator-group>groupname</administrator-group> for each group. Service groups

Users in these Linux groups have access to the service sections in Navigator. To have multiple groups, add a <service-group>groupname</service-group> for each group.

User groups Users in these Linux groups have access to only the end-user pages in the Navigator.

These pages do not allow any updates to the Blue Gene system. To have multiple groups, add a <user-group>groupname</user-group> for each group.

Page 33: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation33

Navigator Security - continued Navigator runs with the profile of the user that starts bgpmaster (typically bgpadmin)

User that starts bgpmaster needs read access to /etc/shadow files to allow Navigator to perform authentications

Navigator setup script The DB2 libraries must be made available to Tomcat so that it can access the database,

and the Java Authentication and Authorization Service (JAAS) plug-in for interfacing to Linux Pluggable authentication modules (PAM) must be available so that Tomcat can authenticate users. A script is provided to do the setup: $ cd /bgsys/drivers/ppcfloor/navigator $ ./setup-tomcat.sh

Setup anonymous access to end-user pages By default, the previous setup configures the Navigator to allow only authenticated users

to access the Web interface. To enable anonymous access to end-user pages, you need to copy the web-withenduser.xml file into Navigator’s configuration:

$ cd /bgsys/drivers/ppcfloor/navigator$ cp web-withenduser.xml

catalina_base/webapps/BlueGeneNavigator/WEB-INF/web.xml

Page 34: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation34

Navigator Security - continued

PAM authentication: Navigator uses the bluegene PAM stack to authenticate users. This is setup by creating a file

/etc/pam.d/bluegene:#%PAM-1.0auth include common-authauth required pam_nologin.soaccount include common-accountpassword include common-passwordsession include common-session

SSL Configuration for Navigator Tomcat HTTP server instance: By default the Tomcat instance requires that Secure Sockets Layer (SSL) be configured and the

server listens on port 32072 By default Navigator uses the /bgsys/local/etc/navigator.keystore keystore. This must be created

when the system is configured. To do this, the keytool command is used.

Page 35: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation35

mpirun Security End user IDs are not required to exist on the service node Requires mpirun_be to run under bgpadmin User’s uid and gid are collected from the front-end and propagated to CIOD

Used for file system access permissions

Page 36: A Database-Centric Approach to  System Management The Blue Gene Supercomputer

© 2006 IBM Corporation36

mpirun Security - continued Challenge protocol

A challenge/response protocol is used to authenticate the mpirun client when connecting to the mpirun daemon on the service node

Authentication uses the OpenSSL Secure Hash Algorithm (SHA-2) and a shared secret Protocol uses shared secret to create a hash of a random number, thereby verifying

the mpirun front end node has access to the secret To protect secret, it is stored in configuration file only accessible by the bgpadmin

user on the service node and by a special mpirun user on the front end node The front end mpirun binary has its setuid flag enabled so it can change its uid to

match the mpirun user and read the config file to access the secret mpirun.cfg file

The mpirun.cfg file contains the shared secret used by the mpirun daemon File needs to exist on the SN and any FENs that submit jobs using mpirun

Files need to match exactly for authentication to work The mpirun.cfg file is in the /bgsys/local/etc/ directory Only bgpadmin and the special mpirun user needs to have access to the file

mpirun.cfg file example: CHALLENGE_PASSWORD=BGPmpirunPasswd