73
Sonexion 900 Replacement Procedures 1.4

About Sonexion 900 Replacement Procedures 1.4

Embed Size (px)

Citation preview

Page 1: About Sonexion 900 Replacement Procedures 1.4

Sonexion 900 Replacement Procedures 1.4

Page 2: About Sonexion 900 Replacement Procedures 1.4

ContentsAbout Sonexion 900 Replacement Procedures 1.4...................................................................................................3

Replace a PCM (SSU, ESU, or MMU) ......................................................................................................................5

Replace a 3.5-inch Disk (4U24 SSU or ESU)..........................................................................................................11

Replace a 4U24 SSU OSS Controller.....................................................................................................................19

Replace the 4U24 SSU Chassis .............................................................................................................................27

Replace a 4U24 ESU SAS EBOD Controller...........................................................................................................35

Replace the 4U24 ESU Chassis .............................................................................................................................40

Replace a 2.5-inch Disk (2U24 MMU).....................................................................................................................49

Replace a 2U24 MMU OSS Controller....................................................................................................................55

Replace the 2U24 MMU Chassis.............................................................................................................................64

Replace a Management Switch...............................................................................................................................72

Contents

2

Page 3: About Sonexion 900 Replacement Procedures 1.4

About Sonexion 900 Replacement Procedures 1.4This publication describes a procedure to install a Sonexion 900 system running release 1.4 at an installation site.Procedures include unpacking the system at the site, cabling the system into the installation environment,powering on the system, running the configuration wizard on Cray Sonexion System Manager (CSSM), andrunning connectivity and performance tests. Additionally, this document contains a section on common installationproblems (including workarounds) and troubleshooting information.

Procedures That Have Been Replaced by Video VersionsSome replacement procedure are being remade as video procedures viewed from a PC connected to theSonexion system. In these cases no text-based procedure is included. Instead, field personnel should log in to theSonexion service console, which provides step-by-step instructions to replace the failed part. Follow the stepsbelow to access the service console:

1. Cable a laptop to any available port on any LMN switch (located at the top of the rack).

2. Log in to the service console and follow the procedure to remove and replace the failed part. To log in,navigate to the service console (http://service:8080). If that URL is not active, then log in to port 8080of the IP address of the currently active MGMT node (MGMT0):

http://IP_address:8080

where IP_address is the IP address of the currently active (primary) MGMT node.

3. Enter the standard service credentials.

Scope and AudienceThe procedures presented in this manual are to be carried about by technicians at sites where Sonexion systemsare installed, employed by either Cray Inc., or the customer organization.

Typographic ConventionsMonospace A Monospace font indicates program code, reserved words or library functions,

screen output, file names, path names, and other software constructs

Monospaced Bold A bold monospace font indicates commands that must be entered on a commandline.

Oblique or Italics An oblique or italics font indicates user-supplied values for options in thesyntax definitions

Proportional Bold A proportional bold font indicates a user interface control, window name, orgraphical user interface button or control.

Alt-Ctrl-f Monospaced hypenated text typically indicates a keyboard combination

About Sonexion 900 Replacement Procedures 1.4

3

Page 4: About Sonexion 900 Replacement Procedures 1.4

Record of Revision, publication HR5-6135Publication Number Date Description

HR5-6135-0 December 2014 Original Printing, release 1.4

About Sonexion 900 Replacement Procedures 1.4

4

Page 5: About Sonexion 900 Replacement Procedures 1.4

Replace a PCM (SSU, ESU, or MMU)PrerequisitesPart number

FRU Where Used Part Number Description

2U24 MMU 100853500 580W PSU for 2U and 4U components

4U24 SSU or ESU 101228700 764W PSU for Sonexion 900 MMU

Time30 minutes

Interrupt levels:

● Without USM firmware update: Live (does not require taking the Lustre file systemoffline)

● With USM firmware update: Interrupt (requires taking the Lustre file system offline, andrebooting the machine)

Tools

● Phillips screwdriver (medium)

● ESD strap

● One of the following:

○ Console with monitor and keyboard (or PC with a serial COM port configured for115.2Kbs)

○ KVM with attached cable

The following procedures can be used to remove a power control module (PCM) from the MMU, SSU, or ESU.

PCMs in the Sonexion System

The Sonexion 900 uses three enclosure types:

● A 2U24 Metadata Management Unit (MMU) enclosure, which includes two Sonexion 900 MMU OSScontrollers, two power cooling modules (PCM), and 16 DDICs.

● A 4U24 Scalable Storage Unit (SSU) enclosure, which includes two Sonexion 900 SSU OSS controllers, fourPCMs, and 23 DDICs (disk drive in carrier, usually referred to simply as disks).

● Multiple 4U24 Expandable Storage Units (ESU) enclosures, each of which includes two SAS EBODcontrollers, two PCMs, and 21 disks.

The PCMs supply power and cooling to the various enclosures; there are no separate power supplies or fanmodules in the chassis.

Replace a PCM (SSU, ESU, or MMU)

5

Page 6: About Sonexion 900 Replacement Procedures 1.4

The PCMs are located to the left and right of the Sonexion 900 OSS controllers (MMU and SSU) or SAS EBODcontrollers (ESU), and are accessible from the rear of the rack. If a PCM fails, the system continues to operatenormally on the remaining PCMs until the failed module is replaced.

NOTE: As shown in the parts listing above, there are two versions of the power supply: the SSU and ESUuse 580W versions, and the MMU uses a 764W version. Make certain to install the correct version.

Notes and Cautions

● Only trained service personnel should perform this procedure.

● If this equipment is used in a manner not specified by the manufacturer, the protection provided by theequipment may be impaired.

● Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.

WARNING:

● Do not remove the cover from the PCM. Danger of electric shock exists inside the cover.

● If three or four PCMs are fitted, a 4U24 enclosure must be connected to at least two separate andindependent power sources to make sure there is protection against electric shock caused by highleakage current (touch current). This is to make sure there is a reliable earth connection.

IMPORTANT:

● The time needed to replace a PCM is critical. The MMU and ESU enclosures operate in a non-redundant state while operating on one PCM. To maintain uninterrupted Sonexion operations, a faultyPCM must be replaced within 24 hours of its failure. A single PCM may be unable to provide sufficientcooling to the system over an extended period of time.

● Do not remove a faulty PCM unless a replacement of the correct type is available and ready forinstallation. The full complement of PCMs must always be installed.

● Operation of the enclosure with any modules missing will disrupt the airflow and the drives will notreceive sufficient cooling. It is essential that all apertures are filled or covered before operating theenclosure system.

● Prior to removing the PCM, disconnect the power from the PCM, by either the mains switch (wherepresent) or by physically removing the power source in order to ensure that the system has warningof an imminent power shutdown. A faulty PCM must be replaced by a fully operational PCM within 24hours.

Remove PCM1. If the location of the failed PCM is not known, look at the Operator Control Panel for the amber Module Fault

LED (front panel). Then look for either an amber AC Fail LED and/or DC Fail LED, and a non-illuminatedgreen PCM OK LED at the rear of the faulty PCM.

Replace a PCM (SSU, ESU, or MMU)

6

Page 7: About Sonexion 900 Replacement Procedures 1.4

Figure 1. Operator Control Panel

Figure 2. Power Control Module

2. Turn off the power switch on the failed PCM.

3. Disconnect the power cord by removing the strain relief bale and unplugging the cord.

4. Release the module latch by grasping it between the thumb and forefinger and gently squeezing it.

Figure 3. Release Latch on 4U24 PCM

Replace a PCM (SSU, ESU, or MMU)

7

Page 8: About Sonexion 900 Replacement Procedures 1.4

Figure 4. Release Latch on a 2U24 PCM

5. Using the latch as a handle, carefully remove the failed PCM from the enclosure.

WARNING: Do not remove the cover from the PCM. Danger of electric shock exists inside the cover.

Figure 5. 4U24: Remove a 4U24 PCM

Replace a PCM (SSU, ESU, or MMU)

8

Page 9: About Sonexion 900 Replacement Procedures 1.4

Figure 6. 2U24: Remove a 2U24 PCM

Install a PCM1. Carefully inspect the replacement PCM for damage, especially to the rear connector. Avoid damaging the

connector pins.If the PCM is damaged, do not install it. Obtain another replacement PCM.

2. With the PCM handle in the open position, slide the PCM into the empty bay at the rear of the 2U24enclosure.

Figure 7. Install a 4U24 PCM

Figure 8. Install a 2U24 PCM

3. As the PCM begins to seat, grasp the handle latch and close it to engage the latch. A click should be heard asthe handle latch engages.

Replace a PCM (SSU, ESU, or MMU)

9

Page 10: About Sonexion 900 Replacement Procedures 1.4

This action engages the caming mechanism on the side of the module and secures the PCM.

Figure 9. Seat the 4U24 PCM

4. Verify that the power switch on the replacement PCM is in the OFF position.

5. Connect the power cord to the replacement PCM, securing it with the strain relief bale.

a. Connect the power cord to the PCM.

b. Place the bale over and onto the power cord.

6. Turn on the power switch on the new PCM. Wait a few minutes before proceeding to the next step.

7. Verify that the AC Fail and DC Fail LEDs are extinguished and the PCM OK LED is illuminated green at therear of the PCM, and the Module Fault LED on the operator control panel of the 4U24 enclosure (front panel)is extinguished.

Replace a PCM (SSU, ESU, or MMU)

10

Page 11: About Sonexion 900 Replacement Procedures 1.4

Replace a 3.5-inch Disk (4U24 SSU or ESU)PrerequisitesPart number

101232900 - 4TB 3.5" 7.2k RE FSAS

Time1 hour

Interrupt levelLive (can be applied to a live system with no service interruption)

Tools

● Lock key (for disk)

● ESD strap

● One of the following:

○ Console with monitor and keyboard (or PC with a serial COM port configured for115.2Kbs)

○ KVM with attached cable

About this taskUse this procedure to remove and replace a failed disk drive in carrier (disk) in the 4U24 enclosure of a ScalableStorage Unit (SSU) or Extended Storage Unit (ESU) component. This procedure applies to both disks used in theMDRAID array and drives marked as hot spares.

Subtasks:

● Remove and Install a 3.5-inch 4U24 Disk

● Verify 4U24 Hot Spare, Arrays, Statuses

The Sonexion 900 uses three enclosure types:

● A 2U24 Metadata Management Unit (MMU) enclosure, which includes two Sonexion 900 MMU OSScontrollers, two power cooling modules (PCM), and 16 DDICs.

● A 4U24 Scalable Storage Unit (SSU) enclosure, which includes two Sonexion 900 SSU OSS controllers, fourPCMs, and 23 DDICs (disk drive in carrier, usually referred to simply as disks).

● Multiple 4U24 Expandable Storage Units (ESU) enclosures, each of which includes two SAS EBODcontrollers, two PCMs, and 21 disks.

The disks are configured in MDRAID arrays, with one hot spare for local arrays. Each array is a set of drivesconfigured to act as a single volume.

When a disk fails and its MDRAID array is degraded, the hot spare becomes active and the failed disk's data isimmediately rebuilt on the spare disk. While the rebuild is underway, Sonexion operations continue withoutinterruption. When the disk is replaced, the new disk is designated as the hot spare.

Replace a 3.5-inch Disk (4U24 SSU or ESU)

11

Page 12: About Sonexion 900 Replacement Procedures 1.4

Failed Hot Spare

If a hot spare fails, it must be replaced with a new drive, but the remove / replace procedure is easier because nodisk rebuild is necessary. Instructions to replace a failed hot spare are provided in the following procedure.

Notes and Cautions

● Only trained service personnel should perform this procedure.

● If this equipment is used in a manner not specified by the manufacturer, the protection provided by theequipment may be impaired.

● Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.

Procedure

1. If the location of the failed disk is not known, do the following:

a. Look for the amber Drive Fault LED on the 4U24 Operator Control Panel (OCP), to find the drawercontaining the faulty disk.

Figure 10. 4U24 Operator Control Panel

Table 1. Operator Control Panel: 4U24 LED Indicators

LEDs State Description

System Power Steady Green AC Power is applied to the enclosure.

Module Fault Steady Amber Indicates one of the following:

● Power Cooling Module fault

● ESM fault

● Over or under temperature fault condition

Refer to individual module fault LEDs.

Logical Fault Steady Amber Indicates failure of a disk drive

b. When the faulty drawer is located, look for the amber Drive Fault LED on the failed disk.

The amber LED indicates a problem with the disk.

Replace a 3.5-inch Disk (4U24 SSU or ESU)

12

Page 13: About Sonexion 900 Replacement Procedures 1.4

Figure 11. 3.5-inch Disk LEDs

When viewed from the front of the 4U24 enclosure, the slot numbers start from the top left (slot 0), andcontinue left to right, top to bottom (slot 23). The dm_report utility reports drive data starting from slot0.

2. Log in to the primary MGMT node via SSH (user name admin and the customer’s password).

[ADMIN]$ ssh –l admin primary_MGMT_node

3. Log in to the OSS node via SSH:

[MGMT0]$ ssh oss_node

4. If the location of the failed disk is already known, go to the end of this subsection. To verify the location of thefailed disk, run the dm_report command:

[admin@n000]$ sudo dm_report

The dm_report command provides status for all drives in the enclosure. The various status types are:

● OK/Failed: for drives owned by the node where the report was run

● Foreign: for drives owned by the other node

● Hot Spare: for hot spare drives owned by either node

● Empty: for slots that do not contain a drive

Following is an example dm_report output, showing a failed disk in slot 1 (indicated by the highlightedtext), which was part of RAID array md1 (indicated by the highlighted text).

$ sudo dm_report[admin@snx11000n000 ~]$ sudo dm_reportDiskmonitor Inventory Report: Version: 1.0-3026.xrtx.2287 Host:snx11000n000 Time: Wed Mar 27 07:46:33 2013encl: 0, wwn: 50050cc10204a72c, dev: /dev/sg25, slots: 24, vendor:XYRATEX , product_id: HB-2435-Sonexion 1600slot: 0, wwn: 50000c0f01b10500, cap: 4000787029504, dev: sdt, parts: 0,status: Foreign Arraysslot: 1, wwn: 50000c0f01b06f64, cap: 4000787029504, dev: sdn, parts: 0,status: Failedslot: 2, wwn: 50000c0f0113593c, cap: 4000787029504, dev: sdm, parts: 0,status: Ok.….Array: md1, UUID: 8326a7ef-27c123c9-6d45a2d1-c7a226f9, status: Degradeddisk_wwn: 50000c0f0113593c, disk_sd: sdm, disk_part: 0, encl_wwn:50050cc10204a72c, encl_slot: 2disk_wwn: 50000c0f01b0a480, disk_sd: sdk, disk_part: 0, encl_wwn:50050cc10204a72c, encl_slot: 3disk_wwn: 50000c0f01b10fbc, disk_sd: sdp, disk_part: 0, encl_wwn:50050cc10204a72c, encl_slot: 5

Replace a 3.5-inch Disk (4U24 SSU or ESU)

13

Page 14: About Sonexion 900 Replacement Procedures 1.4

Array: md127, UUID: 89b60bdb-8890fd5e-6b8f17b2-dbdbcdd2, status: OkArray is unmanaged -- found no disks in a managed enclosureEnd_of_report

5. Identify the faulty drive by examining the dm_report slot location and the disk drive with the amber DriveFault LED illuminated. The faulty drive may also be designated as "Failed" in the dm_report output.

Remove and Install a 3.5-inch 4U24 Disk

6. If the anti-tamper lock is engaged, carefully insert the lock key into the lock socket and rotate it counter-clockwise until the red indicator is no longer visible in the opening above the key.

Figure 12. Anti-Tamper Lock Indicator

7. Remove the lock key.

8. Release the disk by pressing the latch handle towards the handle hinge as shown in the following figure.

Figure 13. Releasing 3.5-inch Disk Latch Handle

9. Gently remove the disk approximately 1 inch (25mm) from the drive slot, and then wait 30 seconds.

Replace a 3.5-inch Disk (4U24 SSU or ESU)

14

Page 15: About Sonexion 900 Replacement Procedures 1.4

Figure 14. Removing a 3.5-inch Disk

10. Fully remove the disk from the drive bay.

11. Wait for the system to detect the missing drive. The dm_report will show the slot as Empty.

On a quiescent system, it takes approximately 30 seconds for the missing drive to be detected, longer on abusy system.

12. Verify that the disk handle is released and in the open position.

Figure 15. 3.5-inch Disk with Handle in Open Position

CAUTION: All drive slots must have a disk or a dummy carrier installed to maintain balanced airflow.

IMPORTANT:

● Examine the replacement drive to ensure that it is the same size, model and manufacture as thefailed drive.

● Ensure that the new disk is oriented so the drive handle opens downward.

● A disk cannot be installed if the anti-tamper lock is activated outside the enclosure.

13. Insert the new disk into the empty drive slot and gently slide the drive carrier into the enclosure until it stops.

Make certain the disk orients up and the handle opens from the left.

Replace a 3.5-inch Disk (4U24 SSU or ESU)

15

Page 16: About Sonexion 900 Replacement Procedures 1.4

Figure 16. Installing a 3.5-inch Disk

14. Seat the disk by pressing the handle latch.

A click is audible as the handle latch engages.

Figure 17. Seated 3.5-inch Disk

15. Verify that the new disk is in the same orientation as the other disks in the enclosure.

16. Activate the anti-tamper locks.

a. Carefully insert the lock key into the lock socket.

b. Rotate the key clockwise until the red indicator is visible in the opening above the key.

Figure 18. Anti-Tamper Lock Indicator

c. Remove the lock key.

Verify 4U24 Hot Spare, Arrays, and Statuses

Replace a 3.5-inch Disk (4U24 SSU or ESU)

16

Page 17: About Sonexion 900 Replacement Procedures 1.4

17. Verify that the new disk is registered as the hot spare:

[admin@n000]$ sudo dm_report

Depending on the cluster's load and drive spin up time, it may take a few minutes for the dm_report outputto show the new disk registered as the hot spare.

Following is a partial sample of dm_report output showing the disk in slot 1 registered as the hot spare:

$ sudo dm_reportDiskmonitor Inventory Report: Host: snx11000n000 Time: Tue Jan 3 14:30:39 2012encl: 0, wwn: 50050cc10204a72c, dev: /dev/sg25, slots: 24, vendor: XYRATEX ,product_id: HB-2435-Sonexion 1600slot: 0, wwn: 50000c0f01b10500, cap: 4000787029504, dev: sdt, parts: 0, status:Foreign Arraysslot: 1, wwn: 50000c0f01131e68, cap: 4000787029504, dev: sde, parts: 0, status: HotSpareslot: 2, wwn: 50000c0f0113593c, cap: 4000787029504, dev: sdm, parts: 0, status: Ok.….Array: md1, UUID: 8326a7ef-27c123c9-6d45a2d1-c7a226f9, status: Okdisk_wwn: 50000c0f0113593c, disk_sd: sdm, disk_part: 0, encl_wwn: 50050cc10204a72c,encl_slot: 2disk_wwn: 50000c0f01b0a480, disk_sd: sdk, disk_part: 0, encl_wwn: 50050cc10204a72c,encl_slot: 3disk_wwn: 50000c0f01b10fbc, disk_sd: sdp, disk_part: 0, encl_wwn: 50050cc10204a72c,encl_slot: 5

18. Following is a partial sample of dm_report output that shows the disk in slot 5 as failed. If the new diskcomes up as a hot spare, go to step 21 on page 18. If it comes up as anything else, continue to thefollowing step.

[admin@snx11000n000 ~]$ sudo dm_reportDiskmonitor Inventory Report: Version: 1.0-3026.xrtx.2287 Host: snx11000n000 Time:Wed Mar 27 08:29:30 2013encl: 0, wwn: 50050cc10c201ea4, dev: /dev/sg24, slots: 24, vendor: XYRATEX ,product_id: EB-2425P-E6EBDslot: 0, wwn: 5000c50047a5aa93, cap: 450098159104, dev: sda, parts: 0,status: Foreign Arraysslot: 1, wwn: 5000c50047ad9dc7, cap: 450098159104, dev: sdb, parts: 0,status: Foreign Arraysslot: 2, wwn: 5000c50047a5b4ab, cap: 450098159104, dev: sdl, parts: 0,status: Foreign Arraysslot: 3, wwn: 5000c50047a5b323, cap: 450098159104, dev: sdj, parts: 0,status: Okslot: 4, wwn: 5000c50047b5a953, cap: 450098159104, dev: sdi, parts: 0,status: Foreign Arraysslot: 5, wwn: 5000c50047b5a81f, cap: 450098159104, dev: sdm, parts:0, status: Failed

Replace a 3.5-inch Disk (4U24 SSU or ESU)

17

Page 18: About Sonexion 900 Replacement Procedures 1.4

19. If the new disk comes up as anything other than hot spare, clear the superblock information:

[admin@n000]$ sudo mdadm --zero-superblock --force /dev/sdXX

where XX is the SD device number.

20. Verify that the new disk is registered as the 'hot spare':

[admin@n000]$ sudo dm_report

After the zero superblock command the new drive should show as a hot spare.

slot: 5, wwn: 5000c50047b5a81f, cap: 450098159104, dev: sdm, parts:0, status: Hot Spare

21. Verify the state of the arrays in the 4U24 enclosure:

[admin@n000]$ sudo cat /proc/mdstat

Following is an example cat /proc/mdstat output showing raid device md1 in recovery.

[admin@eval203 ~]$ sudo cat /proc/mdstatPersonalities : [raid1] [raid6] [raid5] [raid4]md1 : active raid6 sdn[11](S) sds[10] sdd[9] sdk[8] sdl[7] sdf[6] sdw[5] sdr[4]sdx[3] sdt[2]30312499200 blocks super 1.2 level 6, 128k chunk, algorithm 2 [10/8/0] [__UUUUUUUU]in: 187397 reads, 1880276 writes; out: 3472468040 reads, 1275388538 writes,353041032 zwrites1378360615 in raid5d, 75598 out of stripes, 1776296108 handle calledreads: 0 for rmw, 3131597 for rcw21139999 delayed, 4289582396 bit delayed, 16314 active, queues: 2080342 in, 193974out[========>............] recovery = 42.0% (1591677740/3789062400) finish=407.9minspeed=89780K/sec|bitmap: 22/226 pages [88KB], 8192KB chunk, file: /WIBS/eval202:md1/WIB_eval202:md1

22. Verify that the OCP status and Drive Fault LEDs are normal.

a. Verify that the Module Fault LED on the Operator Control Panel (OCP) of the 4U24 enclosure is green.

b. Verify that no Drive Fault LEDs are illuminated.

If a recovery or rebuild is in progress on the MDRAID array, the Activity LEDs will be illuminated for eachdrive in the array. The Logical Fault LED on the OCP of the 4U24 enclosure will also be illuminated.

23. Log out of the MGMT node.

Replace a 3.5-inch Disk (4U24 SSU or ESU)

18

Page 19: About Sonexion 900 Replacement Procedures 1.4

Replace a 4U24 SSU OSS ControllerPrerequisitesPart number

101229500, Sonexion 900 Series Application Controller 16GB (ESU-Ready) For SSU - FRU

Time1.5 hours

Interrupt levels

● Remove/replace SSU controller: Failover (can be applied to a live system with noservice interruption, but requires failover/failback operations)

● USM firmware update needed: Interrupt (requires taking the Lustre file system offline.Perform a USM upgrade only if the firmware version is out of date)

Tools

● One of the following:

○ Console with monitor and keyboard (or PC with a serial COM port configured for115.2Kbs)

○ KVM with attached cable

● ESD strap

About this taskUse this procedure to remove and replace a failed OSS controller in the Sonexion 900 4U24 SSU.

Subtasks:

● Remove and Install OSS Controller

● Verify New SSU Controller

The Sonexion 900 uses three enclosure types:

● A 2U24 Metadata Management Unit (MMU) enclosure, which includes two Sonexion 900 MMU OSScontrollers, two power cooling modules (PCM), and 16 DDICs.

● A 4U24 Scalable Storage Unit (SSU) enclosure, which includes two Sonexion 900 SSU OSS controllers, fourPCMs, and 23 DDICs (disk drive in carrier, usually referred to simply as disks).

● Multiple 4U24 Expandable Storage Units (ESU) enclosures, each of which includes two SAS EBODcontrollers, two PCMs, and 21 disks.

Each controller hosts one OSS node; there are two OSS nodes per SSU. Within an SSU, the OSS nodes areorganized in an HA pair with sequential numbers (for example, node n004 / n005 or snx11000n004 /snx11000n005). If an OSS node goes down because its controller fails, its resources migrate to the HA partner/OSS node in the other controller.

Replace a 4U24 SSU OSS Controller

19

Page 20: About Sonexion 900 Replacement Procedures 1.4

A downed OSS node cannot be reached directly by the Sonexion system. Several steps in this procedure involvelogging into the HA partner (on the other controller) to determine the downed node's status and whether itsresources have successfully failed over to the HA partner.

● Take the OSS node offline if the resources have failed over but the affected node is still online.

● Force a failover HA resources to its partner if the local HA resources have not failed over and the affectednode is still online.

Notes and Cautions

● Only trained service personnel should perform this procedure.

● If this equipment is used in a manner not specified by the manufacturer, the protection provided by theequipment may be impaired.

● Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.

IMPORTANT:

● During the customer wizard the cluster is deliberately set into maintenance mode. No FRUprocedures are supported until the customer wizard is complete. Forceful exit from maintenancemode is not supported and may damage the cluster.

● Cray support personnel should perform all software instructions via remote system access. Beforeperforming this procedure, please contact Cray Support to coordinate login credentials, timing, etc. Ifremote system access is not available, the instructions in this document are sufficiently detailed foronsite personnel to perform.

Procedure

1. Determine the physical and logical location (hostname) of the failed controller in the SSU.

2. Log in to the active MGMT node via SSH (user name admin and the customer’s password).

[ADMIN]$ ssh –l admin active_MGMT_node

3. From the active MGMT node, fail over the resources from the affected node to its HA partner:

[admin@n000]$ cscli failover -n nodes

Where nodes is the names of the node(s) requiring failover.

For example, if a failure occurs on node snx11000n004 and need to fail over its resources to nodesnx11000n05, the command is as follows:

[admin@n000]$ cscli failover -n snx11000n004

4. Log in to the HA partner of the OSS node on the failed controller via SSH.

[admin@n000]$ ssh HA_partner_node_hostname

5. Use the crm_mon utility to display the status of both OSS nodes:

[admin@n000]$ sudo crm_mon -1

Replace a 4U24 SSU OSS Controller

20

Page 21: About Sonexion 900 Replacement Procedures 1.4

When both OSS nodes are online with their resources assigned to them, the crm_mon -1 output looks asfollows:

[admin@snx11000n004 ~]$ sudo crm_mon -1============Last updated: Wed Jan 23 11:18:30 2013Last change: Wed Jan 23 09:58:05 2013 via crm_resource on snx11000n005Stack: HeartbeatCurrent DC: snx11000n005 (fd52ad02-9476-4329-91cf-580b06019ebf) - partition withquorumVersion: 1.1.6.1-2.el6-0c7312c689715e096b716419e2ebc12b579620522 Nodes configured, unknown expected votes55 Resources configured.============Online: [ snx11000n004 snx11000n005 ]snx11000n004-stonith (stonith:external/gem_stonith): Started snx11000n004snx11000n005-stonith (stonith:external/gem_stonith): Started snx11000n005snx11000n004_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Startedsnx11000n004snx11000n005_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Startedsnx11000n005baton (ocf::heartbeat:baton): Started snx11000n005snx11000n004_ibstat (ocf::heartbeat:ibstat): Started snx11000n004snx11000n005_ibstat (ocf::heartbeat:ibstat): Started snx11000n005Resource Group: snx11000n004_md0-groupResource Group: snx11000n004_md0-groupsnx11000n004_md0-wibr (ocf::heartbeat:XYRAID): Started snx11000n004snx11000n004_md0-jnlr (ocf::heartbeat:XYRAID): Started snx11000n004snx11000n004_md0-wibs (ocf::heartbeat:XYMNTR): Started snx11000n004snx11000n004_md0-raid (ocf::heartbeat:XYRAID): Started snx11000n004snx11000n004_md0-fsys (ocf::heartbeat:XYMNTR): Started snx11000n004snx11000n004_md0-stop (ocf::heartbeat:XYSTOP): Started snx11000n004Resource Group: snx11000n004_md1-groupsnx11000n004_md1-wibr (ocf::heartbeat:XYRAID): Started snx11000n005snx11000n004_md1-jnlr (ocf::heartbeat:XYRAID): Started snx11000n005snx11000n004_md1-wibs (ocf::heartbeat:XYMNTR): Started snx11000n005snx11000n004_md1-raid (ocf::heartbeat:XYRAID): Started snx11000n005snx11000n004_md1-fsys (ocf::heartbeat:XYMNTR): Started snx11000n005snx11000n004_md1-stop (ocf::heartbeat:XYSTOP): Started snx11000n005Resource Group: snx11000n004_md2-groupsnx11000n004_md2-wibr (ocf::heartbeat:XYRAID): Started snx11000n004snx11000n004_md2-jnlr (ocf::heartbeat:XYRAID): Started snx11000n004snx11000n004_md2-wibs (ocf::heartbeat:XYMNTR): Started snx11000n004snx11000n004_md2-raid (ocf::heartbeat:XYRAID): Started snx11000n004snx11000n004_md2-fsys (ocf::heartbeat:XYMNTR): Started snx11000n004snx11000n004_md2-stop (ocf::heartbeat:XYSTOP): Started snx11000n004Resource Group: snx11000n004_md3-groupsnx11000n004_md3-wibr (ocf::heartbeat:XYRAID): Started snx11000n005snx11000n004_md3-jnlr (ocf::heartbeat:XYRAID): Started snx11000n005snx11000n004_md3-wibs (ocf::heartbeat:XYMNTR): Started snx11000n005snx11000n004_md3-raid (ocf::heartbeat:XYRAID): Started snx11000n005snx11000n004_md3-fsys (ocf::heartbeat:XYMNTR): Started snx11000n005snx11000n004_md3-stop (ocf::heartbeat:XYSTOP): Started snx11000n005

Replace a 4U24 SSU OSS Controller

21

Page 22: About Sonexion 900 Replacement Procedures 1.4

When the OSS node on the failed controller is offline and its resources have failed over to its HA partner,partial crm_mon -1 output looks as follows. (The full output includes eight RAID groups, but the exampleshows only four.)

[admin@snx11000n009 ~]$ sudo crm_mon -1============Last updated: Wed May 1 11:40:32 2013Last change: Wed May 1 11:33:02 2013 via cibadmin on snx11000n009 Stack: HeartbeatCurrent DC: snx11000n009 (a95f9134-b47d-4e5b-8a7f-4c0fc1715fc8) - partition withquorumVersion: 1.1.6.1-3.el6-0c7312c689715e096b716419e2ebc12b579620522 Nodes configured, unknown expected votes55 Resources configured.============Online: [ snx11000n009 ]OFFLINE: [ snx11000n008 ]Clone Set: clone-gem-ipmi-stonith [gem-ipmi-stonith]Started: [ snx11000n009 ]Stopped: [ gem-ipmi-stonith:0 ] snx11000n009_mdadm_conf_regenerate(ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n009baton (ocf::heartbeat:baton): Started snx11000n009snx11000n009_ibstat (ocf::heartbeat:ibstat): Started snx11000n009 ResourceGroup: snx11000n008_md0-groupsnx11000n008_md0-wibr (ocf::heartbeat:XYRAID): Started snx11000n009snx11000n008_md0-jnlr (ocf::heartbeat:XYRAID): Started snx11000n009snx11000n008_md0-wibs (ocf::heartbeat:XYMNTR): Started snx11000n009snx11000n008_md0-raid (ocf::heartbeat:XYRAID): Started snx11000n009snx11000n008_md0-fsys (ocf::heartbeat:XYMNTR): Started snx11000n009snx11000n008_md0-stop (ocf::heartbeat:XYSTOP): Started snx11000n009 ResourceGroup: snx11000n008_md1-groupsnx11000n008_md1-wibr (ocf::heartbeat:XYRAID): Started snx11000n009snx11000n008_md1-jnlr (ocf::heartbeat:XYRAID): Started snx11000n009snx11000n008_md1-wibs (ocf::heartbeat:XYMNTR): Started snx11000n009snx11000n008_md1-raid (ocf::heartbeat:XYRAID): Started snx11000n009snx11000n008_md1-fsys (ocf::heartbeat:XYMNTR): Started snx11000n009snx11000n008_md1-stop (ocf::heartbeat:XYSTOP): Started snx11000n009Resource Group: snx11000n008_md2-groupsnx11000n008_md2-wibr (ocf::heartbeat:XYRAID): Started snx11000n009snx11000n008_md2-jnlr (ocf::heartbeat:XYRAID): Started snx11000n009snx11000n008_md2-wibs (ocf::heartbeat:XYMNTR): Started snx11000n009snx11000n008_md2-raid (ocf::heartbeat:XYRAID): Started snx11000n009snx11000n008_md2-fsys (ocf::heartbeat:XYMNTR): Started snx11000n009snx11000n008_md2-stop (ocf::heartbeat:XYSTOP): Started snx11000n009 ResourceGroup: snx11000n008_md3-groupsnx11000n008_md3-wibr (ocf::heartbeat:XYRAID): Started snx11000n009snx11000n008_md3-jnlr (ocf::heartbeat:XYRAID): Started snx11000n009snx11000n008_md3-wibs (ocf::heartbeat:XYMNTR): Started snx11000n009snx11000n008_md3-raid (ocf::heartbeat:XYRAID): Started snx11000n009snx11000n008_md3-fsys (ocf::heartbeat:XYMNTR): Started snx11000n009snx11000n008_md3-stop (ocf::heartbeat:XYSTOP): Started snx11000n009

In this example, the OSS node on the failed controller is snx11000n008 and its HA partner is snx11000n009.

6. After verifying the failover has occurred, power off the failed controller as follows. From the active MGMTnode, enter:

[admin@n000]$ cscli power_manage –n nodeXX --power-off

Replace a 4U24 SSU OSS Controller

22

Page 23: About Sonexion 900 Replacement Procedures 1.4

Where nodeXX is the name of the affected MGMT node.

If the failover has occurred and already powered off, proceed to the next step.

Remove and Install OSS Controller

Perform the following steps at the back of the rack. It is required to wear an ESD strap before removing theOSS controllers.

7. Disconnect the cables from the failed controller.

a. Unplug the two RJ-45 network cables.

b. Unplug the InfiniBand cable.

c. Unplug all SAS connection (there may be up to three SAS cables depending upon the number of ESUs).

Attach a label to each cable indicating which port (A, B, or C) and which Sonexion 900 SSU OSScontroller (Upper or Lower) to reconnect the cable to (A, B, or C).

8. Remove the failed controller from the SSU, using the locking lever to slide out the controller from the back ofthe rack.

9. Insert the new controller half-way into the SSU, but do not seat it in the enclosure.

10. Connect the cables to the new controller.

a. Plug in the two RJ-45 network cables.

b. Plug in the InfiniBand cable.

c. Plug in the SAS cables to their original ports

11. Connect a serial cable from the console or PC to the new controller (serial port is on the rear panel).

12. Open a terminal session with these settings:

Bits per second 115200

Data bits 8

Parity None

Stop bits 1

Flow control None

The Function Keys are set to VT100+.

This serial connection allows monitoring the startup for any issues.

13. Completely insert the new controller into the SSU (until the locking lever engages and the unit is properlyseated in the chassis) and press the power button on the back of the controller.

Replace a 4U24 SSU OSS Controller

23

Page 24: About Sonexion 900 Replacement Procedures 1.4

Figure 19. OSS Rear Panel

A thin implement such as a pen or paperclip is required to reach the Power On button.

14. Monitor the boot cycle from the terminal display and allow the discovery process to complete beforeproceeding. The following steps should occur during the boot cycle:

● The new OSS controller boots into discovery mode and discovery occurs.

● The controller automatically reboots with the correct hostname and restores the HA configuration.

● The controller reboots automatically again and comes up online and completely operational.

15. Verify that the newly installed controller is running:

[admin@n000]$ sudo pdsh –a uname –r | dshbak –c

16. Log in to the newly installed OSS node:

[admin@n000]$ ssh oss_node

Verify New SSU Controller

17. Verify that the new controller is online:

[admin@n000]$ sudo crm_mon -1

When both OSS nodes are online with their resources assigned to them, the crm_mon -1 output (fulloutput includes eight RAID groups, the example only shows four) looks as follows:

[admin@snx11000n004 ~]$ sudo crm_mon -1============Last updated: Wed Jan 23 11:52:39 2013Last change: Wed Jan 23 11:52:35 2013 via crm_attribute on snx11000n004Stack: HeartbeatCurrent DC: snx11000n005 (fd52ad02-9476-4329-91cf-580b06019ebf) - partition with quorumVersion: 1.1.6.1-2.el6-0c7312c689715e096b716419e2ebc12b579620522 Nodes configured, unknown expected votes55 Resources configured.============

Online: [ snx11000n004 snx11000n005 ]

Replace a 4U24 SSU OSS Controller

24

Page 25: About Sonexion 900 Replacement Procedures 1.4

snx11000n004-stonith (stonith:external/gem_stonith): Started snx11000n004snx11000n005-stonith (stonith:external/gem_stonith): Started snx11000n005snx11000n005_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Startedsnx11000n005baton (ocf::heartbeat:baton): Started snx11000n005snx11000n004_ibstat (ocf::heartbeat:ibstat): Started snx11000n004snx11000n005_ibstat (ocf::heartbeat:ibstat): Started snx11000n005Resource Group: snx11000n004_md0-groupsnx11000n004_md0-wibr (ocf::heartbeat:XYRAID): Started snx11000n005snx11000n004_md0-jnlr (ocf::heartbeat:XYRAID): Started snx11000n005snx11000n004_md0-wibs (ocf::heartbeat:XYMNTR): Started snx11000n005snx11000n004_md0-raid (ocf::heartbeat:XYRAID): Started snx11000n005snx11000n004_md0-fsys (ocf::heartbeat:XYMNTR): Started snx11000n005snx11000n004_md0-stop (ocf::heartbeat:XYSTOP): Started snx11000n005Resource Group: snx11000n004_md1-groupsnx11000n004_md1-wibr (ocf::heartbeat:XYRAID): Started snx11000n005snx11000n004_md1-jnlr (ocf::heartbeat:XYRAID): Started snx11000n005snx11000n004_md1-wibs (ocf::heartbeat:XYMNTR): Started snx11000n005snx11000n004_md1-raid (ocf::heartbeat:XYRAID): Started snx11000n005snx11000n004_md1-fsys (ocf::heartbeat:XYMNTR): Started snx11000n005snx11000n004_md1-stop (ocf::heartbeat:XYSTOP): Started snx11000n005Resource Group: snx11000n004_md2-groupsnx11000n004_md2-wibr (ocf::heartbeat:XYRAID): Started snx11000n005snx11000n004_md2-jnlr (ocf::heartbeat:XYRAID): Started snx11000n005snx11000n004_md2-wibs (ocf::heartbeat:XYMNTR): Started snx11000n005snx11000n004_md2-raid (ocf::heartbeat:XYRAID): Started snx11000n005snx11000n004_md2-fsys (ocf::heartbeat:XYMNTR): Started snx11000n005snx11000n004_md2-stop (ocf::heartbeat:XYSTOP): Started snx11000n005Resource Group: snx11000n004_md3-groupsnx11000n004_md3-wibr (ocf::heartbeat:XYRAID): Started snx11000n005snx11000n004_md3-jnlr (ocf::heartbeat:XYRAID): Started snx11000n005snx11000n004_md3-wibs (ocf::heartbeat:XYMNTR): Started snx11000n005snx11000n004_md3-raid (ocf::heartbeat:XYRAID): Started snx11000n005snx11000n004_md3-fsys (ocf::heartbeat:XYMNTR): Started snx11000n005snx11000n004_md3-stop (ocf::heartbeat:XYSTOP): Started snx11000n005

18. From the active MGMT node, fail back the resources to balance the load between the affected nodes:

[admin@n000]$ cscli failback -n nodes

Where nodes are the names of the node(s) that previously failed over. For example:

[admin@n000]$ cscli failback -n xx04

19. Verify that all local resources successfully fail back to the OSS node on the new OSS:

[admin@n000]$ sudo crm_mon

When both OSS nodes are online with their resources assigned to them, crm_mon -1 output looks similar likethis:

[admin@snx11000n004 ~]$ sudo crm_mon -1============Last updated: Wed Jan 23 11:18:30 2013Last change: Wed Jan 23 09:58:05 2013 via crm_resource on snx11000n005 Stack: Heartbeat

Replace a 4U24 SSU OSS Controller

25

Page 26: About Sonexion 900 Replacement Procedures 1.4

Current DC: snx11000n005 (fd52ad02-9476-4329-91cf-580b06019ebf) - partition with quorumVersion: 1.1.6.1-2.el6-0c7312c689715e096b716419e2ebc12b579620522 Nodes configured, unknown expected votes55 Resources configured.============Online: [ snx11000n004 snx11000n005 ]snx11000n004-stonith (stonith:external/gem_stonith): Started snx11000n004snx11000n005-stonith (stonith:external/gem_stonith): Started snx11000n005snx11000n004_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Startedsnx11000n004snx11000n005_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Startedsnx11000n005 baton (ocf::heartbeat:baton): Started snx11000n005snx11000n004_ibstat (ocf::heartbeat:ibstat): Started snx11000n004snx11000n005_ibstat (ocf::heartbeat:ibstat): Started snx11000n005Resource Group: snx11000n004_md0-groupResource Group: snx11000n004_md0-groupsnx11000n004_md0-wibr (ocf::heartbeat:XYRAID): Started snx11000n004snx11000n004_md0-jnlr (ocf::heartbeat:XYRAID): Started snx11000n004snx11000n004_md0-wibs (ocf::heartbeat:XYMNTR): Started snx11000n004snx11000n004_md0-raid (ocf::heartbeat:XYRAID): Started snx11000n004snx11000n004_md0-fsys (ocf::heartbeat:XYMNTR): Started snx11000n004snx11000n004_md0-stop (ocf::heartbeat:XYSTOP): Started snx11000n004Resource Group: snx11000n004_md1-groupsnx11000n004_md1-wibr (ocf::heartbeat:XYRAID): Started snx11000n005snx11000n004_md1-jnlr (ocf::heartbeat:XYRAID): Started snx11000n005snx11000n004_md1-wibs (ocf::heartbeat:XYMNTR): Started snx11000n005snx11000n004_md1-raid (ocf::heartbeat:XYRAID): Started snx11000n005snx11000n004_md1-fsys (ocf::heartbeat:XYMNTR): Started snx11000n005snx11000n004_md1-stop (ocf::heartbeat:XYSTOP): Started snx11000n005

20. Compare the USM and GEM firmware versions between the Sonexion 900 SSU OSS controllers to makecertain they match. If they do not match, update the new controller so that its firmware version matches theversion on the existing controller.

Refer to Leveling Sonexion USM Firmware, publication S-2545, for instructions on how to compare andupdate firmware versions. If the firmware versions match, go to the next step.

21. If the terminal connection (console or PC) is still active, terminate it and disconnect the serial cable from thenew controller.

Replace a 4U24 SSU OSS Controller

26

Page 27: About Sonexion 900 Replacement Procedures 1.4

Replace the 4U24 SSU ChassisPrerequisitesPart number

101268200 - Base Power Controlled 6G 4U24, no PSUs, no Controllers

Time2 hours

Interrupt levelInterrupt (requires taking the Lustre file system offline. Perform a USM upgrade only if thefirmware version is out of date)

Tools

● Phillips screwdriver (medium)

● ESD strap

● One of the following:

○ Console with monitor and keyboard (or PC with a serial COM port configured for115.2Kbs)

○ KVM with attached cable

Requirements

● The new chassis and the chassis being replaced must be on the same GEM version. Ifthe GEM versions are different, this procedure may be unsuccessful.

● The size and weight of the 4U24 chassis requires two individuals to move the unitsafely. Do not perform this procedure unless two individuals are onsite and available tomove each 4U24 chassis.

About this taskUse this procedure to remove and replace a defective chassis in a Sonexion 4U24 SSU.

Subtasks:

● Install New SSU Chassis and Replace Components

● Power On SSU Components and Verify

The Sonexion 900 uses three enclosure types:

● A 2U24 Metadata Management Unit (MMU) enclosure, which includes two Sonexion 900 MMU OSScontrollers, two power cooling modules (PCM), and 16 DDICs.

● A 4U24 Scalable Storage Unit (SSU) enclosure, which includes two Sonexion 900 SSU OSS controllers, fourPCMs, and 23 DDICs (disk drive in carrier, usually referred to simply as disks).

● Multiple 4U24 Expandable Storage Units (ESU) enclosures, each of which includes two SAS EBODcontrollers, two PCMs, and 21 disks.

Replace the 4U24 SSU Chassis

27

Page 28: About Sonexion 900 Replacement Procedures 1.4

Notes and Cautions

● Only trained service personnel should perform this procedure.

● If this equipment is used in a manner not specified by the manufacturer, the protection provided by theequipment may be impaired.

● Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.

Procedure

1. If the location of the failed 4U24 chassis is not known, look for a Fault LED (amber) on the failed 4U24chassis (front panel).

2. Log in to the primary MGMT node via SSH (user name admin and the customer's password):

[ADMIN]$ ssh –l admin primary_MGMT_node

3. Stop the Lustre file system:

[MGMT0}$ cscli unmount -f fsname

4. Check that the Lustre file system is stopped on all nodes:

[MGMT0}$ cscli fs_info

5. After verifying the Lustre file system has stopped, power off the Sonexion, as described in Sonexion 900Power On and Power Off Procedures.

6. Remove SSU OSS controllers from the SSU:

a. From the back of the rack, turn off the power switches on both PCMs in the chassis.

b. Unplug the two RJ-45 network cables from the top SSU OSS controller.

c. Unplug the QSFP+ cable from the top SSU OSS controller.

Unplug all SAS cables (there may be up to three SAS cables depending upon the number of ESUs).

Attach a label to each cable indicating the port (A, B, or C) and SSU OSS controller (Upper or Lower) towhich it should be reconnected.

Wear ESD protection to perform the following steps.

d. Release the controller latch by grasping it between the thumb and forefinger and gently squeezing.

e. Using the latch as a handle, carefully remove the controller from the enclosure.

f. Repeat the controller removal steps for the second SSU OSS controller.

7. From the back of the rack, remove the PCMs:

a. Disconnect the power cord from one of the PCMs by removing the bale from the PCM and removing thecord.

b. Release the PCM latch by grasping it between the thumb and forefinger and gently squeezing it.

Replace the 4U24 SSU Chassis

28

Page 29: About Sonexion 900 Replacement Procedures 1.4

Figure 20. Releasing Latch on 4U24 PCM

c. Using the latch as a handle, carefully remove the PCM from the enclosure.

WARNING: Do not remove the cover from the PCM. Danger of electric shock exists inside thecover.

Figure 21. Removing a 4U24 PCM

d. Repeat the PCM removal steps for the second PCM.

8. Remove disks as follows. Note the exact location of the drives, as they must be installed in the same order inthe new 4U24 chassis.

a. If the anti-tamper lock is engaged, carefully insert the lock key into the lock socket and rotate it counter-clockwise until the red indicator is no longer visible in the opening above the key.

Perform these steps at the front of the rack.

b. Remove the lock key.

Replace the 4U24 SSU Chassis

29

Page 30: About Sonexion 900 Replacement Procedures 1.4

Figure 22. Releasing the Latch Handle

Release the disk by pressing the latch handle towards the handle hinge as shown above.

c. Gently remove the disk approximately 1 inch (25mm) from the drive slot, and then wait 30 seconds, toallow the drive to finish spinning.

Figure 23. Removing a disk

d. Fully remove the disk from the drive bay.

Mark the drive with its current drive slot number in the chassis so it can be reinstalled in the same slot inthe new chassis. From the front of the rack, the drive slots are numbered 0 to 23 (left to right).

e. Repeat the disk drive removal steps for the remaining disks.

9. Remove the failed chassis from the rack:

a. Remove the left and right front flange caps by pulling the caps free.

b. Disconnect the chassis from the rack by removing the screw from the left and right flanges (now exposedafter removing the flange caps).

c. With a second person, remove the chassis from the rack.

Install New SSU Chassis and Replace Components

Perform the following steps at the front of the rack.

Replace the 4U24 SSU Chassis

30

Page 31: About Sonexion 900 Replacement Procedures 1.4

10. Install the new chassis in the rack:

a. With the chassis on a bench, remove the left and right front flange caps by pulling the caps free.

The caps simply snap onto the flanges.

b. With a second person, move the 4U24 chassis into the rack. Carefully align the guide on each side of thechassis with the groove on the rail assembly and gently push the chassis completely into the rack.

c. Connect the chassis to the rack by installing a screw into the left and right flanges.

d. Install the flange caps by pressing them into position until they snap into place on the flanges.

11. In following steps, install each disk drive in the same slot where it was located in the old 4U24 chassis.Perform these steps at the front of the rack.

IMPORTANT: Ensure that each disk is oriented so the drive handle opens from the left.

a. Verify that each disk handle is released and in the open position.

Figure 24. Disk with Handle in Open Position

b. Insert each disk into the empty drive slot and gently slide the drive carrier into the enclosure until it stops.

Figure 25. Installing a Disk

c. Seat the disk by pressing the handle latch and rotating it to the closed position.

There will be an audible click as the handle latch engages.

Replace the 4U24 SSU Chassis

31

Page 32: About Sonexion 900 Replacement Procedures 1.4

Figure 26. Seated disk

d. Verify that each disk is in the same orientation as the other disks in the enclosure.

e. Carefully insert the lock key into the lock socket and rotate it clockwise until the red indicator is visible inthe opening above the key.

f. Remove the lock key.

g. Repeat the disk drive installation steps for the remaining disks.

12. Use the following steps to re-install each PCM, working at the back of the rack.

a. Carefully inspect the PCM for damage, especially to the rear connector. Avoid damaging the connectorpins.

IMPORTANT: If the PCM is damaged, do not install it but obtain another PCM.

b. Verify that the power switch on each PCM is in the OFF position.

c. With the PCM handle in the open position, slide the PCM into the empty bay at the rear of the 4U24enclosure.

Figure 27. Installing a PCM

d. As the PCM begins to seat, grasp the handle latch and close it to engage the latch. A click should beheard as the handle latch engages.

Replace the 4U24 SSU Chassis

32

Page 33: About Sonexion 900 Replacement Procedures 1.4

Figure 28. Seating the PCM

e. Connect the power cord to the PCM.

f. Place the bale over and onto the power cord.

g. Repeat the PCM installation steps for the second PCM.

13. Re-install each SSU OSS controller:

a. Inspect the SSU OSS controller for damage, especially to the interface connector.

IMPORTANT: If the controller is damaged, do not install it but obtain another SSU OSS controller.

b. With the latch in the released (open) position, slide the SSU OSS controller into the enclosure until itcompletely seats and engages the latch.

c. Secure the OSS controller by closing the latch.

There will an audible click as the latch engages.

d. Repeat the OSS controller installation steps for the second SSU OSS controller.

Power On SSU Components and Verify

14. Plug in the two RJ-45 network cables to the SSU OSS controller.

15. Plug in the QSFP+ cable to the SSU OSS controller.

16. Plug in the SAS cables to their original ports on the SSU OSS controller.

17. Turn on the power switches on both PCMs.

18. Verify that the indicator LEDs on the PCMs, SAS EBOD controllers and 4U24 chassis (front panel) are normaland blinking green.

The Module Fault and Logical Fault indicators should be off.

Replace the 4U24 SSU Chassis

33

Page 34: About Sonexion 900 Replacement Procedures 1.4

Figure 29. 4U24 Operator Control Panel

Table 2. Operator Control Panel: 4U24 LED Indicators

LEDs State Description

System Power Steady Green AC Power is applied to the enclosure.

Module Fault Steady Amber Indicates one of the following:

● Power Cooling Module fault

● ESM fault

● Over or under temperature fault condition

Refer to individual module fault LEDs.

Logical Fault Steady Amber Indicates failure of a disk drive

19. Power on the Sonexion, as described in Sonexion 900 Power On and Power Off Procedures, Cray publicationSR5-6132.

20. Compare the USM and GEM firmware versions between the new chassis and the MMU OSS controllerstoverify that they match.

Refer to Leveling Sonexion USM Firmware, publication S-2545, for instructions on how to compare andupdate firmware versions.

21. Start the Lustre file system:

[MGMT0]$ cscli mount -f fsname

22. Verify that the Lustre file system is started on all nodes:

[MGMT0]$ cscli fs_info

23. After verifying the Lustre file system has started, close the console connection and disconnect the KVM, or, ifusing a console or PC, disconnect the serial cable from the primary MGMT server.

Replace the 4U24 SSU Chassis

34

Page 35: About Sonexion 900 Replacement Procedures 1.4

Replace a 4U24 ESU SAS EBOD ControllerPrerequisitesPart number

101229500 Sonexion 900 Series Application Controller 16GB (ESU-Ready) For SSU - FRU

Time1.5 hours

Interrupt levelFailover (can be applied to a live system with no service interruption, but requires failover/failback operations)

ToolsPhillips screwdriver (medium)

ESD strap

One of the following:

● Console with monitor and keyboard (or PC with a serial COM port configured for115.2Kbs)

● KVM with attached cable

System Access RequirementsRoot access is required to perform this procedure. To get root system access, contact CraySupport.

About this taskThe Sonexion 900 uses three enclosure types:

● A 2U24 Metadata Management Unit (MMU) enclosure, which includes two Sonexion 900 MMU OSScontrollers, two power cooling modules (PCM), and 16 DDICs.

● A 4U24 Scalable Storage Unit (SSU) enclosure, which includes two Sonexion 900 SSU OSS controllers, fourPCMs, and 23 DDICs (disk drive in carrier, usually referred to simply as disks).

● Multiple 4U24 Expandable Storage Units (ESU) enclosures, each of which includes two SAS EBODcontrollers, two PCMs, and 21 disks.

Notes and Cautions

● Only trained service personnel should perform this procedure.

● If this equipment is used in a manner not specified by the manufacturer, the protection provided by theequipment may be impaired.

● Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.

This procedure includes steps to replace the failed SAS EBOD controller, verify the operation of the new SASEBOD controller, and return the Sonexion system to normal operation.

Replace a 4U24 ESU SAS EBOD Controller

35

Page 36: About Sonexion 900 Replacement Procedures 1.4

Perform the following steps at the back of the rack.

CAUTION: If passive copper cables are connected, they must not have a connection to a commonground or earth point.

Procedure

1. If the location of the failed SAS EBOD controller is not known, look for an amber Fault LED on the failed SASEBOD controller and the Module Fault LED on the OCP (Operator Control Panel) on the 4U24 enclosure(front panel).

2. Fail over and shut down the OSS node hosted by the failed SAS EBOD controller.

a. Log in to the primary MGMT node:

[ Client]$ ssh –l admin primary_MGMT_node

b. Determine whether the failover operation occurred:

[admin@n000]$ cscli fs_info

If a failover occurred, go to Step 3. If it did not occur, proceed to step 2c.

c. Fail overresources from the OSS node hossted on the failed SAS EBOD controller:

[admin@n000]$ cscli failover -n oss_nodename

Wait for the resources to fully fail over to the OSS node's HA partner. To confirm that the failoveroperation has completed, run:

[admin@n000]$ cscli fs_info

3. Shut down the affected OSS node:

[admin@n000]$ cscli power_manage -n oss_node name --power-off

Wait for the OSS node to completely power off. To confirm that the failover operation is completed, run:

[admin@n000]$ sudo pm –q

4. Remove the failed SAS EBOD controller and install the new replacement controller.

5. Disconnect each SAS cable from the failed SAS EBOD controller and attach a label indicating the controllerand port to which it was attached, so it can be properly reconnected. Use "upper" or "lower" to indicate thecontroller, and "A" to indicate the port. :

IMPORTANT: The SAS EBOD controllers are installed on top of each other in the enclosure.Depending on the cable clearance, it may be necessary to disconnect the SAS cables on both SASEBOD controllers to access the failed unit. If the SAS cables are disconnected from the functioningSAS EBOD controller, be sure to attach an identifying label to each one (described above), so theycan be properly reconnected after the new controller is installed.

6. Release the controller latch by grasping it between the thumb and forefinger and gently squeezing it.

Replace a 4U24 ESU SAS EBOD Controller

36

Page 37: About Sonexion 900 Replacement Procedures 1.4

Figure 30. EBOD I/O Latch Operation

7. Use the latch as a handle to carefully remove the failed controller from the enclosure.

Figure 31. Removing an SAS EBOD controller

8. Inspect the new SAS EBOD controller for damage, especially to the interface connector. If the controller isdamaged, do not install it. Obtain another SAS EBOD controller.

9. With the latch in the released (open) position, slide the new SAS EBOD controller into the enclosure until itcompletely seats and engages the latch.

IMPORTANT: If passive copper cables are connected, they must not have a connection to a commonground or earth point.

Replace a 4U24 ESU SAS EBOD Controller

37

Page 38: About Sonexion 900 Replacement Procedures 1.4

Figure 32. Installing an SAS EBOD controller

10. Secure the EBOD controller by closing the latch. There will be an audible click as the latch engages.

The SAS EBOD controller may take up to one minute to re-initialize after cables are reconnected.

11. Plug in the SAS cables to their assigned ports on the SAS EBOD controller. See step 5 on page 36 for moreinformation.

12. Power on the OSS node hosted on the new SAS EBOD controller. On the primary MGMT node:

[admin@n000]$ cscli power_manage -n oss_nodename

Wait for the OSS node to come online; this may take few minutes. On the primary MGMT node, confirm thatthe OSS node has come online.

[admin@n000]$ pdsh -a uname -r | dshbak –c

13. Verify that the LEDs have returned to normal. .

On the new SAS EBOD controller :

● Fault LED is unlit

● Health LED is lit green

On the OCP at the front of the 4U24 enclosure (front panel):

● Module Fault LED is extinguished.unlit

Verify that the SAS Lane LED (port A) on the new SAS EBOD controller is ON and ready, with no trafficshowing.

14. When the OSS nodehosted on the new SAS EBOD controller is online, fail back resources from its HApartner:

[admin@n000]$ cscli failback -n oss_node name

Replace a 4U24 ESU SAS EBOD Controller

38

Page 39: About Sonexion 900 Replacement Procedures 1.4

To confirm that resources have failed back to the affected node, run:

[admin@n000]$ cscli fs_info

15. Log in to the primary MGMT node.

[Client]$ ssh -1 admin primary_MGMT_node

16. Compare the USM and GEM firmware versions on both SAS EBOD controllers in the affected ESUcomponent to determine if they are the same.

If they are not, level the firmware of the new SAS EBOD controller so that it matches the version on theexisting EBOD controller. Refer to Leveling Sonexion USM Firmware, to compare and update firmwareversions.

17. If the terminal connection (console or PC) is still active, close it and disconnect the serial cable from the newEBOD controller.

Replace a 4U24 ESU SAS EBOD Controller

39

Page 40: About Sonexion 900 Replacement Procedures 1.4

Replace the 4U24 ESU ChassisPrerequisitesPart number

101268200 - Base Power Controlled 6G 4U24, no PSUs, no Controllers

Time2 hours

Interrupt levelInterrupt (requires taking the Lustre file system offline. Perform a USM upgrade only if thefirmware version is out of date)

Tools

● Lock key T10T

● Phillips screwdriver (medium)

● ESD strap

● One of the following:

○ Console with monitor and keyboard (or PC with a serial COM port configured for115.2Kbs)

○ KVM with attached cable

Requirements

● The new chassis and the chassis being replaced must be on the same GEM version. Ifthe GEM versions are different, this procedure may be unsuccessful.

● The size and weight of the 4U24 chassis requires two individuals to move the unitsafely. Do not perform this procedure unless two individuals are onsite and available tomove each 4U24 chassis.

About this taskUse this procedure to remove and replace a defective chassis in a Sonexion 900 4U24 ESU.

Subtasks:

● Install New ESU Chassis and Replace Components

● Restart the System

The Sonexion 900 uses three enclosure types:

● A 2U24 Metadata Management Unit (MMU) enclosure, which includes two Sonexion 900 MMU OSScontrollers, two power cooling modules (PCM), and 16 DDICs.

● A 4U24 Scalable Storage Unit (SSU) enclosure, which includes two Sonexion 900 SSU OSS controllers, fourPCMs, and 23 DDICs (disk drive in carrier, usually referred to simply as disks).

Replace the 4U24 ESU Chassis

40

Page 41: About Sonexion 900 Replacement Procedures 1.4

● Multiple 4U24 Expandable Storage Units (ESU) enclosures, each of which includes two SAS EBODcontrollers, two PCMs, and 21 disks.

Notes and Cautions

● Only trained service personnel should perform this procedure.

● If this equipment is used in a manner not specified by the manufacturer, the protection provided by theequipment may be impaired.

● Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.

Procedure

1. If the location of the failed 4U24 chassis is not known, look for a Fault LED (amber) on the failed 4U24chassis (front panel).

2. Log in to the primary MGMT node via SSH (user name admin and the customers password):

[ADMIN]$ ssh –l admin primary_MGMT_node

3. Stop the Lustre file system:

[admin@n000]$ cscli unmount -f fsname

4. Check that the Lustre file system is stopped on all nodes:

[admin@n000]$ cscli fs_info

5. After verifying the Lustre file system has stopped, power off the Sonexion, as described in Sonexion 900Power On and Power Off Procedures.

6. Remove the SAS EBOD controller:

a. Turn off the power switches on both PCMs in the chassis.

b. On one SAS EBOD controller, disconnect each SAS cable and attach a label indicating the EBODcontroller and port to reconnect the cable to the controller. Use upper or lower to indicate the EBODcontroller and A, B or C to indicate the port.

A sample label is shown below:

Upper A

a. Release the controller latch by grasping it between the thumb and forefinger and gently squeezing it.

Figure 33. EBOD I/O Latch Operation

Replace the 4U24 ESU Chassis

41

Page 42: About Sonexion 900 Replacement Procedures 1.4

b. Using the latch as a handle, carefully remove the controller from the enclosure.

Figure 34. Removing an SAS EBOD controller

c. Repeat the EBOD controller removal steps for the second SAS EBOD controller.

7. Remove the PCMs:

a. On one of the PCMs, disconnect the power cord by removing the bale from the PCM and removing thecord.

b. Release the PCM latch by grasping it between your thumb and forefinger and gently squeezing it.

Figure 35. Releasing Latch on a 4U24 PCM

c. Using the latch as a handle, carefully remove the PCM from the enclosure.

WARNING: Do not remove the cover from the PCM. Danger of electric shock exists inside thecover.

Replace the 4U24 ESU Chassis

42

Page 43: About Sonexion 900 Replacement Procedures 1.4

Figure 36. Removing a 4U24 PCM

d. Repeat he PCM removal steps for the second PCM.

8. Remove each disk as follows. Note the exact location of the drives, as they must be installed in the sameorder in the new 4U24 chassis.

a. If the anti-tamper lock is engaged, carefully insert the lock key into the lock socket and rotate it counter-clockwise until the red indicator is no longer visible in the opening above the key.

b. Remove the lock key.

Figure 37. Releasing the Latch Handle

Release the disk by pressing the latch handle towards the handle hinge as shown in

c. Gently remove the disk approximately 1 inch (25mm) from the drive slot, and then wait 30 seconds, toallow the drive to finish spinning.

Replace the 4U24 ESU Chassis

43

Page 44: About Sonexion 900 Replacement Procedures 1.4

Figure 38. Removing a Disk

d. Fully remove the disk from the drive bay.

Mark the drive with its current drive slot number in the chassis so that it can be reinstalled in the same slotin the new chassis. From the front of the rack, the drive slots are numbered 0 to 23 (left to right).

e. Repeat the disk drive removal steps for the remaining disks.

9. From the front of the rack, remove the failed chassis:

a. Remove the left and right front flange caps by pulling the caps free.

b. Disconnect the chassis from the rack by remove the screw from the left and right flanges (now exposedafter removing the flange caps).

c. With a second person, remove the chassis from the rack.

Install New ESU Chassis and Replace Components

Perform the following steps at the front of the rack.

10. Install the new chassis in the rack:

a. With the chassis on a bench, remove the left and right front flange caps by pulling the caps free. The capssimply snap onto the flanges.

b. With a second person, move the 4U24 chassis into the rack. Carefully align the guide on each side of thechassis with the groove on the rail assembly and gently push the chassis completely into the rack.

c. Connect the chassis to the rack by installing a screw into the left and right flanges.

d. Install the flange caps by pressing them into position. They snap into place on the flanges.

11. In following steps, install each disk drive in the same slot where it was located in the old 4U24 chassis.Perform these steps at the front of the rack.

IMPORTANT: Ensure that each disk is oriented so the drive handle open from the left.

a. On one disk, verify that the disk handle is released and in the open position.

Replace the 4U24 ESU Chassis

44

Page 45: About Sonexion 900 Replacement Procedures 1.4

Figure 39. Disk With Handle in Open Position

b. Insert each disk into the empty drive slot and gently slide the drive carrier into the enclosure until it stops.

Figure 40. Installing a Disk

c. Seat the disk by pressing the handle latch and rotating it to the closed position.

There will be an audible click as the handle latch engages.

Figure 41. Seated DDIC

d. Verify that each disk is in the same orientation as the other disks in the enclosure.

e. Carefully insert the lock key into the lock socket and rotate it clockwise until the red indicator is visible inthe opening above the key.

f. Remove the lock key.

Replace the 4U24 ESU Chassis

45

Page 46: About Sonexion 900 Replacement Procedures 1.4

g. Repeat the disk drive installation steps for the remaining disks.

12. Use the following steps to re-install PCMs, working at the back of the rack.

a. Carefully inspect the PCM for damage, especially to the rear connector. Avoid damaging the connectorpins. If the PCM is damaged, do not install it but obtain another PCM.

b. Verify that the power switch on each PCM is in the OFF position.

c. With the PCM handle in the open position, slide the PCM into the empty bay at the rear of the 4U24enclosure.

Figure 42. Installing a 4U24 PCM

d. As the PCM begins to seat, grasp the handle latch and close it to engage the latch. A click should beheard as the handle latch engages.

Figure 43. Seating the 4U24 PCM

e. Connect the power cord to the PCM.

f. Place the bale over and onto the power cord.

g. Repeat the PCM installation steps for the second PCM.

13. Re-install each SAS EBOD controller:

a. Inspect the controller for damage, especially to the interface connector. If the controller is damaged, donot install it but obtain another SAS EBOD controller.

b. With the latch in the released (open) position, slide the new SAS EBOD controller into the enclosure untilit completely seats and engages the latch.

Replace the 4U24 ESU Chassis

46

Page 47: About Sonexion 900 Replacement Procedures 1.4

Figure 44. Installing an SAS EBOD controller

c. Secure the controller by closing the latch. There will be an audible click as the latch engages.

d. Repeat the controller installation steps for the second SAS controller.

Restart the System

14. Power on the components in the 4U24 chassis.

a. Plug in the two SAS cables to their original ports on the SAS EBOD controller. Use the cable labels(created in step 6.b on page 41) to ensure that the cables are connected to the proper ports.

b. Turn on the power switches on both PCMs.

c. Verify that the indicator LEDs on the PCMs, SAS EBOD controllers and 4U24 chassis (front panel) arenormal and lit green.

The Module Fault and Logical Fault indicators should be off.

15. Power on the system, as described in Sonexion 900 Power On and Power Off Procedures.

16. Compare the USM and GEM firmware versions between the new chassis and the Sonexion 900 ESU OSScontrollers to make certain they match.

Figure 45. 4U24 Operator Control Panel

Replace the 4U24 ESU Chassis

47

Page 48: About Sonexion 900 Replacement Procedures 1.4

Table 3. Operator Control Panel: 4U24 LED Indicators

LEDs State Description

System Power Steady Green AC Power is applied to the enclosure.

Module Fault Steady Amber Indicates one of the following:

● Power Cooling Module fault

● ESM fault

● Over or under temperature fault condition

Refer to individual module fault LEDs.

Logical Fault Steady Amber Indicates failure of a disk drive

Refer to Leveling Sonexion USM Firmware, for instructions on how to compare and update firmware versions.

17. Start the Lustre file system:

[admin@n000]$ cscli mount -f fsname

18. Check that the Lustre file system is started on all nodes:

[admin@n000]$ cscli fs_info

19. After verifying the Lustre file system has started, close the console connection and disconnect the KVM, or, ifusing a console or PC, disconnect the serial cable from the primary MGMT server.

The 4U24 ESU chassis FRU procedure is complete.

Replace the 4U24 ESU Chassis

48

Page 49: About Sonexion 900 Replacement Procedures 1.4

Replace a 2.5-inch Disk (2U24 MMU)PrerequisitesPart number

101229600 - 300GB 15K 2.5"

Time1 hour

Interrupt levelLive (can be applied to a live system with no service interruption)

Tools

● Lock key (for disk)

● ESD strap

● One of the following:

○ Console with monitor and keyboard (or PC with a serial COM port configured for115.2Kbs)

○ KVM with attached cable

About this taskUse this procedure to remove and replace a failed disk in carrier (DDIC or just "disk") in the 2U24 MetadataManagement Unit (MMU) enclosure. MDRAID is not supported on the MMU.

Subtasks:

● Remove and Install a 2.5-inch 2U24 Disk

● Verify 2U24 Hot Spare, Arrays, Statuses

The Sonexion 900 uses three enclosure types:

● A 2U24 Metadata Management Unit (MMU) enclosure, which includes two Sonexion 900 MMU OSScontrollers, two power cooling modules (PCM), and 16 DDICs.

● A 4U24 Scalable Storage Unit (SSU) enclosure, which includes two Sonexion 900 SSU OSS controllers, fourPCMs, and 23 DDICs (disk drive in carrier, usually referred to simply as disks).

● Multiple 4U24 Expandable Storage Units (ESU) enclosures, each of which includes two SAS EBODcontrollers, two PCMs, and 21 disks.

Failed Hot Spare

If a hot spare fails, it must be replaced with a new drive, but the remove / replace procedure is easier because nodisk rebuild is necessary. Instructions to replace a failed hot spare are provided in the following procedure.

Notes and Cautions

Replace a 2.5-inch Disk (2U24 MMU)

49

Page 50: About Sonexion 900 Replacement Procedures 1.4

● Only trained service personnel should perform this procedure.

● If this equipment is used in a manner not specified by the manufacturer, the protection provided by theequipment may be impaired.

● Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.

Procedure

1. If the location of the failed disk is not known, do the following:

a. Look for the amber Drive Fault LED on the 2U24 Operator Control Panel (OCP), to find the drawercontaining the faulty disk.

Figure 46. 4U24 Operator Control Panel

c

Table 4. Operator Control Panel: 4U24 LED Indicators

LEDs State Description

System Power Steady Green AC Power is applied to the enclosure.

Module Fault Steady Amber Indicates one of the following:

● Power Cooling Module fault

● ESM fault

● Over or under temperature fault condition

Refer to individual module fault LEDs.

Logical Fault Steady Amber Indicates failure of a disk drive

b. When the faulty drawer is located, look for the amber Drive Fault LED on the failed disk.

The amber LED indicates a problem with the disk.

Replace a 2.5-inch Disk (2U24 MMU)

50

Page 51: About Sonexion 900 Replacement Procedures 1.4

Figure 47. 2.5-inch Disk LEDs

When viewed from the front of the 2U24 enclosure, the slot numbers start from the left (slot 0), andcontinue left to right (slot 23). The dm_report utility reports drive data starting from slot 0.

2. Log in to the primary MGMT node via SSH (user name admin and the customer’s password).

[ADMIN]$ ssh –l admin primary_MGMT_node

3. Log in to the OSS node via SSH:

[MGMT0]$ ssh oss_node

4. If the location of the failed disk is already known, go to Remove and Install a 2.5-inch 2U24 Disk. To verify thelocation of the failed disk, run the dm_report command:

[admin@n000]$ sudo dm_report

The dm_report command provides status for all drives in the enclosure. The various status types are:

● OK/Failed: for drives owned by the node where the command was run

● Foreign: for drives owned by the other node

● Hot Spare: for hot spare drives owned by either node

● Empty: for slots that do not contain a drive

5. Identify the faulty drive by examining the dm_report slot location and the disk drive with the amber DriveFault LED illuminated. The faulty drive may also be designated as "Failed" in the dm_report output.

Remove and Install a 2.5-inch 2U24 Disk

6. If the anti-tamper lock is engaged, carefully insert the lock key into the lock socket and rotate it counter-clockwise until the red indicator is no longer visible in the opening above the key.

7. Remove the lock key.

8. Release the disk by pressing the latch handle towards the handle hinge as shown in the following figure.

Replace a 2.5-inch Disk (2U24 MMU)

51

Page 52: About Sonexion 900 Replacement Procedures 1.4

Figure 48. Releasing 2.5-inch Disk Latch Handle

9. Gently remove the disk approximately 1 inch (25mm) from the drive slot, and then wait 30 seconds.

Figure 49. Removing a 2.5-inch Disk

10. Fully remove the disk from the drive bay.

11. Wait for the system to detect the missing drive.

On a quiescent system, it takes approximately 30 seconds for the missing drive to be detected, longer on abusy system.

CAUTION: All drive slots must have disks or dummy carriers installed to maintain balanced airflow.

IMPORTANT:

● Ensure that the new disk is oriented so the drive handle opens downward.

● A disk cannot be installed if the anti-tamper lock is activated outside the enclosure.

12. Verify that the disk handle is released and in the open position.

Replace a 2.5-inch Disk (2U24 MMU)

52

Page 53: About Sonexion 900 Replacement Procedures 1.4

Figure 50. 2.5-inch Disk with Handle in Open Position

13. Insert the new disk into the empty drive slot and gently slide the drive carrier into the enclosure until it stops.

Make certain the disk orients up and the handle opens from the left.

Figure 51. Installing a 2.5-inch Disk

14. Seat the disk by pressing the handle latch.

A click is audible as the handle latch engages.

Figure 52. Seated 2.5-inch Disk

15. Verify that the new disk is in the same orientation as the other disks in the enclosure.

Replace a 2.5-inch Disk (2U24 MMU)

53

Page 54: About Sonexion 900 Replacement Procedures 1.4

16. Activate the anti-tamper locks.

a. Carefully insert the lock key into the lock socket.

b. Rotate the key clockwise until the red indicator is visible in the opening above the key.

c. Remove the lock key.

Verify 2U24 Hot Spare, Arrays, Statuses

17. Verify that the new disk is registered as the hot spare:

[admin@n000]$ sudo dm_report

Depending on the cluster's load and drive spin-up time, it may take a few minutes for the dm_report outputto show the new disk registered as the hot spare.

18. If the new disk comes up as a hot spare, go to step 21 on page 54. If it comes up as anything else, continueto the following step.

19. If the new disk comes up as anything other than hot spare, clear the superblock information:

[admin@n000]$ sudo mdadm --zero-superblock --force /dev/sdXX

where XX is the SD device number.

20. Verify that the new disk is registered as the 'hot spare':

[admin@n000]$ sudo dm_report

After the zero superblock command the new drive should show as a hot spare.

slot: 5, wwn: 5000c50047b5a81f, cap: 450098159104, dev: sdm, parts:0, status: Hot Spare

21. Verify the state of the arrays in the 2U24 enclosure:

[admin@n000]$ sudo cat /proc/mdstat

22. Verify that the OCP status and Drive Fault LEDs are normal.

a. Verify that the Module Fault LED on the Operator Control Panel (OCP) of the 2U24 enclosure is green.

b. Verify that no Drive Fault LEDs are illuminated.

If a recovery or rebuild is in progress on the MDRAID array, the Activity LEDs will be illuminated for eachdrive in the array. The Logical Fault LED on the OCP of the 2U24 enclosure will also be illuminated.

23. Log out of the MGMT node.

Replace a 2.5-inch Disk (2U24 MMU)

54

Page 55: About Sonexion 900 Replacement Procedures 1.4

Replace a 2U24 MMU OSS ControllerPrerequisitesPart number

101228800, Sonexion 900 Series Application Controller 32GB (Additional GbE port)

Time1.5 hours

Interrupt levels

● Remove/replace SSU controller: Failover (can be applied to a live system with noservice interruption, but requires failover/failback operations)

● USM firmware update needed: Interrupt (requires taking the Lustre file system offline.Perform a USM upgrade only if the firmware version is out of date)

Tools

● One of the following:

○ Console with monitor and keyboard (or PC with a serial COM port configured for115.2Kbs)

○ KVM with attached cable

● ESD strap

About this taskUse this procedure to remove and replace a failed OSS controller in the Sonexion 900 2U24 MMU.

Subtasks:

● Install 2U24 MMU OSS Controller

● Verify Function and State of 2U24 OSS Controller

The Sonexion 900 uses three enclosure types:

● A 2U24 Metadata Management Unit (MMU) enclosure, which includes two Sonexion 900 MMU OSScontrollers, two power cooling modules (PCM), and 16 DDICs.

● A 4U24 Scalable Storage Unit (SSU) enclosure, which includes two Sonexion 900 SSU OSS controllers, fourPCMs, and 23 DDICs (disk drive in carrier, usually referred to simply as disks).

● Multiple 4U24 Expandable Storage Units (ESU) enclosures, each of which includes two SAS EBODcontrollers, two PCMs, and 21 disks.

IMPORTANT: The Sonexion architecture requires the lower Sonexion 900 MMU OSS controller to beinstalled upside-down in the enclosure. This causes the controllers markings to be upside-down and thecable sequence of the two RJ-45 network cables and the QSFP+ cable to be reversed. When a Sonexion900 MMU OSS controller is replaced, reconnect any cables to the same ports as on the failed controller.

Replace a 2U24 MMU OSS Controller

55

Page 56: About Sonexion 900 Replacement Procedures 1.4

When reconnecting cables, verify the controller and port marked on the label against the controller andport in the enclosure.

Manual backups

Sonexion performs node backups (which are used to restore the node) at 3 AM. To perform this procedure whena backup is not available (for example, right after installation), it is necessary to force a backup. To force abackup, log in to the active MGMT node and run:

[ADMIN}$ ssh –l admin active_MGMT_node[admin@n000]$ ssh nfsserv sudo /opt/xyratex/bin/backup_mgmt.sh

Notes and Cautions

● Only trained service personnel should perform this procedure.

● If this equipment is used in a manner not specified by the manufacturer, the protection provided by theequipment may be impaired.

● Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.

Procedure

1. Determine the physical and logical location (hostname) of the failed controller in the OSS.

2. Log in to the active MGMT node via SSH (user name admin and the customer's password).

[ADMIN}$ ssh –l admin active_MGMT_node

3. From the active MGMT node, fail over the resources from the affected node to its HA partner:

[admin@n000]$ cscli failover -n nodes

Where nodes is the names of the node(s) requiring failover.

For example, for a failure on node x04, fail over the node's resources to node x05 as follows:

[admin@n000]$ cscli failover -n x04

4. Log into the active MGMT node.

[admin@n000]$ ssh active_MGMT_node

5. Display the status of the nodes using the crm_mon utility:

[admin@n000]$ sudo crm_mon -1r[admin@snx11000n000 ~] $ sudo crm_mon -1r============Last updated: Tue May 21 07:32:30 2013Last change: Tue May 21 07:29:33 2013 via cibadmin on snx11000n001Stack: HeartbeatCurrent DC: snx11000n000 (1243452b-8777-41b7-a721-9f34bc3162a1) - partition withquorum Version: 1.1.6.1-3.el6-0c7312c689715e096b716419e2ebc12b57962052 2Nodes configured, unknown expected votes 43 Resources configured.

Replace a 2U24 MMU OSS Controller

56

Page 57: About Sonexion 900 Replacement Procedures 1.4

============Online: [ snx11000n000 snx11000n001 ]Full list of resources:Clone Set: clone-gem-ipmi-stonith [gem-ipmi-stonith]Started: [ snx11000n000 snx11000n001 ]snx11000n001_mdadm_conf_regenerate(ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n001snx11000n000_mdadm_conf_regenerate(ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n000baton (ocf::heartbeat:baton): Started snx11000n000snx11000n001_ibstat (ocf::heartbeat:ibstat): Started snx11000n001snx11000n000_ibstat (ocf::heartbeat:ibstat): Started snx11000n000Resource Group: snx11000n001_md66-groupsnx11000n001_md0-raid (ocf::heartbeat:XYRAID): Started snx11000n001snx11000n001_md66-raid (ocf::heartbeat:XYRAID): Started snx11000n001snx11000n001_md66-fsys (ocf::heartbeat:XYMNTR): Started snx11000n001snx11000n001_md66-stop (ocf::heartbeat:XYSTOP): Started snx11000n001prm-httpd (lsb:httpd): Started snx11000n001prm-mysql (lsb:mysqld): Started snx11000n001prm-nfslock (lsb:nfslock): Started snx11000n001prm-bebundd (lsb:bebundd): Started snx11000n001prm-cerebrod (lsb:cerebrod): Started snx11000n001prm-conman (lsb:conman): Started snx11000n001prm-dhcpd (lsb:dhcpd): Started snx11000n001prm-xinetd (lsb:xinetd): Started snx11000n001Clone Set: cln-syslogng [prm-syslogng]Started: [ snx11000n000 snx11000n001 ]prm-nodes-monitor (lsb:nodes-monitor): Started snx11000n001Clone Set: cln-ses_mon [prm-ses_monitor]Started: [ snx11000n000 snx11000n001 ]Clone Set: cln-nsca_passive_checks [prm-nsca_passive_checks]Started: [ snx11000n000 snx11000n001 ]Resource Group: grp-icingaprm-icinga (lsb:icinga): Started snx11000n001prm-nsca (lsb:nsca): Started snx11000n001prm-npcd (lsb:npcd): Started snx11000n001prm-repo-local (ocf::heartbeat:Filesystem): Started snx11000n001prm-repo-remote (ocf::heartbeat:Filesystem): Started snx11000n000prm-db2puppet (ocf::heartbeat:oneshot): Started snx11000n001Clone Set: cln-puppet [prm-puppet]Started: [ snx11000n000 snx11000n001 ]prm-nfsd (ocf::heartbeat:nfsserver): Started snx11000n001prm-vip-eth0-mgmt (ocf::heartbeat:IPaddr2): Started snx11000n001prm-vip-eth0-nfs (ocf::heartbeat:IPaddr2): Started snx11000n001Resource Group: snx11000n000_md64-groupsnx11000n000_md64-raid (ocf::heartbeat:XYRAID): Started snx11000n001snx11000n000_md64-fsys (ocf::heartbeat:XYMNTR): Started snx11000n001snx11000n000_md64-stop (ocf::heartbeat:XYSTOP): Started snx11000n001Resource Group: snx11000n000_md67-groupsnx11000n000_md67-raid (ocf::heartbeat:XYRAID): Started snx11000n001snx11000n000_md67-fsys (ocf::heartbeat:XYMNTR): Started snx11000n001snx11000n000_md67-stop (ocf::heartbeat:XYSTOP): Started snx11000n001

6. After verifying the failover has occurred, power off the failed controller as follows. From the active MGMTnode, enter:

[admin@n000]$ cscli power_manage –n nodeXX --power-off

Where nodeXX is the name of the affected MGMT node.

Replace a 2U24 MMU OSS Controller

57

Page 58: About Sonexion 900 Replacement Procedures 1.4

If the failover has occurred and already powered off, proceed to the next step.

Install 2U24 MMU OSS Controller

Perform the following steps at the back of the rack. It is required to wear an ESD strap before removing theOSS controllers.

7. Make a note of the new controller's eth0 MAC address, which is usually labeled on the device.

8. Plug in the two RJ-45 network cables to the new controller.

9. Plug in the QSFP+ cable to the new controller.

10. Connect a serial cable from the console or PC to the new controller (serial port is on the rear panel).

11. Open a terminal session with these settings:

Bits per second 115200

Data bits 8

Parity None

Stop bits 1

Flow control None

The Function Keys are set to VT100+.

This serial connection allows monitoring the startup for any issues.

12. Completely insert the new controller into the MMU, until the locking lever engages and the unit is properlyseated in the chassis.

13. Log in to the active MGMT node (user name admin and the customer's password):

[ADMIN]$ ssh –l admin active_MGMT_node

14. Update the MAC address in t0db to reflect the MAC of the new device (recorded in step 7 on page 58):

[admin@n000]$ sudo mysql t0db –e "update netdev set mac_address='newmac' where hostname='nodename'"

For example:

[admin@n000]$ sudo mysql t0db –e "update netdev set mac_address='00:50:CC:79:03:06' where hostname='snx11000n000'"

15. Apply the new configuration:

[admin@n000]$ sudo /opt/xyratex/bin/beUpdatePuppet –s

16. Enable management restore:

[admin@n000]$ cscli restore_mgmt –enable

Replace a 2U24 MMU OSS Controller

58

Page 59: About Sonexion 900 Replacement Procedures 1.4

The expected output should be:

[admin@snx11000n101 ~]$ cscli restore_mgmt –enable restore_mgmt: Done

17. Power on the replacement controller by pressing the power button on the back of the controller.

Figure 53. OSS Rear Panel

Verify Function and State of 2U24 OSS Controller

18. Use the serial connection to monitor the system recovery.

a. Wait for the login prompt and log in using user name admin and the customer's password.

b. Set the IPMI address to the address of the previous controller. Run the following set of commands inorder:

[admin@n000]$ sudo ipmitool lan set 1 access off[admin@n000]$ sudo ipmitool lan set 1 ipsrc static[admin@n000]$ sudo ipmitool lan set 1 ipaddr ip_address[admin@n000]$ sudo ipmitool lan set 1 netmask 255.255.0.0[admin@n000]$ sudo ipmitool lan set 1 access on

19. Verify that the newly installed controller has booted up as follows. From the MGMT node run:

[admin@n000]$ sudo pdsh –a uname | dshbak –c

Full bootup takes approximately 15 minutes. During bootup there is a period of approximately three minuteswhere it appears as though nothing is happening.

20. Log in to the newly installed controller, once it has rebooted, from the active MGMT node:

[admin@n000]$ ssh new_controller

21. Check ifconfig to make certain the ports present on the active MGMT node are present on the new node:

[admin@n000]$ sudo ifconfig

Replace a 2U24 MMU OSS Controller

59

Page 60: About Sonexion 900 Replacement Procedures 1.4

For example, if eth0 and eth1 are the current ports seen on the active MGMT node, the output would looksimilar to the following:

[admin@snx11000n000]$ sudo ifconfigeth0 Link encap:Ethernet HWaddr 00:50:CC:79:29:CEinet addr:172.16.2.3 Bcast:172.16.255.255 Mask:255.255.0.0inet6 addr: fe80::250:ccff:fe79:29ce/64 Scope:LinkUP BROADCAST RUNNING MULTICAST MTU:900 Metric:1RX packets:489744 errors:0 dropped:0 overruns:0 frame:0TX packets:277298 errors:0 dropped:0 overruns:0 carrier:0collisions:0 txqueuelen:1000RX bytes:318452263 (303.6 MiB) TX bytes:45163100 (43.0 MiB)Interrupt:18 Memory:d0200000-d0920000eth1 Link encap:Ethernet HWaddr 00:50:CC:79:29:CFinet addr:10.106.54.85 Bcast:10.106.55.255 Mask:255.255.252.0inet6 addr: fe80::250:ccff:fe79:29cf/64 Scope:LinkUP BROADCAST RUNNING MULTICAST MTU:900 Metric:1RX packets:47071 errors:0 dropped:0 overruns:0 frame:0TX packets:5531 errors:0 dropped:0 overruns:0 carrier:0collisions:0 txqueuelen:1000RX bytes:3808228 (3.6 MiB) TX bytes:903450 (882.2 KiB)Interrupt:19 Memory:d0800000-d0820000

22. Verify that the new controller is online:

[admin@n000]$ sudo crm_mon -1r

When both MGMT nodes are online and still in the failed over state, the crm_mon -1r output (full outputincludes 3 RAID groups) looks as follows:

[admin@snx11000n000 ~]$ sudo crm_mon -1r============Last updated: Fri Apr 26 07:13:09 2013Last change: Fri Apr 26 06:15:47 2013 via cibadmin on snx11000n000Stack: HeartbeatCurrent DC: snx11000n000 (291dd14a-472f-4f77-b00b-b2d654c0f3f5) - partition with quorumVersion: 1.1.6.1-3.el6-0c7312c689715e096b716419e2ebc12b579620522 Nodes configured, unknown expected votes43 Resources configured.============

Online: [ snx11000n000 snx11000n001 ]

Full list of resources:Clone Set: clone-gem-ipmi-stonith [gem-ipmi-stonith]Started: [ snx11000n000 snx11000n001 ]snx11000n001_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Startedsnx11000n001snx11000n000_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Startedsnx11000n000 baton (ocf::heartbeat:baton): Started snx11000n000 snx11000n001_ibstat(ocf::heartbeat:ibstat): Started snx11000n001 snx11000n000_ibstat(ocf::heartbeat:ibstat): Started snx11000n000Resource Group: snx11000n001_md66-groupsnx11000n001_md0-raid (ocf::heartbeat:XYRAID): Started snx11000n000snx11000n001_md66-raid (ocf::heartbeat:XYRAID): Started snx11000n000

Replace a 2U24 MMU OSS Controller

60

Page 61: About Sonexion 900 Replacement Procedures 1.4

snx11000n001_md66-fsys (ocf::heartbeat:XYMNTR): Started snx11000n000snx11000n001_md66-stop (ocf::heartbeat:XYSTOP): Started snx11000n000prm-httpd (lsb:httpd): Started snx11000n000prm-mysql (lsb:mysqld): Started snx11000n000prm-nfslock (lsb:nfslock): Started snx11000n000prm-bebundd (lsb:bebundd): Started snx11000n000prm-cerebrod (lsb:cerebrod): Started snx11000n000prm-conman (lsb:conman): Started snx11000n000prm-dhcpd (lsb:dhcpd): Started snx11000n000prm-xinetd (lsb:xinetd): Started snx11000n000Clone Set: cln-syslogng [prm-syslogng]Started: [ snx11000n001 snx11000n000 ]-nodes-monitor (lsb:nodes-monitor): Started snx11000n000Clone Set: cln-ses_mon [prm-ses_monitor]Started: [ snx11000n001 snx11000n000 ]Clone Set: cln-nsca_passive_checks [prm-nsca_passive_checks]Started: [ snx11000n001 snx11000n000 ]Resource Group: grp-icingaprm-icinga (lsb:icinga): Started snx11000n000prm-nsca (lsb:nsca): Started snx11000n000prm-npcd (lsb:npcd): Started snx11000n000prm-repo-local (ocf::heartbeat:Filesystem): Started snx11000n000prm-repo-remote (ocf::heartbeat:Filesystem): Started snx11000n001 prmdb2puppet (ocf::heartbeat:oneshot): Started snx11000n000Clone Set: cln-puppet [prm-puppet]Started: [ snx11000n001 snx11000n000 ] prm-nfsd (ocf::heartbeat:nfsserver):Started snx11000n000prm-vip-eth0-mgmt (ocf::heartbeat:IPaddr2): Started snx11000n000 prm-vipeth0-nfs (ocf::heartbeat:IPaddr2): Started snx11000n000 Resource Group:snx11000n000_md64-groupsnx11000n000_md64-raid (ocf::heartbeat:XYRAID): Started snx11000n000snx11000n000_md64-fsys (ocf::heartbeat:XYMNTR): Started snx11000n000snx11000n000_md64-stop (ocf::heartbeat:XYSTOP): Started snx11000n000Resource Group: snx11000n000_md67-groupsnx11000n000_md67-raid (ocf::heartbeat:XYRAID): Started snx11000n000snx11000n000_md67-fsys (ocf::heartbeat:XYMNTR): Started snx11000n000snx11000n000_md67-stop (ocf::heartbeat:XYSTOP): Started snx11000n000

23. From the active MGMT node, fail back the resources to balance the load between the affected nodes:

[admin@n000]$ cscli failback -n nodes

Where nodes are the names of the node(s) that previously failed over. For example:

[admin@n000]$ cscli failback -n xx04

24. Log in to the active MGMT node via SSH.

[ADMIN]$ ssh –l admin active_MGMT_node

25. Display the status of the nodes using the crm_mon utility:

[admin@n000]$ sudo crm_mon -1r

Replace a 2U24 MMU OSS Controller

61

Page 62: About Sonexion 900 Replacement Procedures 1.4

Below is an example of the node status.

[admin@snx11000n000 ~]$ sudo crm_mon -1r============Last updated: Tue May 21 07:25:38 2013Last change: Tue May 21 05:56:26 2013 via crm_resource on snx11000n000 Stack: HeartbeatCurrent DC: snx11000n000 (1243452b-8777-41b7-a721-9f34bc3162a1) - partition with quorumVersion: 1.1.6.1-3.el6-0c7312c689715e096b716419e2ebc12b579620522 Nodes configured, unknown expected votes43 Resources configured.============Online: [ snx11000n000 snx11000n001 ]Full list of resources:Clone Set: clone-gem-ipmi-stonith [gem-ipmi-stonith]Started: [ snx11000n000 snx11000n001 ]snx11000n001_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Startedsnx11000n001 snx11000n000_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate):Started snx11000n000baton (ocf::heartbeat:baton): Started snx11000n000snx11000n001_ibstat (ocf::heartbeat:ibstat): Started snx11000n001 snx11000n000_ibstat(ocf::heartbeat:ibstat): Started snx11000n000 Resource Group: snx11000n001_md66-groupsnx11000n001_md0-raid (ocf::heartbeat:XYRAID): Started snx11000n001snx11000n001_md66-raid (ocf::heartbeat:XYRAID): Started snx11000n001snx11000n001_md66-fsys (ocf::heartbeat:XYMNTR): Started snx11000n001snx11000n001_md66-stop (ocf::heartbeat:XYSTOP): Started snx11000n001prm-httpd (lsb:httpd): Started snx11000n000prm-mysql (lsb:mysqld): Started snx11000n000prm-nfslock (lsb:nfslock): Started snx11000n001prm-bebundd (lsb:bebundd): Started snx11000n000prm-cerebrod (lsb:cerebrod): Started snx11000n000prm-conman (lsb:conman): Started snx11000n000prm-dhcpd (lsb:dhcpd): Started snx11000n001prm-xinetd (lsb:xinetd): Started snx11000n001Clone Set: cln-syslogng [prm-syslogng]Started: [ snx11000n000 snx11000n001 ]prm-nodes-monitor (lsb:nodes-monitor): Started snx11000n000 Clone Set: clnses_mon [prm-ses_monitor]Started: [ snx11000n000 snx11000n001 ]Clone Set: cln-nsca_passive_checks [prm-nsca_passive_checks]Started: [ snx11000n000 snx11000n001 ]Resource Group: grp-icingaprm-icinga (lsb:icinga): Started snx11000n000prm-nsca (lsb:nsca): Started snx11000n000prm-npcd (lsb:npcd): Started snx11000n000prm-repo-local (ocf::heartbeat:Filesystem): Started snx11000n001 prm-repo-remote(ocf::heartbeat:Filesystem): Started snx11000n000 prm-db2puppet(ocf::heartbeat:oneshot): Started snx11000n000 Clone Set: cln-puppet [prm-puppet]Started: [ snx11000n001 snx11000n000 ]prm-nfsd (ocf::heartbeat:nfsserver): Started snx11000n001 prm-vip-eth0-mgmt(ocf::heartbeat:IPaddr2): Started snx11000n000 prm-vip-eth0-nfs(ocf::heartbeat:IPaddr2): Started snx11000n001 Resource Group: snx11000n000_md64-group

Replace a 2U24 MMU OSS Controller

62

Page 63: About Sonexion 900 Replacement Procedures 1.4

snx11000n000_md64-raid (ocf::heartbeat:XYRAID): Started snx11000n000snx11000n000_md64-fsys (ocf::heartbeat:XYMNTR): Started snx11000n000snx11000n000_md64-stop (ocf::heartbeat:XYSTOP): Started snx11000n000 Resource Group:snx11000n000_md67-groupsnx11000n000_md67-raid (ocf::heartbeat:XYRAID): Started snx11000n001snx11000n000_md67-fsys (ocf::heartbeat:XYMNTR): Started snx11000n001snx11000n000_md67-stop (ocf::heartbeat:XYSTOP): Started snx11000n001

26. Compare the USM and GEM firmware versions between the Sonexion 900 MMU OSS controllers to makecertain they match. If they do not match, update the new controller so that its firmware version matches theversion on the existing controller.

Refer to Leveling Sonexion USM Firmware, for instructions on how to compare and update firmware versions.If the firmware versions match, go to the next step.

27. If the terminal connection (console or PC) is still active, terminate it and disconnect the serial cable from thenew controller.

Replace a 2U24 MMU OSS Controller

63

Page 64: About Sonexion 900 Replacement Procedures 1.4

Replace the 2U24 MMU ChassisPrerequisitesPart number

100853300 - Base Power Controlled 6G 2U24, no PSUs, no Controllers, with hub caps

Time2 hours

Interrupt levelInterrupt (requires taking the Lustre file system offline. Perform a USM upgrade only if thefirmware version is out of date)

Tools

● Lock key T10T

● Phillips screwdriver (medium)

● ESD strap

● One of the following:

○ Console with monitor and keyboard (or PC with a serial COM port configured for115.2Kbs)

○ KVM with attached cable

Requirements

● The new chassis and the chassis being replaced must be on the same GEM version. Ifthe GEM versions are different, this procedure may be unsuccessful.

● The size and weight of the 4U24 chassis requires two individuals to move the unitsafely. Do not perform this procedure unless two individuals are onsite and available tomove each 4U24 chassis.

About this taskUse this procedures to remove and replace a defective chassis in a 2U24 MMU on the Sonexion 900. Only thechassis is actually replaced; components from the original (failed) chassis are removed and reused in the newchassis.

Subtasks:

● Install New MMU Chassis and Replace Components

● Power On MMU Components and Verify

The Sonexion 900 uses three enclosure types:

● A 2U24 Metadata Management Unit (MMU) enclosure, which includes two Sonexion 900 MMU OSScontrollers, two power cooling modules (PCM), and 16 DDICs.

Replace the 2U24 MMU Chassis

64

Page 65: About Sonexion 900 Replacement Procedures 1.4

● A 4U24 Scalable Storage Unit (SSU) enclosure, which includes two Sonexion 900 SSU OSS controllers, fourPCMs, and 23 DDICs (disk drive in carrier, usually referred to simply as disks).

● Multiple 4U24 Expandable Storage Units (ESU) enclosures, each of which includes two SAS EBODcontrollers, two PCMs, and 21 disks.

In this procedure, only the defective chassis is replaced; all other components are re-used in the new 2U24chassis. This procedure includes steps to stop all client I/O and file system, replace the failed 2U24 chassis, verifythe operation of the new 2U24 chassis, and return the Sonexion system to normal operation.

Notes and Cautions

● Only trained service personnel should perform this procedure.

● If this equipment is used in a manner not specified by the manufacturer, the protection provided by theequipment may be impaired.

● Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.

Procedure

1. If the location of the failed 2U24 MMU chassis is not known, look for a Fault LED (amber) on the failed MMUchassis (front panel).

2. Log in to the primary MGMT node via SSH (user name admin and the customer's password):

[ADMIN]$ ssh -1 admin primary_MGMT_node

3. Stop the Lustre file system:

[admin@n000]$ cscli unmount -f fsname

4. Verify that the Lustre file system is stopped on all nodes:

$ cscli fs_info

5. After verifying the Lustre file system has stopped, power off the Sonexion 900 as described in Sonexion 900Power On and Power Off Procedures

6. Remove the MMU OSS controllers:

a. Turn off the power switches on both PCMs in the chassis.

b. Unplug the two RJ-45 network cables.

c. Unplug the QSFP+ cable from the top MMU OSS controller.

d. Release the controller latch by grasping it between the thumb and forefinger and gently squeezing.

e. Using the latch as a handle, carefully remove the controller from the enclosure.

Mark the OSS with its current top or bottom slot in the chassis so that it can be reinstalled in the same slotin the new chassis.

f. Repeat the controller removal steps for the second MMU OSS controller.

Replace the 2U24 MMU Chassis

65

Page 66: About Sonexion 900 Replacement Procedures 1.4

7. Remove the PCMs by performing the following steps at the back of the rack

a. On one of the PCMs, disconnect the power cord by moving the bale towards the center of the PCM andremoving the cord.

b. Release the module latch by grasping it between the thumb and forefinger and gently squeezing it.

Figure 54. Releasing Latch on MMU PCM

c. Using the latch as a handle, carefully remove the PCM from the enclosure.

WARNING: Do not remove the cover from the PCM. Danger of electric shock exists inside thecover.

Figure 55. Removing an MMU PCM

d. Repeat the module removal steps for the second PCM.

8. Perform the following steps at the front of the rack to remove disks from the 2U24. Label drive slot locations ofthe drives, because they must be installed in the same order in the new 2U24 chassis. It is required to wearan ESD strap before removing the disks.

a. If the anti-tamper lock is engaged, carefully insert the lock key into the lock socket and rotate it counter-clockwise until the red indicator is no longer visible in the opening above the key.

Replace the 2U24 MMU Chassis

66

Page 67: About Sonexion 900 Replacement Procedures 1.4

b. Remove the lock key.

Figure 56. Drive Carrier Unlocked and Locked

c. Release the disk by pressing down the latch and rotating the latch downward.

Figure 57. Releasing the Latch Handle

d. Gently remove the disk approximately 1 inch (25mm) from the drive slot, then wait 30 seconds for thedrive to stop spinning.

Replace the 2U24 MMU Chassis

67

Page 68: About Sonexion 900 Replacement Procedures 1.4

Figure 58. Removing a disk

e. Mark the drive with its current drive slot number in the chassis so it can be reinstalled in the same slot inthe new chassis. From the front of the rack, the drive slots are numbered 0 to 23 (left to right).

f. Fully remove the disk from the drive bay.

g. Repeat the disk drive removal steps for the remaining disks.

9. Remove the failed chassis from the front of the rack:

a. Remove the left and right front flange caps by pulling the caps free.

b. Disconnect the chassis from the rack by removing the screw from the left and right flanges (now exposedafter removing the flange caps).

c. With a second person, remove the chassis from the rack.

Install New MMU Chassis and Replace Components

Perform these steps at the front of the rack.

10. Install the new chassis in the rack:

a. With the chassis on a bench, remove the left and right front flange caps by pulling the caps free. The capssimply snap onto the flanges.ng the latch handle

b. With a second person, move the 2U24 chassis into the rack. Carefully align the guide on each side of thechassis with the groove on the rail assembly and gently push the chassis completely into the rack.

c. Connect the chassis to the rack by installing a screw into the left and right flanges.

d. Install the flange caps by pressing them into position. They snap into place on the flanges.

11. Perform the following steps at the front of the rack to re-install disks. Install each drive in the same slot whereit was located in the old 2U24 MMU chassis, oriented so that the drive handle opens downward. A disk cannotbe installed if the anti-tamper lock is activated outside the enclosure.

a. Verify that the disk handle is released and in the open position.

Replace the 2U24 MMU Chassis

68

Page 69: About Sonexion 900 Replacement Procedures 1.4

Figure 59. 2.5-inch Disk with Handle in Open Position

b. Insert each disk into the empty drive slot and gently slide the drive carrier into the enclosure until it stops.

Make certain the disk orients up and the handle opens from the left.

Figure 60. Installing a 2.5-inch Disk

c. Seat the disk by pressing the handle latch and rotating it to the closed position.

There will be an audible click as the handle latch engages.

d. Verify that each disk is in the same orientation as the other disks in the enclosure.

e. Carefully insert the lock key into the lock socket and rotate it clockwise until the red indicator is visible inthe opening above the key.

f. Remove the lock key.

g. Repeat the disk drive installation steps for the remaining disks.

12. To replace the PCMs, perform the following steps at the back of the rack. Carefully inspect the PCM fordamage, especially to the rear connector. Avoid damaging the connector pins. If the PCM is damaged, do notinstall it but obtain another PCM.

a. Verify that the power switch on each PCM is in the OFF position.

b. With the PCM handle in the open position, slide the PCM into the empty bay at the rear of the 2U24enclosure.

Replace the 2U24 MMU Chassis

69

Page 70: About Sonexion 900 Replacement Procedures 1.4

Figure 61. Installing an MMU PCM

c. As the PCM begins to seat, grasp the handle latch and close it to engage the latch. A click should beheard as the handle latch engages.

d. Connect a power cord to each PCM.

e. Place the bale over and onto the power cord.

f. Repeat the module installation steps for the second PCM.

13. Inspect the new OSS controller for the 2U24 MMU to find damage, especially to the interface connector. If thecontroller is damaged, do not install it but obtain another OSS controller.

a. With the latch in the released (open) position, slide the Sonexion 900 MMU OSS controller into theenclosure until it completely seats and engages the latch.

b. Secure the controller by closing the latch.

There will be an audible click as the latch engages.

c. Repeat the controller installation steps for the second OSS controller.

Power On MMU Components and Verify

14. Plug in the two RJ-45 network cables to the OSS controllers.

15. Plug in the QSFP+ cable to the OSS controllers.

16. Turn on the power switches on both PCMs.

17. Verify that the indicator LEDs on the OSS controllers are lit and the LEDs on the PCMs and 2U24 chassis(front panel) are blinking green.

Replace the 2U24 MMU Chassis

70

Page 71: About Sonexion 900 Replacement Procedures 1.4

Figure 62. 2U24 Operator's Panel

Table 5. Operator Control Panel: 4U24 LED Indicators

LEDs State Description

System Power Steady Green AC Power is applied to the enclosure.

Module Fault Steady Amber Indicates one of the following:

● Power Cooling Module fault

● ESM fault

● Over or under temperature fault condition

Refer to individual module fault LEDs.

Logical Fault Steady Amber Indicates failure of a disk drive

18. Power on the Sonexion, as described in Sonexion 900 Power On and Power Off Procedures.

19. Compare the USM and GEM firmware versions between the new chassis and the Sonexion 900 MMU OSScontrollers to make certain they match.

Refer to Leveling Sonexion USM Firmware, for instructions on how to compare and update firmware versions.

20. Start the Lustre file system:

[admin@n000]$ cscli mount -f fsname

21. Verify that the Lustre file system is started on all nodes:

[admin@n000]$ cscli fs_info

22. After verifying that the Lustre file system has started, close the console connection and disconnect the KVM,or, if using a console or PC, disconnect the serial cable from the primary MGMT server.

Replace the 2U24 MMU Chassis

71

Page 72: About Sonexion 900 Replacement Procedures 1.4

Replace a Management SwitchPrerequisitesPart number

● 101018600: Brocade GbE L2 24 Port Managed Switch MPN ICX6430-24 - Single PSU- Airflow=Power Supply to Port

● 101018700: Brocade GbE L2 48 Port Managed Switch MPN ICX6430-48 - Single PSU- Airflow=Power Supply to Port

Time1.5 hours

Interruption levelInterrupt (requires taking the Lustre file system offline)

Tools

● Phillips screwdriver (#2)

● Console with monitor and keyboard (or PC with a serial port configured for 115.2 Kbps,8 data bits, no parity and 1 stop bit)

● RS-232 to Ethernet serial cable

● ESD strap, boots, garment or other approved methods

About this task

Notes and Cautions

● Only trained service personnel should perform this procedure.

● If this equipment is used in a manner not specified by the manufacturer, the protection provided by theequipment may be impaired.

● Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.

CAUTION: The size and weight of the Brocade switch requires two individuals to move the unit safely. Donot perform this procedure unless two individuals are onsite and available to move each switch.

Procedure

1. If the location of the failed Management Switch in the rack is not known, check the status of the connected(cabled) ports or look for an error indicator. The LEDs on inactive ports are off. In an operational switch, allconnected ports have valid links with green LEDs and no warning or error LEDS are lit.

2. Power off the Sonexion system

3. At the back of the rack, disconnect all network cables from the failed switch.

Replace a Management Switch

72

Page 73: About Sonexion 900 Replacement Procedures 1.4

Refer to the cable reference placard attached to the rack for the port connection layout.

On the lower switch, the mounting tabs might catch on the PDU it is pulled out. It will be a tight fit but will slideout.

4. At the front of the rack, disconnect the power cord from the failed switch.

5. At the back of the rack, remove the four retaining pan-head screws from the front of the failed switch.

6. With a second person, carefully slide the failed switch out of the rack.

7. If the switch has not yet been unpacked, do the following:

a. Place the shipping cartons on a flat surface.

b. Cut all straps securing the cartons.

c. Unpack the switch and accessories from the cartons.

8. Using the Phillips head screws (provided), attach the mounting brackets (two total) to the sides of the newswitch. One bracket attaches to each side of the switch (in the front).

Each mounting bracket requires four screws.

9. With a second person, slide the switch into the rack.

10. Align the mounting brackets and the rack holes. Using two pan-head screws with nylon washers, attach eachbracket to the rack.

11. Connect the power cord to the power receptacle on the switch.

IMPORTANT: The switch does not have an ON/OFF control. The switch powers on when the powercord is plugged in and power is applied. Wait for the switch to power on and complete its boot cycle(approximately 5 minutes).

12. Once the switch has booted, reconnect the network cables. Reference the cable placard attached to the rackfor the connection layout.

13. Check the status of the connected (cabled) ports.

IMPORTANT: Wait for links to be established on all connected ports (green LEDs). Do not go to thenext step until all connected ports have valid links.

14. Power on the Sonexion 900, as described in Sonexion 900 Power On and Power Off Procedures.

Replace a Management Switch

73