Advanced Diagnostics 2

Embed Size (px)

Citation preview

Diagnostics

Diagnostics

SP Diags (stinger)

Spdiag Tool (galaxy)

SunVTS

PC Check

HDT (sundiag replacement)

CSTH

Herd & EDAC

MCAT

Bonnie

Memtest

Web Pages

Decoder Tools

Day 2 PM and Acknowledgements

I have borrowed/stolen/copied* the following in this presentation.

Newisys decoder from Barry Wright

HDT from Bernward Schwartz

SGR from

http://panacea/twiki/bin/view/SGR/WebHome

SP Diags for V20/40z

Not to be confused with spdiag Tool

Bootable CD (nsv 2.2.0.6 or above required) or SP based

Enable diagnostic boot in BIOS for bootable CD

NSV installed on remote system and mounted locally by NFS

.

SP Diags for V20/40z

Install diags:

cp -r /mnt/cdrom/nsv_file /mnt/nsv/

cd /mnt/nsv/

unzip -a *.zip

chmod 777 /mnt/nsv/diags/NSV_version_number/scripts

chmod -R 755 /mnt/nsv/diags/NSV_version_number/mppc

Note:Now ensure nfs is enabled on server and can export file system

sp add mount -r NFS_server_hostname:/directory_with_NSV_files -l /mnt

sp update diags -p /mnt/diags/DIAGS_version#

SP Diags for V20/40z

diags start (for standalone)

diags start -n (on-line nic,disk,mem)

diags get state (confirm diags are loaded)

diags get tests (list diagnostics tests)

diags run tests -av

diags run tests -av >/mnt/log/diags.log

diags terminate

Ensure diags and BIOS, drivers are compatible

Diags will fail to run otherwise

SP Diags for V20/40z

diags -h this will show all syntax

diags -a -v full test

Bootable CD

diags terminate -n

diags start -n

diags run tests -a -v >diags.out &

tail -f diags.out

The spdiag Tool (Galaxy)

SP based diagnostic

Test i2c , voltage , fans , temp

Stop ipmi /etc/init.d/ipmistack stop

/usr/local/bin/spdiag 1 g4 i2ctst

Reboot SP

PC Check

Supplemental/Tools CD and now boot menu

AMD based X2100,X2100M2,X2200M2 and all new X4x40 platforms

All Intel based platforms

Monitor and keyboard

Serial port

Scripts , burn-in tests , loopback

PC Check

Front Menu:

System Information menu

Advanced Diagnostics Tests

Immediate Burn-in Testing

Deferred Burn-in Testing

Create Diagnostic Partition

Show Results Summary

Print Results Report

PC Check

Burn-in Testing:

quick.tst - requires user input, no time-out

noinput.tst no user input, good first test

full.tst requires loopback & user input

Command Line:

Example pccheck cpu.tst /BD

pccheck /? - shows all flags

pccheck suncsi.tst /IS /BD /KS /MH 30 /HMD 1m /HDD 1m /SD 5m

SUNvts

What are you trying to test/replicate ?

Local or bootable CD-ROM

Galaxy 2.2 cd contains vts6.3

GUI or command line

Unsupported platforms:

/opt/SUNWvts/lib/conf/platform.conf

smbios | grep Product

Boot with graphics head

Edit tty boot console=ttya,ttya-mode=9600,8,n,1,-

SUNvts

HDT (Hardware Debug Tool)

PLEASE USE WITH CAUTION !!!

Will hang the host if OS running

Reboot SP after use

http://panacea/twiki/bin/view/Products/Galaxydiag

Additional tools:

/usr/local/bin/collectHostStatus.sh

-nohdtl disables hdt test

/usr/local/bin/collectDebugInfo.sh

Platform Specifics

On G4:

HDT uses some signals over the i2c bus => IPMI on the SP has to be shut down. SP should be rebooted when done with hdt diags.

JTAG chain goes through all CPU modules => All slots must have CPU or filler module inserted for HDT to work on G4

Direct access to all CPU's, default is cpu 0

Other Platforms:

Only CPU0 in JTAG chain

no i2c involved, only used for platform identification

From Bernward Schwarte presentation.

Getting Started & Cautions

hdt or hdtl?

Current hdt binary and some documentation at: http://nsgtwiki.sfbay.sun.com/twiki/bin/view/Galaxy/SpBasedHdtDiag under Galaxy->Pre-OS-Diagnostics

Copy to SP: scp hdt sunservice@:/coredump

ssh sunservice@ password: changeme

cd /coredump (or check /usr/local/bin for the built in copy)

Caution: All hdt commands stop CPU's. Some hdt commands will reset/power-cycle system.

All command line parameters are interpreted as hex values !

./hdt prints syntax of all available commands

./hdt pd 0 18 0

hdt leaves CPU in HDT-mode when exiting, use -e option to exit HDT-mode

Available Commands/Diagnostics

Basics:

Single HDT command: -h * Note: this is not -help

Access io- and memory space: -mr, -mw, -ir, -iw

Access CPU registers: -rd, -rr, -rw

Single step: -hs

Control:

Reset : -xr [b c]

Stop at reset: -xs [b c p]

resource init: -hi : sets up HT routing and resources

Power On/Off : -o [0 off, 1 cycle, 2 on]

set CPU: -c G4 only

Breakpoints: -bps -bpm -bpc

exit: -e

Diagnostics

Extended:

PCI configuration space access -pr, -pw -pd, -ps

Dump commands

Machine check: -dm

DIMM SPD: -dd

CMOS: -dc

SIO: -ds

Flash: -df

HT link testing: -a

Powercycles, stops at reset vector, sets all HT links, warm resets

HDT Not Working

Depending on System state HDT can be non-functional

To capture some system/error state:

Reset system and stop at reset vector: hdt -xs b

Init HT routing and PCI bridge enumeration hdt -hi

Dump Machine check and HT link status: hdt -dm -dl

hdtDiag: Galaxy/Thumper HDT Diagnostics, Version 0.7.0-------------------------------------------------------hdtDiag: Error, HDT command failed, no CFF cpu 0hdtDiag: SysIdent: HDT access failinghdtDiag: defaulting to G12X

HDT

Check Versions 0.8.0 , 0.8.3 , 0.9.9, 1.3, 1.4.1 etc

./hdt -xs

./hdt -hi

./hdt -l -q or try ./hdt -l -a

./hdt -e

Reboot SP

CSTH (Continuous System Telemetry Harness)

Calls ipmitool to create a telemetry stream of:

volt,temp,current,fans and PSU variables

Collect data and submit for analysis to engineering:

./start-csth-ipmi [--interval ]

Example:

./start-csth-ipmi test-sp admin test.pass 60 &

./stop-csth-ipmi test-sp

CSTH (Example From an x4200)

HERD (Hardware Error Report Decode)

http://nsgtwiki.sfbay.sun.com/twiki/bin/view/Galaxy/HERD

Hardware error report and decoding from mcelog or via the command line with kernel 2.6.4 or above

Installed as RPM on top of SLES and Red Hat

Be provide by Sun Microsystems

Will report errors to messages file and service processor (if applicable)

Same command line options as mcelog

Must be run on the same host as the machine that reported the errors when using the herd -e function.

HERD (Hardware Error Report Decode)

Example from console / logs:

Example of running herd manually (pre herd install):

Mar 5 18:03:01 va64-x2200c-gmp03 herd: HARDWARE ERROR. This is *NOT* a software problem! Mar 5 18:03:01 va64-x2200c-gmp03 herd: Please contact your hardware vendor Mar 5 18:03:01 va64-x2200c-gmp03 herd: CPU 0 4 northbridge Mar 5 18:03:01 va64-x2200c-gmp03 herd: TSC fcc73b11cf Mar 5 18:03:01 va64-x2200c-gmp03 herd: ADDR 142110 Mar 5 18:03:01 va64-x2200c-gmp03 herd: Northbridge Chipkill ECC error Mar 5 18:03:01 va64-x2200c-gmp03 herd: Chipkill ECC syndrome = 11ea Mar 5 18:03:01 va64-x2200c-gmp03 herd: bit46 = corrected ecc error Mar 5 18:03:01 va64-x2200c-gmp03 herd: bit57 = processor context corrupt Mar 5 18:03:01 va64-x2200c-gmp03 herd: bit61 = error uncorrected Mar 5 18:03:01 va64-x2200c-gmp03 herd: bus error 'local node response, request didn't time out generic read mem transaction memory access, level generic' Mar 5 18:03:01 va64-x2200c-gmp03 herd: STATUS b675410011080a13 MCGSTATUS 0

# herd -e 142110000000142110: Cpu Node 0, DIMM 2

EDAC (Kernel 2.6.20.xx and above)

2 examples of edac not working & working (x2200):

Feb 25 06:50:57 va64-x2200c kernel: EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout)memory transaction type(generic read), mem or i/o(mem access), cache level(generic) > /var/log/mce

*Note this is not collected by sysreport

RedHat implemented as a daemon

See RedHat advisory RHEA-2006-0134-7

MCAT (Machine Check Analysis Tool)

Event Source 62 - WMIxWDM

Processor Number : 0

Bank Number : 4

Time Stamp (0x): 01C856C4 58A8C10D

Error Status (0x): D4714000 E1080A13

Error Address (0x): 00000000 A047BF50

Error Misc. (0x): 00000000 00000000

Single bit errors:

Correctable ECC error

Error address valid in MCi_ADDR

Error reporting enabled

Second error

Error valid Cont: >>

Bus Error Code:

Participation processor: Local node responded to the request (RES)

Time-out: Request did not time out

Memory transaction type: Generic read (RD)

I/O: DRAM memory access (MEM)

Cache level: Generic (LG)

North Bridge Error MC4:

Extended Error Code: 0x8 - ChipKill ECC Error

Error Code: 0x0A13

DRAM memory access (MEM) Generic read (RD), on Generic (LG) cache

ChipKill Syndrome: 0xE1E2

Error address at 2564 MB

Takes input from a Windows Event Log entry and decodes the output:

MCAT Continued

This can be gathered by running ipmitool fru:

FRU Device Description : p0.fru (ID 6)

Product Manufacturer : ADVANCED MICRO DEVICES

Product Name : DUAL CORE AMD OPTERON(TM) 275

Product Part Number : 0F21

Product Version : 02

FRU Device Description : p0.d0.fru (ID 8)

Product Manufacturer : MICRON TECHNOLOGY

Product Name : 1024MB DDR 400 (PC3200) ECC

Product Part Number : 18VDDF12872G-40BD3

Product Version : 0300

Product Serial : D7010058

Continued: >>>

FRU Device Description : p0.d1.fru (ID 9)

Product Manufacturer : MICRON TECHNOLOGY

Product Name : 1024MB DDR 400 (PC3200) ECC

Product Part Number : 18VDDF12872G-40BD3

Product Version : 0300

Product Serial : D7010056

FRU Device Description : p0.d2.fru (ID 10)

Product Manufacturer : MICRON TECHNOLOGY

Product Name : 1024MB DDR 400 (PC3200) ECC

Product Part Number : 18VDDF12872G-40BD3

Product Version : 0300

Product Serial : D701A6F4

FRU Device Description : p0.d3.fru (ID 11)

Product Manufacturer : MICRON TECHNOLOGY

Product Name : 1024MB DDR 400 (PC3200) ECC

Product Part Number : 18VDDF12872G-40BD3

Product Version : 0300

Product Serial : D701A6EE

FRU Output for this failing platform:

Manual Diagnosis

Processor Number :0 - CPU 0 (If said 4 then it would be socket CPU4, not core 4).

Error address at 2564 MB (i.e. between 2 and 3 GBytes).

From the FRU information, each DIMM is 1 Gbyte.

The DIMMs are numbered for closest to CPU outwards based on mapping.

(DIMMs should be populated from outside inward but are mapped closest to CPU outwards).

The BIOS sets up memory from DIMM0/1 outwards.

Assuming "optimal defaults":

Our Opterons use a 128-bit wide data path. DIMM0 and DIMM1 are used in a pair.

These are single-rank DIMMs but they are all the same so is "chipselect interleaving".

The first 128KB are on DIMM0 and 1. The second 128KB are on DIMM2 and 3.

2564/128 = 20.03 ----> which is in DIMM0 and DIMM1 pair.

(Always replace Opteron platform DIMMs in pairs).

Windows reporting decode is performed as follows:

Manual Diagnosis

ChipKill Syndrome: 0xE1E2

Looking this up in the table 26 of the AMD BIOS And Kernel Writer's Guide shows this is symbol 0x1a

which according to the text above 26, this symbol maps to the upper 64-bits of the 128-bit data path.

DIMM0 from 00h-0fh provides the low 64-bits, DIMM1 from 10h-1fh provides the high 64-bits.

The check bits for the lower 64-bits is 20h-21h and the check bits for the upper 64-bits is 22h-23h

Technical documentation including the AMD BIOS and Kernel Writers Guide is available from AMD via:

http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_9003,00.html

Remember though to download the correct document for your processor revision:

SingleDual core Opteron for x2100, x2200, x4100, x4200, x4500, x4600 etc is document family 0fh.

Quad code Opteron for supported platforms is document family 10h.

Manual diagnosis continued:

Manual Diagnosis

Chipkill Syndrome Table for 0Fh CPUs 0-63 data bits

Symbol 1h 2h 3h 4h 5h 6h 7h 8h 9h ah bh ch dh eh fh 00h e821 7c32 9413 bb44 5365 c776 2f57 dd88 35a9 a1ba 499b 66cc 8eed 1afe f2df 01h 5d31 a612 fb23 9584 c8b5 3396 6ea7 eac8 b7f9 4cda 11eb 7f4c 227d d95e 846f 02h 0001 0002 0003 0004 0005 0006 0007 0008 0009 000a 000b 000c 000d 000e 000f 03h 2021 3032 1013 4044 6065 7076 5057 8088 a0a9 b0ba 909b c0cc e0ed f0fe d0df 04h 5041 a082 f0c3 9054 c015 30d6 6097 e0a8 b0e9 402a 106b 70fc 20bd d07e 803f 05h be21 d732 6913 2144 9f65 f676 4857 3288 8ca9 e5ba 5b9b 13cc aded c4fe 7adf 06h 4951 8ea2 c7f3 5394 1ac5 dd36 9467 a1e8 e8b9 2f4a 661b f27c bb2d 7cde 358f 07h 74e1 9872 ec93 d6b4 a255 4ec6 3a27 6bd8 1f39 f3aa 874b bd6c c98d 251e 51ff 08h 15c1 2a42 3f83 cef4 db35 e4b6 f177 4758 5299 6d1a 78db 89ac 9c6d a3ee b62f 09h 3d01 1602 2b03 8504 b805 9306 ae07 ca08 f709 dc0a e10b 4f0c 720d 590e 640f 0ah 9801 ec02 7403 6b04 f305 8706 1f07 bd08 2509 510a c90b d60c 4e0d 3a0e a20f 0bh d131 6212 b323 3884 e9b5 5a96 8ba7 1cc8 cdf9 7eda afeb 244c f57d 465e 976f 0ch e1d1 7262 93b3 b834 59e5 ca56 2b87 dc18 3dc9 ae7a 4fab 642c 85fd 164e f79f 0dh 6051 b0a2 d0f3 1094 70c5 a036 c067 20e8 40b9 904a f01b 307c 502d 80de e08f 0eh a4c1 f842 5c83 e6f4 4235 1eb6 ba77 7b58 df99 831a 27db 9dac 396d 65ee c12f 0fh 11c1 2242 3383 c8f4 d935 eab6 fb77 4c58 5d99 6e1a 7fdb 84ac 956d a6ee b72f

Manual Diagnosis

Chipkill Syndrome Table for 0Fh CPUs 64-128 data bits

Symbol 1h 2h 3h 4h 5h 6h 7h 8h 9h ah bh ch dh eh fh 10h 45d1 8a62 cfb3 5e34 1be5 d456 9187 a718 e2c9 2d7a 68ab f92c bcfd 734e 369f 11h 63e1 b172 d293 14b4 7755 a5c6 c627 28d8 4b39 99aa fa4b 3c6c 5f8d 8d1e eeff 12h b741 d982 6ec3 2254 9515 fbd6 4c97 33a8 84e9 ea2a 5d6b 11fc a6bd c87e 7f3f 13h dd41 6682 bbc3 3554 e815 53d6 8e97 1aa8 c7e9 7c2a a16b 2ffc f2bd 497e 943f 14h 2bd1 3d62 16b3 4f34 64e5 7256 5987 8518 aec9 b87a 93ab ca2c e1fd f74e dc9f 15h 83c1 c142 4283 a4f4 2735 65b6 e677 f858 7b99 391a badb 5cac df6d 9dee 1e2f 16h 8fd1 c562 4ab3 a934 26e5 6c56 e387 fe18 71c9 3b7a b4ab 572c d8fd 924e 1d9f 17h 4791 89e2 ce73 5264 15f5 db86 9c17 a3b8 e429 2a5a 6dcb f1dc b64d 783e 3faf 18h 5781 a9c2 fe43 92a4 c525 3b66 6ce7 e3f8 b479 4a3a 1dbb 715c 26dd d89e 8f1f 19h bf41 d582 6ac3 2954 9615 fcd6 4397 3ea8 81e9 eb2a 546b 17fc a8bd c27e 7d3f 1ah 9391 e1e2 7273 6464 f7f5 8586 1617 b8b8 2b29 595a cacb dcdc 4f4d 3d3e aeaf 1bh cce1 4472 8893 fdb4 3155 b9c6 7527 56d8 9a39 12aa de4b ab6c 678d ef1e 23ff 1ch a761 f9b2 5ed3 e214 4575 1ba6 bcc7 7328 d449 8a9a 2dfb 913c 365d 688e cfef 1dh ff61 55b2 aad3 7914 8675 2ca6 d3c7 9e28 6149 cb9a 34fb e73c 185d b28e 4def 1eh 5451 a8a2 fcf3 9694 c2c5 3e36 6a67 ebe8 bfb9 434a 171b 7d7c 292d d5de 818f 1fh 6fc1 b542 da83 19f4 7635 acb6 c377 2e58 4199 9b1a f4db 37ac 586d 82ee ed2f

Manual Diagnosis

ECC Syndrome Table (for completion) for 0Fh CPUs(Single Error Correction, Double Error Detection):

n=0 n=1 n=2 n=3 n=4 n=5 n=6 n=7Bit (0+n) ce cb d3 d5 d6 d9 da dcBit (8+n) 23 25 26 29 2a 2c 31 34Bit (16+n) 0e 0b 13 15 16 19 1a 1cBit (24+n) e3 e5 e6 e9 ea ec f1 f4Bit (32+n) 4f 4a 52 54 57 58 5b 5dBit (40+n) a2 a4 a7 a8 ab ad b0 b5Bit (48+n) 8f 8a 92 94 97 98 9b 9dBit (56+n) 62 64 67 68 6b 6d 70 75Bit (64+n) 01 02 04 08 10 20 40 80*Typically used for single DIMM configurations

Other Tools/Diags (Un-supported)

Bonnie

Benchmark to measure performance of filesystem

http://www.textuality.com/bonnie/

Memtest86+

Standalone bootable diagnostic

http://www.memtest.org/

http://www.memtest86.com/ Original version

Other memory tool

http://people.redhat.com/dledford/memtest.html

http://sourceforge.net/

Netperf or ttcp - google for them - network tools

SGR

Situation appraisal Recognise a problem

Problem Analysis - Find True Cause

http://systems-tsc/twiki/pub/SGR/SgrtOnlineHelp/PA-guide.pdf

The Steps in FTC are:

* Define a Problem Statement

* Describe the problem with a Problem Specification

* Develop Possible Causes from either Experience or Differences and Changes

* Identify the Most Probable Cause

* Test the Most Probable Cause against the Problem Specification

* Verify the Most Probable Cause

Newisys MCE Decoder v20/40z

What to gather from inventory get all -v1. How many CPU's?2. How many Dimms per CPU?3. What is the part number of the Dimm?

NOTE:This is for V20/40z ONLY and only works on Northbridge Errors

Details from CPU0 explained

Here you see 4 identical Dimms on CPU0.

The Dimm Manufacture part # is: 36VDDF25672G-40BD2

Name Type OEM Manufacture Date Hardware Revision Part #

CPU 0 DIMM 0 memory 2cffffffffffffff 2005-04-16 0200 36VDDF25672G-40BD2

CPU 0 DIMM 1 memory 2cffffffffffffff 2005-04-16 0200 36VDDF25672G-40BD2

CPU 0 DIMM 2 memory 2cffffffffffffff 2005-03-19 0200 36VDDF25672G-40BD2

CPU 0 DIMM 3 memory 2cffffffffffffff 2005-03-19 0200 36VDDF25672G-40BD2

DDR 0 VRM memvrm S-SCI448 2005-05-27 A01 S01479

CPU 0 VRM vrm NA

Determine Type & Rank of Dimm

** Dimms can be single rank or dual rank. For a description of the differences see:

http://pts-platform.uk/twiki/bin/view/Products/ProdFAQv2040z or http://en.wikipedia.org/wiki/DIMM#Ranking

Browse to the Qualified Memory page:

http://nsgtwiki.sfbay.sun.com/twiki/bin/view/Stinger/StingerQualifiedMemory

Compare your DIMM Manufacture part number to the list: 36VDDF25672G-40BD2

This equates to a 2GB Micron Dual Rank DIMM:

Micron:

512MB: MT18VDDF6472G-40BG3 Die: G Single Rank SPD 1.0

1GB: MT18VDDF12872G-40BD3 Die: D Single Rank SPD 1.0

2GB: MT36VDDF25672G-40BD2 Die: D Dual Rank SPD 1.0

Now we are ready to populate the Memory Decode Tool

Warning! Decode Tool is Sun Internal

https://supportcenter.newisys.com/edbug/edbug_int.pl?auth=dfqsdftqw11p4jdvhasovygm82cbcfrk

This link cannot be shared with customers.

It is internal for Sun use only.

The link has the account and password in it.

Populate the Decode Tool

Information for Decode Tool

Enter the CPU that has the machine check: (From the Error)0, 1, 2, or 3

Enter the platform type: 2100 = V20z4300 = V40z

Enter the machine check status: (From the Error)Enter the machine check address: (From the Error)Specify which CPUs have DIMMs: (From inventory ger all -v)Specify which DIMMs are populated on each CPU: (From Inventory get all -v)Specify the DIMM type: (Rank from Qualified Memory Page)BIOS defaults: Leave this at the default (Place a in DIMM interleaving, 128 bit DIMM interface, and Chipkill ECC enabled. No in Node interleaving)

Result Output

Only one error is presentError details:K8_CPU-0 is reporting this corrected error: DRAM chipkill ECC error found by scrubber The DRAM error was at address '00000000 9C6A0B30' (2 GB range) This error is related to DIMM 1 on K8_CPU-0 The ECC syndrome ('5E34'x) maps to a correctable error at data bit 66 Within the DIMM, this would be an error at physical bit 2 Processor was responding to another source of the transaction Transaction was a read Error classification: Error type: DRAM ECC Error severity: Corrected Error enabled: yes Error recovered: yes Possible sympathy: no Error address: '000000009C6A0B30'x Address type: Physical

Anything Else........

Newisys Machine Check (northbridge only)

V20/40z only

http://systems-tsc/twiki/pub/Products/ProdTroubleshootingV20z/V20z-V40z-Memory-DIMM-Error-Decode-Final.sxi

Windows Debugging

http://www.microsoft.com/whdc/devtools/debugging/default.mspx

MCAT

http://www.amd.com/us-en/assets/content_type/utilities/mcatsetup.exe

Machine Check Analysis Tool (MCAT) is a command line utility that takes Windows System Event Log (.evt) file as an argument and decodes the MCA Error logs into human readable format. MCAT can alternatively take in MCE Error information as raw register hexadecimal values as command line argument as well.

Diagnostics Complete!

Sun Microsystems, Inc.

Page

Click to edit the title text format

Click to edit the outline text format

Second Outline Level

Sun Confidential: Internal Only

Click to edit the notes format

Page

Click to edit the title text format

Presenters Name

Presenters Title

Presenters Company

Click to edit the notes format

Page