60
Picking Up the Pieces Picking Up the Pieces after your LINUX after your LINUX System Crashes System Crashes An Administrator's Overview to Gathering Crash Data Alan Boda Hewlett-Packard Company

Picking Up the Pieces After Your LINUX System Crashes

Embed Size (px)

DESCRIPTION

Very nice presentation on Linux system crash analysis.

Citation preview

Page 1: Picking Up the Pieces After Your LINUX System Crashes

Picking Up the Pieces after Picking Up the Pieces after your LINUX System Crashesyour LINUX System Crashes

An Administrator's Overview to Gathering Crash Data

Alan Boda Hewlett-Packard Company

Page 2: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP2

Introduction Introduction

Did my system crash or hang?Why did my system crash or hang?What do I save?What should I do next time?What should I do now?

Page 3: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP3

What we will be discussingWhat we will be discussing

System Admin goals Crash analogy Difference between a crash and a hang Environment Scenarios Tools to have in place now What data to gather and how to gather it What to do before and after the reboot Reconstructing the crash scene Ways to look at gathered data

Page 4: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP4

What we won’t be discussingWhat we won’t be discussing

Crash dump analysisTool installation and configuration detailsSystem or Application Performance

Tuning

Page 5: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP5

Goals as System AdministratorGoals as System AdministratorPrepare system so it can tell you what

happened if it hangs or crashesReconstruct the Crash/Hang SceneDevelop emergency procedures

LED’s

updates

dumps

profilers

track record

logs

Diagnostics

environme

nt

sar

Page 6: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP6

Car Crash AnalogyCar Crash AnalogyAccident Reconstruction ConsultantsEvidence and CluesMaking the accident scene tell its storyMore clues = clearer pictureGoal: Prep system to tell you what

happened

Page 7: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP7

Car Crash vs. System CrashCar Crash vs. System Crashskid marks performance degradation

Weather system environment changes

blown tire Failures in storage, fan, power supply, etc…

eye witnesses system administrator or user who saw the failure or hang

survivors Do any processes respond? Login response? Db/sql query response? Ping response?

black box System activity dataMessage Logs System management log entriesProfiler tool dataDumps

physical evidence

Damaged hardwareDiagnostic LED’sPhysical or remote console messages

age of vehicle New system installation?Extended System/Application track record?

service records New Package installationsKernel or package upgradesSystem h/w upgrade records

travel plan Scheduled tests, changes, maintenance logs

Page 8: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP8

System Hang vs. CrashSystem Hang vs. Crash

System Hang– Partially responsive– Resource deficiency– Not responsive, but no crash– Runaway high priority process– bug in driver’s interrupt handling code

Oops System Crash

– Nonresponsive system– Panic due to logical error

Page 9: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP9

Different Environment ScenariosDifferent Environment Scenarios

Lights Out environment?Cluster environment?Database server?

Page 10: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP10

System Tools System Tools to Configure Now!to Configure Now!

Page 11: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP11

SysRqSysRq aka magic keys used during hang/freeze situations Alt-SysRq-<command key> sequence provides memory, stack trace, process info commands to sync disks, crash system logs to messages and netdump (RHEL) – best effort RHEL and SLES kernels have SysRq configured but not

enabled. To verify if enabled:

# cat /proc/sys/kernel/sysrq(1 = enabled, 0 = disabled)

Security risk

Page 12: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP12

SysRq ConfigurationSysRq Configuration Must set /proc/sys/kernel/sysrq to 1

# echo 1 > /proc/sys/kernel/sysrq

-or-

#sysctl –w kernel.sysrq=1#sysctl –p

To retain setting across reboots

RHEL - Edit: /etc/sysctl.confkernel.sysrq = 1

SLES - Edit /etc/sysconfig/sysctlENABLE_SYSRQ="yes"

Page 13: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP13

Sample SysRq OutputSample SysRq OutputSysRq : Show RegsPid/TGid: 0/0, comm: swapperEIP: 0060:[<c0109129>] CPU: 1EIP is at default_idle [kernel] 0x29 (2.4.21-20.ELsmp)ESP: 080b:c01091c2 EFLAGS: 00000246 Tainted: PEAX: 00000000 EBX: c0109100 ECX: c043bc80 EDX: c9b20000ESI: c9b20000 EDI: c9b20000 EBP: c0109100 DS: 0068 ES: 0068 FS: 0000 GS:0000CR0: 8005003b CR2: b729f000 CR3: 376c9900 CR4: 000006f0Call Trace: [<c01091c2>] cpu_idle [kernel] 0x42 (0xc9b21fb0)[<c01291c3>] call_console_drivers [kernel] 0x63 (0xc9b21fc4)[<c01294f3>] printk [kernel] 0x153 (0xc9b21ffc) Zone:Normal freepages:108783 min: 1279 low: 4544 high: 6304Zone:HighMem freepages:1209405 min: 255 low: 20990 high: 31485Free pages: 1321089 (1209405 HighMem)( Active: 78806/14876, inactive_laundry: 4493, inactive_clean: 0, free:1321089)…

Page 14: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP14

sysstat sysstat package containing iostat, sadc, sar, mpstat System activity data collected snapshots taken every 10 minutes saves 7 days of reports by default (RHEL) to verify:

# rpm -qa | grep sysstatsysstat-5.0.1-35.4

# chkconfig --list | grep sysstatsysstat 0:off 1:on 2:on 3:on 4:on 5:on 6:off

Page 15: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP15

Contents of /var/log/sa (RHEL)Contents of /var/log/sa (RHEL)# ls -l /var/log/satotal 4060-rw-r--r-- 1 root root 207600 Jan 20 23:50 sa20-rw-r--r-- 1 root root 207600 Jan 21 23:50 sa21-rw-r--r-- 1 root root 207600 Jan 22 23:50 sa22-rw-r--r-- 1 root root 207600 Jan 23 23:50 sa23-rw-r--r-- 1 root root 207600 Jan 24 23:50 sa24-rw-r--r-- 1 root root 207600 Jan 25 23:50 sa25-rw-r--r-- 1 root root 207600 Jan 26 23:50 sa26-rw-r--r-- 1 root root 207600 Jan 27 23:50 sa27-rw-r--r-- 1 root root 88080 Jan 28 10:00 sa28-rw-r--r-- 1 root root 287976 Jan 20 23:53 sar20-rw-r--r-- 1 root root 287976 Jan 21 23:53 sar21-rw-r--r-- 1 root root 287976 Jan 22 23:53 sar22-rw-r--r-- 1 root root 287976 Jan 23 23:53 sar23-rw-r--r-- 1 root root 287976 Jan 24 23:53 sar24-rw-r--r-- 1 root root 287976 Jan 25 23:53 sar25-rw-r--r-- 1 root root 287976 Jan 26 23:53 sar26-rw-r--r-- 1 root root 287976 Jan 27 23:53 sar27

Page 16: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP16

Contents of /var/log/sa (SLES)Contents of /var/log/sa (SLES) # ls /var/log/sa. sa.2006_01_13 sa.2006_01_24 sar.2006_01_10 sar.2006_01_21.. sa.2006_01_14 sa.2006_01_25 sar.2006_01_11 sar.2006_01_22sa.2006_01_04 sa.2006_01_15 sa.2006_01_26 sar.2006_01_12 sar.2006_01_23sa.2006_01_05 sa.2006_01_16 sa.2006_01_27 sar.2006_01_13 sar.2006_01_24sa.2006_01_06 sa.2006_01_17 sa.2006_01_28 sar.2006_01_14 sar.2006_01_25sa.2006_01_07 sa.2006_01_18 sar.2006_01_04 sar.2006_01_15 sar.2006_01_26sa.2006_01_08 sa.2006_01_19 sar.2006_01_05 sar.2006_01_16 sar.2006_01_27sa.2006_01_09 sa.2006_01_20 sar.2006_01_06 sar.2006_01_17sa.2006_01_10 sa.2006_01_21 sar.2006_01_07 sar.2006_01_18sa.2006_01_11 sa.2006_01_22 sar.2006_01_08 sar.2006_01_19sa.2006_01_12 sa.2006_01_23 sar.2006_01_09 sar.2006_01_20

Page 17: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP17

Sample sar reportSample sar report# more sar20Linux 2.4.21-37.ELsmp (karp.alf.cpqcorp.net) 2006-01-20

00:00:00 proc/s00:10:00 0.0300:20:00 0.0100:30:00 0.0100:40:00 0.0100:50:00 0.0101:00:00 0.0101:10:00 0.0301:20:00 0.0101:30:00 0.0101:40:00 0.0101:50:00 0.0102:00:00 0.0102:10:00 0.0302:20:00 0.0102:30:00 0.0102:40:00 0.01

Page 18: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP18

System Management ToolsSystem Management Tools

snmp-based toolsExamples:

– IBM: IBM Director Agents– Dell: OpenManage Server Administrator– HP: Insight Manager and Agents

Agents monitor and log to system logsPredictive fault (if supported by driver)

Page 19: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP19

Other toolsOther tools

Special situations vendor-specific cron script to gather

– /proc/meminfo– top– /proc/slabinfo– vmstat– netstat– interrupt– lsof

Page 20: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP20

Crash Dump IssuesCrash Dump Issues

Inconsistent crash dump methods Standard kernel deadlocks resources network throughput for network-based dumps assumes trusted kernel state where to dump ASR interference

Page 21: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP21

Crash Dump toolsCrash Dump tools

netdump - RHELdiskdump - RHELLKCD - SLESmkdump kdump

Page 22: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP22

NetdumpNetdump

dumps to remote disk nic must support polled operation log file of panic, oops and other SysRq output Verify:

– # service netdump status– # service netdump-server status– Check /etc/sysconfig/netdump

DEV=eth0 (or other nic) NETDUMPADDR={netdump-server IP}

Page 23: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP23

DiskdumpDiskdump

dumps to local disk limited controllers dump levelsAvailable as of RHEL 3 U3Verify:

– # service diskdump status– # cat /etc/sysconfig/diskdump

Page 24: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP24

LKCDLKCD

dumps to local diskcan also dump to netdump-server (default)different dump levelsVerify:

– # lkcd query

Page 25: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP25

mkdumpmkdumpminikernel dump (based on mkexec)OpenSourceUses netdump and LKCD dump format

kdumpkdumpkexec-based kernel crash dump mechanismOpenSourceUse crash to analyze dump file

Page 26: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP26

Dump SuggestionsDump SuggestionsDisk-based dumpsNetwork-based dumpsAutomatic Server Recovery (ASR)

timeoutsSynchronize timeBest effort

Page 27: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP27

Test out DumpTest out Dump Enable the magic sysrq key

# sysctl -w kernel/sysrq=1 Enable panic_on_oops

# sysctl -w kernel/panic_on_oops=1 netdump: check to see if netlog is working

# echo h > /proc/sysrq-trigger netdump: Test SysRq writes to netdump log file

#echo m > /proc/sysrq-trigger Sync all mounted file systems

# echo s > /proc/sysrq-trigger Crash the system

– # echo c > /proc/sysrq-trigger (RHEL)– # echo d > /proc/sysrq-trigger (SLES)– crash.c (RHEL)

diskdump – check /var/crash/127.0.0.1-<date> lkcd – check /var/log/dump/

Page 28: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP28

System SnapshotSystem Snapshot

Take snapshot of working system nowRun normal working load while taking

snapshotWill discuss tools one can use shortly

Page 29: Picking Up the Pieces After Your LINUX System Crashes

Now that the System has Now that the System has Crashed or HungCrashed or Hung

Gathering the clues

Assume that tools have been preconfigured!

Page 30: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP30

Before You RebootBefore You Reboot Don’t delay! Console messages? ping? telnet, ssh, or rsh? If db server, query response? SysRq keys LED’s? Physical environment changes? Dump? Time of hang or crash

Page 31: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP31

After the RebootAfter the Reboot kernel (uname –a) loaded modules (lsmod) bus information (lspci -w) boot information (dmesg) system logs (/var/log/*) memory (/proc/meminfo) cpu (/proc/cpuinfo) disk (/proc/scsi/scsi) disk partition (/proc/partitions) installed rpm’s (/var/log/rpminfo, “rpm –qa”) time of hang or crash cpu details – (dmidecode)

Page 32: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP32

Snapshot Tools for After RebootSnapshot Tools for After Reboot

sysreport (RHEL) sitar (SLES) config.sh (SLES) cfg2html (OpenSource) h/w diagnostic tools (vendor-specific)

Page 33: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP33

sysreportsysreport

Verify: rpm –qa | grep sysreportWhat does it generate?

# ls -w 50boot free ksyms mount rpm-Vadate hardware.py lib proc unamedf hostname ls-boot ps uptimeetc ifconfig lsmod pstree varfdisk-l installed-rpms lspci route

Page 34: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP34

sitarsitar

Generates various reports detailing– add-on's– installed packages– system info– yast installed packages

Reports created in different formats

Page 35: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP35

sitar-generated filessitar-generated files# sitar# ls /tmp/sitar-fwills.america.cpqcorp.net-2006020104/...sitar-addon-fwills.america.cpqcorp.net-yast2.selsitar-fwills.america.cpqcorp.net-yast1.selsitar-fwills.america.cpqcorp.net.htmlsitar-fwills.america.cpqcorp.net.sdocbook.xmlsitar-fwills.america.cpqcorp.net.texsitar-sles-fwills.america.cpqcorp.net-yast2.sel

Page 36: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP36

Sitar .html reportSitar .html reportfwills.america.cpqcorp.net, Wed Feb 1 04:48:57 2006

Linux fwills 2.6.5-7.193-default #1 Wed Jul 20 14:39:18 UTC 2005 i686 i686 i386 GNU/Linux SUSE LINUX Enterprise Server 9 (i586)

Table of Contents 1. General Information 2. CPU

1. General Information

Hostname fwills.america.cpqcorp.net

Operating System SUSE LINUX Enterprise Server 9 (i586)

UName Linux fwills 2.6.5-7.193-default #1 Wed Jul 20 14:39:18 UTC 2005 i686 i686 i386 GNU/Linux

Date Wed Feb 1 04:48:57 2006

Main Memory 385976 KByte

Cmdline root=/dev/sda2 vga=0x317 selinux=0 resume=/dev/sda3 elevator=cfq splash=silent

Load 0.00 0.00 0.00 1/79 2084

Uptime (minutes hours days) 181715 3028 124

Idletime (minutes hours days) 37934 632 26

2. CPU

Page 37: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP37

config.shconfig.sh What does it generate?

# ls -w 50. iscsi.txt performance.txt.. lvm.txt rcd.txtboot.txt messages.txt release.txtchkconfig.txt modules.txt rpm.txtconfig.sh.txt mpio.txt rug.txtcron.txt ncp.txt scsi.txtenv.txt network.txt siga.txtevms.txt nss.txt softraid.txthwinfo.txt pam.txt y2log.txt

Page 38: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP38

Tools to view sysstat dataTools to view sysstat data

sar isagsarcheck

Page 39: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP39

Sample sar commandsSample sar commands# sar -u 2 4Linux 2.4.21-37.ELsmp (karp.alf.cpqcorp.net) 02/01/2006

03:01:50 PM CPU %user %nice %system %iowait %idle03:01:52 PM all 0.50 0.00 0.25 0.75 98.4903:01:54 PM all 0.00 0.00 0.50 0.00 99.5003:01:56 PM all 0.00 0.00 0.00 0.25 99.7503:01:58 PM all 0.25 0.00 0.50 0.00 99.25Average: all 0.19 0.00 0.31 0.25 99.25

# cd /var/log/sa# sar -A -f sa01 > sar01-new# ls -l sa*01*-rw-r--r-- 1 root root 132720 Feb 1 15:10 sa01-rw-r--r-- 1 root root 181575 Feb 1 15:08 sar01-new

Page 40: Picking Up the Pieces After Your LINUX System Crashes

Reconstructing the Crash SceneReconstructing the Crash Scene

Page 41: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP41

Check logsCheck logs

/var/log/messages– Search for kernel load entry– Work backwards and look for:

Errors or WarningsOops messages with trace output

Other log filesdf output

Page 42: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP42

OopsOopsOct 30 00:05:34 karp kernel: Unable to handle kernel NULL pointer dereferenceat virtual address 00000008Oct 30 00:05:34 karp kernel:  printing eip:Oct 30 00:05:34 karp kernel: c011ec5dOct 30 00:05:34 karp kernel: *pde = 2aefa001Oct 30 00:05:34 karp kernel: Oops: 0000Oct 30 00:05:34 karp kernel: Kernel 2.4.9-e.38enterpriseOct 30 00:05:34 karp kernel: CPU:    1Oct 30 00:05:34 karp kernel: EIP:    0010:[get_module_list+61/816]    Tainted: POct 30 00:05:34 karp kernel: EIP:    0010:[<c011ec5d>]    Tainted: POct 30 00:05:34 karp kernel: EFLAGS: 00010246Oct 30 00:05:34 karp kernel: EIP is at get_module_list [kernel] 0x3dOct 30 00:50:00 karp syslogd 1.4.1: restart.Oct 30 00:50:00 karp syslog: syslogd startup succeededOct 30 00:50:00 karp kernel: klogd 1.4.1, log source = /proc/kmsg started.Oct 30 00:50:00 karp kernel: Inspecting /boot/System.map-2.4.9-e.38enterpriseOct 30 00:50:00 karp syslog: klogd startup succeeded

Page 43: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP43

ksymoopsksymoops

>>EIP; c0113f8c <sys_init_module+49c/4d0>Trace; c011d3f5 <sys_mremap+295/370>Trace; c011af5f <do_generic_file_read+5bf/5f0>Trace; c011afe9 <file_read_actor+59/60>Trace; c011d2bc <sys_mremap+15c/370>Trace; c010e80f <do_sigaltstack+ff/1a0>Trace; c0107c39 <overflow+9/c>Trace; c0107b30 <tracesys+1c/23>Trace; 00001000 Before first symbol

Page 44: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP44

SAR Data ExampleSAR Data Example15:20:00    dentunusd   file-sz  %file-sz  inode-sz  super-sz %super-sz  dquot-sz %dquot-sz  rtsig-sz %rtsig-sz15:30:00      1866554      2728      2.08   2055024         0       0.00         0      0.00         1      0.1015:40:01      1866909      2785      2.12   2055020         0       0.00         0      0.00         1      0.10…17:20:00      1870217      2786      2.13   2055019         0       0.00         0      0.00         1      0.1017:30:00      1870516      2762      2.11   2055022         0       0.00         0      0.00         1      0.1017:40:00      1870848      2785      2.12   2055019         0       0.00         0      0.00         1      0.1017:50:00      1569671      2156      1.64   1743619         0       0.00         0      0.00         1      0.1018:00:00      1570730      1984      1.51   1744880         0       0.00         0      0.00         1      0.1018:10:00      1571240      1792      1.37   1745241         0       0.00         0      0.00         1      0.1018:20:00      1571768      1510      1.15   1745796         0       0.00         0      0.00         1      0.1018:30:00      1572100      1483      1.13   1745826         0       0.00         0      0.00         1      0.1018:40:00      1573175       16      0.01   1747980       0        0.00         0      0.00         1      0.10   

Page 45: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP45

ISAG View of Sar DataISAG View of Sar Data

Page 46: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP46

The Question of DebuggersThe Question of Debuggers

Torvalds quote“I do see some good points in a kernel debugger, but I have yet to be convinced that the good things outweigh the bad. The only valid uses of debuggers is to get a stack backtrace and a register dump, imho, andthat is what you get from a kernel panic anyway (and the ksymoops.ccprogram will actually make it readable for others than just me ;-)

I'm afraid that I've seen too many people fix bugs by looking atdebugger output, and that almost inevitably leads to fixing the symptomsrather than the underlying problems. “

Ref: http://www.ussg.iu.edu/hypermail/linux/kernel/9510/0103.html

Page 47: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP47

In The DumpsIn The Dumps

Recover the vmcoreTools to analyze:

– netdump / diskdump: use crash– LKCD: use lcrash or crash– mkdump: use lcrash or crash

Key items: process stacks, system callsRequirements needed from crashed system

Page 48: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP48

NetdumpNetdump Check netdump log file first

– Oops or panic messages– loaded modules– SysRq memory, trace, process info– Stack trace

Use “crash” on vmcore syslog Ref: /usr/share/doc/netdump-*

Page 49: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP49

Netdump-server FilesNetdump-server Files

# ls -l /var/crash/16.113.5.104-2003-12-15-12:21total 141108-rw------- 1 netdump netdump 63067 Dec 15 2003 log-rw------- 1 netdump netdump 134205440 Dec 15 2003 vmcore

# du -sk /var/crash/16.113.5.104-2003-12-15-12:21/vmcore131192 /var/crash/16.113.5.104-2003-12-15-12:21/vmcore

Page 50: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP50

Netdump-server logNetdump-server log# more logOops: 0002Kernel 2.4.9-e.3CPU: 0EIP: 0010:[<c8a44076>] Tainted: PEFLAGS: 00010282EIP is at init_module [crash] 0x16eax: 00000013 ebx: c8a44000 ecx: 00000000 edx: c543e000esi: 00000000 edi: 00000000 ebp: c3149f28 esp: c3149f20ds: 0018 es: 0018 ss: 0018Process insmod (pid: 5619, stackpage=c3149000)Stack: 00000000 00000060 00000060 c0118eb5 00000000 c36fb000 00000098 c35c6000 00000060 ffffffea 00000005 c468b740 00000060 c8a3f000 c8a44060 000002e8 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000Call Trace: [<c0118eb5>] sys_init_module [kernel] 0x535[<c8a44060>] init_module [crash] 0x0[<c0106f03>] system_call [kernel] 0x33

Code: c6 05 00 00 00 00 00 b8 00 00 00 00 c9 c3 63 72 61 73 68 69< netdump activated - performing handshake with the client. >

Process: 5619, { insmod}…

Page 51: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP51

Disk DumpDisk Dump

check for vmcore in /var/crash/127.0.0.1-<date> use same crash tool as for netdump to debug Ref: /usr/share/doc/diskdumputils-*/

Page 52: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP52

LKCDLKCD

check /var/log/dump/nUse lcrash or crash to analyze vmcoreRef: /usr/share/doc/packages/lkcdutils/

Page 53: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP53

What next?What next?

H/W vendorSystem Service ProviderOS vendorOpenSource Community

Page 54: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP54

SummarySummary

Tools to use to prep system Reconstruction of hang / crash sceneBefore a rebootAfter the rebootMethods to approach looking at data

Page 55: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP55

Questions & AnswersQuestions & Answers??????

Page 56: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP56

Make your system talkMake your system talkPrepare your system now

so it can tell you what happened!

Page 57: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP57

Appendix – More InformationAppendix – More Information http://www.novell.com/coolsolutions/tools/16106.html -- SLES config.sh http://come.to/cfg2html -- cfg2html utility to gather system information http://www.linuxtroubleshooting.com/wiki/index.php?title=Main_Page – Linux troubleshooting tools http://www.volny.cz/linux_monitor/isag/ -- isag http://rpmfind.net//linux/RPM/contrib/noarch/noarch/isag-4.1.1-1.noarch.html -- isag http://www.sarcheck.com/sclinux.htm -- sarcheck http://linuxgazette.net/issue59/nazario.html -- good dmesg description http://lkcd.sourceforge.net -- lkcd http://lkcd.sourceforge.net/doc/lcrash.pdf -- lcrash HOWTO http://lkcd.sourceforge.net/doc/lkcd_tutorial.pdf -- good lkcd tutorial /usr/share/doc/packages/lkcdutils/README.SuSE – LKCD setup http://www.novell.com/coolsolutions/feature/14813.html -- SLES lkcd http://support.novell.com/cgi-bin/search/searchtid.cgi?10099561.htm – SLES lkcd howto /usr/share/doc/diskdumputils-*/README -- diskdump setup http://www.redhat.com/support/wpapers/redhat/netdump/ -- netdump /usr/share/doc/netdump*/README* -- netdump / netdump-server http://www.linuxforums.org/forum/peripherals-hardware/35963-cpu-naming-schemes-x86-386-486-586-amd-64-ia64-em64t.html?

highlight=naming+schemes -- good cpu chip reference http://mkdump.sourceforge.net -- mkdump http://lse.sourceforge.net/kdump/ - kdump http://www.linuxdevcenter.com/lpt/a/1319 -- “Linux System Failure Post-Mortem”, by Jennifer Vesperman

(O’Reilly Network) http://www.die.net/doc/linux/man/man5/proc.5.html - manpage for /proc details http://www-128.ibm.com/developerworks/db2/library/techarticle/dm-0509wright/?ca=dgr-lnxw06DB2Linux –

good article on Linux memory utilization http://www.ataassociates.com/Process.htm -- accident reconstruction

Page 58: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP58

Appendix - VocabularyAppendix - Vocabulary AMD64/EM64T – Similar X86 architectures w/ 64 bit mem registers

collectively known as X86_64 ARC – Accident Reconstruction Consultant ASR - Automatic Server Recovery IA64 – CPU based on 64-bit Itanium chipset ISAG – Interactive System Activity Grapher lkcd – Linux Kernel Crash Dump utility mkdump – minikernel dump utility RHEL - Red Hat Enterprise Linux SAR – System Activity Report SLES - SuSE Linux Enterprise Server SysRq (aka magic keys) – key sequence intercepted by kernel to

perform certain operations x86 – CPU based on Intel 80x86 chipset

Page 59: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP59

Appendix – Pre-Crash Check ListAppendix – Pre-Crash Check List

Enable SysRq Enable sysstat Enable system management tools Develop emergency procedures Train staff in emergency procedures Configure and enable dump utility Take system snapshot on loaded/running system Setup remote console access

Page 60: Picking Up the Pieces After Your LINUX System Crashes

8/15/2006Alan Boda - HP60