Upload
amit-mehta
View
123
Download
4
Tags:
Embed Size (px)
DESCRIPTION
Very nice presentation on Linux system crash analysis.
Citation preview
Picking Up the Pieces after Picking Up the Pieces after your LINUX System Crashesyour LINUX System Crashes
An Administrator's Overview to Gathering Crash Data
Alan Boda Hewlett-Packard Company
8/15/2006Alan Boda - HP2
Introduction Introduction
Did my system crash or hang?Why did my system crash or hang?What do I save?What should I do next time?What should I do now?
8/15/2006Alan Boda - HP3
What we will be discussingWhat we will be discussing
System Admin goals Crash analogy Difference between a crash and a hang Environment Scenarios Tools to have in place now What data to gather and how to gather it What to do before and after the reboot Reconstructing the crash scene Ways to look at gathered data
8/15/2006Alan Boda - HP4
What we won’t be discussingWhat we won’t be discussing
Crash dump analysisTool installation and configuration detailsSystem or Application Performance
Tuning
8/15/2006Alan Boda - HP5
Goals as System AdministratorGoals as System AdministratorPrepare system so it can tell you what
happened if it hangs or crashesReconstruct the Crash/Hang SceneDevelop emergency procedures
LED’s
updates
dumps
profilers
track record
logs
Diagnostics
environme
nt
sar
8/15/2006Alan Boda - HP6
Car Crash AnalogyCar Crash AnalogyAccident Reconstruction ConsultantsEvidence and CluesMaking the accident scene tell its storyMore clues = clearer pictureGoal: Prep system to tell you what
happened
8/15/2006Alan Boda - HP7
Car Crash vs. System CrashCar Crash vs. System Crashskid marks performance degradation
Weather system environment changes
blown tire Failures in storage, fan, power supply, etc…
eye witnesses system administrator or user who saw the failure or hang
survivors Do any processes respond? Login response? Db/sql query response? Ping response?
black box System activity dataMessage Logs System management log entriesProfiler tool dataDumps
physical evidence
Damaged hardwareDiagnostic LED’sPhysical or remote console messages
age of vehicle New system installation?Extended System/Application track record?
service records New Package installationsKernel or package upgradesSystem h/w upgrade records
travel plan Scheduled tests, changes, maintenance logs
8/15/2006Alan Boda - HP8
System Hang vs. CrashSystem Hang vs. Crash
System Hang– Partially responsive– Resource deficiency– Not responsive, but no crash– Runaway high priority process– bug in driver’s interrupt handling code
Oops System Crash
– Nonresponsive system– Panic due to logical error
8/15/2006Alan Boda - HP9
Different Environment ScenariosDifferent Environment Scenarios
Lights Out environment?Cluster environment?Database server?
8/15/2006Alan Boda - HP10
System Tools System Tools to Configure Now!to Configure Now!
8/15/2006Alan Boda - HP11
SysRqSysRq aka magic keys used during hang/freeze situations Alt-SysRq-<command key> sequence provides memory, stack trace, process info commands to sync disks, crash system logs to messages and netdump (RHEL) – best effort RHEL and SLES kernels have SysRq configured but not
enabled. To verify if enabled:
# cat /proc/sys/kernel/sysrq(1 = enabled, 0 = disabled)
Security risk
8/15/2006Alan Boda - HP12
SysRq ConfigurationSysRq Configuration Must set /proc/sys/kernel/sysrq to 1
# echo 1 > /proc/sys/kernel/sysrq
-or-
#sysctl –w kernel.sysrq=1#sysctl –p
To retain setting across reboots
RHEL - Edit: /etc/sysctl.confkernel.sysrq = 1
SLES - Edit /etc/sysconfig/sysctlENABLE_SYSRQ="yes"
8/15/2006Alan Boda - HP13
Sample SysRq OutputSample SysRq OutputSysRq : Show RegsPid/TGid: 0/0, comm: swapperEIP: 0060:[<c0109129>] CPU: 1EIP is at default_idle [kernel] 0x29 (2.4.21-20.ELsmp)ESP: 080b:c01091c2 EFLAGS: 00000246 Tainted: PEAX: 00000000 EBX: c0109100 ECX: c043bc80 EDX: c9b20000ESI: c9b20000 EDI: c9b20000 EBP: c0109100 DS: 0068 ES: 0068 FS: 0000 GS:0000CR0: 8005003b CR2: b729f000 CR3: 376c9900 CR4: 000006f0Call Trace: [<c01091c2>] cpu_idle [kernel] 0x42 (0xc9b21fb0)[<c01291c3>] call_console_drivers [kernel] 0x63 (0xc9b21fc4)[<c01294f3>] printk [kernel] 0x153 (0xc9b21ffc) Zone:Normal freepages:108783 min: 1279 low: 4544 high: 6304Zone:HighMem freepages:1209405 min: 255 low: 20990 high: 31485Free pages: 1321089 (1209405 HighMem)( Active: 78806/14876, inactive_laundry: 4493, inactive_clean: 0, free:1321089)…
8/15/2006Alan Boda - HP14
sysstat sysstat package containing iostat, sadc, sar, mpstat System activity data collected snapshots taken every 10 minutes saves 7 days of reports by default (RHEL) to verify:
# rpm -qa | grep sysstatsysstat-5.0.1-35.4
# chkconfig --list | grep sysstatsysstat 0:off 1:on 2:on 3:on 4:on 5:on 6:off
8/15/2006Alan Boda - HP15
Contents of /var/log/sa (RHEL)Contents of /var/log/sa (RHEL)# ls -l /var/log/satotal 4060-rw-r--r-- 1 root root 207600 Jan 20 23:50 sa20-rw-r--r-- 1 root root 207600 Jan 21 23:50 sa21-rw-r--r-- 1 root root 207600 Jan 22 23:50 sa22-rw-r--r-- 1 root root 207600 Jan 23 23:50 sa23-rw-r--r-- 1 root root 207600 Jan 24 23:50 sa24-rw-r--r-- 1 root root 207600 Jan 25 23:50 sa25-rw-r--r-- 1 root root 207600 Jan 26 23:50 sa26-rw-r--r-- 1 root root 207600 Jan 27 23:50 sa27-rw-r--r-- 1 root root 88080 Jan 28 10:00 sa28-rw-r--r-- 1 root root 287976 Jan 20 23:53 sar20-rw-r--r-- 1 root root 287976 Jan 21 23:53 sar21-rw-r--r-- 1 root root 287976 Jan 22 23:53 sar22-rw-r--r-- 1 root root 287976 Jan 23 23:53 sar23-rw-r--r-- 1 root root 287976 Jan 24 23:53 sar24-rw-r--r-- 1 root root 287976 Jan 25 23:53 sar25-rw-r--r-- 1 root root 287976 Jan 26 23:53 sar26-rw-r--r-- 1 root root 287976 Jan 27 23:53 sar27
8/15/2006Alan Boda - HP16
Contents of /var/log/sa (SLES)Contents of /var/log/sa (SLES) # ls /var/log/sa. sa.2006_01_13 sa.2006_01_24 sar.2006_01_10 sar.2006_01_21.. sa.2006_01_14 sa.2006_01_25 sar.2006_01_11 sar.2006_01_22sa.2006_01_04 sa.2006_01_15 sa.2006_01_26 sar.2006_01_12 sar.2006_01_23sa.2006_01_05 sa.2006_01_16 sa.2006_01_27 sar.2006_01_13 sar.2006_01_24sa.2006_01_06 sa.2006_01_17 sa.2006_01_28 sar.2006_01_14 sar.2006_01_25sa.2006_01_07 sa.2006_01_18 sar.2006_01_04 sar.2006_01_15 sar.2006_01_26sa.2006_01_08 sa.2006_01_19 sar.2006_01_05 sar.2006_01_16 sar.2006_01_27sa.2006_01_09 sa.2006_01_20 sar.2006_01_06 sar.2006_01_17sa.2006_01_10 sa.2006_01_21 sar.2006_01_07 sar.2006_01_18sa.2006_01_11 sa.2006_01_22 sar.2006_01_08 sar.2006_01_19sa.2006_01_12 sa.2006_01_23 sar.2006_01_09 sar.2006_01_20
8/15/2006Alan Boda - HP17
Sample sar reportSample sar report# more sar20Linux 2.4.21-37.ELsmp (karp.alf.cpqcorp.net) 2006-01-20
00:00:00 proc/s00:10:00 0.0300:20:00 0.0100:30:00 0.0100:40:00 0.0100:50:00 0.0101:00:00 0.0101:10:00 0.0301:20:00 0.0101:30:00 0.0101:40:00 0.0101:50:00 0.0102:00:00 0.0102:10:00 0.0302:20:00 0.0102:30:00 0.0102:40:00 0.01
8/15/2006Alan Boda - HP18
System Management ToolsSystem Management Tools
snmp-based toolsExamples:
– IBM: IBM Director Agents– Dell: OpenManage Server Administrator– HP: Insight Manager and Agents
Agents monitor and log to system logsPredictive fault (if supported by driver)
8/15/2006Alan Boda - HP19
Other toolsOther tools
Special situations vendor-specific cron script to gather
– /proc/meminfo– top– /proc/slabinfo– vmstat– netstat– interrupt– lsof
8/15/2006Alan Boda - HP20
Crash Dump IssuesCrash Dump Issues
Inconsistent crash dump methods Standard kernel deadlocks resources network throughput for network-based dumps assumes trusted kernel state where to dump ASR interference
8/15/2006Alan Boda - HP21
Crash Dump toolsCrash Dump tools
netdump - RHELdiskdump - RHELLKCD - SLESmkdump kdump
8/15/2006Alan Boda - HP22
NetdumpNetdump
dumps to remote disk nic must support polled operation log file of panic, oops and other SysRq output Verify:
– # service netdump status– # service netdump-server status– Check /etc/sysconfig/netdump
DEV=eth0 (or other nic) NETDUMPADDR={netdump-server IP}
8/15/2006Alan Boda - HP23
DiskdumpDiskdump
dumps to local disk limited controllers dump levelsAvailable as of RHEL 3 U3Verify:
– # service diskdump status– # cat /etc/sysconfig/diskdump
8/15/2006Alan Boda - HP24
LKCDLKCD
dumps to local diskcan also dump to netdump-server (default)different dump levelsVerify:
– # lkcd query
8/15/2006Alan Boda - HP25
mkdumpmkdumpminikernel dump (based on mkexec)OpenSourceUses netdump and LKCD dump format
kdumpkdumpkexec-based kernel crash dump mechanismOpenSourceUse crash to analyze dump file
8/15/2006Alan Boda - HP26
Dump SuggestionsDump SuggestionsDisk-based dumpsNetwork-based dumpsAutomatic Server Recovery (ASR)
timeoutsSynchronize timeBest effort
8/15/2006Alan Boda - HP27
Test out DumpTest out Dump Enable the magic sysrq key
# sysctl -w kernel/sysrq=1 Enable panic_on_oops
# sysctl -w kernel/panic_on_oops=1 netdump: check to see if netlog is working
# echo h > /proc/sysrq-trigger netdump: Test SysRq writes to netdump log file
#echo m > /proc/sysrq-trigger Sync all mounted file systems
# echo s > /proc/sysrq-trigger Crash the system
– # echo c > /proc/sysrq-trigger (RHEL)– # echo d > /proc/sysrq-trigger (SLES)– crash.c (RHEL)
diskdump – check /var/crash/127.0.0.1-<date> lkcd – check /var/log/dump/
8/15/2006Alan Boda - HP28
System SnapshotSystem Snapshot
Take snapshot of working system nowRun normal working load while taking
snapshotWill discuss tools one can use shortly
Now that the System has Now that the System has Crashed or HungCrashed or Hung
Gathering the clues
Assume that tools have been preconfigured!
8/15/2006Alan Boda - HP30
Before You RebootBefore You Reboot Don’t delay! Console messages? ping? telnet, ssh, or rsh? If db server, query response? SysRq keys LED’s? Physical environment changes? Dump? Time of hang or crash
8/15/2006Alan Boda - HP31
After the RebootAfter the Reboot kernel (uname –a) loaded modules (lsmod) bus information (lspci -w) boot information (dmesg) system logs (/var/log/*) memory (/proc/meminfo) cpu (/proc/cpuinfo) disk (/proc/scsi/scsi) disk partition (/proc/partitions) installed rpm’s (/var/log/rpminfo, “rpm –qa”) time of hang or crash cpu details – (dmidecode)
8/15/2006Alan Boda - HP32
Snapshot Tools for After RebootSnapshot Tools for After Reboot
sysreport (RHEL) sitar (SLES) config.sh (SLES) cfg2html (OpenSource) h/w diagnostic tools (vendor-specific)
8/15/2006Alan Boda - HP33
sysreportsysreport
Verify: rpm –qa | grep sysreportWhat does it generate?
# ls -w 50boot free ksyms mount rpm-Vadate hardware.py lib proc unamedf hostname ls-boot ps uptimeetc ifconfig lsmod pstree varfdisk-l installed-rpms lspci route
8/15/2006Alan Boda - HP34
sitarsitar
Generates various reports detailing– add-on's– installed packages– system info– yast installed packages
Reports created in different formats
8/15/2006Alan Boda - HP35
sitar-generated filessitar-generated files# sitar# ls /tmp/sitar-fwills.america.cpqcorp.net-2006020104/...sitar-addon-fwills.america.cpqcorp.net-yast2.selsitar-fwills.america.cpqcorp.net-yast1.selsitar-fwills.america.cpqcorp.net.htmlsitar-fwills.america.cpqcorp.net.sdocbook.xmlsitar-fwills.america.cpqcorp.net.texsitar-sles-fwills.america.cpqcorp.net-yast2.sel
8/15/2006Alan Boda - HP36
Sitar .html reportSitar .html reportfwills.america.cpqcorp.net, Wed Feb 1 04:48:57 2006
Linux fwills 2.6.5-7.193-default #1 Wed Jul 20 14:39:18 UTC 2005 i686 i686 i386 GNU/Linux SUSE LINUX Enterprise Server 9 (i586)
Table of Contents 1. General Information 2. CPU
…
1. General Information
Hostname fwills.america.cpqcorp.net
Operating System SUSE LINUX Enterprise Server 9 (i586)
UName Linux fwills 2.6.5-7.193-default #1 Wed Jul 20 14:39:18 UTC 2005 i686 i686 i386 GNU/Linux
Date Wed Feb 1 04:48:57 2006
Main Memory 385976 KByte
Cmdline root=/dev/sda2 vga=0x317 selinux=0 resume=/dev/sda3 elevator=cfq splash=silent
Load 0.00 0.00 0.00 1/79 2084
Uptime (minutes hours days) 181715 3028 124
Idletime (minutes hours days) 37934 632 26
2. CPU
8/15/2006Alan Boda - HP37
config.shconfig.sh What does it generate?
# ls -w 50. iscsi.txt performance.txt.. lvm.txt rcd.txtboot.txt messages.txt release.txtchkconfig.txt modules.txt rpm.txtconfig.sh.txt mpio.txt rug.txtcron.txt ncp.txt scsi.txtenv.txt network.txt siga.txtevms.txt nss.txt softraid.txthwinfo.txt pam.txt y2log.txt
8/15/2006Alan Boda - HP38
Tools to view sysstat dataTools to view sysstat data
sar isagsarcheck
8/15/2006Alan Boda - HP39
Sample sar commandsSample sar commands# sar -u 2 4Linux 2.4.21-37.ELsmp (karp.alf.cpqcorp.net) 02/01/2006
03:01:50 PM CPU %user %nice %system %iowait %idle03:01:52 PM all 0.50 0.00 0.25 0.75 98.4903:01:54 PM all 0.00 0.00 0.50 0.00 99.5003:01:56 PM all 0.00 0.00 0.00 0.25 99.7503:01:58 PM all 0.25 0.00 0.50 0.00 99.25Average: all 0.19 0.00 0.31 0.25 99.25
# cd /var/log/sa# sar -A -f sa01 > sar01-new# ls -l sa*01*-rw-r--r-- 1 root root 132720 Feb 1 15:10 sa01-rw-r--r-- 1 root root 181575 Feb 1 15:08 sar01-new
Reconstructing the Crash SceneReconstructing the Crash Scene
8/15/2006Alan Boda - HP41
Check logsCheck logs
/var/log/messages– Search for kernel load entry– Work backwards and look for:
Errors or WarningsOops messages with trace output
Other log filesdf output
8/15/2006Alan Boda - HP42
OopsOopsOct 30 00:05:34 karp kernel: Unable to handle kernel NULL pointer dereferenceat virtual address 00000008Oct 30 00:05:34 karp kernel: printing eip:Oct 30 00:05:34 karp kernel: c011ec5dOct 30 00:05:34 karp kernel: *pde = 2aefa001Oct 30 00:05:34 karp kernel: Oops: 0000Oct 30 00:05:34 karp kernel: Kernel 2.4.9-e.38enterpriseOct 30 00:05:34 karp kernel: CPU: 1Oct 30 00:05:34 karp kernel: EIP: 0010:[get_module_list+61/816] Tainted: POct 30 00:05:34 karp kernel: EIP: 0010:[<c011ec5d>] Tainted: POct 30 00:05:34 karp kernel: EFLAGS: 00010246Oct 30 00:05:34 karp kernel: EIP is at get_module_list [kernel] 0x3dOct 30 00:50:00 karp syslogd 1.4.1: restart.Oct 30 00:50:00 karp syslog: syslogd startup succeededOct 30 00:50:00 karp kernel: klogd 1.4.1, log source = /proc/kmsg started.Oct 30 00:50:00 karp kernel: Inspecting /boot/System.map-2.4.9-e.38enterpriseOct 30 00:50:00 karp syslog: klogd startup succeeded
8/15/2006Alan Boda - HP43
ksymoopsksymoops
>>EIP; c0113f8c <sys_init_module+49c/4d0>Trace; c011d3f5 <sys_mremap+295/370>Trace; c011af5f <do_generic_file_read+5bf/5f0>Trace; c011afe9 <file_read_actor+59/60>Trace; c011d2bc <sys_mremap+15c/370>Trace; c010e80f <do_sigaltstack+ff/1a0>Trace; c0107c39 <overflow+9/c>Trace; c0107b30 <tracesys+1c/23>Trace; 00001000 Before first symbol
8/15/2006Alan Boda - HP44
SAR Data ExampleSAR Data Example15:20:00 dentunusd file-sz %file-sz inode-sz super-sz %super-sz dquot-sz %dquot-sz rtsig-sz %rtsig-sz15:30:00 1866554 2728 2.08 2055024 0 0.00 0 0.00 1 0.1015:40:01 1866909 2785 2.12 2055020 0 0.00 0 0.00 1 0.10…17:20:00 1870217 2786 2.13 2055019 0 0.00 0 0.00 1 0.1017:30:00 1870516 2762 2.11 2055022 0 0.00 0 0.00 1 0.1017:40:00 1870848 2785 2.12 2055019 0 0.00 0 0.00 1 0.1017:50:00 1569671 2156 1.64 1743619 0 0.00 0 0.00 1 0.1018:00:00 1570730 1984 1.51 1744880 0 0.00 0 0.00 1 0.1018:10:00 1571240 1792 1.37 1745241 0 0.00 0 0.00 1 0.1018:20:00 1571768 1510 1.15 1745796 0 0.00 0 0.00 1 0.1018:30:00 1572100 1483 1.13 1745826 0 0.00 0 0.00 1 0.1018:40:00 1573175 16 0.01 1747980 0 0.00 0 0.00 1 0.10
8/15/2006Alan Boda - HP45
ISAG View of Sar DataISAG View of Sar Data
8/15/2006Alan Boda - HP46
The Question of DebuggersThe Question of Debuggers
Torvalds quote“I do see some good points in a kernel debugger, but I have yet to be convinced that the good things outweigh the bad. The only valid uses of debuggers is to get a stack backtrace and a register dump, imho, andthat is what you get from a kernel panic anyway (and the ksymoops.ccprogram will actually make it readable for others than just me ;-)
I'm afraid that I've seen too many people fix bugs by looking atdebugger output, and that almost inevitably leads to fixing the symptomsrather than the underlying problems. “
Ref: http://www.ussg.iu.edu/hypermail/linux/kernel/9510/0103.html
8/15/2006Alan Boda - HP47
In The DumpsIn The Dumps
Recover the vmcoreTools to analyze:
– netdump / diskdump: use crash– LKCD: use lcrash or crash– mkdump: use lcrash or crash
Key items: process stacks, system callsRequirements needed from crashed system
8/15/2006Alan Boda - HP48
NetdumpNetdump Check netdump log file first
– Oops or panic messages– loaded modules– SysRq memory, trace, process info– Stack trace
Use “crash” on vmcore syslog Ref: /usr/share/doc/netdump-*
8/15/2006Alan Boda - HP49
Netdump-server FilesNetdump-server Files
# ls -l /var/crash/16.113.5.104-2003-12-15-12:21total 141108-rw------- 1 netdump netdump 63067 Dec 15 2003 log-rw------- 1 netdump netdump 134205440 Dec 15 2003 vmcore
# du -sk /var/crash/16.113.5.104-2003-12-15-12:21/vmcore131192 /var/crash/16.113.5.104-2003-12-15-12:21/vmcore
8/15/2006Alan Boda - HP50
Netdump-server logNetdump-server log# more logOops: 0002Kernel 2.4.9-e.3CPU: 0EIP: 0010:[<c8a44076>] Tainted: PEFLAGS: 00010282EIP is at init_module [crash] 0x16eax: 00000013 ebx: c8a44000 ecx: 00000000 edx: c543e000esi: 00000000 edi: 00000000 ebp: c3149f28 esp: c3149f20ds: 0018 es: 0018 ss: 0018Process insmod (pid: 5619, stackpage=c3149000)Stack: 00000000 00000060 00000060 c0118eb5 00000000 c36fb000 00000098 c35c6000 00000060 ffffffea 00000005 c468b740 00000060 c8a3f000 c8a44060 000002e8 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000Call Trace: [<c0118eb5>] sys_init_module [kernel] 0x535[<c8a44060>] init_module [crash] 0x0[<c0106f03>] system_call [kernel] 0x33
Code: c6 05 00 00 00 00 00 b8 00 00 00 00 c9 c3 63 72 61 73 68 69< netdump activated - performing handshake with the client. >
Process: 5619, { insmod}…
8/15/2006Alan Boda - HP51
Disk DumpDisk Dump
check for vmcore in /var/crash/127.0.0.1-<date> use same crash tool as for netdump to debug Ref: /usr/share/doc/diskdumputils-*/
8/15/2006Alan Boda - HP52
LKCDLKCD
check /var/log/dump/nUse lcrash or crash to analyze vmcoreRef: /usr/share/doc/packages/lkcdutils/
8/15/2006Alan Boda - HP53
What next?What next?
H/W vendorSystem Service ProviderOS vendorOpenSource Community
8/15/2006Alan Boda - HP54
SummarySummary
Tools to use to prep system Reconstruction of hang / crash sceneBefore a rebootAfter the rebootMethods to approach looking at data
8/15/2006Alan Boda - HP55
Questions & AnswersQuestions & Answers??????
8/15/2006Alan Boda - HP56
Make your system talkMake your system talkPrepare your system now
so it can tell you what happened!
8/15/2006Alan Boda - HP57
Appendix – More InformationAppendix – More Information http://www.novell.com/coolsolutions/tools/16106.html -- SLES config.sh http://come.to/cfg2html -- cfg2html utility to gather system information http://www.linuxtroubleshooting.com/wiki/index.php?title=Main_Page – Linux troubleshooting tools http://www.volny.cz/linux_monitor/isag/ -- isag http://rpmfind.net//linux/RPM/contrib/noarch/noarch/isag-4.1.1-1.noarch.html -- isag http://www.sarcheck.com/sclinux.htm -- sarcheck http://linuxgazette.net/issue59/nazario.html -- good dmesg description http://lkcd.sourceforge.net -- lkcd http://lkcd.sourceforge.net/doc/lcrash.pdf -- lcrash HOWTO http://lkcd.sourceforge.net/doc/lkcd_tutorial.pdf -- good lkcd tutorial /usr/share/doc/packages/lkcdutils/README.SuSE – LKCD setup http://www.novell.com/coolsolutions/feature/14813.html -- SLES lkcd http://support.novell.com/cgi-bin/search/searchtid.cgi?10099561.htm – SLES lkcd howto /usr/share/doc/diskdumputils-*/README -- diskdump setup http://www.redhat.com/support/wpapers/redhat/netdump/ -- netdump /usr/share/doc/netdump*/README* -- netdump / netdump-server http://www.linuxforums.org/forum/peripherals-hardware/35963-cpu-naming-schemes-x86-386-486-586-amd-64-ia64-em64t.html?
highlight=naming+schemes -- good cpu chip reference http://mkdump.sourceforge.net -- mkdump http://lse.sourceforge.net/kdump/ - kdump http://www.linuxdevcenter.com/lpt/a/1319 -- “Linux System Failure Post-Mortem”, by Jennifer Vesperman
(O’Reilly Network) http://www.die.net/doc/linux/man/man5/proc.5.html - manpage for /proc details http://www-128.ibm.com/developerworks/db2/library/techarticle/dm-0509wright/?ca=dgr-lnxw06DB2Linux –
good article on Linux memory utilization http://www.ataassociates.com/Process.htm -- accident reconstruction
8/15/2006Alan Boda - HP58
Appendix - VocabularyAppendix - Vocabulary AMD64/EM64T – Similar X86 architectures w/ 64 bit mem registers
collectively known as X86_64 ARC – Accident Reconstruction Consultant ASR - Automatic Server Recovery IA64 – CPU based on 64-bit Itanium chipset ISAG – Interactive System Activity Grapher lkcd – Linux Kernel Crash Dump utility mkdump – minikernel dump utility RHEL - Red Hat Enterprise Linux SAR – System Activity Report SLES - SuSE Linux Enterprise Server SysRq (aka magic keys) – key sequence intercepted by kernel to
perform certain operations x86 – CPU based on Intel 80x86 chipset
8/15/2006Alan Boda - HP59
Appendix – Pre-Crash Check ListAppendix – Pre-Crash Check List
Enable SysRq Enable sysstat Enable system management tools Develop emergency procedures Train staff in emergency procedures Configure and enable dump utility Take system snapshot on loaded/running system Setup remote console access
8/15/2006Alan Boda - HP60