Upload
chrystal-baldwin
View
221
Download
1
Tags:
Embed Size (px)
Citation preview
Windows NT 4.0Windows NT 4.0Setup and DebuggingSetup and Debugging
Joseph WestSr Technology Specialist
Agenda Agenda
Setup (build overview)– Three phases of Setup
• Character-Based Setup• Boot from Character-Based to GUI-Based Setup• GUI-Based Setup
Troubleshooting– (Blue Screens & Stop Codes)
Latest information for NT 4.0– SP4
Hardware Compatibility Hardware Compatibility ListList How important is it Support parameters
http://www.microsoft.com/hwtest/http://support.microsoft.com/
Character-Based SetupCharacter-Based Setup
Gathering of System Architecture Information
CPU Type Motherboard Architecture Hard Drive Controllers File Systems Disk Free Space Memory
Info Gathered is Required Info Gathered is Required for Basic System for Basic System InitializationInitialization ‘Failure to Detect’ will lead to failure of Setup Unsupported components and enhancements
– PCI 2.1– Special Bus Drivers– Caching Chips for Burst Mode
Boot from Character-Boot from Character-Based to GUI-Based SetupBased to GUI-Based Setup Windows NT Kernel is loaded completely
for the first time– Finds a valid Hard Drive– Polls Adapters and tests Bus
Most likely point of failure– Drivers are loaded into Memory and Multi-
threading is initialized
GUI-Based SetupGUI-Based Setup
Install secondary Drivers Create Accounts
– Machine and Administrator Configure Network Settings Build final System Tree and Registry
Troubleshooting Troubleshooting Character-Based SetupCharacter-Based Setup
NTHQ Tool
Located in Support Directory Purpose is to show all hardware peripheral
settings Works with PCI, PnP and Legacy peripherals
Troubleshooting Troubleshooting Character-Based SetupCharacter-Based Setup
NTHQ Demo
Troubleshooting Troubleshooting Character-Based SetupCharacter-Based Setup
Unsupported Controllerand BIOS Enhancements
32-bit I/O Enhanced Drive Access Multiple Block Access or Rapid IDE Power Management Features
Troubleshooting Troubleshooting Character-Based SetupCharacter-Based Setup
Setup Hangs During Initial Boot Disable CD-Boot capability before
installing– Needs to be done at both the Controller and
BIOS levels
Troubleshooting Troubleshooting Character-Based SetupCharacter-Based Setup
Setup Cannot Find Hard Drive
Scan System for Viruses Make certain there is valid Boot Sector
on the Hard Drive
Troubleshooting Troubleshooting Character-Based SetupCharacter-Based Setup
Setup Cannot Find Hard Drive
If Hard Drive Controller is SCSI– Are devices properly terminated– Is SCSI BIOS enabled - first Controller (if at all)– On secondary Controllers, make certain BIOS
is disabled– Partition and format using current Controller
Troubleshooting Troubleshooting Character-Based SetupCharacter-Based Setup
Setup Cannot Find Hard Drive
If Hard Drive Controller is IDE or EIDE– Make certain drive is on primary Controller
Channel– Make certain drive is jumpered correctly
• (i.e.) Master, Slave, Independent
Troubleshooting Troubleshooting Character-Based SetupCharacter-Based Setup
Setup Does Not Detect Hard Drive Controller Correctly
Manually select Controller type Make certain that an NT 4.0 driver is
being loaded Use NTHQ Tool to check for correct IRQ
and Memory addressing
Troubleshooting Troubleshooting Character-Based SetupCharacter-Based Setup
Setup Cannot Find a Valid Partition
If Windows 95 is on the system, back-up and Fdisk Hard Drive (no support for Fat 32)– Recreate Partitions and Format with DOS 6.22
Restore Windows 95 and proceed with Windows NT installation
Make certain that correct HAL is being loaded
Troubleshooting Failure to Troubleshooting Failure to RebootRebootFrom Character-Based to GUI-Based SetupFrom Character-Based to GUI-Based Setup
Stop Messages
Record Hex Value, 0x1e, 0x7b, etc. Record Values in parentheses Record component where failure
occurred Note where in Boot Process error
occurred Call PSS (installation support)
Troubleshooting Failure to Troubleshooting Failure to Reboot Reboot
Stop Messages Whichcan be Solved in the Field
0x7b, (0x4,0,0,0), or 0x8b– Indicates problem with Master Boot Record– Scan for Viruses– Confirm correct Controller driver is loaded– Refresh Master Boot Record
Troubleshooting Failure to Troubleshooting Failure to Reboot Reboot
After Reboot, Video Remains “Black”
Check for devices using IRQ’s 2, 9 or 12 (PCI) Scan Hard Drive for Viruses
Troubleshooting Failure to Troubleshooting Failure to RebootReboot
Stop Messages Whichcan be Solved in the Field
0x1e or 0xa– Disable any Third-party services or drivers
which were loaded prior to Upgrade– Use NTHQ to confirm appropriate Memory
and IRQ settings
Troubleshooting Troubleshooting GUI-Based Setup IssuesGUI-Based Setup Issues
Setup Will Not Read From CD-ROM Drive
Make certain CD is on HCL Copy I386 directory to the Hard Drive
and start again from the beginning Make certain that the Controller and/or
Hard Drive is correctly configured
Troubleshooting Troubleshooting GUI-Based Setup IssuesGUI-Based Setup Issues
If Setup Fails DuringCopy of Files to Hard Drive
Disable all external Caches in BIOS Make certain Hard Drives are terminated
correctly; Active Preferred
Setup Enhancements Setup Enhancements in Windows NT 4.0in Windows NT 4.0
Bootable CD-ROM
Supports only El Torrito Specification Can only be used in ‘No Emulation Mode’ Must be supported by both System and
SCSI BIOS
Setup Enhancements Setup Enhancements in Windows NT 4.0in Windows NT 4.0
Winnt Character-Based Setup Logging
Using Winnt or Winnt32 /L:– Logs all actions during character-based
setup to find last successful action– Helps to isolate where setup halted without
requiring special DLL’s
Setup Enhancements Setup Enhancements in Windows NT 4.0in Windows NT 4.0
Restartable GUI-Based Setup
If the machine fails during GUI-mode Setup; the problem can be fixed and setup will continue from reboot
Agenda Agenda
Setup (build overview)– Three phases of Setup
• Character-Based Setup• Boot from Character-Based to GUI-Based Setup• GUI-Based Setup
Troubleshooting– (Blue Screens & Stop Codes)
Latest information for NT 4.0– SP4
You’re Up and Running, You’re Up and Running, But ...But ...
DebuggingDebugging(the connection)(the connection)
Connect– Modem, Null-modem cable, LAN
Boot.ini– / Debug /Debugport=com1 /
Baudrate=19200 Symbols
– Retail NT CD (in the) support\debug\[platform]\symbols sub-directory
Interpreting Blue ScreensInterpreting Blue Screens
The error code and parameters at the top of the screen
The list of modules that have successfully loaded and initialized in the middle of the screen
The list of modules that are currently on the stack at the bottom of the screen
Stop CodesStop Codes
Note: For a complete listing of stop codes, see Windows NTW 4.0 Resource Kit, Chapter 39, “Windows NT Debugger”, or Q142657 article on http://support.microsoft.com
Common Stop CodesCommon Stop Codes
0xA 0x1E 0x24 0x3F 0x50 0x7B 0x7F 0xC000021A
0xA0xA
0x0000000A IRQL_NOT_LESS_OR_EQUAL Description
– An attempt was made to touch paged out memory at a process interrupt request level (IRQL) that is too high. Code that runs at higher interrupt levels can’t touch paged-out memory because paging would be to expensive. If it happens that a pageable page is not committed, but it’s virtual address range is still in the translation buffer, high irql code can get away with touching it. But if the system is stressed – then the memory manager will have likely paged that page out and when an in page is attempted - the bugcheck will occur. So, this is why certain bugs tend to not show up on developers boxes which are less stressed than production.
Typical Scenarios– System configuration changes, virus scanners, other file I/O filters.
0x1E0x1E
0x0000001E KMODE_EXCEPTION_NOT_HANDLED
Description– Essentially, this bugcheck identifies an error that occurred in a
section of code where no error detection routines were in place. Most exceptions are generated directly in the section of code that is executing. In this case, the error was not trapped in the middle of the code that was executing. Therefore, the error was allowed to fall through to this default error handler. This makes the error a very common exception. The actual instruction fault is usually similar to a STOP 0xA – that is a memory access violation.
Typical Scenarios– Invalid or obsolete third-party driver or system service, Microsoft
driver or system service bug, file I/O filter drivers.
0x240x24
0x00000024 NTFS_FILE_SYSTEM Description
– A STOP 0x24 is the result of NTFS code that detects a problem with the structure of the NTFS file system. This is not a cut and dried exception code and debugging it is sometimes difficult. Disk corruption can generate a STOP 0x23 (FAT_FILE_SYSTEM) and 0x24. However any processes involved in reading or writing data from a FAT or NTFS file system could cause the disk data to appear corrupted. Therefore SCSI and IDE drivers as well as the disk structure itself (hard errors, i.e. bad blocks) can be suspect. The file system calls this bug check in multiple places and this will help us identify the actual source line that generated the bug check. Also, this bugcheck can be caused by I/O filter drivers (resource hangs, race conditions, etc.). After the above is eliminated, more low-level constructs such as file system synchronization objects, scb attributes, etc. need to be examined by the debug engineer.
Typical Scenarios– This bugcheck is encountered when the NTFS file system has a
corruption, or the hard drive has a bad block.
0x3F0x3F
0x0000003F NO_MORE_SYSTEM_PTES Description
– This stop isn’t as common as most of the others in this section, but a good explanation is warranted. A STOP 0x3F is the result of a system doing lots of I/O, therefor fragmenting the system PTE’s. The bugcheck occurs not because the system is out of PTE's, but because a driver requests a huge chunk of memory that can’t be satisfied because a contiguous block that big isn’t available.
Typical Scenarios– Often video drivers will allocate large amounts of kernel memory
that must succeed. Also, some backup programs do the same.– For these situations, consult a PSS engineer for the Registry hack
that allows the increase of total system PTE’s.
0x500x50
0x00000050 PAGE_FAULT_IN_NONPAGED_AREA Description
– A STOP 0x50 is caused when a memory region that is not supposed to be paged out (usually for performance reasons) is paged out. This stop can be caused by a variety of problems including corrupt NTFS volumes, bad network packet data, and in general kernel mode drivers that corrupt memory. Also, drivers that free an MDL but don’t communicate it to all portions of the driver. Others include Disk, Controller, and Disk Driver problems.
Typical Scenarios– Usually third-party kernel mode drivers munging memory, or
reading beyond allowable memory. Also, when the file system is pushed to the tested limits (large Mac volumes), bugs in NTFS are exposed that result in this STOP. This STOP can occur due to interaction problems between SCSI Controller firmware and Hard Drive firmware.
0x7B0x7B
0x0000007B INACCESSIBLE_BOOT_DEVICE Description
– During the initialization of the I/O system, the driver for the boot device may have failed to initialize the device that the system is attempting to boot from, or the file system that is supposed to read that device may have either failed its initialization or simply not recognized the data on the boot device as a file system structure.
– If this is the initial setup of the system, this error may have occurred because the system was installed on an unsupported Hard Disk or SCSI Controller.
– This error can also be caused by the installation of a new SCSI Adapter or Hard Disk Controller or by repartitioning the Hard Disk with the System Partition.
Typical Scenarios– VIRUS– LBA type problems, MBR type problems, SCSI Controller/Hard Drive
geometry issues, etc.
0x7F0x7F
0x0000007F UNEXPECTED_KERNEL_MODE_TRAP
Description– This error means a trap occurred in kernel mode, either a kind of
trap that the kernel is not allowed to have or catch (a bound trap), or a kind of trap that is always instant death (double fault).
Typical Scenarios– Hardware, kernel mode drivers that manipulate critical system data
in an untimely fashion.– This STOP most often is the result of the processor taking a double
0x7f (8,0,0,0). Note that these parameters can also show up for a modern software issue involving Netmon (bhnt.sys).
0xC000021A0xC000021A
0xC000021A FATAL_SYSTEM_ERROR Description
– This is a typical description that accompanies this error: The Windows Subsystem System process terminated unexpectedly with a status of (0x6130F2B6 0x01B6FBA4). The system has been shutdown.
– The failing process sometimes is listed in the blue screen itself.– This bugcheck occurs when a user-mode subsystem such as
Winlogon or CSRSS is fatally compromised such that security can not be guaranteed. The Operating System makes a transition into kernel mode and throws this exception.
Typical Scenarios– A typical cause of this crash would be an extensible perfmon counter
that overwrites it’s Winlogon shared data buffer (Q171033), and in general any access violation that compromises a user-mode subsystem.
BreakBreak
Agenda Agenda
Setup (build overview)– Hardware Compatibility List– Three Phases of Setup– Character-Based Setup– Boot from Character-Based to GUI-Based Setup– GUI-Based Setup
Troubleshooting– (Blue Screens & Stop Codes)
Latest Information for NT 4.0– SP4
A Day in the LifeA Day in the Life
Video
NT4 Service Pack 4NT4 Service Pack 4
Contents– Hotfixes for important customer-reported problems– Resource and memory leak bugfixes from NT5– 30+ support, diagnostic and repair tools from the NT
Resource Kit are included on the SP4 CDROM – Event log entries for clean and dirty shutdown
Process Improvements– Dedicated Service Pack test team– Beta Program for Service Packs – Improving the Knowledge Base, depth and ease of use– Slipstreaming Service Packs into OEM releases
Resource / Memory LeaksResource / Memory Leaks
Problem– Leaks lead to hung systems and bluescreen crashes– Some customers do “preventive reboots” – Difficult to stop or kill the offending process
Solutions– Fix leaks: several hundred in NT5, key fixes in NT4
SP4– Job objects in NT5, set memory limits on a collection
of processes– Visual Studio adding leak checking to MFC and CRT
Next Work Items– Better leak detection– Logging in under low resource conditions– Stopping and killing processes
Bugchecks (Blue Screens)Bugchecks (Blue Screens)
Kernel mode code detected a serious error– Blue screens are still frequent and very hard to
diagnose – Crash dumps take too long on large memory systems
Prevention– Find and fix bugs in our code – Review all calls to KEbugcheck by NT5 RTM
Improve diagnosis – Reduced clutter on the blue screen, focus on key data,
and add hints – Crash dumps are now dramatically faster in NT5– Developing comprehensive crashdump analysis tools
for NT4 and NT5
Bugchecks (Blue Screens)Bugchecks (Blue Screens)
Stop 0x0000001E ( 0xC0000005, 0xFDE38AF9, 0x00000001, 0x7E8B0EB4 )KMODE_EXCEPTION_NOT_HANDLED
Address <x> has base at <x> - <filename> <manufacturer> <version>
If this is the first time you've seen this Stop error screen, restart your computer. If this screen appears again, follow these steps:
Check to make sure any new hardware or software is properly installed. If this is a new installation, ask your hardware or software manufacturer for any Windows NT updates you might need.
If problems continue, disable or remove any newly installed hardware or software. Disable BIOS memory options such as caching or shadowing. If you need to use Safe Mode to remove or disable components, restart your computer, press F8 to select Advanced Startup Options, and then select Safe Mode.
Refer to your Getting Started manual for more information on troubleshooting Stop errors.
3rd Party Drivers3rd Party Drivers
Problem– One of the most common complaints from PSS – Source of pool corruption - difficult to diagnose
Solution– DDK driver samples and documentation is improved in
NT5– Enhanced driver testing in NT4 and NT5, including pool
corruption tests– NT5 will have driver signing, “warning” level by default– WDM drivers will drive higher quality– We are testing major third-party anti-virus software
regularly
Unnecessary Reboots in Unnecessary Reboots in NT5NT5 Problem
– Hardware and software configuration and maintenance
Solutions– Fixed 50 software configuration cases which required a
reboot in NT4. Key fixes include:• Adding, removing and configuring network protocols;
changing IP addresses• Reconfiguring settings on PCI and other PnP hardware
– Reboots still required for some rare cases• Machine name change, domain membership changes,
system locale and system font changes, service pack installation
– Hardware reconfiguration by clustering solutions in NTS/E
– Where possible, hotfixes will avoid requiring a reboot
Diagnosis and RecoveryDiagnosis and Recovery
Recovery Involves– Detection (hard with a hung application or server)– Diagnosis (need good tools, need parallel installs, bad
error messages)– System Recovery (chkdsk, crash dump biggest time hits)– Application recovery (SQL, Exchange Store, etc)
We are delivering– 30+ of the most critical support, diagnostic, and repair
tools in SP4 and NT5 B2– Fixing 35 worst error messages by B2+30, then next
200 as time allows– NT5 Safe-mode Boot today and Floppy Boot by NT5 RTM
• Both support NTFS– Web-based trouble-shooter for most common
bluescreens– Online chkdsk post NT5
NT Test InitiativesNT Test Initiatives
Long duration Server stress– 10 Servers running stress for a month+ starting at NT5
Beta 2– Mix of stress including BackOffice, IIS, Client/Server, etc– Specifically watching for memory and resource leaks
Improved driver testing for NT4 and NT5– Catch pool corruption– Fault injection
Better integration testing of Server applications– BackOffice applications: Exchange, SQL Server– Using automated scripts from BackOffice teams– Testing with Oracle, SAP R/3, Lotus Notes– 100 Top Server Applications from Tier 1 RDP customers
Expanded tests for customer configurations– RDP Customer configurations, ISP
Resource Kit ToolsResource Kit Tools
Network Diagnostic and Support Tools– nettest - quickly determine whether local uses network
is configured properly (IDW) Applications, Service Problems and Memory
Leaks– memsnap - detection of memory and resource leaks
over time (dump directory) Disk Problems
– fixacls - resets ACLs on system files to installation defaults, fixes users who hose their ACLs
Debugger Tools– debug wizard - easy setup of debuggers for customers
Other– windiff - file compare util, critical for many situations
(reskit)
Event Log AnalystEvent Log Analyst
Prototype tool for collecting and analyzing event log reliability data
Designed for collecting reliability trend data from an entire datacenter in few hours– Collected data from 800+ CDC servers in 5 hours– Analysis is manual with Excel, less than 3 hours
Provides trend analysis of reboots, bugchecks, and Dr Watsons
Event Log AnalystEvent Log Analyst
Local Area Network
SERVER-01 SERVER-N
CollectionServer
Server List
Server InfoReboot Info
Bugcheck InfoDr W atson Info
SERVER-02
Event Log Analyst MetricsEvent Log Analyst Metrics
Mean time between reboots Mean time between bugchecks Mean time between Dr Watsons Trend analysis of reboots/server-year Trend analysis of bugchecks/server-year Trend analysis of Dr Watsons/server-year Bugcheck distribution Dr Watson distribution SP4 Only: Availability percentage SP4 Only: Mean time to repair
Tools for NT4 SP4 and NT5Tools for NT4 SP4 and NT5
Network Diagnostic and Support Tools– browstat - only useful tool for diagnosing browser problems
(reskit)– dhcpcmd - useful for fixing DHCP issues (reskit)– dnscmd - diagnose and repair DNS problems (reskit)– eseutil - used for WINS and DHCP database diagnosis and repair – nettest - quickly determine whether local uses network is
configured properly (IDW)– winscl - diagnose and repair WINS (reskit)– winsadd - command line tool for batching static and dynamic
entries in WINS– nltest - used for resetting secure channels, diagnosing and fixing
trust problems (reskit)
Tools for NT4 SP4 and NT5Tools for NT4 SP4 and NT5
Applications, Service Problems and Memory Leaks – depends - display and troubleshoot application
dependency problems (IDW)– tlist - list running processes, used in conjunction with kill
(reskit)– kill - forcibly terminate processes (reskit)– memsnap - detection of memory and resource leaks over
time (dump directory)– pmon - detection of memory and resource leaks over
time (reskit)– pviewer - gather extended information about running
processes (reskit)– reg - registry utility, used for diagnosis and repair of
many types of issues
Tools for NT4 SP4 and NT5Tools for NT4 SP4 and NT5
Disk Problems– disksave - saves and restores the MBR (reskit)– fixacls - resets ACLs on system files to installation
defaults, fixes users who hose their ACLs– ftedit - used daily to help customers repair fault
tolerant volumes (reskit)
Debugger Tools– gflags - set global flags needed for various kinds of
debugging (IDW)– remote - allow remote debugging by PSS (reskit)– debug wizard - easy setup of debuggers for customers– all standard debuggers - already ships in /support dir
Tools for NT4 SP4 and NT5Tools for NT4 SP4 and NT5
Other– uptomp - update system from uniproc to multiproc
(reskit)– robocopy - used daily by PSS during support calls, easiest
way to move large amounts of data around very quickly.– shutdown - remote shutdown of systems (reskit)– ntevntlg.mdb & ntmsgs.hlp - better error message docs
(reskit)– windiff - file compare utility; critical for many situations
(reskit)– dumpel - dump event log messages from local or remote
systems (reskit)– list - used daily by PSS for reviewing exceedingly large
log files, etc.
SummarySummary
Best Practices matter– Mature, disciplined planning & procedures– Design, Implement, Test– Configuration & Operational control
Technology matters– OS system services– UPS, RAID, ECC Memory, multi-homing– Cluster Services
We can deliver availability with Windows NT today
Microsoft is investing heavily in availability
References and ResourcesReferences and Resources
http://www.microsoft.com/ntserver/ http://www.microsoft.com/ntworkstation/ http://www.microsoft.com/windowsnt5/ http://www.microsoft.com/hwtest/ http://support.microsoft.com/ http://support.microsoft.com/support/kb/articles/
q103/0/59.asp– Descriptions of Bug Codes for Windows NT
References and ResourcesReferences and Resources
Inside Windows NT Second Edition, David A. SolomonMS Press 1998
Windows NTW 4.0 Resource Kit– Chapter 19: “What Happens When You Start Your
Computer”– Chapter 21: “Troubleshooting Startup and Disk Problems”– Chapter 36: “General Troubleshooting”– Chapter 39: “Windows NT Debugger”, or Q142657 article
Supporting Windows NT Server in the EnterpriseMS Press 1998– Chapter 7: “Troubleshooting Tools and Methods”
Questions?Questions?