610
AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

  • Upload
    others

  • View
    22

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

AIX 5L Internals

Student GuideVersion 20001015

IBM Web ServerKnowledge Channel

Page 2: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Student Guide Draft Version for review, Sunday, 15. October 2000, title.fm

Tradmarks

IBM® is a registered trademark of International Business Machines Corporation.UNIX is a registered trademark in the United States, other countries, or both and is licensed exclusively through X/Open Compnay Limited.

<<< list any other Trademarks used int he course materials >>>

July 2000 Edition

The information contained in this document has not been submitted to any formal IBM test and is distributed on an “as is” basis without any warranty either express or implied. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customer’s ability to evaluate and integrate them into the customer’s operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or simular results will result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk.

© Copyright International Business Machines Corporation 2000. All rights reserved. This document may not be reproduced in whole or in part without the prior written permission from IBM. Information in this course is subject to change without notice.

Web Server Knowledge ChannelTechnical Education

Page 3: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Draft Version for review, Sunday, 15. October 2000, intTOC.fm Student Guide

© Copyright IBM Corp. 2000 Version 20001015 Contents iiiCourse materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Contents

Kernel OverviewKernel Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2Kernel states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-8Kernel exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-10Kernel Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-12Kernel Limits Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1664-bit Kernel base enablement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1764-bit Kernel stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-24CPU big- and little-endian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-26Multi Processor dependent designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-28Command and Utility compatibility for 32-bit and 64-bit kernels . . . . . . . . . . . . . . . . 1-29Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-30Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-33Interrupt handling in AIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-35Handling CPU state information at interrupt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-36Handling CPU state information at interrupt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-37

IA-64 Hardware OverviewIA-64 Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2IA-64 formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3IA-64 memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5IA-64 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8IA-64 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12IA-64 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19

Power Hardware OverviewPower Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2Power CPU Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-864 bit CPU Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15

SMP Hardware OverviewSMP Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2

Configuring System Dumps on AIX 5LAbout This Lesson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3System Dump Facility in AIX5L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5Configuring for System Dumps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7Obtaining a Crash Dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16Dump Status and completion codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-17dumpcheck utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-19Verify the dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-21Packaging the dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-26

Introduction to Dump Analysis ToolsAbout This Lesson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2System Dump Analysis Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6dump components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7Dump creation process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8Component dump routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9bosdebug command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10

Page 4: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Student Guide Draft Version for review, Sunday, 15. October 2000, intTOC.fm

iv AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Memory Overlay Detection System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-11System Hang Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-14truss command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-20KDB kernel debugger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-23kdb command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-25KDB miscellaneous sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-26KDB dump/display/decode sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-29KDB modify memory sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-33KDB trace sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-36KDB break point and step sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-38KDB name list/symbol sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-42KDB watch break point sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-43KDB machine status sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-45KDB kernel extension loader sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-47KDB address translation sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-49KDB process/thread sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-50KDB Kernel stack sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-58KDB LVM sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-60KDB SCSI sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-62KDB memory allocator sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-65KDB file system sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-69KDB system table sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-72KDB network sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-77KDB VMM sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-80KDB SMP sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-86KDB data and instruction block address translation sub commands . . . . . . . . . . . . 6-87KDB bat/brat sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-89IADB kernel debugger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-90iadb command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-92IADB break point and step sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-93IADB dump/display/decode sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-96IADB modify memory sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-100IADB name list/symbol sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-105IADB watch break point sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-106IADB machine status sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-108IADB kernel extension loader sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-110IADB address translation sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-111IADB process/thread sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-112IADB LVM sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-114IADB SCSI sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-115IADB memory allocator sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-116IADB file system sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-117IADB system table sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-118IADB network sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-119IADB VMM sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-120IADB SMP sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-122IADB block address translation sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-123IADB bat/brat sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-124IADB miscellaneous sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-125

Page 5: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Draft Version for review, Sunday, 15. October 2000, intTOC.fm Student Guide

© Copyright IBM Corp. 2000 Version 20001015 Contents vCourse materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-127Process Management

Process Management Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2Process operations fork() system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3Process operations exec() system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8Process operations exec() system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10Process operations exit system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12Process operations, wait() system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-13Kernel Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-16Thread Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17AIX Thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-19Thread Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21Threads Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-22Thread states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-25Thread Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-27Process swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-28Thread Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-29The Dispatcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-33AIX run queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-36Process and Threads data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-39Process and Threads data structures addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-43What is new in AIX 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-48Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-50Signal handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-51Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-53Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-57

Memory ManagementOverview of Virtual Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2Memory Management Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3Demand Paging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5Memory Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7Memory Object types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-8Page Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-9Page Not In Hardware Frame Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-11Page on Paging Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13Loading Pages From The Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17Filesystem I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18Free Memory and Page Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-19vmtune . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-21Fatal Memory Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22Memory Objects (Segments) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-23Shared Memory segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-25shmat Memory Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-26Memory Mapped Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-28

IA-64 Virtual Memory ManagerIA-64 Addressing Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-3Region Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-4Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-5

Page 6: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Student Guide Draft Version for review, Sunday, 15. October 2000, intTOC.fm

vi AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Single vs. Multiple Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-7AIX 5L Region Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8Memory Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-10LP64 Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-14ILP32 Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-15Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-16

LVMLogical Volume Manager overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-3Data Integrity and LVM Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-12LVM Striping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-15LVM Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-17Physical disk layout Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-21VGSA structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-30Physical disk layout IA-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-31LVM Passive Mirror Write Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-36AIX 5 LVM Hot Spare Disk in a Volume group. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-40LVM Hot spot management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-42LVM split mirror AIX 4.3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-45LVM Variable logical track group (LTG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-46LVM command overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-47LVM Problem Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-48Trace LVM commands with the trace command . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-51LVM Library calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-56logical volume device driver LVMDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-57Disk Device Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-58Disk low level Device Calls such as SCSI calls . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-60Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-61

Enhanced Journaled File SystemJ2 - Enhanced Journaled File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2Aggregate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3Allocation Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-7Filesets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-8Extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10Binary Trees of Extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12inodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-15File Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-19fsdb Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-23Exercise 1 - fsdb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-24Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-27Directory Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-31Exercise 2 - Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-35

Logical and Virtual File SystemsGeneral File System Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2Logical File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-4User File Descriptor Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-6System File Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-7Virtual File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-9Vnode/vfs interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-10

Page 7: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Draft Version for review, Sunday, 15. October 2000, intTOC.fm Student Guide

© Copyright IBM Corp. 2000 Version 20001015 Contents viiCourse materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Vnodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-11vfs and vmount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-12File and Filesystem Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-14gfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-15vnodeops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-16vfsops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-17The Gnode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-18Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-20Lab Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-21Lab Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-26

AIX 5L bootWhat is boot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-2Various Types of boot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-3Systems types and Kernel images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-5RAMFS and prototype files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-6Boot Image Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-8AIX 5L Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-10Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-11Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-12The Power Boot Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-13Power boot disk layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-14AIX 5L Power boot record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-16Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-20Power boot images structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-21RSPC boot image hints header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-22CHRP Boot image ELF structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-24CHRP boot image ELF structure - Continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-25CHRP boot image ELF structure - Continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-26Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-27Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-28Power ROS and Softros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-30IPLCB on Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-31Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-33Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-34The IA-64 Boot Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-35IA-64 boot disk layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-37EFI boot manager and boot maintenance manager overview . . . . . . . . . . . . . . . . 14-39EFI Shell Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-40IA-64 Boot Loader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-43IA-64 Initial Program Load Control Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-44Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-45Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-46Hard Disk Boot process (rc.boot Phase I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-47Hard Disk Boot process (rc.boot Phase II) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-48Hard Disk Boot process (rc.boot Phase III) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-49CDROM Boot process (rc.boot Phases I, II and III) . . . . . . . . . . . . . . . . . . . . . . . . 14-50Tape Boot process (rc.boot Phases I, II and III) . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-51Network Boot process (rc.boot Phases I, II and III) . . . . . . . . . . . . . . . . . . . . . . . . 14-52Common Boot process (rc.boot Phase III) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-53

Page 8: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Student Guide Draft Version for review, Sunday, 15. October 2000, intTOC.fm

viii AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Network boot $RC_CONFIG files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-54The init process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-56ODM Structure and usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-57boot and installation logging facilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-63Debugging boot problems using KDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-65Debugging boot problems using IADB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-67Packaging Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-69Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-71Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-72

/proc Filesystem Support/proc Filesystem Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-2Types of Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-4The as File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-5The ctl File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-6The status File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-7The psinfo file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-10The map File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-11The cred File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-13The sigact File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-14lwp/lwpctl file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-15The lwp/lwpstatus File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-16The lwp/lwpsinfo File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-19Control Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-20PCSTOP, PCDSTOP, and PCWSTOP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-21PCRUN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-23PCSTRACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-25PCCSIG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-26PCKILL, PCUNKILL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-27PCSHOLD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-28PCSFAULT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-29Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-34Code Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-35

Page 9: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -1 of 38Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm Guide

Unit 1. Kernel Overview

This overview describes the concepts used in the AIX 5L kernel.

What You Should Be Able to DoAfter completing this unit, you should be able to

• Identify major components of the kernel.

• Identify the major differences between AIX 5L and previous versions of AIX.

• Determine what kernel to use.

• Determine what the kernel limits are.

• Find out if a thread is in user or kernel model.

• Define the kernel address layout.

• Describe the steps the kernel takes in handling an interrupt or exception.

Page 10: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-2 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Kernel Overview

Introduction Up until AIX 5L, the kernel was a 32-bit kernel for Power architecture only. AIX version 4.3 introduced the 64-bit application enabling on Power, which meant there was still a 32-bit kernel, but an 64-bit environment was available through a kernel extension which did the appropriate Now AIX 5L features both a 32-bit and a 64-bit kernel on Power systems, and a 64-bit kernel on the IA-64 architecture.

This overview describes the concepts used in the kernel in general and in the 64-bit kernel specifically.

Kernel description

The kernel is the base program of the computer. It is an intermediary between the applications and the computer hardware. There is no need for applications to have specific knowledge of any kind of hardware. Processes, that is, programs in execution or running programs, just ask for a generic task to complete (like ‘give me this file’) and the kernel will go out and get it.

The kernel is the first and most important program on the computer. It can access things other programs can not. It can create and destroy processes and it controls the way programs run. Resource usage is balanced by the kernel in order to keep everybody happy.

Functions of the kernel

The kernel provides the system with the following functions:

• Create, manage and delete processes.

• Schedule and balance resources.

• Provide access to devices.

• Handle asynchronous events.

The kernel manages resources so they can be shared simultaneously among many processes and users. Resources can be physical like the CPU, the memory or an adapter, or it can be virtual, like a lock or a slot in the process table.

Uniprocessor support

The 64-bit kernel is aimed at the high-end server environment and multiprocessor hardware. As a result, it is optimized strictly for the multiprocessor environment and no separate uniprocessor version is provided.

Continued on next page

Page 11: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -3 of 38Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm Guide

Kernel Overview -- continued

64-bit vs. 32-bit kernel

The primary purpose of the 64-bit AIX kernel is to address the fundamental need for workload scalability. This is achieved through a kernel address space which is large enough to support increases in software resources.

The demands placed on the system software by customer applications will soon outstrip the existing AIX 32-bit kernel because of the 32-bit kernel’s limited address space. At 4GB, this address space is simply too small to efficiently and/or effectively handle the amount of software resources needed to support the projected 2001 workloads and hardware. In fact, a number of software resources pools within the 32-bit kernel are now under pressure from today’s application workloads.

32-bit kernel life time

Customers have made and will continue to make significant investment in 32-bit RS/6000 hardware systems and need system software that protects this investment. Thus, AIX also offers a 32-bit kernel.The RS/6000 software plan is to eventually drop support for the 32-bit kernel. However, support will not be withdrawn before 2002 and after the initial 64-bit kernel release. This process is driven by end-of-life plans for 32-bit hardware systems, as well as the fact that customers require a bridge period under which both the 32-bit and 64-bit kernels are available for 64-bit hardware systems and offer the same basic functionality. This period is needed to ease migration to the 64-bit kernel.

Compatibility Customers need system software that protects their investment in existing applications and provides binary and source compatibility. AIX 5L will therefore maintain support for existing 32-bit applications.

Continued on next page

Page 12: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-4 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Kernel Overview -- continued

Kernels supported by hardware platform

The table below shows which kernels are supported on different systems. In general, a 64-bit kernel and application can only run on 64-bit hardware, but 64-bit hardware can execute 32- and 64-bit kernels and applications.

Currently, there are three different CPUs types in the RS/6000 systems (only the PowerPC 604e CPU is 32-bit).

Binary compatibility and limitations

The 64-bit kernel offers binary compatibility to existing applications for both 32-bit and 64-bit applications. However, it does not extend to the minority of applications that are built non-shared or have intimate knowledge of internal details, such as programs accessing /dev/kmem or /dev/mem. This is consistent with the general AIX policy for these two classes of applications.

In addition, binary compatibility will not be provided to applications that are dependent on existing kernel extensions that are not ported to the 64-bit kernel environment. Only 64-bit kernel extensions will be supported. This direction is taken to avoid the significant cost of providing 32-bit kernel extension support under the 64-bit kernel, and is consistent with the directions taken by other UNIX vendors such as SUN, HP, DEC and SGI. On the plus side, this direction also forces kernel extensions to migrate to the more scalable and strategic 64-bit environment (to better face the next century).

Continued on next page

32-bit Power 64-bit Power Intel IA6432-bit Kernel 32-bit applications 32-bit applications64-bit Kernel Not supported;

64-bit kernel is not supported at 32-bit CPUs

32-bit applications64-bit applications

32-bit applications32-bit applications

CPU Type

PowerPC 604e 32-bit

Power3-II 64-bit

RS64 II 64-bit

RS64 III 64-bit

Page 13: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -5 of 38Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm Guide

Kernel Overview -- continued

Compatibility for kernel extensions

There is no change to the compatibility provided for 32-bit kernel extensions under the 32-bit kernel. 64-bit kernel extensions will not be supported under the 32-bit kernel.

Compatibility for system calls

One important aspect of binary compatibility involves the required functional behavior of system call APIs when supplied invalid user addresses. Under today’s 32-bit kernel, this behavior differs in many ways for 32-bit and 64-bit applications. For 32-bit applications, APIs return errors (that is, EFAULT errno) to the application if presented with an invalid address. This behavior is due to the fact that all user space accesses that are made under an API inside the kernel, and under the protection of kernel exception handling. For 64-bit applications, an invalid user address will cause a signal (SIGSEGV) to be sent to the application. This occurs because structure reshaping is done in supporting API libraries and it is the user mode library routine that accesses the invalid user (structure) address.

Today’s kernel behaviors is preserved by the 64-bit kernel for 32-bit applications but not for 64-bit applications. This is because the behavior for 64-bit applications under the 32-bit kernel will be changed and made consistent with that now provided for 32-bit applications. This is done for a number of reasons.

First, it is difficult to fully preserve the present behavior for 64-bit applications. Reshaping is not required for these applications under the 64-bit kernel, so there will be no library accesses. Signals could be sent as part of kernel exception handling, but it would be hard to produce the same signal context as is seen under the 32-bit kernel.

Next, the functional behaviors of 32-bit and 64-bit applications should only differ in places where there are fundamental application differences, like address space layout. Introducing different behaviors in other places only complicates matters for application writers.

Finally, both the errno and signal behaviors are allowable under the standards, but the errno behavior offers a more friendly application programming model.

In order to provide a consistent behavior across kernels and applications, all structure reshaping is performed inside both kernels for both application types.

Page 14: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-6 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Kernel Overview -- continued

Source compatibility

Source code compatibility is preserved for applications and 32-bit kernel extensions. Consistent with general AIX policy, this extends to makefiles (build mechanisms), but not to the small set of applications that rely upon shipped header file contents that are provided only for use by the kernel. Programs accessing /dev/mem or /dev/kmem serve as an example of such applications.

32-bit vs. 64-bit kernel Performance on Power

The 64-bit kernel is intended to increase scalability of the RS/6000 product family and is optimized for running 64-bit applications on the upcoming Gigaprocessor systems (Power4, which will be announced in 2001). The performance of 64-bit applications running on the 64-bit kernel on Gigaprocessor-based systems is better than if the same application was running on the same hardware with the 32-bit kernel. This is because the 64-bit kernel allows 64-bit applications to be supported without requiring system call parameters to be remapped or reshaped. The 64-bit kernel may also be compiler-optimized specifically for the Gigaprocessor system, whereas the 32-bit kernel may be optimized to a more general platform.

32-bit application Performance on 32-bit and 64-bit kernels

The 64-bit kernel will also be optimized for 32-bit applications (to the extent possible). This is because 32-bit applications now dominate the application space and will continue to do so for some time. In fact, performance trade-offs involving 32-bit versus 64-bit applications should be made in favor of 32-bit applications. However, 32-bit applications on the 64-bit kernel will typically have less performance than on the 32-bit kernel, because call parameter reshaping is required for 32-bit applications on the 64-bit kernel.

64-bit application and 64-bit kernel performance at non Gigaprocessor systems

The performance of 64-bit applications under the 64-bit kernel on non-Gigaprocessor systems may be less than that of the same applications on the same hardware under the 32-bit kernel. This is due to the fact that the non-Gigaprocessor systems are intended as a bridge to Gigaprocessor systems and lack some of support that is needed for optimal 64-bit kernel performance. In addition, efforts should be made to optimize 64-bit kernel performance for non-Gigaprocessor system, but performance trade-offs are made in the favor of the Gigaprocessor.

Continued on next page

Page 15: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -7 of 38Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm Guide

Kernel overview -- continued

32-bit and 64-bit kernel extension performance at Gigaprocessor systems

The performance of 64-bit kernel extensions on Gigaprocessor systems should be the same or better than their 32-bit counterparts on the same hardware. However, the performance of the 64-bit kernel extension on non-Gigaprocessor machines may be less than 32-bit kernel extensions on the same hardware. This flows from the fact that 64-bit kernels are optimized for Gigaprocessor systems.

Kernel characteristics

Since the kernel is a program itself, it behaves almost like any other program. It’s features are:

• Preemptable

• Pageable

• Segmented

• 64-bit

• Dynamically loadable

Preemptable means that the kernel can be in the middle of a system call and be interrupted by a more important task. The preemption causes a context switch to another thread inside the kernel.

Some parts of the kernel are pageable, which means they are not needed in memory all the time, and can be paged to paging space.

Both the 32-bit kernel and the 64-bit kernel implement virtual address translation by using segments. In previous versions of AIX, segment registers were used to map segments to thread contexts. Now segment tables are being used.

The kernel can be dynamically extended with extra functionality.

Page 16: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-8 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Kernel states

kernel system diagram

Roughly there are three distinct layers:

• The user level

• The kernel level

• The hardware level

This diagram shows how the kernel is the interface between the user level and the hardware. Applications live at the user level, and they can only access hardware, like a disk or printer, through the kernel.

Process execution modes

Processes can run in two different execution modes: kernel mode and user mode.These modes are also referred to as Supervisor State and Problem State.

Continued on next page

libraries

system call interface

file subsystem

character block

device driver

hardware control

hardware

Inter-processCommunication

scheduler

memory management

process

control

subsystem

buffer cache

user programstrap (Power)

User level

Kernel level

Kernel level

Hardware level

Page 17: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -9 of 38Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm Guide

Kernel states -- continued

User mode protection domain

A process running in user mode can only affect its own execution environment and runs in the processor’s unprivileged state. In user mode, a process has read/write access to the user data process private segment and the shared library data segment. It also has access to the shared memory segments using the shared memory functions. The process in user mode has read access to the user text and shared library text segment.

User mode processes can still use kernel functions by means of a system call. Access to functions that directly or indirectly invoke system calls are typically provided by programming libraries which gives access to operating system functions.

Kernel mode protection domain

Code running in this mode has read/write access to global kernel space and access to kernel data in the process private segment when running within the process context. Code in interrupt handlers, the base kernel and kernel extensions run in kernel mode. If a program running in kernel mode needs to access user data, a kernel service is used to do so. Programs running in kernel mode can use kernel services, can access global system data, are exempt from all security restraints, and run in the processor privileged state

In short:

User mode or problem state:

• User programs and applications run in this mode.

• Kernel data and global structures are protected from access/modification.

Kernel mode or supervisor state:

• Kernel and kernel extensions run in this mode.

• Can access or modify anything.

• Certain instructions limited to supervisor state only.

The kernel state is part of the thread state, so this information typically is kept in the threads Machine State area (MST).

Page 18: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-10 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Kernel exercise

Exercise: figuring out thread state on Power

Look at the value of the Machine State Register (MSR) for thread of interest:

# echo “mst <thread slot>”| kdb | grep msriar : 0000000000009444 msr : A0000000000010B2 cr : 31384935

From /usr/include/sys/machine.h :

#define MSR_PR 0x4000 /* Problem state */

This means that if bit 15 from the MSR is set, the thread is running in user mode, that is, when the fourth nibble from the right is either 4,5,6,7 or C,D,E,F.

Continued on next page

Page 19: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -11 of 38Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm Guide

Kernel exercise -- continued

Exercise: figuring out thread state on IA-64

Look at the value of the Interrupt Processor State Register (IPSR) for thread of interest.

On an interrupt, and if PSR.ic (Interrupt Collection) is 1, the IPSR receives the value of the PSR. The IPSR, IIP and IFS are used to restore the processor state on a Return From Interrupt (rfi). The IPSR has the same format as PSR. IPSR.ri is set to 0, after any interruption from the IA-32 instruction set.

# iadb

(0)> ut -t <thread-ID>

*ut_save: 0x0003ff002ff3b400 *ut_rsesave: 0x0003ff002ff3bf50

System call state: ut_psr: 0x00001053080ee030

... more stuff...

(0)>mst 0x0003ff002ff3b400

mst at address 0003FF002FF3B400

prev : 0000000000000000 intpri : INTBASE

stackfix : 0000000000000000 backt :

kjmpbuf : 0000000000000000 emulator : NO

excbranch : E000000000020A80 excp_type : EXTINT(10)

ipsr : 00001010080AE030 isr : 0000000000000000

iip : E00000000000B970 ifa : E000009729F4F22A

iipa : E00000000000B960 ifs : 8000000000000716

iim : 00000000000000F4 fpowner : LOW/HIGH

fpsr : 0009804C0270033F fpeu : YES

... tons of more stuff ...

(0)> q

From /usr/include/sys/machine.h :

#define PSR_PK 15

00001010080AE030 (HEX) =

100000001000000001000000010101110000000110000 (Binary)

Bit 15 is set, which means that the thread has the Protection Key set, and hence is in a problem state.

Page 20: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-12 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Kernel Limits

Kernel Limits Most of the settings in the kernel are dynamic and don’t need to be tuned. Their maximum values are considered to be chosen in such a way that they will never be reached during normal system usage. Some limits chosen as a maximum can technically be even higher.

The following table lists kernel system limits as of AIX 5L Version 5.0

Continued on next page

Semaphores 32-bit kernel 64-bit-kernel

Maximum number of semaphore IDs

131072 131072

Maximum semaphores per semapore IDs

65535 65535

Maximum operations per semop call

1024 1024

Maximum undo entries per process

1024 1024

Size in bytes of undo structure

8208 8216

Semaphore maximum value

32767 32767

Adjust on exit maximum value

16384 16384

Message Queues 32-bit kernel 64-bit kernel

Maximum message size

4 MB 4 MB

Maximum bytes on queue

4 MB 4 MB

Maximum number of message queue IDs

131072 131072

Maximum messages per queue ID

524288 524288

Page 21: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -13 of 38Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm Guide

Kernel Limits -- continued

Kernel Limits

There are a couple of kernel parameters which affect the availability of semaphores (semaem, semmap, semmni, semmns, semmnu, semume). Please check them by referencing the working system. Please keep in mind that other applications could also affect the availability of semaphores.

Continued on next page

Shared Memory 32-bit kernel 64-bit kernel

Maximum region size 2 GB 2 GB

Minimum segment size

1 1

Maximum number of shared memory IDs

131072 131072

Maximum number of segments per process

11 268435465

LVM 32-bit kernel 64-bit kernel

Maximum number of VGs

255 4095

Maximum number of PP’s per hdisk

1016 1016

Maximum number of LVs

256 512

Maximum number of major numbers (see note 1)

65535 1073741823

Maximum number of VMM-mapped devices(see note 2)

1024 1024

Maximum number of disks per VG

32 128

Page 22: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-14 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Kernel Limits -- continued

Kernel Limits

Continued on next page

Filesystems JFS JFS2

Maximum file system size (see note 3)

1 TB 32 PB

Maximum file size (see note 4)

64 GB 32 PB

Maximum size of log device

256 MB 32 PB

Maximum number of file system inodes

2^24 Unlimited

Maximum number of file system fragments

2^28 N/A

Maximum number of hard links

32767 32767

Miscellaneous 32-bit kernel 64-bit kernel

Maximum number of processes per system

131072 131072

Maximum numbers of threads per system

262143 262143

Maximum number of open files per system

1000000 Unlimited (resource bound)

Maximum number of open files per process

32767 32767

Maximum number of threads per process

32767 32767

Maximum number of processes per user

131072 131072

Maximum physical memory size

4 GB 1 TB

Minimum physical memory size

32 256 MB

Maximum value for the wall

1 GB 4 GB

Page 23: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -15 of 38Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm Guide

Kernel Limits -- continued

Kernel Limits Notes:

1. Each volume group takes one major number; some are reserved for the OS and for other device drivers. Run "lvlstmajor" to see the range of free major numbers; rootvg always uses 10.

2. VMM-mapped devices are mounted JFS/CDRFS file systems, open JFS log devices, and paging spaces. Of 512, 16 are pre-reserved for paging spaces. These devices have are indexed through the kernels Page Device Table (PDT), which is a fixed size array.

3. To achieve 1TB, the file system must be created with npbi=65536 or higher and frag=4096.

4. To achieve around 64 GB files, the file system must be created with the -a bf=true flag AND the application must support files greater than 2 GB.

Page 24: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-16 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Kernel Limits Exercises

Checking kernel values

The purpose for this exercise is to find actual limit or settings in a running kernel. From the file /usr/include/sys/msginfo, we obtain the structure msginfo that holds four integers. To list the content in the running kernel, we use kdb fat Power and iadb at IA-64 platform. From both systems, we display 16 bytes equal to four integers.

/*

* Message information structure.

*/

struct msginfo {

int msgmax, /* max message size */

msgmnb, /* max # bytes on queue */

msgmni, /* # of message queue identifiers */

msgmnm; /* max # messages per queue identifier */

};

Power # kdb

(0)> d msginfo

msginfo+000000: 0040 0000 0040 0000 0002 0000 0008 0000 msgmax msgmnb msgmni msgmnm

IA-64 # iadb

> d msginfo 4 4

e00000000415cfb0: 00400000 00400000 00020000 00080000

msgmax msgmnb msgmni msgmnm

Page 25: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -17 of 38Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm Guide

64-bit Kernel base enablement

64-bit Kernel base enablement

Several components of base enablement support are provided to make it possible for kernel subsystems and kernel extensions to run in 64-bit mode and use a large address space.

State management support

Support is provided for saving and restoring 64-bit kernel context, including full 64-bit GPR contents. This support also extends to the area of kernel exception handling where setjmpx() and longjmpx() must deal with 64-bit kernel context. In addition, state management is extended to include the 64-bit kernel address space as part of the kernel context.

Temporary attachment

The 64-bit kernel provides kernel subsystems and kernel extensions with the capability to change the contents of the kernel address space. This includes the capability to change segments within the address space temporarily for a specific thread of execution and is consistent with the segmented virtual memory architecture of the hardware and the legacy 32-bit kernel programming model.

A total of four concurrent temporary attachments will be supported under a single thread of execution. This limitation is consistent with the limitation imposed by the 32-bit kernel and is made to restrict the amount of kernel state that must be saved and restored at context switch.

Global attachment

While the temporary attachment model is maintained, the 64-bit kernel also provides a model under which subsystem data is placed within the global kernel address space and made visible to all kernel code for the entire life of its usefulness, rather than temporarily attaching segments as needed and in the context of a single thread.

This global attachment model does more than allow the 64-bit kernel to provide sufficient space for subsystems to place their data in the global kernel heap. Rather, it includes the capability to place subsystem segments within the global address space. This capability is needed for two reasons:

• Different memory characteristics

• Data organized around segment

Continued on next page

Page 26: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-18 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

64-bit Kernel base enablement -- continued

Global attachment

Some subsystems require virtual memory characteristics that are different from those of the kernel heap. For the most part, these characteristics are defined at the segment level and typically must be reflected by segment types that are different from those used for the kernel heap. Also, some subsystems organize their data around segments and require sizes and alignments that are inappropriate for the kernel heap.

The global attachment model is of importance for a number of reasons. First, it is more scalable than the temporary attachment model. This is particularly true for subsystems that require large portions of their data to be accessible at the same time for a single operation. As the volume of this data increases to meet workload or hardware requirements, the temporary attachment model proves impractical for these subsystems, as increasing numbers of segments must be attached and detached. An example of such a subsystem is the VMM, where page fault resolution and virtual memory kernel services require access to all page frames and segment descriptors.

The global attachment model is also of value in cases where only a small number of subsystem segments are involved. Segments are attached to the global kernel addresses space only once, typically at subsystem initialization, and are accessible from then on without requiring individual subsystem operations to incur the path length cost of segment attachment. This is not to say that the global attachment model is without its own path length costs; specifically, use of this model may result in more segment lookside buffer (SLB) reloads. This is because it provides no opportunity to prime the SLB table with virtual segment IDs (VSIDs) for soon-to-be-accessed segments. Rather, it relies upon the caching nature of the SLB table and updates SLBs with new VSIDs only when satisfying reload faults. This differs from the temporary attachment model where VSIDs are placed in the SLB as part of segment attachment.

Continued on next page

Page 27: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -19 of 38Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm Guide

64-bit Kernel base enablement -- continued

Global attachment

Finally, this model simplifies the general kernel programming model. Subsystems are not required to deal with the complexity of segments, segment offsets or segment attachments in accessing their data. Rather, data accesses are made simply and naturally using addresses within the flat kernel address space.

The specific subsystem segments that will be placed in the kernel address space under the global attachment model include:

• Kernel Heap

Although traditionally part of the global address space, the kernel heap segments will be placed in this space through global attachment.

• File System Segments

The global segments used to hold the file and inode tables will be provided through global attachment.

• mbuf Segments

The mbuf pool has long been a part of global space and this will continue under the 64-bit kernel.

• VMM Segments

These segments are privately attached in the 32-bit kernel legacy and hold the software page frame table, segment control blocks, paging device table, file system lockwords, external page tables, and address space map entries.

• Process and Thread Tables

Global attachment is used for the segments required for the globally addressable process and thread tables.

All segments added to the global kernel address space through global attachment will be strictly read/write for the kernel and no-access for users. In addition, unaligned accesses to these segments will not be supported and will result in a protection exception.

Page 28: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-20 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

64-bit Kernel base enablement -- continued

Data isolation While placing subsystem data in the global kernel address space provides significant benefits, it eliminates the data isolation that is provided by the temporary attachment model. Under this model, data is typically made accessible only while running subsystem code and is not generally exposed to other subsystems. Unrelated interrupt handlers may gain accessibility to data by interrupting subsystem code. However, this exposure is more limited than that which occurs by placing data in global space where all kernel code has accessibility.

Isolation is critical for some classes of subsystem data. As a result, not all subsystem data should be placed in the global kernel address space. In particular, file systems should continue to use temporary attachments to provided isolation for user data.

Kernel address space layout

The kernel address space layout preserves the existing 32-bit and 64-bit user address layouts that is now found under the 32-bit kernel legacy. In addition, a common global kernel and per-process user address space is provided. This is required for a number of performance reasons:

• Efficient transition between kernel and user mode

• Preservation of SLBs

• Reduces complexity

• Single per-process segment table

To begin, a common address space improves the efficiency of transition between kernel and user mode since there is no need to switch address spaces. Next, it preserves SLBs. This is because the segments within the user and kernel address space are common, so there is no need to use separate SLBs or perform SLB invalidation at user/kernel transitions. Also, a common address space reduces the complexity and path length that is associated with kernel access to user space. There is no longer a need for the kernel to gain address ability to segments from a separate user address space in performing accesses or to serialize accesses against changes in the user address space. Rather, user segments are already in place and properly serialized in the common address space. Finally, the common address space supports the efficiency of a single per-process segment table.

Continued on next page

Page 29: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -21 of 38Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm Guide

64-bit Kernel base enablement -- continued

Kernel address space layout

Temporary attachments are not included as part of the common address space. This is for a number of reasons. First, data isolation would be impacted for temporary attachments if they were placed in the common address space. This is because the attached data would be accessible in the kernel by all threads of a process rather than only by the thread that performed the temporary attachment. Second, it would be inefficient for the common address space to include temporary attachments. This is due to the fact that changes to the common address space would have to be serialized among all threads of a process.

I/O space mapping

The 64-bit kernel supports I/O space at locations below and above 4 GB within the hardware system memory map. Under the 64-bit kernel, I/O space is virtually mapped through the page translation hardware and made accessible through segments on all supported hardware system implementations. In the 32-bit kernel legacy on current hardware systems, I/O space virtual access is achieved through block address translation (BAT) registers, but this capability is not provided by the Gigaprocessor hardware.

Performance when accessing I/O addresses

The capability to place portions of I/O space within the global kernel address must be provided to allow temporary attachment overhead to be avoided. This capability is built upon the global attachment model. Along with services to support this, others services are provided that allow portions of I/O space to be temporarily attached. However, these services will form an I/O space temporary attachment model that is slightly different from the one now found under the 32-bit kernel. Specifically, I/O space mappings must be created prior to any temporary attachments and destroyed once all temporary attachments are complete. These mapping operations are performed by individual device drivers through new services and typically occur at the time of device configuration and de-configuration. Compare to the existing model under the 32-bit kernel, where no separate mapping operations are present.

Continued on next page

Page 30: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-22 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

64-bit Kernel base enablement -- continued

I/O mapping in 64-bit kernel mode

The mapping operations are provided under the 64-bit kernel model for a number of reasons. The first is performance. While the 32-bit kernel model does not require I/O space to be mapped before it is attached, it does require each temporary attachment to perform some level of mapping. Under the 64-bit kernel model, each device driver maps its portion of I/O once at initialization time and incurs no additional mapping overhead in performing temporary attachments. Next, the presence of the mapping operations provide efficient use of system resources. I/O space is mapped in virtual memory through the page table and segments under the 64-bit kernel and these system resources are only consumed for portions of I/O space that are actually in use. In the absence of mapping operations, the 64-bit kernel itself would have to map all of I/O space into virtual memory and possibly waste resources for unused portions. In addition to potentially wasting resources, arming the kernel with the responsibility of mapping I/O space would lead to arbitrary layouts of I/O space in virtual memory and would not support data isolation. Finally, the interfaces for performing temporary attachments are simplified, as no I/O mapping information must be specified. This implies new interfaces for attaching and detaching from I/O space.

The new I/O space temporary attachment model and supporting services is not only provided under the 64-bit kernel but under the 32-bit kernel as well. This is required to ease the migration of 32-bit device drivers to the 64-bit kernel environment and to make it simpler to maintain 32-bit and 64-bit versions of a single device driver.

Rather than placing their respective portions of I/O space in the global kernel address space, most device drivers should continue to access I/O space through temporary attachments. This is because a large proportion of these accesses occur under interrupts and would more than likely miss the SLB table if the accesses were performed using the global attachment model. While the temporary attachment model adds overhead to I/O space accesses, it typically avoids the SLB miss performance penalty by priming the SLB table.

Continued on next page

Page 31: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -23 of 38Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm Guide

64-bit Kernel base enablement -- continued

LP64 C language data model

The 64-bit kernel uses the LP64 (Long Pointer 64-bit) C language data model. This data model was chosen for a number of reasons. First, the LP64 data is also used by 64-bit AIX applications, and this allows the 64-bit kernel to support these applications in a straightforward manner. Of the prevailing 64-bit data models, including ILP64 and LLP64, the LP64 data model is most consistent with the ILP32 data model used by 32-bit applications. This consistency simplifies 32-bit application support under the 64-bit kernel and allows 32-bit and 64-bit applications to be supported in fairly common ways. Next, LP64 has been chosen as the data model for the 64-bit kernel implementations provided by key UNIX vendors, including SGI, SUN, and H-P. Use of a common data model simplifies matters for ISVs, and enables AIX to use industry wide solutions to some problems. Finally, the 64-bit kernel requires no new compiler functionality and can use the existing 64-bit mode compiler.

Register conventions

The register conventions used in the 64-bit kernel environment are the same as those used in the 64-bit application environment. This means that general purpose register 13 will be reserved for operating system use.

Page 32: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-24 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

64-bit Kernel stack

Kernel stack 64-bit code has greater stack requirements than 32-bit code. This is for two reasons. First, the amount of stack space required to hold subroutine linkage information increases for 64-bit code, since this information is made up of register and pointer values and these values are larger 64-bit quantities. Second, long and pointer values are 64-bit quantities for 64-bit code and consume more space when maintained as stack variables.

The larger stack requirements of 64-bit code also means that stack-related sizes under the 64-bit kernel are increased over those of the 32-bit kernel. In fact, most existing stack sizes will double.

Minimum stack size

Under the 64-bit kernel, the components of the common subroutine linkage, such as the link register and TOC pointer, are 64-bit quantities. As a result, the minimum stack frame size is 112 bytes.

Process context stack size

Consistent with the 32-bit kernel, the kernel stacks for use in process context are 96 KB in size. This size should prove to be sufficient for the 64-bit kernel, since it has been found to be twice that of what is actually needed for the 32-bit kernel.

Interrupt stack size

The interrupt stack will be 8 KB in size under the 64-bit kernel. This size is clearly warranted, since some interrupt handlers find the 4 KB interrupt stack size of the 32-bit kernel to be insufficient.

Dynamic resource pools

To allow scalability, resource pools are allocated dynamically from the kernel heap and through separately created segments intended for this purpose. This means that some existing resource pools, like the shared memory, message queue, and semaphore ID pools, are relocated from the kernel BSS.

Continued on next page

Page 33: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -25 of 38Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm Guide

64-bit Kernel stack -- continued

Kernel heap The kernel heap is the home of most kernel data structures, and is sufficiently large to allow subsystems to scale fixed resource pools, while at the same time, providing adequate space for dynamically allocated resources. To provide this, the kernel heap is expanded to encompass a larger number of segments and placed above 4 GB within the global kernel address space to accommodate its larger size.

While the kernel heap is extended and moved above 4 GB, the interfaces provided for the allocation and freeing from this heap are the same as those provided under the 32-bit kernel. The use of these interfaces is pervasive, so common interfaces eases the 64-bit kernel porting effort for kernel subsystems and kernel extensions and makes it simpler to support both kernels.

The kernel heap is now expanded to 16 segments, for a total of about 4GB of allocatable space. This is more than eight times larger than the space available under the 32-bit kernel.

Allocation requests are only limited in size by the amount of available heap space, rather than by some arbitrary limit. This means that the segments that make up the kernel heap are laid out contiguously within the address space, and requests for more than a segment size worth of data is granted if sufficient free space is available. It also means that a request can be satisfied with space that crosses segment boundaries.

A separate global heap reserved for the loader is provided in segment zero (that is, the kernel segment). This heap is used to hold the system call table and svc_instructions code for 32-bit applications and must be placed in segment zero, because it is the only global segment that is mapped into the 32-bit user address space. This heap is also used to hold the system call table for 64-bit applications and loader sections for kernel extensions. This data is located in the loader heap because it must be readable in user mode. This type of access is not supported for the kernel heap.

Page 34: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-26 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

CPU big- and little-endian

Memory view for big and little endian systems

Although both Power and IA-64 architectures support big-endian and

little-endian implementations, the endian of AIX 5L running on IA-64 and AIX 5L on PowerPC are different. AIX 5L for IA-64 is little-endian, and AIX 5L for PowerPC is big-endian.

Logically, in multi-digit numbers, leftmost digits are more significant, and rightmost least. For example, in the four-digit number 8472, the 4 is more significant than the 7.

Now, when you look at the system memory, we can look at it in two ways. The example shows a 100 byte memory seen the two ways. Try to write the number 1234567890 at address 0-9 in both figures. What is the digit in the byte at address two?

Continued on next page

99

79

59

39

19

9080

50403020100

7060

89

69

49

29

09

00102030405060708090

091929394959697989

99

addressaddress address address

Page 35: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -27 of 38Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm Guide

CPU Big and little Endian -- continued

Register and memory byte order

Computers address memory in bytes while manipulating data in words (of multiple bytes). When a word is placed in memory, starting from the lowest address, there are only two options: Either place the least significant byte first (known as little-endian) or place the most significant byte first (known as big-endian).

In the register layout shown in the figure above, “a” is the most significant byte, and “h” is the least significant byte. The figure also shows the byte order in memory. On big-endian systems, the most significant byte will be placed at the lowest memory address. On little-endian systems, the least significant byte will be placed at the lowest memory address.

Power, PowerPC, most RISC-based computers, IBM 370 computers, and Internet protocol (IP) are some examples of things that use the big-endian data layout. Intel processors, Compaq Alpha processors, and some networking hardware are examples of things that use the little-endian data layout.

register bit 63 0

big-endian memory

address 0 1 2 3 4 5 6 7

little-endian memory

0 1 2 3 4 5 6 7addressaddress

h g f e d c b a

a b c d e f g h

a b c d e f g h

Page 36: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-28 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Multi Processor dependent designs

Kernel lock The kernel lock is not supported under the 64-bit kernel. This lock was originally provided to allow subsystems to deal with the pre-emptive nature of AIX kernel on uniprocessor hardware, while later being used as a mean for ensuring correctness for non-MP-safe subsystems on MP hardware. At a minimum, all 64-bit kernel subsystems and kernel extension must be MP-safe, with most required to be MP-efficient to meet performance requirements. As a result, the kernel lock is no longer required.

Device funneling

Under the 64-bit kernel, no support will be provided for device funneling. This means that all device drivers must be MP-safe and identify themselves as such when registering devices and interrupt handlers.

Device funneling was originally provided under the 32-bit kernel so that non-MP-safe device drivers could run correctly on multi-processor hardware with no change. However, all device drivers must change to some extent under the 64-bit kernel and this provides the opportunity to simplify the 64-bit kernel by not providing device funneling support and requiring additional changes for the set of device drivers that are not MP-safe.

Of the existing IBM Austin-owned device drivers, only the X.25 and graphics device drivers are not MP-safe. However, this is of no concern, since X.25 will not be provided under the 64-bit kernel and the (new) graphics drivers that will be provided in the time frame of the 64-bit kernel will be MP-safe.

Page 37: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -29 of 38Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm Guide

Command and Utility compatibility for 32-bit and 64-bit kernels

Commands and utilities

A number of AIX-supplied commands and utilities deal directly with kernel details and require different implementation under the different kernels. Commands based upon /dev/kmem or /dev/mem serve as an example.

While two different implementations may be required, the AIX-supplied commands and utilities must use a common binary. This is required to support a common system base and means that a single binary front-end must be used, but does not dictate that only a single binary be used. In fact, two binaries make sense in cases where kernel data structures are used (like vmstat) and these data structures have different sizes or formats under 32-bit and 64-bit compilations. Rather than duplicating data structures for a single binary, both a 32-bit and a 64-bit binary version are provided; one of these serves as a front-end and executes the other when the bit-ness of the kernel does not match its own. This implementation ensures that there is one common command interface for both 32-bit and 64-bit kernels utilities.

Page 38: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-30 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Exceptions

Exceptions and interrupts distinction

The distinction between the terms "exception" and "interrupt" is often blurred. The bulk of AIX documentation refers to both classes generically as "interrupts," while the hardware documentation (like the PowerPC 60x User’s Manuals) makes the distinction. We will try to keep the terms separate.

Definition of exceptions

Exceptions are synchronous events that are normally caused by the process doing something illegal.

An exception is a condition caused by a process attempting to perform an action that is not allowed, such as writing to a memory location not owned by the process, or trying to execute illegal operations. For illegal operations, the kernel traps the offending action and delivers a signal to the process causing the exceptions, (or crashes, if the process was in kernel mode). Exceptions can also be caused by a page fault. A page fault is a reference to a virtual memory location for which the associated real data is not in physical memory.

Determine the action taken on an exception

The result of an exception is either to send a signal to the process or to crash the machine. The decision is based upon what kind of exception occurred and whether the process was executing in user mode or kernel mode:

• Exceptions are caused within the context of a process.

• A process may NOT decide how to react to the exception.

• Exception handlers are kernel code and run without regard to the process, except to cleanly handle the exception generated by the process.

• Some exceptions result in the death of the process.

• Some exception types can be found in �V\V�PBH[FHSW�K!

A process can decide how to respond to the signal generated by the exception in certain cases. For example, a process can decide to catch the signal for SIGILL, which is delivered when a process in user mode executes an illegal instruction.

An exception is also a mechanism to change to supervisor state as a result of:

• Program errors

• Unusual conditions

• Program requests

Continued on next page

Page 39: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -31 of 38Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm Guide

Exceptions -- continued

Branching to exception handlers

After an exception, the system switches to supervisor state and branches to an exception handler routine. The branch address is found from the content of a specific memory location called "vector."

Examples of exceptions vectors:

• System reset

• Machine check

• Data storage interrupt (DSI)

• Instruction storage interrupt (ISI)

• Alignment

• Program (invalid instruction or trap instruction)

• Floating-point unavailable

• Decrementer

• System call

System reset exception

The system reset exception is used when a system reset is initiated by the system administrator. This generally causes a "soft" reboot of the system.

Machine check exception

The machine check exception is generated when a hardware machine check occurs. This generally indicates either a hardware bus error or bad real address access. If a machine check occurs with the ME bit off, then a machine checkstop occurs. Generally, a machine check exception causes a kernel crash dump to be generated. A machine checkstop causes no kernel crash dump to be generated, though a checkstop record is generated.

Data storage exception

Data storage interrupt (DSI) and instruction storage interrupt (ISI) exceptions are caused by hardware not being able to find a translation for a instruction fetch or load/store operation. These generally result in a page fault.

Continued on next page

Page 40: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-32 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Exceptions -- continued

Alignment exception

Alignment exceptions are generated when an instruction generates an unaligned memory operation that can not be completed by the hardware. Which unaligned operations can not be handled by the hardware are processor dependent. This exception generally results in AIX performing the unaligned operation with special purpose code.

Invalid instruction exception

The program instruction is generated when an illegal instruction or trap instruction is generated. This is generally caused by debugger breakpoints in a process being hit. This exception generally results in a call to an application or kernel debugger.

Floating point unavailable exception

The floating point unavailable exception is caused when a thread executes a floating point instruction when floating point operations are not allowed. This generally indicates that a thread has not executed any floating point instructions yet or that another thread’s floating point data is currently in the processor’s floating point registers. AIX does not save a thread’s floating point register values until it first uses the floating point registers. On UP systems, AIX does not save off floating point registers for the currently running thread when another thread is dispatched. Often, no other thread will use the floating point registers before the thread is again dispatched. This saves AIX having to save and restore the floating point registers on every thread dispatch.

Decrementer exception

The decrementer exception is caused when the decrementer register has reached the value zero. This indicates that a timer operation has completed.

System call exception

The system call exception occurs whenever a thread executes a system call.

Page 41: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -33 of 38Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm Guide

Interrupts

Description of interrupts

Interrupts are asynchronous events that may be generated by the system or a device, and "interrupts" the execution of the current process. Interrupts usually occur when a process is running and some asynchronous event occurs such as disk I/O completion or a clock tick.

The event usually has nothing to do with the current running process. The kernel immediately preempts the current running process to handle the interrupt. The state of the machine is saved on the stack and the interrupt is handled. The user process has no knowledge that the interrupt occurred.

Interrupts are one of the major reasons that AIX cannot be a hard real-time system. No guarantee can be made as to how long it may take for some action to occur as it may get interrupted any number of times during the action.

Interrupts are caused outside the context of a process. In general, a process may NOT *decide how to react to the interrupt. Interrupt handlers are kernel code and run without regard to the process unless the nature of the interrupt is to update some process related structure, *statistics, and so on.

Interrupt levels Each interrupts has a level and an associated priority; the level is a value that is used to differentiate between interrupts. The priority ranks the importance of each one.

Devices, such as adapter cards, with interrupt facilities have a interrupt level associated. When the system receive an interrupt with that level, AIX then knows that it was caused by the device at that level.

In AIX, devices may share interrupt levels such that more than one adapter may share the same level.

Controlling Interrupts

A kernel process can disable some or all types of interrupts for short periods. The interrupted process will safely return to continue execution.

Some interrupt types can be found in <sys/m_intr.h>

Most interrupts are not concerned with which process is getting interrupted. The major counter example is the clock interrupt. This is used to update the run-time statistics for the currently running process.

Page 42: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-34 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Interrupts -- continued

Critical sections

A critical section is a code section that must be executed without any break. For example: if data is examined and changed based on the value. A process would disable interrupts across a critical section to ensure that the section is executed without breaks.

Out of order instruction sets and Interrupts

On modern processors, such as Power and IA-64, many instructions are being executed at one time. When a hardware interrupt occurs, instructions are executed to completion and any following instructions are terminated with no effect on the processor registers or memory; results from out of order instructions are discarded. This is what is meant by "interrupts are guaranteed to occur between the execution of instructions." The processor makes sure that the effect of its operations are equivalent to an interrupt occurring between the execution of instructions.

Page 43: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -35 of 38Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm Guide

Interrupt handling in AIX

Interrupt handling

When an interrupt is received, AIX performs several steps to handle the interrupt properly:

• Saves the current state of the machine.

• Determines the real handler for the interrupt.

• Calls that handler to "service" the interrupt.

• Restores the machine state if and when the handler completes.

Interrupt priorities

Interrupt priorities have no relationship to process and thread scheduling priorities.

AIX associates priorities with each type of interrupt. A lower priority number means a more favored interrupt. Interrupt processing can itself be interrupted, but only by a more favored (lower priority number) interrupt.

Interrupt routines usually allow themselves to be interrupted by higher prioritized interrupts, but refuse to take less favored interrupt; however, interrupt routines and other programs running in kernel mode can manually raise or lower their interrupt priority. This is called "disabling or enabling interrupts." The reason for this is that a high prioritized disk handler must complete in time before new data arrives, and it does not want to be interrupted by less prioritized interrupts.

Page 44: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-36 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Handling CPU state information at interrupt

Saving and restoring machine state

AIX maintains a set of machine state save (mstsave) areas. Each processor has a pointer to the mstsave area it should use when the next interrupt occurs. This pointer is called the current save area, or csa pointer. When state needs to be saved, AIX:

• Saves almost all registers into the mstsave pointed to by this processor’s csa.

• Gets the next available mstsave area from this processor’s pool.

• Links just-saved mstsave to new mstsave.

• Updates this processor’s csa to point to a new area.

When an interrupt handler returns, AIX must restore the machine state that was in effect when the interrupt occurred. AIX does this by:

• Reloading registers from the processor’s previous mstsave area.

• AIX then sets the processor’s csa pointer to the (now unused) previous mstsave area.

• If returning to base interrupt level, AIX generally reruns the dispatcher to determine which thread to resume.

• The interrupt might have made another thread runnable.

Page 45: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -37 of 38Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm Guide

Handling CPU state information at interrupt

mstsave area description

Because the mstsave (machine state) areas are linked together, the mstsave areas provide an interrupt history stack.

Whenever AIX receives an interrupt that is of higher priority than what it is currently doing, it must save the state of the machine into an mstsave area. The csa (Current Save Area) pointer points to an unused mstsave area that AIX can use if another, higher-priority interrupt comes in. This area may contain stale data from being used for a previously-handled interrupt, but its prev pointer always points to the previous mstsave area (or is null if there aren’t any more in use at that time).

These areas are linked together from most-recently to least-recently used, so this means that they go from higher to lower interrupt priority. At the end of the mstsave chain is the mstsave area for the base interrupt level. This mstsave area contains the state of the machine when it was last doing something other than interrupt processing (that is, the machine state when the oldest interrupt that we are currently processing came in).

Size limitation on mstsave area and interrupt stack

The stack used by an interrupt handler is kept in the same page as the mstsave area. This limits the stack to 4K on the 32-bit kernel and 8k on 64-bit kernel minus the size of the mstsave area. Using this area for the stack ensures that the stack is pinned, which is required for interrupt handlers.

Continued on next page

mstsave mstsave mstsave mstsave

csa

unused(next interrupt

goes here)

highpriorityinterrupt

lowpriorityinterrupt

baseinterrupt

level

prev prev prev

Page 46: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-38 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Handling CPU state information at interrupt -- continued

Saving base level machine state

The user64 area is only used when the process is a 64-bit process in a 32-bit kernel. If the user64 area is being used it is initialized and pinned. The area is created when a process calls exec() for a 64-bit executable. It is destroyed when a 64-bit process exits or calls exec() for a 32-bit executable.

The portion of the base level state save area that contains the 32-bit registers is unused for 64-bit processes.

At a 32-bit kernel, only the base level state save (MST) area needs to have a 64-bit register state save area (user64) associated with it. Since all interrupt handlers run in 32-bit kernel mode, all state save areas other than the base level state save area only needs to save 32-bit states (even on 64-bit hardware). At a 64-bit kernel all MST areas are 64-bit.

The thread’s base level state save area is in the initial thread’s uthread block.

The initial thread’s ublock is in the process’ ublock

In the 32-bit kernel, there is also the user64 area, which is used to save the 64-bit user registers for 64-bit processes.

base level mst save area

initial thread’s uthread block

user area

user64 (32-bit kernel only)

process ublock

Page 47: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -1 of 34Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm Guide

Unit 2. IA-64 Hardware Overview

This unit describes: The /proc filesystem in the AIX 5L kernel.

What You Should Be Able to Do• list the registers available to programs

• describe how EPIC improves performance

Page 48: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-2 of 34 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Hardware Overview

Introduction to IA-64

IA-64 is Intel’s 64-bit architecture, based on the Explicitly Parallel Instruction Computing (EPIC) design philosophy. These are the IA-64 goals:

• Overcome the limitations of today’s architectures.

• Provide world class floating point performance.

• Support large memory needs with 64-bit addressability.

• Protect existing investments with IA-32 compatibility.

• Support growing high-end application workloads for e-business, enterprise, and technicalcomputing.

Performance IA-64 increases performance by using available compile-time information to reduce current performance limiters, thereby moving some of the performance burden from the microarchitecture to the compiler. This enables designing simpler processors, which are more likely to achieve higher frequencies.

To achieve improved performance, IA-64 code:

• Increases instruction level parallelism (ILP)

• Improves branch handling

• Hides memory latencies

• Supports modular code

IA-64 increases ILP by providing more architectural resources: large register files, and a 3-instruction wide word.

The architecture also enables the compiler/assembly writer to explicitly indicate parallelism.

Branch handling is improved by providing the means to minimize branches in the code, increase branch prediction rate for the remaining branches and providing specific support for typical branches.

Memory latency is reduced by allowing the compiler to schedule loads earlier in the code and enabling memory hierarchy cache management.

IA-64 supports the current compiler trend to produce modular code by providing specific hardware support for function calls and returns.

Page 49: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -3 of 34Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm Guide

IA-64 formats

Data types The following data types are supported :

• Integer: 1, 2, 4 and 8 byte(s)

• Floating-point single, double and double-extended formats

• Pointers: 8 bytes

The basic IA-64 data type is 8 bytes. Apart from a few exceptions, all integer operations are on 64-bit data, and registers are always written as 64 bits. Therefore, 1, 2 and 4 byte operands loaded from memory are zero-extended to 64 bits.

Continued on next page

63 31 015 7

Integer data types

6379 31 0

Floating-point Data Types

Page 50: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-4 of 34 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 formats -- continued

Instruction format

A typical IA-64 instruction is a three operand instruction, with the following syntax:

[(qp)] mnemonic[.comp1][.comp2] dests = srcs

Some examples of different IA-64 instructions:

Simple Instruction

add r1 = r2, r3

Predicated instruction

(p4)add r1 = r2, r3

Instruction with immediate

add r1 = r2, r3, 1

Instruction with completer

cmp.eq p3 = r2, r4

(qp) A qualifying predicate is a predicate register indicating whether or not the instruction is executed. When the value of the register is true (1), the instruction is executed. When the value of the register is false (0), the instruction is executed as a NOP. Instructions that are not explicitly preceded by a predicate, assume the first predicate register, p0, which is always true. Some instructions cannot be predicated.

mnemonic A unique name identifying the instruction.

[comp1][comp2] Some instructions may include one or more completers. Completers indicate optional variations on the basic mnemonic.

dests, srcs Most IA-64 instructions have at least two source operands and a destination operand. Source operands are used as input. Typically, the source operands are registers, or immediates. The destination operand(s) is typically a register to which the result is written.

Page 51: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -5 of 34Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm Guide

IA-64 memory

Memory organization

IA-64 defines a single, uniform, linear address space of 2^64 bytes which is divided into 8 regions of size 2^61. A single space means that both data and instructions share the same memory range. Uniform means that there are no address regions with predefined functionality. Linear means that the address space contains no segments; all 2^64 bytes are consecutive.

All code is stored in little-endian byte order in memory. Data is typically stored in little-endian byte order. IA-64 also provides support for big-endian code and operating systems.

Moving data between registers to and from memory is performed strictly through the load (ld) and store (st) instructions. IA-64 supports loads and stores of all data types. Because registers are written as 64-bit, loads are zero-extended. Stores always write the exact number of bytes for the required format.

The size of memory location is specified in the opcode as a number

• st1/ld1 = byte (8bits)

• st2/ld2 = halfword (16 bits)

• st4/ld4 = word ( 32 bits)

• st8/ld8 = doubleword ( 64 bits)

Examples :

// Loads 32 bits from address 4 + r30 into r31 High 32-bits cleared on 64-bit processor

add r31 = 4, r30

ld4 r31 = [r30]

//Stores 64 bits from r3 to address r29 - 8

add r24 = -8, r29

st8 [r24] = r3

//Loads 8 bits from address 27+r1 into r3

add r2 = 0x27, r1

ld1 r3 = [ r2 ]

Continued on next page

Page 52: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-6 of 34 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 memory -- continued

Region Usage On IA64, the 64-bit linear address space consists of 8 regions of size 2^61 with the upper 3-bits of the address selecting a virtual region, a physical region register, and an associated region identifier. The region

identifier (RID), much like the POWER segment identifier (SID), participates in the hardware address translation such that in order to share the same address translation, the same RID must be used. The sharing semantic (private, globally shared, shared-by-some) is determined by whether or not multiple processes utilize the same RID.

For example, a process’s private storage resides within a region whose RID is mapped only by that process. Therefore, address space usage is in a large part determined by assigning the desired sharing semantics to each of the 8 virtual regions and mapping the appropriate objects into those regions that require those semantics.

There are two imporant properties associated with this region usage. First, the mapping of objects to regions is many-to-one. That is, multiple objects map into a single region. Second, mapping the same object to different regions results in aliases. This is a distinct difference from the POWER architecture where an object (a.k.a. SID) is addressed the same regardless of the virtual address used. Aliases simply additional address translations on IA64 and thus a likelyhood for decreased performance and so their use should be minimized.

Another significant departure from AIX is that the majority of the 64-bit address space is managed using Single Address Space (SAS) semantics. This is necessary to achieve the desired degree of sharing of address translations for shared objects: to achieve a single translation for an object all accesses must be made through a common global address. Such a semantic is possible by virtue of the IA64 protection keys which provide additional access control beyond address translations. So, a process that maps a region only has accessibility to those objects within that region for which it has the appropriate protection key. Note that AIX manages some parts of the process address space as SAS -- for example, the shared library text segment contains mappings whose addresses are common across all processes. The AIX use of the SAS style of management is minimal because the POWER architecture provides for sharing on a segment basis regardless of the virtual address used to map the segment. To achieve the same degree of sharing on IA64 a shared object must be mapped at a global address.

Continued on next page

Page 53: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -7 of 34Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm Guide

IA-64 memory -- continued

region usage continued

In addition to the sharing semantics there are additional properties that influence the location of objects within regions. First, to preserve the flat-address space with a logical boundary between user and kernel space it is useful to place user and kernel objects at opposite ends of the address space whenever feasible. Next, the IA64 architecture provides for multiple page sizes and a preferred page size per region so objects with similar page size requirements are most naturally colocated within the same region. Finally, certain object types such as executable text have properties and uses which mandate that they be isolated to a separate region.

Given these general guidelines, the following table shows the selected region usage and subsequent sections describe each region use in greater detail. These selections provide for 4 regions dedicated to user space and 3 for kernel for the initial release.

VRNStyle Name Example Uses process data, stack, heap, mmap, ILP32 shared library

0 MAS Private Private text, ILP32 main text, u-block, kernel thread stacks/msts

1 SAS/MAS Text LP64 shared library text, LP64 main text

2 SAS LP64 shmat

3 SAS LP64 shmat w/ large superpage

4 n/a reserved

5 SAS Temp kernel temporary attach, global buffer pool

6 SAS Kernel2 kernel global w/ large page size

7 SAS Kernel kernel global Virtual Region Usage

Page 54: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-8 of 34 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Instructions

Instruction level parallelism (ILP)

IA-64 enables improving instruction level parallelism (ILP) by:

• Enabling the compiler/assembly writer to explicitly indicate parallelism.

• Providing a three-instruction-wide word, called a bundle, that facilitates parallel processing of instructions.

•Providing a large number of registers, enabling using different registers for different variables and avoiding register contention.

A-64 instructions are bound in instruction groups. An instruction group is a set of instructions which do not have read-after-write (RAW) or write-after-write (WAW) dependencies between them and may execute in parallel. In any given clock cycle, the processor executes as many instructions from one instruction group as it can, according to its resources.

An instruction group must contain at least one instruction; the number of instructions in an instruction group is not limited. Instruction groups are indicated in the code by cycle breaks (;;) placed in the code by the assembly writer or compiler. An instruction group may also end dynamically during run-time by a taken branch.

Instruction groups reduces the need to optimize the code for each new micro architecture. Processors with additional resources will take advantage of the existing ILP in the instruction group.

Continued on next page

IA-64processor

Parallel Instruction Processing

Page 55: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -9 of 34Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm Guide

IA-64 Instructions -- continued

Instruction groups and bundles

Instruction groups are composed of 41-bit instructions contained in bundles. Each bundle contains three instructions, and a template field, which are set during code generation, by a compiler, or the assembler. The code generation process ensures instruction group assignment without RAW or WAW dependency violations within the instruction group.

The template field maps each instruction to an execution unit. This allows the processor to dispatch all three instructions in parallel.

Bundles are aligned at 16-byte boundaries.

Template

The template field can end the instruction group either at the end of the bundle, or in the middle of the instruction group.

Continued on next page

instruction slot 2 instruction slot 1 instruction slot 0

127 86 45 4 0

tem

pla

te

Bundle structure

The set of templates define the combinations of functional units that can be invoked by a executing a single bundle. This in turn lets the compiler schedule the functional units in an order that avoids contention. The template can also indicate a stop. The 24 available templates are listed opposite.

M - is a memory functionI - is an integer functionF - is a floating point functionB - is a branch functionL - is a function involving a long immediate"s" indicates a stop.

* L+X is an extended type that is dispatched to the I-unit.

MIIMIsIMLX*MMIMsMIMFIMMFMIBMBBBBBMMBMFB

MIIsMIsIsMLXs*MMIsMsMIsMFIsMMFsMIBsMBBsBBBsMMBsMFBs

Page 56: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-10 of 34 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Instructions -- continued

Instruction set A basic IA-64 instruction has the following syntax:

[qp] mnemonic[.comp] dest=srcs

Where :

Continued on next page

qp Specifies a qualifying predicate register. The value of the qualifying predicate determines whether the results of the instruction are committed in hardware or discarded. When the value of the predicate register is true (1), the instruction executes, its results are committed, and any exceptions that occur are handled as usual. When the value is false (0), the results are not committed and no exceptions are raised. Most IA-64 instructions can be accompanied by a qualifying predicate.

mnemonic Specifies a name that uniquely identifies an IA-64 instruction.

comp Specifies one or more instruction completers. Completers indicate optional variations on a base instruction mnemonic. Completers follow the mnemonic and are separatedby periods.

dest Represents the destination operand(s), which is typically the result value(s) produced by an instruction.

srcs Represents the source operands. Most IA-64 instructions have at least two input source operands.

Page 57: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -11 of 34Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm Guide

IA-64 Instructions -- continued

Branch instructions

All instructions beginning with “br.” are branches. The IA-64 architecture provides three branch types:

• Relative direct branches, using 21-bit displacement that is appended to the instruction pointer of the bundle containing the branch.

• Long branches goes to an explicit address by using an 60 bit displacement from the current instruction pointer.

• Indirect branches, using 64-bit addresses in the branch registers

IA-64 allows multiple branches to be evaluated in parallel. The first taken branch which is predicated true is taken.

Extended mnemonics are defined by assembler to cover most combinations : br.cond, br.call, br.ia, br.ret, br.cloop, br.ctop, br.cexit

Branch prediction hints can be provided with branch hints as part of a branch instruction, or with separate Branch Predict instructions (brp)

Page 58: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-12 of 34 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Registers

Registers IA-64 provides several register files that are visible to the programmer:

• 128 General registers

• 128 Floating-point registers

• 64 Predicate registers

• 8 Branch registers

• 128 Application registers

• Instruction Pointer (IP) register

Registers are referred to by a mnemonic denoting the register type and a number. For example, general register 32 is named r32.

Continued on next page

63 0

gr0

gr127

General Registers

81 0

fr127

fr0

0

0.0

Floating-point registers

63 0

br7

br0

Branch registers

1pr63

p0

Predicate registers

Instruction pointer063

63 0

ar127

ar0

Application registers

Page 59: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -13 of 34Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm Guide

IA-64 Registers -- continued

General registers

Continued on next page

IA-64 provides 128 64-bit general purpose registers for all integer and multimedia computation.

• Register gr0 is a read-only register and is always zero (0).

• 32 registers are static and global to the process.

• 96 registers are stacked. These registers are for argument passing and local register stack frame. A portion of these registers can also be used for software pipelining.

Each register has an associated NaT bit, indicating whether the value stored in the register is valid.

gr0

gr1

gr2

gr31

gr32

gr127

0

063 nat

0

Page 60: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-14 of 34 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Registers -- continued

Floating-point registers

Continued on next page

IA-64 provides 128 82-bit floating-point registers, for floating-point computations. All floating-point registers are globally accessible within the process. There are:

• 32 static floating-point registers

• 96 rotating floating-point registers, for software pipelining

The first two registers (fr0 and fr1) are read-only:

• fr0 is read as +0.0

• fr1 is read as +1.0.

Each register contains three fields:

• 64-bit significand field

• 17-bit exponent field

• 1-bit sign field.

fr0

fr1

fr2

fr31

fr32

fr127

0.0

081

0.1

Page 61: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -15 of 34Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm Guide

IA-64 Registers -- continued

Predicate registers

64 one-bit predicate registers enable controlling the execution of instructions. When the value of a predicate register is true (1), the instruction is executed. The predicate registers enable:

Whenever in a program encounters a branch condition, like the ‘if-then-else’ condition, it depends on the outcome of the condition which branch gets executed. Branch prediction used to be an often used solution, where the processor tried to predict which branch would be taken and then execute that branch in advance. Ofcourse, if the outcome was wrong, then a performance penalty was met because the branch taken was discarded and the other branch had to be executed...

The IA-64 executes all branches in parallel, where the predication register is used to stop that branch of execution. This way the processor can process ‘out-of-order execution’ by just executing all branches without performance penalty.

Continued on next page

• validating/invalidating instructions

• eliminating branches in if/then/else logic blocks

There are:

• 16 static predicate registers

• 48 rotating predicate registers for controlling software pipelining

Instructions that are not explicitly preceded by a predicate, defaults to the first predicate register, pr0, which is read-only, and is always true (1).

0

pr0pr1pr2

pr15pr16

pr63

Page 62: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-16 of 34 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Registers -- continued

Branch registers

IA-64 improves branch handling by:

• providing the means to minimize branches in the code through the use of qualifying predicates

• providing support for special branch

A qualifying predicate is a predicate register indicating whether or not the instruction is executed. When the value of the register is true (1), the instruction is executed. When the value of the register is false (0), the instruction is executed as a NOP. Instructions that are not preceded by a predicate explicitly, assume the first predicate register, p0, which is always true.

Predication enables you to convert a control dependency to a data dependency, thus eliminating branches in the code. An instruction is control dependent if it depends on a branch instruction to execute. Instructions are considered to be data dependent if the first produces a result that is used by the second, or if the second instruction is data dependent on the first through a third instruction. Dependent instructions cannot be executed in parallel. You cannot change the execution sequence of dependent instructions.

Continued on next pag

Eight 64-bit branch registers are used to specify the branch target addresses for indirect branches.

The branch registers streamline call/return branching

063

br0

br1

br2

br7

Page 63: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -17 of 34Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm Guide

IA-64 Registers -- continued

Application registers

Instruction pointer (IP)

The 64-bit instruction pointer holds the address of the bundle of the currently executing instruction. The IP cannot be directly read or written, it increments as instructions are executed. Branch instructions set the IP to a new value. The IP is always 16-byte aligned.

Continued on next page

128 special purpose registers are used for various functions. Some of the more commonly used application registers have assembler aliases.For example, ar66 is used as the Epilogue Counter (EC) and is called ar.ec.

063

ar0

ar7

ar16

ar17

ar18

ar19

ar32

ar36

ar40

ar44

ar64

ar65

ar66

ar127

EC

LC

PFS

ITC

FPSR

UNAT

CCV

RNAT

BSPSTORE

BSP

RSC

KR7

KR0

Page 64: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-18 of 34 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Registers -- continued

Register validity

If data needs to get from the memory to the processor, there’s always a delay because it’ll take a while to get there. This is called ‘memory latency’. In an attempt to eliminate this time, the processor tries to read the memory beforehand.

If data has been read in in advance and then other data has been written back to that exact location, the already read in data becomes invalid.

Speculative memory access creates a need to delay exception handling. This is enabled by propagating exception conditions.

Each general register has an a corresponding NaT (Not a Thing) Bit. The NaT bits enable propagating validity/invalidity of a speculative load result.

Floating-point registers use a special instance of pseudo-zero, called NaTVal. NaTVal is a floating-point register value used to propagate valid/invalid results of speculative loads of floating-point data.

gr0

gr1

gr2

gr31

gr32

gr127

0

063 nat

0

Page 65: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -19 of 34Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm Guide

IA-64 Operations

Software pipelining loops

Loop performance is traditionally improved through software techniques. However, these techniques entail significant additional code:

• Loop unrolling requires multiple copies of the original loop in the unrolled loop. The loop instructions are replicated and the end code adjusted to eliminate the branch.

• Software pipelining requires adding prolog code to fill the execution pipe and epilog code that drains it. Software pipelining is a method that enables the processor to execute, in any given time, several instructions in various stages of the loop.

IA-64 provides hardware support for software pipelining loops, eliminating the need for additional prolog and epilog code through the use of:

• special branch instructions

• Loop count (LC) and epilogue count (EC) application registers

• rotating registers

Rotating registers are registers which are rotated by one register position on each loop execution. The logical names of the registers are rotated in a wrap-around fashion, so that logical register X is logical register X+1 after one rotation. The predicate, floating-point and general registers can be rotated.

IA-64 provides support for special branch instructions. One example is the br.cloop instruction, used for simple counted loops.

The cloop branch instruction uses the LC application register, and not a qualifying predicate to determine the branch condition.

The cloop branch checks whether the LC register is zero. If it is not, it decrements LC and the branch is taken. After the last iteration LC is zero and the branch is not taken, avoiding a branch misprediction.

Continued on next page

Page 66: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-20 of 34 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Operations -- continued

Reduced memory access costs

As current processors increase in speed and parallelism, more scheduling opportunities are lost while memory is accessed.

IA-64 allows you to eliminate many memory accesses through the use of large register files to manage work in progress, and by allowing better control of the memory hierarchy.

Furthermore, the cost of the remaining memory accesses is dramatically reduced by moving load instructions earlier in the code. Thus hiding memory latency, which is the time required by the processor, between an issuance of a load instruction and the moment when the result of this instruction can be used. This enables the processor to bring the data in time, and avoid stalling the processor. Memory latency is hidden through the use of:

• Data speculation - the execution of an operation before its data dependency is resolved.

• Control speculation - the execution of an instruction before its control dependency is resolved.

The large number of registers in IA-64 enable multiple computations to be performed without having to store temporary data in memory. This reduces the number of memory accesses.

Continued on next page

dependency

ld

dependency

early load

check validity

Hiding memory latency

Page 67: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -21 of 34Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm Guide

IA-64 Operations -- continued

Memory access is supported through the load (ld) and store (st) instructions. All other integer, floating-point and branch instructions use the registers as operands.

IA-64 enables you to hide the memory latencies of the remaining load instructions, by placing speculative loads, prior to coding barriers. Thus the stall caused by memory latency is minimized. This also enables more opportunities for parallelism. When you use speculative loads, error/exception detection is deferred until final result is actually required:

• If no error/exception is detected the latency is hidden.

• If an error/exception is detected then memory accesses and dependent instructions must be redone by an exception handler.

A-64 provides an advanced load instruction (ld.a), that allows you to move potentially data dependent loads earlier in the code.

To verify the data speculation, a check load instruction (ld.c) must be placed at the location of the original load instruction.

If the contents of the memory address have not changed since the advanced load, the speculation succeeded, and the memory latency is hidden. If the contents of the memory address have changed by a store instruction, the ld.c instruction repeats the load.

Data speculation does not defer exceptions. For example page faults are taken immediately.

Also, IA-64 provides a control-speculative load instruction (ld.s), which executes the load while speculating the results of the governing branch. Control-speculative loads are also referred to as speculative loads.

To verify the load, a check instruction (chk.s) is placed at the location of the original load. IA-64 uses a NaT bit/NaTVal, to track the success of the load. If the NaT bit/NaTVal indicates a deferred exception, the chk.s instruction jumps to correction code that repeats all dependent instructions. The correction code is generated by a compiler or assembly writer.

If the load is successful, the speculation succeeded, and the memory latency is hidden.

Continued on next page

Page 68: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-22 of 34 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Operations -- continued

Then there’s also a combined speculation load (ld.sa) which enables placing a load before a control and a data barrier. Use this type of speculative load to advance a load around a procedure call.

To verify the speculation, a special check instruction (chk.a) is placed at the location of the original load instruction. If the load is successful, the speculation succeeded, and the memory latency is hidden.

If an exception was generated, or the data was invalidated, the chk.a instruction jumps to correction code that repeats all dependent instructions. The correction code is generated by a compiler or assembly writer.

Procedure calls

The traditional use of a procedure stack in memory for procedure call management demands a large overhead. IA-64 uses the general register stack for procedure call management, thus eliminating the frequent memory accesses. The general register stack consists of 96 general registers, starting at r32, used to pass parameters to the called procedure and store local variables for the currently executing procedure. The new structure of a register stack allows:

• the caller procedure to pass parameters through registers to the called procedure

• dynamic allocation of local registers for the currently executing procedure

• allocating a maximum of 96 logical registers for each function

Continued on next page

IA-32 IA-64

Procedure Acall B ...

Procedure Bsave current register state...restore previous register statereturn

Procedure Acall B

Procedure Balloc no save!...no restore!return

Page 69: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -23 of 34Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm Guide

IA-64 Operations -- continued

When a procedure call is executed, the called procedure receives a procedure frame which contains the output registers of the caller as input.

The called procedure can resize the frame to include its own input, local and output area, using the alloc instruction. For each subsequent call, this sequence is repeated, and a new procedure frame is created.

When the procedure returns, the processor unwinds the register stack, the current frame is released, and the previous procedure’s frame is restored.

Continued on next page

The general register stack is divided into two subsets:

• Static: The first 32 physical registers (r0-r31) are permanent registers, visible to all procedures, in which global variables are placed.

• Stacked: The other 96 physical registers behave like a stack. The procedure code allocates up to 96 input and output registers for a procedure frame. An integral mechanism ensures that a stack overflow or underflow never occurs.

As each procedure frame is allocated, the previous frame is hidden, and the first register in the frame is renamed as logical register r32.

Using small register frames eliminates or reduces the need for saving and restoring registers to and from memory, when allocating a new register stack frame.

GlobalRegisters

ProcedureFrame

Stacked

Registers

gr0

gr31gr32

gr127

Page 70: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-24 of 34 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Operations -- continued

Register stack engine

Continued on next page

Using a register stack reduces the need to perform memory saves. However, when a procedure tries to use more physical registers than remain on the stack, a register stack overflow could occur.

IA-64 uses a hardware mechanism called a Register Stack Engine (RSE), which operates transparently in the background, to ensure that an overflow does not occur, and that the contents of the registers are always available. The RSE is not visible to the software.

When the stack fills up, the RSE saves logical registers to memory, thus freeing them. The stored registers are restored in the same way when necessary.

Through this mechanism, the RSE offers an unlimited number of physical registers for allocation.

GlobalRegisters

ProcedureFrame

Stacked

Registers

gr0

gr31gr32

gr127R

SE

mem

ory

Page 71: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -25 of 34Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm Guide

IA-64 Operations -- continued

Floating point and multimedia

IA-64 provides high floating-point performance with full IEEE floating-point support for single, double, and double-extended formats.

Also special support is provided for multimedia, or data-parallel applications:

• integer data and SIMD computations, similar to the MMX[tm] technology.

• floating-point data and SIMD-FP computations, similar to IA-32 Streaming SIMD Extensions .

These floating-point features help improve IA-64 floating-point performance:

• 128 floating-point registers.

• A multiply and accumulate instruction (fma), with four different floating-point registers for operands (f=a * b + c). This instruction enables performing a multiply and add in the same number of cycles as one add or multiply instruction.

• Load and store to and from memory. You can also load from memory into two floating-point registers.

• Data transfer between floating-point and general registers.

• Multiple status fields register, enables speculation on floating-point operations.

• Quick conversion from integer to floating-point and vice-versa.

• Rotating floating-point registers.

Continued on next page

Page 72: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-26 of 34 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Operations -- continued

IA-64 provides 128 82-bit floating-point registers. However the floating-point data type is 80 bits.

Intermediate computation values can contain 82 bits. This enables software divide and square root computation, comparable to hardware functions, while taking advantage of wide machines. These fast software divides and square roots result in valid 80-bit IEEE values.

Continued on next page

Integer multimedia is provided by defining a set of instructions which treat the general registers as 8x8, 4x16, or 2x32 bit elements, and by providing specific instructions for operating on these data elements. IA-64 multimedia support is semantically compatible with the MMX[tm] Technology. Three major types of instructions are provided:

• Addition and subtraction (including 3 forms of saturating arithmetic)

• Multiplication

• Left shift, signed and unsigned right shift

• Pack and unpack to convert between different element sizes.

Floating-point multimedia is provided through a set of instructions which treat the floating-point registers as 2x32 bit elements.

a3+b3 a2+b2 a1+b1 a0+b0

b3 b2 b1 b0

a3 a2 a1 a0

63 0

81 80 63 0

Exponent

Floating-point Register

Significand

Sign

Page 73: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -27 of 34Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm Guide

IA-64 Operations -- continued

For floating-point multimedia operations the floating-point register is divided as shown in the graphic below

IA-64 provides four separate status fields (sf0-sf3) enabling four different computational environments. Each status field contains dynamic control and status information for floating-point operations.

The FPSR contains the four status fields and a traps field that enable masking the IEEE exception events and denormal operand exceptions. This register also includes 6 reserved bits which must be 0.

Continued on next page

81 80 63 31 0

Exponent Single-precision FP Single-precision FP

063

traps

66 13 13

sf3 sf2 sf1 sf0

13 13

Floating-point status register

Page 74: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-28 of 34 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Operations -- continued

Multimedia instructions

Multimedia instructions treat the general registers as concatenations of eight 8-bit, four 16-bit, or two 32-bit elements. They operate on each element independently and in parallel. The elements are always aligned on their natural boundaries within a general register. Most multimedia instructions are defined to operate on multiple element sizes. Three classes of multimedia instructions are defined: arithmetic, shift and data arrangement.

Processor Abstraction Layer (PAL)

IA-64 firmware consists of three major components

• Processor Abstraction Layer (PAL)

• System Abstraction Layer (SAL)

• Extensible Firmware Interface (EFI) layer

PAL provides a consistent firmware interface to abstract processor implementation-specific features.

The System Abstraction Layer (SAL) is a firmware layer which isolates operating system and other higher level software fromimplementation differences in the platform, while PAL is the firmware layer that abstracts the processor implementation.

Continued on next page

Operating System Software

OS BootHandoff

EFI ProcedureCalls

Extensible FirmwareInterface (EFI)

Platform/System Abstraction Layer (SAL)

OS BootSelection SAL Procedure

Calls

Transfers to OSEntrypointsfor HardwareEvents

Access toPlatformResources

PAL ProcedureCalls

Transfers to SALEntrypoints

Processor Abstraction Layer (PAL)

Processor (Hardware)

Platform (Hardware)

Non-performance CriticalHardware Events, e.gReset, Machine Checks

Performance CriticalHardware Events e.g. Interrupts

InstructionExecution

Interrupts,Traps and Faults

Page 75: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -29 of 34Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm Guide

IA-64 Operations -- continued

Interrupts Interrupts are events that occur during IA-32 or IA-64 instruction processing, causing the flow control to be passed to an interrupt handling routine. In the process, certain processor state is saved automatically by the processor. Upon completion of interrupt processing, a return from interrupt (rfi) is executed which restores the saved processor state. Execution then proceeds with the interrupted IA-32 or IA-64 instruction.

From the viewpoint of response to interrupts, the processor behaves as if it were not pipelined. That is, it behaves as if a single IA-64 instruction (along with its template) is fetched and then executed; or as if a single IA-32 instruction is fetched and then executed. Any interrupt conditions raised by the execution of an instruction are handled at execution time, in sequential instruction order. If there are no interrupts, the next IA-64 instruction and its template, or the next IA-32 instruction, are fetched.

Interrupt definitions

Depending on how an interrupt is serviced, interrupts are divided into: IVA-based interrupts and PAL-based interrupts.

• IVA-based interrupts are serviced by the operating system. IVA-based interrupts are vectored to the interrupt Vector Table (IVT) pointed to by CR2, the IVA control register

• PAL-based interrupts are serviced by PAL firmware, system firmware, and possibly the operating system. PAL-based interrupts are vectored through a set of hardware entry points directly into PAL firmware.

interrupts are divided into four types: Aborts, Interrupts, Faults, and Traps.

Aborts

A processor has detected a Machine Check (internal malfunction), or a processor reset. Aborts can be either synchronous or asynchronous with respect to the instruction stream. The abort may cause the processor to suspend the instruction stream at an unpredictable location with partially updated register or memory state. Aborts are PAL-based interrupts.

Continued on next page

Page 76: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-30 of 34 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Operations -- continued

Machine Checks (MCA)

A processor has detected a hardware error which requires immediate action. Based on the type and severity of the error the processor may be able to recover from the error and continue execution. The PALE_CHECK entry point is entered to attempt to correct the error.

Processor Reset (RESET)

A processor has been powered-on or a reset request has been sent to it. The PALE_RESET entry point is entered to perform processor and system self-test and initialization.

External device Interrupts

An external or independent entity (e.g. an I/O device, a timer event, or another processor) requires attention. Interrupts are asynchronous with respect to the instruction stream. All previous IA-32 and IA-64 instructions appear to have completed. The current and subsequent instructions have no effect on machine state. Interrupts are divided into Initialization interrupts, Platform Management interrupts, and External interrupts. Initialization and Platform Management interrupts are PAL-based interrupts; external interrupts are IVA-based interrupts.

Initialization Interrupts (INIT)

A processor has received an initialization request. The PALE_INIT entry point is entered and the processor is placed in a known state.

Platform Management Interrupts (PMI)

A platform management request to perform functions such as platform error handling, memory scrubbing, or power management has been received by a processor. The PALE_PMI entry point is entered to service the request. Program execution may be resumed at the point of interrupt. PMIs are distinguished by unique vector numbers. Vectors 0 through 3 are available for platform firmware use and are present on every processor model. Vectors 4 and above are reserved for processor firmware use. The size of the vector space is model specific.

Continued on next page

Page 77: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -31 of 34Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm Guide

IA-64 Operations -- continued

External Interrupts (INT)

A processor has received a request to perform a service on behalf of the operating system. Typically these requests come from I/O devices, although the requests could come from any processor in the system including itself. The External Interrupt vector is entered to handle the request. External Interrupts are distinguished by unique vector numbers in the range 0, 2, and 16 through 255. These vector numbers are used to prioritize external interrupts. Two special cases of External Interrupts are Non-Maskable Interrupts and External Controller Interrupts.

Non-Maskable Interrupts (NMI)

Non-Maskable Interrupts are used to request critical operating system services. NMIs are assigned external interrupt vector number 2.

External Controller Interrupts (ExtINT)

External Controller Interrupts are used to service Intel 8259A-compatible external interrupt controllers. ExtINTs are assigned locally within the processor to external interrupt vector number 0.

Faults The current IA-64 or IA-32 instruction which requests an action which cannot or should not be carried out, or system intervention is required before the instruction is executed. Faults are synchronous with respect to the instruction stream. The processor completes state changes that have occurred in instructions prior to the faulting instruction. The faulting and subsequent instructions have no effect on machine state. Faults are IVA-based interrupts.

Continued on next page

Page 78: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-32 of 34 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Operations -- continued

Traps The IA-32 or IA-64 instruction just executed requires system intervention. Traps are synchronous with respect to the instruction stream. The trapping instruction and all previous instructions are completed. Subsequent instructions have no effect on machine state. Traps are IVA-based interrupts.

Continued on next page

Aborts Interrupts Faults Traps

RESET

MCA

INITPMI

INT(NMI,ExtINT,...)

PAL-based interrupts

IVA-based interrupts

Page 79: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -33 of 34Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm Guide

IA-64 Operations -- continued

Interrupt programming model

When an interrupt event occurs, hardware saves the minimum processor state required to enable software to resolve the event and continue. The state saved by hardware is held in a set of interrupt resources, and together with the interrupt vector gives software enough information to either resolve the cause of the interrupt, or surface the event to a higher level of the operating system. Software has complete control over the structure of the information communicated, and the conventions between the low-level handlers and the high-level code. Such a scheme allows software rather than hardware to dictate how to best optimize performance for each of the interrupts in its environment. The same basic mechanisms are used in all interrupts to support efficient IA-64 low-level fault handlers for events such as a TLB fault, speculation fault, or a key miss fault.

On an interrupt, the state of the processor is saved to allow an IA-64 software handler to resolve the interrupt with minimal bookkeeping or overhead. The banked general registers provide an immediate set of scratch registers to begin work. For low-level handlers (e.g. TLB miss) software need not open up register space by spilling registers to either memory or control registers.

Upon an interrupt, asynchronous events such as external interrupt delivery is disabled automatically by hardware to allow IA-64 software to either handle the interrupt immediately or to safely unload the interrupt resources and save them to memory. Software will either deal with the cause of the interrupt and rfi back to the point of the interrupt, or it will establish a new environment and spill processor state to memory to prepare for a call to higher-level code. Once enough state has been saved (such as the IIP, IPSR, and the interrupt resources needed to resolve the fault) the low-level code can re-enable interrupts by restoring the PSR.ic bit and then the PSR.i bit. Since there is only one set of interrupt resources, software must save any interrupt resource state the operating system may require prior to unmasking interrupts or performing an operation that may raise a synchronous interrupt (such as a memory reference that may cause a TLB miss).

Continued on next page

Page 80: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-34 of 34 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Operations -- continued

PSR.ic Interrupt state collection bit

The PSR.ic (interrupt state collection) bit supports an efficient nested interrupt model. Under normal circumstances the PSR.ic bit is enabled. When an interrupt event occurs, the various interrupt resources are overwritten with information pertaining to the current event. Prior to saving the current set of interrupt resources, it is often advantageous in a miss handler to perform a virtual reference to an area which may not have a translation. To prevent the current set of resources from being overwritten on a nested fault, the PSR.ic bit is cleared on any interrupt. This will suppress the writing of critical interrupt resources if another interrupt occurs while the PSR.ic bit is cleared. If a data TLB miss occurs while the PSR.ic bit is zero, then hardware will vector to the Data Nested TLB fault handler.

Page 81: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000 Unit . -1

Draft Version for Review October 15, 2000 12:13 pm Instructor Guidepower_hardware_overview.fm

Unit 3. Power Hardware Overview

Objectives The Objectives for this lesson are:

• Provide an overview of the e-server p series systems and their processors.

• List the registers available to the program and describe the internal use.

Page 82: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

power_hardware_overview.fmInstructor Guide Draft Version for Review October 15, 2000 12:13 pm

-2 Course short title © Copyright IBM Corp. 2000

Power Hardware Overview

e-server p-series or RS/6000 introduction

This section introduces RS/6000, giving a brief history of the products, an overview of the RS/6000 design, and a description of key RS/6000 technologies.The RS/6000 family combines the benefits of UNIX computing with IBMs leading-edge RISC technology in a broad product line - from powerful desktop workstations ideal for mechanical design, to workgroup servers for departments and small businesses, to enterprise servers for medium to large companies for ERP and server consolidation applications, up to massively parallel RS/6000 SP systems that can handle demanding scientific and technical computing, business intelligence, and Web serving tasks. Along with AIX, IBMs award winning UNIX operating system, and HACMP, the leading high availability clustering solution, the RS/6000 platform provides the power to create change and has the flexibility to manage it with a wide variety of applications that provide real value.

RS/6000 History

The first RS/6000 was announced February 1990 and shipped June 1990. Since then, over 1,100,000 systems have shipped to over 132,000 customers.The next figure summarizes the history of the RS/6000 product line, classified by machine type. For each machine type, the I/O bus architecture and range of processor clock speeds are indicated. The figure shows the following:

• In the past, RS/6000 I/O buses were based on the Micro Channel Architecture (MCA). Today, RS/6000 I/O buses are based on the industry-standard Peripheral Component Interface (PCI) Architecture.

• Processor speed, one key element of RS/6000 system performance, has increased dramatically over time.

• There have been many machine types over the entire RS/6000 history. In recent years, there has been considerable effort to reduce the complexity of the model offerings without creating gaps in the market coverage.

Continued on next page

Page 83: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000 Unit . -3

Draft Version for Review October 15, 2000 12:13 pm Instructor Guidepower_hardware_overview.fm

Power Hardware Overview -- continued

RS/6000 history

RISC CPU 320 520The RISC CPU was the first CPU for the RS/6000 series of systems the CPU consist of four chips and runs at a speed of 33 Mhz. The CPU had a outstanding floating point performance at the time. The CPU was used in the 7012 and 7013 system model 320 - 380 and 520 - 580.

RISC II CPU 390 590The RISC II has enhanched features over the first RISC design and runs up to 200 Mhz. The CPU was used in the 7012 and 7013 system model 390 and 590.

Continued on next page

19921990 1991 1993 1994 1995 1996 1997 1998 1999 2000

7017 (125 to 450 MHz)PCI Enterprise Servers

7025 (166 to 500 MHz)PCI Workgroup Servers Deskside Systems

7026 (166 to 500 MHz)PCI Workgroup Servers - Rack Systems

7043 (166 to 375 MHz)PCI Workstations & Workgroup Servers

7024 (100 to 233 MHz)PCI Deskside Systems

7012 (20 to 200 MHz)Micro Channel Desktop Systems

7009 (80 to 120 MHz)Micro Channel Compact Servers

7013 (20 to 200 MHz)Micro Channel Deskside Systems

7006 (80 to 120 MHz)Micro Channel Entry Desktops

7248 (100 to 133 MHz)PCI Workstations

7011 (33 to 80 MHz)Micro Channel Workstations

SP1, SP2, SPAll Node Types

7015 (25 to 200 MHz)Micro Channel Rack Systems

7044 (333 to 400 MHz)PCI Workstations & Workgroup Servers

7046 (375 MHz)PCI Workgroup Servers - Rack Systems

Today

Page 84: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

power_hardware_overview.fmInstructor Guide Draft Version for Review October 15, 2000 12:13 pm

-4 Course short title © Copyright IBM Corp. 2000

Power Hardware Overview -- continued

PowerPC and Power2 Cpu family

PowerPc CPUs started as a joint effort between Motorola Apple and IBM the family consist of PowerPc, PPc601, PPc604 and PPc604e. These CPUs are very close to those prodused by Motorola and used in Apple systems, currently the PPc604e CPU is used in model f50, b50, and 43p

Power3 and Power3-II CPUs

The POWER3 microprocessor introduces a new generation of 64-bit processors especially designed for high performance and visual computing applications. POWER3 processors replace the POWER2 and the POWER2 Super Chips (P2SC) in high-end RS/6000 workstations and SP nodes. The RS/6000 44P 7044 Model 270 workstation features the POWER3-II microprocessor as well as the POWER3-II based SP nodes.The POWER3 implementation of the PowerPC architecture provides significant enhancements compared to the POWER2 architecture. The SMP- capable POWER3 design allows for concurrent operation of fixed-point instructions, load/store instructions, branch instructions, and floating-point instructions. Compared to the P2SC, which reaches its design limits at a clock frequency of 160 MHz, POWER3 is targeting up to 600 MHz by exploiting more advanced chip manufacturing processes, such as copper technology. The first POWER3-based system, RS/6000 43P 7043 Model 260, runs at 200 MHz as well as the POWER3 wide and thin nodes for the SP.Features of the POWER3, exceeding its predecessor (P2SC), include:• A second load-store unit• Improved memory access speed• Speculative execution

Continued on next page

F l o a t i n gP o i n tU n i t

F P U 1

F l o a t i n gP o i n tU n i t

F P U 2

F i x e dP o i n t

U n i t

F X U 1

F i x e dP o i n tU n i t

F X U 2

F i x e dP o i n tU n i t

F X U 3

L D / S TU n i t

L S 1

L D / S TU n i t

L S 2

B r a n c h / D is p a t c h

M e m o r y M g m t U n i tI n s t r u c t i o n C a c h e

I U

M e m o r y M g m t U n i tD a t a C a c h e

D U

B I U B u s I n t e r f a c e U n i t L 2 C o n t r o l , C l o c k

B r a n c h h i s t o r y t a b l e 2 0 4 8 e n t r i e sB r a n c h t a r g e t c a c h e 2 5 6 e n t r i e s

3 2 K B , 1 2 8 - w a y 6 4 K B , 1 2 8 - w a y

3 2B y t e s

3 2B y t e s

3 2 B y t e s@ 2 0 0 M H z = 6 . 4 G B / s

1 6 B y t e s@ 1 0 0 M H z = 1 . 6 G B / s

L 2 C a c h e1 - 1 6 M B 6 X X B u s

C P U r e g i s t e r s :3 2 x 6 4 - b i t i n t e g e r( F i x e d P o i n t )3 2 x 6 4 - b i t F P( F l o a t i n g P o i n t )

R e g i s t e r b u f f e r s f o rr e g i s t e r r e n a m i n g :2 4 F P1 6 I n t e g e r

D i r e c tM a p p e d

Page 85: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000 Unit . -5

Draft Version for Review October 15, 2000 12:13 pm Instructor Guidepower_hardware_overview.fm

Power Hardware Overview -- continued

RS64 and RS64 II CPUs

The RS64 microprocessor, based on the PowerPC Architecture, was designed for leading-edge performance in OLTP, e-business, BI, server consolidation, SAP, Notesbench, and Web serving for the commercial and server markets. It is the basis for at least four generations of RS/6000 and AS/400 enterprise server offerings.The RS64 processor focuses on commercial performance with emphasis on conditional branches with zero or one cycle incorrect branch predict penalty, contains 64 KB L1 instruction and data caches, has a one cycle load support, four superscalar fixed point pipelines, and one floating point pipeline. There is an on-board bus interface (BIU) that controls both the 32 MB L2 bus interface and the memory bus interface. RS64 and RS64 II are defined by the following specifications:• 125 MHz RS64/262 MHz RS64 II on the RS/6000 Model S70• 262 MHz RS64 II on the RS/6000 Model S70 Advanced • 340 MHz RS64 II on the RS/6000 Model H70• 64 KB on-chip, L1 instruction cache• 64 KB on-chip four-way set associative data cache• 32 MB L2 cache• Superscalar design with integrated integer, floating-point, and branch

units• Support for up to 64-way SMP configurations (currently 12-way)• 128-bit data bus• 64-bit real memory addressing• Real memory support for up to one terabyte (240) • CMOS 6S2 using a 162 mm2 die, 12.5 million transistors

Continued on next page

B r a n c h / D i s p a t c h

M e m o r y M g m t U n i tI n s t r u c t i o n C a c h e

I U

S i m p l eC o m p l e xF i x e dP o i n t U n i t

S i m p l eF i x e dP o i n t

U n i t

F l o a t i n gP o i n t U n i t

L o a d /S t o r eU n i t

M e m o r y M g m t U n i tD a t a C a c h e

D U

B I U B u s I n t e r f a c e U n i t L 2 C o n t r o l , C l o c k

3 2 B y t e s 1 6 B y t e s

L 2 C a c h e1 - 3 2 M B 6 X X B u s

3 2 B y t e s 3 2 B y t e s

Page 86: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

power_hardware_overview.fmInstructor Guide Draft Version for Review October 15, 2000 12:13 pm

-6 Course short title © Copyright IBM Corp. 2000

Power Hardware Overview -- continued

RS64 III The RS64 III processor is designed to perform applications that place heavy demands on system memory. The RS64 III architecture addresses both the need for very large working sets and low latency. Latency is measured by the number of CPU cycles that elapse before requested data or instructions can be utilized by the processor. The RS64 III processors combine IBM advanced copper chip technology with a redesign of critical timing paths on the chip to achieve greater throughput. The L1 instruction and data caches have been doubled to 128 KB each. New circuit design techniques were used to maintain the one cycle load-to-use latency for the L1 data cache. L2 cache performance on the RS64 III processor has been significantly improved. Each processor has an on-chip L2 cache controller and an on-chip directory of L2 cache contents. The cache is four-way set associative. This means that directory information for all four sets is accessed in parallel. Greater associativity results in more cache hits and lower latency, which improves commercial performance.Using a technique called Double Data Rate (DDR), the new 8 MB Static SRAM used for L2 is capable of transferring data twice during each clock cycle. The L2 interface is 32 bytes wide and runs at 225 MHz (half processor speed), but, because of the use of DDR, it provides 14.4 GBps of throughput.

In summary, the RS64 III features include:• 128 KB on-chip L1 instruction cache• 128 KB on-chip L1 data cache with one cycle load-to-use latency• On-chip L2 cache directory that supports up to 8 MB of off-chip L2

SRAM memory• 14.4 GBps L2 cache bandwidth • 32 byte on-chip data buses• 4-way superscalar design• Five stage deep pipeline• The Model S80 uses the 450 MHz RS64 III 64-bit copper-chip

technology• The Model M80 uses the 500 MHz RS64 III 64-bit copper-chip

technology• The Model F80 and the H80 use 450 or 500MHz RS64 III 64-bit

copper-chip technology

Continued on next page

Page 87: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000 Unit . -7

Draft Version for Review October 15, 2000 12:13 pm Instructor Guidepower_hardware_overview.fm

Power Hardware Overview -- continued

Power4 or Gigaprocessor Copper SOI CPU

POWER4 is a new processor initiative from IBM. It is comprised of two 64-bit 1 GHz five issue superscalar cores that have a triple level cache hierarchy. It has a 10 GBps main memory interface with a 45 GBps multiprocessor interface. IBM is utilizing the 0.18 micron copper silicon-on-insulator technology in its manufacture. The targeted market is the Enterprise Server or servers in e-business. It is currently in the design stage.

System Bus information

All current systems in the RS/6000 family are equiped with PCI buses,the PCI architecture provides an industry standard specification and protocol that allows multiple adapters access to system resources through a set of adapter slots.Each PCI bus has a limit on the number of slots (adapters) it can support. Typically, this can range from two to six. To overcome this limit, the system design can implement multiple PCI buses. Two different methods can be used to add PCI buses in a system. These two methods are:

• Secondary PCI Bus, The simplest method to add PCI slots when designing a system is to add a secondary PCI bus. This bus is bridged onto a primary bus using a PCI-to-PCI bridge chip.

• Another method of providing more PCI slots is to design the system with two or more primary PCI buses. This design requires a more sophisticated I/O interface with the system memory.

Page 88: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

power_hardware_overview.fmInstructor Guide Draft Version for Review October 15, 2000 12:13 pm

-8 Course short title © Copyright IBM Corp. 2000

Power CPU Overview

32-bit hardware characteristics

32-bit Power and PowerPC processors all have the following features in common:

User registers• 32 general-purpose integer registers, each 32 bits wide (GPRs)• 32 floating-point registers, each 64 bits wide (FPRs)• A 32-bit Condition Register (CR)• A 32-bit Link Register (LR)• A 32-bit Count Register (CTR)

System Registers• 16 Segment Registers (SRs)• A Machine State Register (MSR)• A Data Address Register (DAR)• Two Save and Restore Registers (SRRs)• 4 special purpose (SPRG) registers (PowerPC only)

All instructions are 32 bits long. The Data Address Register contains the memory address that caused the last memory-related exception.SRRs are used to save information when an interrupt occurs

• SRR0 points to the instruction that was running when the interrupt occurred

• SRR1 contains the contents of the MSR when the interrupt occurred

SPRGs are used for general operating system purposes, requiring per-processor temporary storage. It provides fast state saves and support for multi-processing environments

General purpose registers

General Purpose Registers (GPRs) (often just called Rs) used for loads, stores, and integer calculationsNo memory-to-memory operations are provided.This always needs to go through registers

Condition register

The condition register (CR) contains bits set by the results of compare instructions. It’s treated as 8 4-bit registers.The bits are used to test for less-than, greater-than, equal, and overflowconditions.

Continued on next page

Page 89: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000 Unit . -9

Draft Version for Review October 15, 2000 12:13 pm Instructor Guidepower_hardware_overview.fm

Power CPU Overview -- continued

Link register The link register (LR) is set by some branch instructions.Its content points to the instruction which has to be executed immediately after the branch. It typically is used in subroutine calls to find out where to return to.

Count register The Count Register (CTR) has two uses :• It can be decremented, tested, and used to decide whether to take

a branch, all from one branch instruction• It can contain the target address for a branch instruction

Machine state register

The MSR controls many of the current operating characteristics of the processor. Among others are :

• Privilege Level (Supervisor vs. Problem or Kernel vs. User)• Addressing modes (virtual vs. real)• Interrupt enabling• Little-endian vs. Big-endian mode

Instruction set A single instruction generally modifies only one register or one memory location. Exceptions to this are “multiple” and “update” operations

The format of an instruction is:• An opcode mnemonic• An optional set of option bits• 0, 1, 2, or 3 registers• 0 or 1 memory locations, expressed as an offset added

to/subtracted from a register

The first two may be combined into an “extended mnemonic”For example of the format ���U���means the address in r3 + 24General Purpose Registers are named “r0” - “r31”Although most instructions are the same, the mnemonics for POWER and PowerPC are often different. POWER mnemonics are generally simpler and shorter, while PowerPC mnemonics are longer, but more explicit.These differences are because PowerPC was developed with 64-bit in mind. Note: the actual opcodes generated by the assembler for these instructions are identical

Continued on next page

Page 90: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

power_hardware_overview.fmInstructor Guide Draft Version for Review October 15, 2000 12:13 pm

-10 Course short title © Copyright IBM Corp. 2000

Power CPU Overview -- continued

Register to register operations

These types of operations will always have at least 2 registers listed, where the first is the target for the result of the instruction, and the others provide the input to the operation.

Immediate operations are shown as a register with an offset. “Immediate” means that a constant value is involved.The value is built right into the instruction.

Examples : • RU��U���U���U���# Logical ORs r4 and r5, result into r3• DGGL��U����[���U�� # Adds 0x48 to r1, result into r1

Register to memory operations

Register-Memory Operations always have one register and one memory location. The register is always listed first.

The size of the memory location is specified in the opcode :• b = byte (8 bits)• h = halfword (16 bits)• w = word (32 bits)• d = doubleword (64 bits)

All opcodes beginning with “l” are loads and all opcodes beginning with “st” are stores.

Continued on next page

Page 91: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000 Unit . -11

Draft Version for Review October 15, 2000 12:13 pm Instructor Guidepower_hardware_overview.fm

Power CPU Overview -- continued

Register to memory operation examples

Examples:• OZ]��U������U����# Loads 32 bits from address 4+r30 into

r31.High 32-bits cleared on 64-bit processor• VWG��U������U����# Stores 64 bits from r3 to address r29 - 8.

Invalid operation on 32-bit processor• OE]��U����[���U�� # Loads 8 bits from address 27+r1 into r0.

Top 24/56 bits are cleared• VWK��U����[���U���# Stores low 16 bits from r3 to address

0x56+r1

Notice that the load instructions also have a “z” in their mnemonics. The “z” stands for “zero,” and is intended to make clear that these instructions clear any bits in the target register that were not actually copied from memory.

In case you were wondering, there are load instructions without the “z”. lwa and lha are “algebraic” loads. This means that the value being loaded is sign-extended to fill out the rest of the register. This is used when loading a signed value - if a halfword had a negative value, lhz would make it a positive, but lha would preserve the value’s “negativeness.”

Compare instructions

There are four variations of compare instructions , all beginning with “cmp”. They compare two values :

• Register and register, or• Register and immediate value (i.e. constant value)

The result of the comparison iis placed in the Condition Register (CR) where the various bits that can be set are :

• LT = less than• GT = greater than• EQ = equal• OV = overflow (a.k.a. carry bit)

Continued on next page

Page 92: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

power_hardware_overview.fmInstructor Guide Draft Version for Review October 15, 2000 12:13 pm

-12 Course short title © Copyright IBM Corp. 2000

Power CPU Overview -- continued

Branch instructions

All instructions beginning with a “b” are branches. They change the address for the next instruction to be run.

They have three addressing modes :

• Absolute - goes to an explicit address• Relative - target address is an offset from current instruction

address• Register - Only two registers can contain a branch target : Count

(CTR) and Link (LR)

Branches can be conditional. That depends upon whether the option bit matches the specified bit in the CR. A branch instruction can specify which CR to use, where CR0 is assumed unless otherwise specified. Extended mnemonics are defined by the assembler to cover most combinationsThe conditional branch instruction is central to any computer architecture. However, most architectures (including POWER and PowerPC) avoid putting comparisons directly into their branch instructions (to keep things simple). They provide compare instructions that set “condition bits.” These bits are what are used on branch instructions to make the actual decision.The assembler (and crash’s disassembler) provides extended mnemonics that combine a type of branch and the condition register bit that determines whether the branch is taken. Another bit in the branch opcode determines whether the CR bit must be on or off for the branch to take place. This bit is also incorporated into the extended mnemonics (the “not” versions of the branches). For maximum flexibility, the assembler usually also allows you to specify the “not” cases as the logically-opposite case. For example, bnl (branch not less than) can also be written as bge (branch greater than or equal to). Either case is still saying, “branch if the LT bit is turned off.”

Examples• EOW���[��F�� # Branches to address 38c00 if LT bit is on in CR0• EJH��FU����[����� # Branches if LT bit is off in CR3• EQHOU��FU� # Branches to address in LR if EQ bit is off in CR7• EOHD���FU����[���� # Branches to absolute address 0x3600 if

GT bit is off in CR2

Continued on next page

Page 93: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000 Unit . -13

Draft Version for Review October 15, 2000 12:13 pm Instructor Guidepower_hardware_overview.fm

Power CPU Overview -- continued

Trap instructions

Most mnemonics beginning with a “t” are traps, and generate a program exception if the specified condition is met. There are two variations of the trap instruction :

• t or tw - compares two registers, traps if specified comparison is true

• ti or twi - compares register to immediate value instead

“w” mnemonics are the PowerPC indication that these trap instructions are working on 32-bit values. As with branches, there are extended mnemonics defined to provide various traps. In this context ‘lt’, ‘gt’, ‘eq’, etc. have same meaning as on branch mnemonics

Examples• WZHT��U���U��# Traps if r3 equals r4• WZQHL��U����� # Traps if r31 is not equal to 0

Trap instructions are the only instructions in this architecture that perform a comparison and take some action, all in one instruction. They do not set or use condition register bits.

Special register operations

The Special Purpose Registers (SPRs) can only be copied to or from GPRs.

• PIVSU��U���� # Copies SPR 8 into r3• PWVSU�����U� # Copies r3 into SPR 9

Extended mnemonics are defined to cover common SPRs : • PIOU��U��# Copies the LR (SPR 8) into r3• PWFWU��U� # Copies r3 into the CTR (SPR 9)

Continued on next page

Page 94: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

power_hardware_overview.fmInstructor Guide Draft Version for Review October 15, 2000 12:13 pm

-14 Course short title © Copyright IBM Corp. 2000

Power CPU Overview -- continued

Interrupt vectors

Interrupt vectors are addresses of short sections of code which saves the state of the processor and then branches to a handler routine.Some examples are :

• system reset - vector 0x100• machine Check - vector 0x200• data storage interrupt (DSI) - vector 0x300• instruction storage interrupt (ISI) - vector 0x400• external interrupt - 0x500• alignment - vector 0x600• program (invalid instruction or trap instruction) - vector 0x700• floating-point unavailable - vector 0x800• decrementer - vector 0x900• system call - vector 0xc00• There are some exceptions unique to each type of processor.

Page 95: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000 Unit . -15

Draft Version for Review October 15, 2000 12:13 pm Instructor Guidepower_hardware_overview.fm

64 bit CPU Overview

64-bit hardware characteristics

With full hardware 32-bit binary compatibility as the baseline, the features that characterize a PowerPC processor as 64-bit include:

• 64-bit general registers• 64-bit instructions for loading and storing 64-bit data operands, and

for performing 64-bit arithmetic and logical operations.• two execution modes: 32-bit and 64-bit. Whereas 32-bit processors

have implicitly only one mode of operation, 32-bit execution mode on a 64-bit processor causes instructions and addressing to behave the same as on a 32-bit processor. As a separate mode, 64-bit execution mode creates a true 64-bit environment, with 64-bit addressing and instruction behavior.

• 64-bit physical memory addressing facilities• additional supervisor instructions, as needed to set up and control

the execution mode. A key feature the PowerPC 64-bit architecture provides is execution mode on a per-process level, helping AIX to create, at the system level, a mixed environment of concurrent 32-bit and 64-bitprocesses.

The Machine Status Register (MSR) bit controls 32-bit or 64-bit execution mode :

• Allows support for 32-bit processes on 64-bit hardware• Used by the kernel to run in 32-bit mode in kernel• portions of the VMM run in 64-bit mode on 64-bit hardware (to

address large tables to represent large virtual memory)• 32-bit mode on 64-bit hardware looks exactly like 32-bit hardware

(ensures binary compatability for 32-bit applications)• 32-bit instructions use only bottom 32-bits of registers for data or

addresses

Continued on next page

Page 96: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

power_hardware_overview.fmInstructor Guide Draft Version for Review October 15, 2000 12:13 pm

-16 Course short title © Copyright IBM Corp. 2000

64 bit CPU Overview -- continued

Segment table The 64-bit virtual address space is represented with a segment table, which acts as an in-memory set-associative cache of the most recently used 256 segment number to segment ID mappings. The current segment table is pointed to with the 64 bit Address Space Register (ASR) register. The ASR has a valid bit to indicate that no segment table is valid. This is used in 32-bit mode on 64-bit processors to indicate that the segment table is not being used. IBM "bridge extensions" to PowerPC 64-bit architecture allow segment register operations to work for 32-bit mode. It allows the kernel to continue to manipulate segment registers. The "bridge extensions" are used to load and store "segment registers" instead.

A Segment Lookaside Buffer (SLB) is used to cache recently used segment number to segment ID mappings. This is similar to Translation Lookaside Buffer (TLB) for page to frame translationsThe SLB is similar to segment table but smaller and faster (on chip, not in memory)

Page 97: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -1 of 6Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, smp_hardware_overview.fm Guide

Unit 4. SMP Hardware Overview

Objectives The Objectives for this lesson are:

• list the three types op multiprocessor design

• describe what is meant MP safe

Page 98: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-2 of 6 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, smp_hardware_overview.fm

SMP Hardware Overview

Symmetric multi- processing

On uniprocessor systems, bottlenecks exist in the form of the address and data bus restricting transfers to one at a time and the other program counter forcing instructions to be executed in strict sequence. Some performance improvement was achieved by constantly improving the speeds of these uniprocessor machines.

With symmetric multiprocessing, more than one CPU work together.

There are several categories of MP systems depending on whether the CPU share resources, have their own resources (like memory, operating system, I/O channels, control units, files and devices), how they are connected (whether in a single machine sharing a single bus or in different machines), whether all processors are functionally equal or some are specialized.

Types of Multiprocessors:

• Loosely-coupled MP

• Tightly-coupled MP

• Symmetric MP

Loosely coupled MP

Has different systems on a communication link with the systems fuctioning independently and communicating when necessary.

The separate systems can access each other’s files and maybe even download tasks to the lightly loaded CPU to achieve some load balancing.

Tightly coupled MP

Uses a single storage shared by the various processors and a single operating system that controls all the processors and system hardware.

Symmetric MP All of the processors are functionally equivalent and can perform I/O and computation.

Continued on next page

Page 99: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -3 of 6Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, smp_hardware_overview.fm Guide

SMP Hardware Overview -- continued

Multi- processor organization

In order to have all CPU’s work together, there must be some sort of organization. There are three ways to do that :

• Master/slave multiprocessing organization.

• Separate executives organization.

• Symmetric multi-processing organization.

Master slave organization

One processor is designated as the master and the others are the slaves. The master is a general purpose processor and performs input/output as well as computation. The slave processors perform only computation.

The processors are considered asymmetric (not equivalent) since only the master can do I/O as well as computation. Utilization of a slave may be poor if the master does not service slave requests efficiently enough.

Another disadvantage may be I/O-bound jobs, because they may not run efficiently since only the master does I/O.

Separate executives organization

With this organization each processor has its own operating system and responds to interrupts from users running on that processor. A process is assigned to run on a particular processor and runs to completion.

It is possible for some of the processors to remain idle while other processors execute lengthy processes. Some tables are global to the entire system and access to these tables must be carefully controlled. Each processor controls its own dedicated resources, such as files and I/O devices.

Symmetric multi- processing organization

All of the processors are functionally equivalent and can perform I/O and computation. The operating system manages a pool of identical processors, any one of which may be used to control any I/O devices or reference any storage unit. Conflicts between processors attempting to access the same storage at the same time are ordinarily resolved by hardware. Multiple tables in the kernel can be accessed by different processes simultaneously. Conflicts in access to systemwide tables are ordinarily resolved by software. A process may be run at different times by any of the processors and at any given time, several processors may execute operating system function in kernel-mode.

Continued on next page

Page 100: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-4 of 6 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, smp_hardware_overview.fm

SMP Hardware Overview -- continued

Multi- processor definitions

There are two ways of identifying separate processors. You can identfiy them by :

• the physical CPU number

• the logical CPU number

The lowest number will start from ‘0’ on Power systems, but will start

from ‘1’ on IA-64.

Where the physical numbers identify all processors on the system, regardless of their state, and the logical numbers identify enabled processors only. The Object Data Manager (ODM) names for processors are based on physical numbers with the prefix /proc. The table below illustrates these naming scheme for a three-processor Power system.

Continued on next page

ODM name Physical number

Logical number

Processor state

/proc0 0 0 Enabled

/proc1 1 Disabled

/proc2 2 1 Enabled

Page 101: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -5 of 6Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, smp_hardware_overview.fm Guide

SMP Hardware Overview -- continued

Funneling In order to run some Uni-Processor device drivers unchanged because they are not ‘thread-safe’ or ‘MP safe’, their execution had to be “funneled” through one specific processor, which is called the MP master. Funneled code runs only on the master processor; therefore, the current uniprocessor serialization is sufficient.

One processor will be known as the default, or Master processor and this concept is used for funneling. It is not a master processor in the sense of master/slave processing - the term is used only to designate which processor will be the default processor. It’s defined by the value of MP_MASTER in the <sys/processor.h> file

Note : funneling is NOT supported by the 64-bit kernel !!!

Funneling has the following characteristics :

• Interrupts for a funneled device driver will be routed to the MP master CPU.

• Funneling is intended to support third party device driver and low-throughput device drivers.

• The base kernel will provide binary compatibility for these device drivers.

• Funneling only works if all references to the device driver are through the device switch table.

MP safe MP safe code will run on any processor. It’s modified to prevent resource clashes by adding locking code in order to serialize its execution.

MP efficient MP efficient code is MP safe code, but has also some data locking mechanisms to serialize data access. This way it will be easier to spread whatever the code does across the availables CPUs.

MP efficient device drivers are intended for high-throughput device drivers.

Page 102: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-6 of 6 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, smp_hardware_overview.fm

Page 103: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -1 of 28Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, crashdump.fm Guide

Unit 5. Configuring System Dumps on AIX 5L

This lesson describes how to configure and take system dumps on a node running the AIX5L operating system.

What You Should Be Able to DoAfter completing this unit, you should be able to

• Configure an AIX5L system to take a system dump• Test the system dump configuration of an AIX5L system• Verify the validity of a dump file

Page 104: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-2 of 28 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Page 105: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -3 of 28Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, crashdump.fm Guide

About This Lesson

Purpose This lesson describes how to configure and take system dumps on a node running the AIX5L operating system.

Objectives At the completion of this lesson, you will be able to:

• Configure an AIX5L system to take a system dump• Test the system dump configuration of an AIX5L system• Verify the validity of a dump file

Table of contents

This lesson covers the following topics:

Continued on next page

Topic See Page

About This Lesson 3

System Dump Facility in AIX5L 5

Configuring for System Dumps 7

Obtaining a Crash Dump 16

Dump Status and completion codes 17

dumpcheck utility 19

Verify the dump 21

Packaging the dump 26

Page 106: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-4 of 28 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, crashdump.fm

About This Lesson -- continued

Estimated length

This lesson takes approximately 1hour to complete.

Accountability You will be able to measure your progress with the following:

• Exercises using your lab system.

• Check-point activity

• Lesson review

Reference

Redbooks

Organization of this lesson

This lesson consists of information followed by exercises that allow you to practice what you’ve just learned. Sometimes, as the information is being presented, you are required to do something - pull down a menu, enter a response, etc. This symbol, in the left hand side-head, is an indication that an action is required.

Page 107: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -5 of 28Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, crashdump.fm Guide

System Dump Facility in AIX5L

Introduction An AIX5L system can generate a system dump (or crash dump) due to encountering a severe system error, such as an exception in kernel mode that was unexpected or that the kernel cannot handle. It can also be initiated by the system administrator when the system has hung.

When an unexpected system halt occurs, the system dump facility automatically copies selected areas of kernel data to the primary dump device. These areas include kernel segment 0 as well as other ares registered in the Master Dump Table by kernel modules or kernel extensions. The system dump is a snapshot of the operating system state at the time of the crash or manually initiated dump.

The system dump facility provides a mechanism to capture sufficient information about the AIX5L kernel for later analysis. Once the preserved image is written to disk, the system will be booted and returned to production.

Analysis of the dump can be done on another machine away from the production machine at a convenient time, or location by a skilled kernel person.

Process The process of taking a system dump is illustrated in the following chart. The process involves a two stages, in stage one the contents of memory is copied into a temporary disk location. In stage two, AIX5L is booted and the memory image is moved to a permanent location in the /var/adm/ras directory.

Continued on next page

Page 108: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-6 of 28 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, crashdump.fm

System Dump Facility in AIX5L -- continued

Process continued

AIX5L in production

Stage 1

Stage 2

System is booted

Copycore copies dumpinto /var/adm/ras

- copycore started inrc.boot

Memory Dumper Run

- memory is copied to disk locationspecified in SWservAt ODM objectclass

System panics

Page 109: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -7 of 28Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, crashdump.fm Guide

Configuring for System Dumps

Introduction When the operating system is installed, parameters regarding the dump device are configured with default settings. To ensure that a system dump is taken successfully, the system dump parameters need to be configured properly.

The system dump parameters are stored in system configuration objects within the SWservAt ODM object class. Objects within the SWservAt object class define where and how a system dump should be handled.

SWservAt object class

The SWservAt ODM object class is stored in the /etc/objrepos directory. Objects included within the object class are:

Each object can be changed with the use of the sysdumpdev command.

Continued on next page

name default descriptiontprimary /dev/hd6 Defines the permanent primary dump

device. By default this is the primary paging space logical volume, hd6.

primary /dev/hd6 Defines the temporary primary dump device. By default this is the primary paging space logical volume, hd6.

tsecondary /dev/sysdumpnull Defines the permanent secondary dump device. By default this is the device sysdumpnull.

secondary /dev/sysdumpnull Defines the temporary secondary dump device. By default this is the device sysdumpnull.

autocopydump /var/adm/ras Defines the directory the dump is copied to at system boot.

forcecopydump TRUE TRUE - If a the copy fails to the copy directory, the system boot process will bring up a utility to copy the dump to removable media.

enable_dump FALSE FALSE - Disables the ability to force a sysdump using the dump key sequence or the reset button on systems without a key mode switch.

dump_compress OFF OFF - specifies that dumps will not be compressed.

Page 110: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-8 of 28 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Configuring for System Dumps -- continued

sysdumpdev The sysdumpdev command changes the settings of SWservAt objects. The command provides you with the ability to:

• Estimate the size of the system dump• Selecting the primary and secondary dump devices• Selecting the directory the dump will be copied to at boot• Displaying information from the previous dump invocation• Determine if a new system dump exists• Display current dump settings

Dump Device selection rules

When selecting the primary or secondary dump device the following rules must be observed:

• A mirrored paging space may be used as a dump device.• Do not use a diskette drive as your dump device.• If you use a paging device, only use hd6, the primary paging device.

Continued on next page

Page 111: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -9 of 28Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, crashdump.fm Guide

Configuring for System Dumps -- continued

Preparing for a system dump

To ensure that a system dump will be successfully captured, complete the following steps:

Continued on next page

Step Action

1. Estimate the size of the dump. This can be done through smit by following the fast path:

# smit dump_estimate

Or, using the sysdumpdev command:

# sysdumpdev -e

(With Compression turned on)

0453-041 Estimated dump size in bytes:11744051

(With Compression turned off)

0453-041 Estimated dump size in bytes:58720256

Using the above example, the dump will require 12MB (with compression on), or 59MB (with compression turned off) of device storage. This value can change based on the activity of the system. It is best to run this command when the machine is under its heaviest workload. Size the dump device four times the value reported by the sysdumpdev command in order to handle a system dump during peak system activity.

IA-64 Systems - Compression must be turned off to gather a valid system dump. (Eratta)

DUMPSPACE requirement for this system:

______MB * 4 = ______MB

Note: On AIX5L a new utility called dumpcheck has been created to monitor the system and verify that if a system dump occurred that the resources are properly configured to the system dump. The utility is run as a cron job, and is located in the /usr/lib/ras directory. The time when the command is scheduled to run should be adjusted to when the peak system load is expected. Any warnings will be logged in the errorlog.

Page 112: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-10 of 28 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Configuring for System Dumps -- continued

Preparing for a system dump continued

Continued on next page

Step Action

2 Create a primary dump device named dumplv. Calculate the required number of PPs for the dump device. Get the PP size of the volume group by using the lsvg command:

# lsvg rootvgVOLUME GROUP: rootvg VG IDENTIFIER: db1010aVG STATE: active PP SIZE: 16 megabyte(s)VG PERMISSION: read/write TOTAL PPs: 1626 (26016 megabytes)MAX LVs: 256 FREE PPs: 1464 (23424 megabytes)LVs: 11 USED PPs: 162 (2592 megabytes)OPEN LVs: 8 QUORUM: 2TOTAL PVs: 3 VG DESCRIPTORS: 3STALE PVs: 0 STALE PPs: 0ACTIVE PVs: 3 AUTO ON: yesMAX PPs per PV: 1016 MAX PVs: 32LTG size: 128 kilobyte(s) AUTO SYNC: noHOT SPARE: no

Determine the necessary number of PPs by dividing the estimated size of the dump by the PP size. For example:

236MB (59*4) / 16MB = 14.75 (required number is 15)

Create a logical volume of the required size, for example:

#mklv -y dumplv -t sysdump rootvg 15

Page 113: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -11 of 28Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, crashdump.fm Guide

Configuring for System Dumps -- continued

Preparing for a system dump continued

Continued on next page

Step Action

3. Verify the size of the device /dev/dumplv. Enter the following command:

# lslv dumplvLOGICAL VOLUME: dumplv VOLUME GROUP: rootvgLV IDENTIFIER: e59bd8 PERMISSION: read/writeVG STATE: active/complete LV STATE:opened/syncdTYPE: dump WRITE VERIFY: offMAX LPs:512 PP SIZE: 16 megabyte(s)COPIES: 1 SCHED POLICY: parallelLPs: 15 PPs: 15STALE PPs: 0 BB POLICY: relocatableINTER-POLICY: minimum RELOCATABLE: noINTRA-POLICY: middle UPPER BOUND: 32MOUNT POINT: N/A LABEL: NoneMIRROR WRITE CONSISTENCY: offEACH LP COPY ON A SEPARATE PV?: yes

In this example, the dumplv logical volume contains 15 16MB partitions giving a total size of 240MB.

4. Assign the primary dump device by using the sysdumpdev command:

#sysdumpdev -s /dev/dumplv -Pprimary /dev/dumplvsecondary /dev/sysdumpnullcopy directory /var/adm/rasforced copy flag FALSEalways allow dump FALSEdump compression OFF

Page 114: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-12 of 28 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Configuring for System Dumps -- continued

Preparing for a system dump continued

Continued on next page

Step Action

5. Create a secondary dump device. The secondary dump device is used to back up the primary dump device. If an error occurs during a system to dump to the primary dump device, the system attempts to dump to the secondary device (if it is defined).

Create a logical volume of the required size, for example:

#mklv -y hd7 -t sysdump rootvg 15

6. Assign the secondary dump device by using the sysdumpdev command:

#sysdumpdev -s /dev/hd7 -Pprimary /dev/dumplvsecondary /dev/hd7copy directory /var/adm/rasforced copy flag FALSEalways allow dump FALSEdump compression OFF

Page 115: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -13 of 28Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, crashdump.fm Guide

Configuring for System Dumps -- continued

Preparing for a system dump continued

Continued on next page

Step Action

7. Verify the size of the filesystem containing the copy directory is large enough to handle a crash dump. Check the size of the copy directory filesystem with the following command:

#df -k /varFilesystem 1024-blocks Free%Used Iused %Iused Mounted on/dev/hd9var 32768 31268 5% 143 64% /varIn this example the /var filesystem is 32MB. To increase the size of the /var filesystem to 240MB, use the following command:# chfs -asize=+240000 /var

Note: The default copy directory is /var/adm/ras. The rc.boot script is coded to check and mount the /var filesystem to support the copy of the system dump out of the dump device. If an alternate location is selected modification of /sbin/rc.boot maybe necessary. Also you will be required to update the ram filesystem with the bosboot command.

Portion of /sbin/rc.boot:

# Mount /var for copycore echo "rc.boot: executing \"fsck -fp var\"" \ >>/../tmp/boot_log fsck -fp /var echo "rc.boot: executing \"mount /var\"" \ >>/../tmp/boot_log mount /var [ $? -ne 0 ] && loopled 0x518 # retrieve dump echo "rc.boot: executing \"copycore\"" \ >>/../tmp/boot_log copycore umount /var

Page 116: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-14 of 28 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Configuring for System Dumps -- continued

Preparing for a system dump continued

Continued on next page

Step Action

8. Configure the force copy flag. If paging space is being used as a dump device, the force copy flag must be set to TRUE. This will force the system boot sequence into menus that allow copy of the dump to external media if the copy to the copy directory fails. This utility will give you the opprotunity to save the crash to removable media if the default copy directory is full or un-available. To set the flag to TRUE, use the following command:

# sysdumpdev -PD /var/adm/ras

9. Configure the allow system dump flag. To enable the reset button or dump key sequence to force a dump sequence with the key in the normal position, or on a machine without a key mode switch, the allow system dump flag must be set to TRUE. To set the flag TRUE, use the following command:

# sysdumpdev -KP

10. Configure the compression flag. To enable compression of the system dump prior to being written to the dump device, the compression flag must be set to ON. To set the flag to ON, use the following command:

# sysdumpdev -CP

IA-64 Systems - Compression must be turned off to gather a valid system dump. (Eratta):

# sysdumpdev -cP

Note: Turning the compression flag on will cause the dump to be saved in a compressed form on the primary dump device. Also, the copycore utility will generate a compressed vmcore file, vmcore.x.Z.

Page 117: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -15 of 28Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, crashdump.fm Guide

Configuring for System Dumps -- continued

Preparing for a system dump continued

Step Action

11. Configure the system for autorestart. A useful system attribute is autorestart. If autorestart is TRUE, the system will automatically reboot after a crash. This is useful if the machine is physically distant or often unattended. To list the system attributes, use the following command:

# lsattr -El sys0

To set autorestart to TRUE, use SMIT by following the fast path:

# smit chgsys

Or use the command:

# chdev -l sys0 -a autorestart=’true’

Page 118: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-16 of 28 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Obtaining a Crash Dump

Introduction AIX5L has been designed to automatically collect a system crash dump following a system panic. This section discusses the operator controls and procedure that is used to obtain a system dump.

User initiated dumps

Under unattended hang conditions or for other debugging purposes system administrator may use different techniques to force a dump:

• Using sysdumpstart -p command (primary dump device) or sysdump -s command (secondary dump device).

• Start a system dump with the Reset button by doing the following (this procedure works for all system configurations and will work in circumstances where other methods for starting a dump will not):

Power PC - Pressing the Ctlr-Alt 1 key sequence to write the dump information to the primary dump device, or press the Ctlr-Alt 2 key sequence to write the dump information to the secondary dump device.

IA-64 - Pressing the Ctlr-Alt-NUMPAD1 key sequence to write the dump information to the primary device, or Ctlr-Alt-NUMPAD2 key sequence to write the dump information to the secondaray dump device.

Step Action

1. Turn the machine’s mode switch to the Service position, or set Always Allow System Dump to TRUE.

2. Press the Reset button. The system writes the dump information to the primary dump device.

Page 119: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -17 of 28Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, crashdump.fm Guide

Dump Status and completion codes

Progression status codes

A system crash will cause a number of status codes to be displayed. When a system has crashed, the LEDs will display a flashing 888. The system may display the code 0c9 for a short period of time, indicating a system dump is in progress. When the dump is complete, the dump status code will change to 0c0 if the system was able to dump successfully.

If the Low-Level Debugger (LLDB) is enabled, a c20 will appear in the LEDs, and an ASCII terminal connected to the s1 or s2 serial port will show an LLDB screen. Typing quit dump will initiate a dump.

During the dump process, the following progression status codes may be seen on the LED or LCD displays:

Continued on next page

LED code sysdumpdev status Description

0c0 0 Dump successful

0c1 -4 I/O error during dump.

0c4 -2 Dump device is too small. Partial dump taken.

0c5 -3 Internal dump error. It shows only when the dump facility itself fails. This does not include the failure of dump component routines.

0c8 -1 No dump device defined.

0c2 N/A User-initiated dump in progress.

0c6 N/A User-initiated dump in progress to secondary dump device.

0c9 N/A System initiated dump in progress.

0cc N/A Dump process switched to secondary dump device.

Flashing 888 N/A System has crashed

102 N/A This value indicates an unexpected system halt.

Page 120: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-18 of 28 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Dump Status and completion codes -- continued

Error log If the dump was lost or did not save during system boot, the error log can help determine the nature of the problem that caused the dump. To check the error log, use the errpt command.

Create a user initiated dump

Create a test dump by entering the following command:

LED code sysdumpdev status Description

nnn N/A This value is the cause of the system halt (reason code)

000 N/A Unexpected system interrupt (hardware related)

2xx N/A Machine check

Step Action

1. # sysdumpstart -p

IA-64 Systems - For a dump that is approximately 120MB in size wait for approximately 15 minutes before shutting down the machine.

2. Reboot the system.

Page 121: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -19 of 28Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, crashdump.fm Guide

dumpcheck utility

Description The /usr/lib/ras/dumpcheck utility is used to check the disk resources used by the system dump facility. The command logs an error if either the largest dump device is too small to receive the dump or there is insufficient space in the copy directory when the dump device is a paging space.

Requirements In order to be effective, the dumpcheck utility must be:

• Enabled:• To verify if dumpcheck has been enabled by using the following

command:

# crontab -l | grep dumpcheck

0 15 * * * /usr/lib/ras/dumpcheck >/dev/null 2>&1

• Enable the dumpcheck utility by using the -t flag. This will create an entry in the root crontab if none exists. Example, to set the dumpcheck utility to run at 2PM:

# /usr/lib/ras/dumpcheck -t “0 14 * * *”

• Dumpcheck should be run at the time the system is heavily loaded in order to find the maximum size the dump will take. The default time is set for 3PM.

dumpcheck overview

Dumpcheck utility will do the following when enabled :

• Estimate the dump or compressed dump size using sysdympdev -e • Find the dump logical volumes and copy directory using sysdumpdev -l• Estimate the primary and secondary dump device sizes• Estimate the copy directory free space• If the dump device is a paging space, dumpcheck will verify if the free

space in the copy directory is large enough to copy the dump• If the dump device is a logical volume, dumpcheck will verify it is large

enough to contain a dump• If the dump device is a tape, dumpcheck will exit without message.

Any time a problem is found, dumpcheck will log a entry in the error log and, if the -p flag is present, will display a message to stdout for crontab, that mean it will mail the stdout to the root user.

Continued on next page

Page 122: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-20 of 28 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, crashdump.fm

dumpcheck utility -- continued

Error log entry sample

The following is an example of an errorlog entry created by the dumpcheck utility because of lack of space in the primary and secondary dump devices:

----------------------------------------------------LABEL: DMPCHK_TOOSMALLIDENTIFIER: E87EF1BE

Date/Time: Tue Aug 15 09:49:41 CDTSequence Number: 45Machine Id: 000714834C00Node Id: wcs2Class: OType: PENDResource Name: dumpcheck

DescriptionThe largest dump device is too small.

Probable CausesNeither dump device is large enough to accommodate a system dump at this time.

Recommended Actions Increase the size of one or both dump devices.

Detail DataLargest dump device testdump

Largest dump device size in kb 8192Current estimated dump size in kb 65536----------------------------------------------------

Page 123: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -21 of 28Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, crashdump.fm Guide

Verify the dump

Description Before submitting a dump to IBM for analysis, it is important to verify that the dump is valid and readable.

Locating the dump

To locate the dump issue the following command:

# sysdumpdev -L

The following output shows a good dump:

0453-039

Device name: /dev/dumplv

Major device number: 10

Minor device number: 2

Size: 8837632 bytes

Uncompressed Size: 32900935 bytes

Date/Time: Fri Sep 22 13:01:41 PDT 2000

Dump status: 0

dump completed successfully

Dump copy filename: /var/adm/ras/vmcore.0.Z

In this case a valid dump was safely save by the system in the /var/adm/ras directory.

The following case shows the command output when the copy failed. Presumably the dump is available on the external media device, for example, tape.

0453-039

Device name: /dev/dumplv

Major device number: 10

Minor device number: 2

Size: 8837632 bytes

Uncompressed Size: 32900935 bytes

Date/Time: Fri Sep 22 13:01:41 PDT 2000

Dump status: 0

dump completed successfully

0481-195 Failed to copy the dump from /dev/dumplv to /var/adm/ras.

0481-198 Allowed the customer to copy the dump to external media.

Note: A dump saved on Initial Program Load (IPL) to external media is not sufficient for analysis. Additional files are required.

Continued on next page

Page 124: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-22 of 28 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Verify the dump -- continued

Dump analysis tools

To verify the dump is valid, the dump must be examined by a kernel debugger. The kernel debugger used to validate the dump depends on the system architecture. If the system is running on Power PC, the debugger is kdb. The kernel debugger for IA-64 platforms is iadb.

Verifying the dump

The following procedure should be used to verify the dump

Continued on next page

Step Action

1. Locate the crash dump:

# sysdumpdev -L

0453-039

Device name: /dev/dumplv

Major device number: 10

Minor device number: 2

Size: 8837632 bytes

Uncompressed Size: 32900935 bytes

Date/Time: Fri Sep 22 13:01:41 PDT 2000

Dump status: 0

dump completed successfully

Dump copy filename: /var/adm/ras/vmcore.0.Z

2. Change directory to the dump location. In the above example:# cd /var/adm/ras

3. Decompress the vmcore file if necessary:# uncompress vmcore.0.Z

Page 125: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -23 of 28Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, crashdump.fm Guide

Verify the dump -- continued

Verifying the dump continued

Continued on next page

Step Action

4. Start the kernel debugger;

Power PC:# kdb /var/adm/ras/vmcore.0The specified kernel file is a UP kernelvmcore.1 mapped from @ 70000000 to @ 71fdba81Preserving 880793 bytes of symbol tableFirst symbol __mulhKERNEXT FUNCTION NAME CACHE (90112 bytes) allocatedKERNEXT COMMANDS SPACE (4096 bytes) allocatedComponent Names: 1) dmp_minimal [5 entries]. . . . Dump analysis on CHRP_UP_PCI POWER_PC POWER_604 machine with 1 cpu(s) (32-bit registers)Processing symbol table..........................done(0)>

IA-64:

# iadb /var/adm/ras/vmcore.0symbol capture using file: /unixiadb: Probing a live system, with memfd as :4Current Context: cpu:0x1, thread slot: 77, process Slot: 51, ad space: 0x8e44 thrd ptr: 0xe00000972a13b000, proc ptr: e00000972a12e000 mst at:3ff002ff3b400(1)>

Page 126: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-24 of 28 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Verify the dump -- continued

Verifying the dump continued

Continued on next page

Step Action

5. Issue the stat subcommand to verify the details of the dump. Ensure the values are consistent with the dump that was taken.

Power PC:(0)> statSYSTEM_CONFIGURATION:CHRP_UP_PCI POWER_PC POWER_604 machine with 1 cpu(s) (32-bit registers)

SYSTEM STATUS:sysname... AIXnodename.. kca41release... 0version... 5machine... 000930134C00nid....... 0930134Ctime of crash: Thu Oct 5 10:37:57 2000age of system: 3 min., 11 sec.xmalloc debug: disabled

IA-64:(1)>statSYSTEM_CONFIGURATION:IA64 machine with 2 cpu(s)(64-bit registers)

SYSTEM STATUS:sysname... AIXnodename.. kca40hostname.. kca40.hil.sequent.comrelease... 0version... 5machine... 000000004C00nid....... 0000004ccurrent time: Fri Oct 6 12:20:56 2000age of system: 1 day, 1 hr., 1 min., 43 sec.xmalloc debug: disabled

Page 127: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -25 of 28Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, crashdump.fm Guide

Verify the dump -- continued

Verifying the dump continued

Step Action

6. Exit the kernel debugger:

Power PC:(0) > q

IA-64:(1) > q

Page 128: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-26 of 28 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Packaging the dump

Overview Once a valid dump has been identified, the next step is to package the dump to be send in for analysis.

Packaging the dump

The following procedure will automatically collect the required files pertaining to the system dump

Continued on next page

Step Action

1. Compress the vmcore file:

# compress /var/adm/ras/vmcore.0

2. Gather all of the files and information regarding the dump using the following command:

# snap -DkgChecking space requirement for general information....................................................... done.Checking space requirement for kernel information.......... done.Checking space requirement for dump information..... done.Checking for enough free space in filesystem... done.********Checking and initializing directory structureCreating /tmp/ibmsupt directory tree... done.Creating /tmp/ibmsupt/dump directory tree... done.Creating /tmp/ibmsupt/kernel directory tree... done.Creating /tmp/ibmsupt/general directory tree... done.Creating /tmp/ibmsupt/general/diagnostics directory tree... done.Creating /tmp/ibmsupt/testcase directory tree... done.Creating /tmp/ibmsupt/other directory tree... done.********Finished setting up directory /tmp/ibmsuptGathering general system information........................done.Gathering kernel system information........... done.Gathering dump system information...... done.

Page 129: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -27 of 28Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, crashdump.fm Guide

Packaging the dump -- continued

Packaging the dump continued

Packaging a dump stored on external media

A dump saved to external media needs to be gathered with other files to provide a dump which is readable. To gather and pack the files follow the following steps:

Step Action

3. Copy the dump to external media. To copy the gathered files to the /dev/rmt0 tape device, issue the following command:

# snap -o /dev/rmt0

Once this command completes, the tape can be removed and sent in for analysis. Write protect the tape and label appropriately

Step Action

1. Create a skeleton directory to contain the dump information.

# snap -D

This will fail stating the dump device is no longer valid. Overcome this by restoring the dump from the media used on IPL to save the dump.

2. Restore the dump from external media. For example, a dump saved to the /dev/rmt0 device is restored by commands:

# cd /tmp/ibmsupt/dump# tar -xvf /dev/rmt# mv dump_file dump

3. Copy the dump to external media. To copy the gathered files to the /dev/rmt0 tape device, issue the following command:

# snap -o /dev/rmt0

Once this command completes, the tape can be removed and sent in for analysis. Write protect the tape and label appropriately

Page 130: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-28 of 28 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Page 131: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -1 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

Unit 6. Introduction to Dump Analysis Tools

This lesson describes the different tools that are available to debug a system dump taken from an AIX5L system.

What You Should Be Able to DoAfter completing this unit, you should be able to:

At the completion of this lesson, you will be able to:

• Describe available tools for system dump analysis• Invoke the IADB/iadb and KDB/kdb kernel debuggers

Page 132: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-2 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

About This Lesson

Purpose This lesson describes the different tools that are available to debug a system dump taken from an AIX5L system.

Prerequisites You should have completed the following lesson:

• Configuring System Dumps on AIX5L

Objectives At the completion of this lesson, you will be able to:

• Describe available tools for system dump analysis• Invoke the IADB/iadb and KDB/kdb kernel debuggers

Table of contents

This lesson covers the following topics:

Continued on next page

Topic See Page

About This Lesson 3

System Dump Analysis Tools 7

dump components 8

Dump creation process 9

Component dump routines 10

bosdebug command 11

Memory Overlay Detection System 12

System Hang Detection 15

truss command 21

KDB kernel debugger 24

kdb command 26

KDB miscellaneous sub commands 27

KDB dump/display/decode sub commands 30

KDB modify memory sub commands 34

KDB trace sub commands 37

KDB break point and step sub commands 39

KDB name list/symbol sub commands 43

Page 133: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -3 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

About This Lesson -- continued

Table of contents continued

Continued on next page

Topic See Page

KDB watch break point sub commands 44

KDB machine status sub commands 46

KDB kernel extension loader sub commands 48

KDB address translation sub commands 50

KDB process/thread sub commands 51

KDB Kernel stack sub commands 59

KDB LVM sub commands 61

KDB SCSI sub commands 63

KDB memory allocator sub commands 66

KDB file system sub commands 70

KDB system table sub commands 73

KDB network sub commands 78

KDB VMM sub commands 81

KDB SMP sub commands 87

KDB data and instruction block address translation sub commands

88

KDB bat/brat sub commands 90

IADB kernel debugger 91

iadb command 93

Page 134: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-4 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

About This Lesson -- continued

Table of contents continued

Continued on next page

Topic See Page

IADB break point and step sub commands 94

IADB dump/display/decode sub commands 97

IADB modify memory sub commands 101

IADB name list/symbol sub commands 106

IADB watch break point sub commands 107

IADB machine status sub commands 109

IADB kernel extension loader sub commands 111

IADB address translation sub commands 112

IADB process/thread sub commands 113

IADB LVM sub commands 115

IADB SCSI sub commands 116

IADB memory allocator sub commands 117

IADB file system sub commands 118

IADB system table sub commands 119

IADB network sub commands 120

IADB VMM sub commands 121

IADB SMP sub commands 123

IADB block address translation sub commands 124

IADB bat/brat sub commands 125

IADB miscellaneous sub commands 126

Exercise 128

Page 135: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -5 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

About This Lesson -- continued

Estimated length

This lesson takes approximately 1.5 hours to complete.

Accountability You will be able to measure your progress with the following:

• Exercises using your lab system.

• Check-point activity

• Lesson review

Reference

• AIX5L docs

Organization of this lesson

This lesson consists of information followed by exercises that allow you to practice what you’ve just learned. Sometimes, as the information is being presented, you are required to do something - pull down a menu, enter a response, etc. This symbol, in the left hand side-head, is an indication that an action is required.

Page 136: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-6 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

System Dump Analysis Tools

Introduction AIX5L introduces new debugging tools, the main change from the previous releases of AIX is that the crash command has been replaced by:

• IADB and KDB kernel debuggers for live systems debugging• iadb and kdb commands for system image analysis

In addition the following tools/commands are available, to assist you with debug:

• bosdebug

• Memory Overlay Detection System (MODS)• System Hang Detection• truss

Typographic conventions

In the following sections we will use uppercase IADB and KDB when speaking about the live kernel debuggers, and lowercase iadb and kdb when speaking about the commands.

Page 137: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -7 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

dump components

Introduction In AIX5L, a dump image is not actually a full image of the system memory but a set of memory areas dumped by the dump process

The Master dump Table

A master dump table entry is a pointer to a function provided by the kernel extension that will be called by the kernel dump routine when a system dump occurs. These functions must return a pointer to a component dump table structure. These functions and the component dump table entry both must reside in pinned global memory. They must be registered to the kernel by the dmp_add and unregistered using dmp_del kernel services. Kernel specific areas are pre-loaded by kernel initialization.

Component dump tables

Dump component tables are structures of type struct_cdt. Component dump tables are returned by the dmp registered functions when the dump process start. Each one is a structure made of:• a CDT Header• an array of CDT entries

CDT Header The CDT Header contains:• A magic number that can be one of:

• DMP_MAGIC_32 for 32 -bit CDT• DMP_MAGIC_VR for 32-bit CDT that may contain virtual or real

addresses • DMP_MAGIC_64 for 64-bit CDT

• the component dump name • the length of component dump table

CDT entries CDT entries in the component dump tables will be one of cdt_entry64, cdt_entry_vr or cdt_entry32 according to the DMP_MAGIC number has defined in /usr/include/sys/dump.h

Page 138: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-8 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

Dump creation process

Introduction This section will describe the dump process.

Process overview

The following steps will be used to write a dump to the dump device:

Step Action

1 Interrupts are disabled

2 0c9 or 0c2 are written to the LED display, if present

3 Header information about the dump is written to the dump device

4 The kernel steps through each entry in the master dump table, calling each component dump routine twice• Once to indicate that the kernel is starting to dump this

component 1 is passed as a parameter• Again to say that the dump process is complete 2 is passedAfter the first call to a component dump routine, the kernel processes the CDT that was returnedFor each CDT entry, the kernel :• Checks every page in the identified data area to see if it is in

memory or paged out• Builds a bitmap indicating each page's status• Writes a header, the bitmap, and those pages which are in

memory to the dump device

5 Once all dump routines have been called, the kernel enters an infinite loop, displaying 0c0 or flashing 888

Page 139: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -9 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

Component dump routines

Description Component Dump Routines• When called with a 1:

• Make any necessary preparations for dumping• For example, they may read device-specific information from an adapter.

The FDDI device driver does this• Fill in the component dump table• Most device drivers do this during their initialization• Return the address of the component dump table

• When called with a 2:• Clean up after themselves• In reality, most routines either return immediately, do some debug printfs

and then return, or else they ignore the parameter entirely and return the same thing every time

Note A component dump routine may or may not do a lot of work when called with a 1. Many simply return the address of some previously-initialized CDT, but some (for example, the thread table and process table dump routines) actually build the CDT from scratch.The original rationale for the second call to each dump routine was to provide notification that the dump process had finished with that component's dump data. In practice, however, no one really cares. The routines that just return an address don't even bother to look at the parameter they were passed. The routines that build the data on the fly look for a 2 and return immediately. The most that any routine today does with this second call is to issue some debug printf call. This is generally used to debug the component dump routine itself, by verifying that the system dump facility was able to successfully process its CDT.

Page 140: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-10 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

bosdebug command

Introduction The bosdebug command can be used to enable or disable the MODS feature as well as other kernel debugging parameters.

Any changes made with the bosdebug command will not take effect until the system is rebooted.

bosdebug parameters

The bosdebug command accept the following parameters :• -I: Causes the kernel debug program to be loaded and invoked on each

subsequent reboot.• -D: Causes the kernel debug program to be loaded on each subsequent reboot.• -M: Causes the memory overlay detection system to be enabled. Memory

overlays in kernel extensions and device drivers will cause a system crash.• -s sizelist Causes the memory overlay detection system to promote each of the

specified allocation sizes to a full page, and allocate and hide the next subsequent page after each allocation. This causes references beyond the end of the allocated memory to cause a system crash. sizelist is a list of memory sizes separated by commas. Each size must be in the range from 16 to 2048, and must be a power of 2.

• -S: Causes the memory overlay detection system to promote all allocation sizes to the next higher multiple of page size (4096), but does not hide subsequent pages. This improves the chances that references to freed memory will result in a crash, but it does not detect reads or writes beyond the end of allocated memory until that memory is freed.

• -n sizelist: Has the same effect as the -s option, but works instead for network memory. Each size must be in the range from 32 to 2048, and must be a power of 2. This causes the net_malloc_frag_mask variable of the 'no' command to be turned on during boot.

• -o: Turns off all debugging features of the system.• -L: Displays the current settings for the kernel debug program and the memory

overlay detection system. • -R on | off: Sets the real-time extensions for multiprocessor systems only.

Page 141: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -11 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

Memory Overlay Detection System

Introduction The Memory Overlay Detection System (MODS) helps detect memory overlay problems in the kernel, kernel extensions, and device drivers. The MODS can be enabled using the bosdebug command.

Problems detected

Some of the most difficult types of problems to debug are what are generally called "memory overlays." Memory overlays include the following:

• Writing to memory that is owned by another program or routine • Writing past the end (or before the beginning) of declared variables or arrays • Writing past the end (or before the beginning) of dynamically-allocated

memory • Writing to or reading from freed memory • Freeing memory twice • Calling memory allocation routines with incorrect parameters or under

incorrect conditions.

In the kernel environment (including the kernel, kernel extensions, and device drivers), memory overlay problems have been especially difficult to debug because tools for finding them have not been available. Starting with Version 4.2.1, however, the Memory Overlay Detection System (MODS) helps detect memory overlay problems in the kernel, kernel extensions, and device drivers.

Note: This feature does not detect problems in application code; it only watches kernel and kernel extension code.

When to use MODS

This feature is useful in the following circumstances:

• When developing your own kernel extensions or device drivers and want to test them thoroughly.

• When asked to turn this feature on by IBM technical support service to help in further diagnosing a problem that you are experiencing.

Continued on next page

Page 142: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-12 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

Memory Overlay Detection System -- continued

How MODS works

The primary goal of the MODS feature is to produce a dump file that accurately identifies the problem.

MODS works by turning on additional checking to help detect the conditions listed above. When any of these conditions is detected, your system crashes immediately and produces a dump file that points directly at the offending code. (Previously, a system dump might point to unrelated code that happened to be running later when the invalid situation was finally detected.)

If your system crashes while the MODS is turned on, then MODS has most likely done its job.

To make it easier to detect that this situation has occurred, the IADB/iadb and KDB/kdb commands have been extensively modified. The stat subcommand now displays both:

• Whether the MODS (also called "xmalloc debug") has been turned on • Whether this crash was the result of the MODS detecting an incorrect

situation.

The xmalloc subcommand provides details on exactly what memory address (if any) was involved in the situation, and displays mini-tracebacks for the allocation and/or free of this memory.

Similarly, the netm command displays allocation and free records for memory allocated using the net_malloc kernel service (for example, mbufs, mclusters, etc.).

You can use these commands, as well as standard crash techniques, to determine exactly what went wrong.

MODS limitations

There are limitations to the Memory Overlay Detection System. Although it significantly improves your chances, MODS cannot detect all memory overlays. Also, turning MODS on has a small negative impact on overall system performance and causes somewhat more memory to be used in the kernel and the network memory heaps. If your system is running at full CPU utilization, or if you are already near the maximums for kernel memory usage, turning on the MODS may cause performance degradation and/or system hangs.

Our practical experience with the MODS, however, is that the great majority of customers will be able to use it with minimal impact to their systems.

Continued on next page

Page 143: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -13 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

Memory Overlay Detection System -- continued

MODS and kdb If a system crash occurs due to an MODS problem, the kdb xm sub command will be able to display status and traces on memory overlay problems

Page 144: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-14 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

System Hang Detection

Introduction System hang management allows users to run mission critical applications continually while improving application availability. System hang detection alerts the system administrator of possible problems and then allows the administrator to log in as root or to reboot the system to resolve the problem.

System Hang Detection

All processes (also know as threads) run at a priority. This priority is numerically inverted in the range 40-126. Forty is highest priority and 126 is the lowest priority. The default priority for all threads is 60. The priority of a process can be lowered by any user with the nice command. Anyone with root authority can also raise a process’s priority.

The kernel scheduler always picks the highest priority runnable thread to put on a CPU. It is therefore possible for a sufficient number of high priority threads to completely tie up the machine such that low priority threads can never run. If the running threads are at a priority higher than the default of 60, this can lock out all normal shells and logins to the point where the system appears hung.

The System Hang Detection (SHD) feature provides a mechanism to detect this situation and allow the system administrator a means to recover. This feature is implemented as a daemon (shdaemon) that runs at the highest process priority. This daemon queries the kernel for the lowest priority thread run over a specified interval. If the priority is above a configured threshold, the daemon can take one of several actions. Each of these actions can be independently enabled, and each can be configured to trigger at any priority and over any time interval. The actions and their defaults are:

Continued on next page

Page 145: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -15 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

System Hang Detection -- continued

System Hang Detection continued

Continued on next page

Action DefaultEnabled

DefaultPriority

DefaultTimeout

(Seconds)

Default Device

Log an error in errlog

disabled 60 120

Display a warning message

disabled 60 120 /dev/console

Give a recovery getty

enabled 60 120 /dev/tty0

Launch a command

disabled 60 120

Reboot the system

disabled 39 300

Page 146: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-16 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

System Hang Detection -- continued

shconf Script The shconf command is invoked when System Hang Detection is enabled. shconf configures which events are surveyed and what actions are to be taken if such events occur.

The user can specify the five actions described below and can specify the priority level to check, the time out while no process or thread executes at a lower or equal priority, the terminal device for the warning action and the getty action:

• Log an error in the error log file • Display a warning message on the system console (alphanumeric

console) or on a specified TTY • Reboot the system • Give a special getty to allow the user to log in as root and launch

commands • Launch a command

For the Launch a command and Give a special getty options, SHD will launch the special getty or the specified command at the highest priority. The special getty will print a warning message specifying that it is a recovering getty running at priority 0. The following table lists the default values when the SHD is enabled. Only one action is enabled per type of detection.

Note: When Launch a recovering getty on a console is enabled, the shconf script adds the -u flag to the getty line in the inittab that is associated with the console login.

Continued on next page

Page 147: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -17 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

System Hang Detection -- continued

Process The shdaemon is in charge of handling the detection of system hang. It retrieves configuration information, initiates working structures, and starts detection times set in by the user.

The shdaemon is started by init with a priority zero.

The shdaemon will be set (off/respawn) in the inittab each time the shconf command will (disable/enable) the sh_pp option.

SMIT Interface You can manage the SHD configuration from the SMIT System Environments menu. From the System Environments menu, select Manage System Hang Detection. The options in this menu allow system administrators to enable or disable the detection mechanism.

Continued on next page

Page 148: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-18 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

System Hang Detection -- continued

Configuration of the SHD

The shconf command can be used to configure the System Hang Detection.The following parameters maybe used with shconf:• -d: Display the System Hang Detection status.• -R -l prio: will reset effective values to default.• -D[O] -l prio: Display the default values (Optional O will output values

separated by colons• -E[O] -l prio: Display the effective values (Optional O will output values

separated by colons• -l prio [-a Attribute=Value]: will change the Attribute to the nue Value

Options The following options can be used to customize the System Hang Detection :

Continued on next page

name default description

sh_pp enable Enable Process Priority Problem

pp_errlog disable Log Error in the Error Logging

pp_eto 2 Detection Time-out

pp_eprio 60 Process Priority

pp_warning disable Display a warning message on a console

pp_wto 2 Detection Time-out

pp_wprio 60 Process Priority

pp_wterm /dev/console Terminal Device

pp_login enable Launch a recovering login on a console

pp_lto 2 Detection Time-out

pp_lprio 56 Process Priority

pp_lterm /dev/tty0 Terminal Device

pp_cmd disable Launch a command

pp_cto 2 Detection Time-out

pp_cprio 60 Process Priority

pp_cpath / Script

pp_reboot disable Automatically REBOOT system

pp_rto 5 Detection Time-out

pp_rprio 39 Process Priority

Page 149: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -19 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

System Hang Detection -- continued

example The following output represent various use of the chconf command:# shconf -R -l prio <== restore default valuesshconf: Default Problem Conf is restored.shconf: Priority Problem Conf has changed.# shconf -D -l prio <== display default valuessh_pp disable Enable Process Priority Problempp_errlog disable Log Error in the Error Loggingpp_eto 2 Detection Time-outpp_eprio 60 Process Prioritypp_warning disable Display a warning message on a consolepp_wto 2 Detection Time-outpp_wprio 60 Process Prioritypp_wterm /dev/console Terminal Devicepp_login enable Launch a recovering login on a consolepp_lto 2 Detection Time-outpp_lprio 56 Process Prioritypp_lterm /dev/tty0 Terminal Devicepp_cmd disable Launch a commandpp_cto 2 Detection Time-outpp_cprio 60 Process Prioritypp_cpath / Scriptpp_reboot disable Automatically REBOOT systempp_rto 5 Detection Time-outpp_rprio 39 Process Priority# shconf -l prio -a pp_lterm=/dev/console <== change terminal device to /dev/consoleshconf: Priority Problem Conf has changed.# shconf -l prio -a sh_pp=enable <== enable priority problem detectionshconf: Priority Problem Conf has changed.# ps -ef|grep shd <== verify the shdaemon has been started root 4982 1 0 17:08:17 - 0:00 /usr/sbin/shdaemon root 9558 9812 1 17:08:22 0 0:00 grep shd

Page 150: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-20 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

truss command

Description The truss command executes a specified command, or attaches to listed process IDs, and produces a trace of the system calls, received signals, and machine faults a process incurs. Each line of the trace output reports either the Fault or Signal name, or the Syscall name with parameters and return values. The subroutines defined in system libraries are not necessarily the exact system calls made to the kernel. The truss command does not report these subroutines, but rather, the underlying system calls they make. When possible, system call parameters are displayed symbolically using definitions from relevant system header files. For path name pointer parameters, truss displays the string being pointed to. By default, undefined system calls are displayed with their name, all eight possible arguments and the return value in hexadecimal format.

Continued on next page

Page 151: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -21 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

truss command -- continued

Options The following options can be used with the truss command line

Each option requiring a list must contain a list separated by commas. You can use “all”/”!all” to include/exclude all possible values of the list.

Continued on next page

Option Description-a Displays the parameter strings passed in each system call. -c Counts traced system calls, faults, and signals rather than displaying

trace results line by line. A summary report is produced. -e Displays the environment strings which are passed in each executed

system call. -f Follows all children created by the fork system call. -i Keeps interruptible sleeping system calls from being displayed. Causes

system calls to be reported only once, upon completion. -m [!] Fault

Machine faults to trace/exclude. Faults may be specified by name or number (see the sys/fault.h header file). The default is -mall.

-o Outfile Designates the file to be used for the trace output. -p Interprets the parameters to truss as a list of process ids for an existing

process rather than as a command to be executed. truss takes control of each process and begins tracing it.

-r [!] FileDescriptor

Displays the full contents of the I/O buffer for each read on any of the specified file descriptors. The output is formatted 32 bytes per line and shows each byte either as an ASCII character (preceded by one blank) or as a two-character C language escape sequence for control characters. If ASCII interpretation is not possible, the byte is shown in two-character hexadecimal. The default is -r!all.

-s [!] Signal

Permits listing Signals to trace/exclude. The trace output reports the receipt of each specified signal even if the signal is being ignored, but not blocked, by the process. Blocked signals are not received until the process releases them. Signals may be specified by name or number (see sys/signal.h). The default is -s all.

-t [!] Syscall

Includes/excludes system calls from the trace. The default is -tall.

-w [!] FileDescriptor

Displays the contents of the I/O buffer for each write on any of the listed file descriptors (see -r). The default is -w!all.

-x [!] Syscall

Displays data from the specified parameters of traced system calls in raw format, usually hexadecimal, rather than symbolically. The default is -x!all.

Page 152: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-22 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

truss command -- continued

truss output example

The following output represent an example of the use of a truss command:

��WUXVV��DHLPV��UDOO��ZDOO��R�OV�RXW�OVOV�RXW��PRUH�OV�RXWH[HFYH���XVU�ELQ�OV����[�))��������[�))����������DUJF���

�DUJY��OV�HQYS��B �XVU�ELQ�WUXVV�/$1* &�/2*,1 URRW��1/63$7+ �XVU�OLE�QOV�PVJ��/��1��XVU�OLE�QOV�PVJ��/��1�FDW��3$7+ �XVU�ELQ��HWF��XVU�VELQ��XVU�XFE��XVU�ELQ�;����VELQ��/&BB)$6706* WUXH�/2*1$0( URRW�0$,/ �XVU�VSRRO�PDLO�URRW��/2&3$7+ �XVU�OLE�QOV�ORF�86(5 URRW�$87+67$7( FRPSDW��6+(// �XVU�ELQ�NVK�2'0',5 �HWF�REMUHSRV�+20( ��7(50 DL[WHUP��0$,/06* ><28�+$9(�1(:�0$,/@�3:' �KRPH�DOH[�7= 367�3'7�$BB] ��/2*1$0(BBJHWBNHUQHOBWRGBSWU��[�����������[���%�������[�����������['��&�$����[���&�������[�����������[���$$������[���(������ ��[�))))���JHWXLG[����������������������������������������� ��[��������NLRFWO������������[�����������[����������������� ��NLRFWO�����������������[�)(&�������[������������ ��VEUN��[����������������������������������������� ��[���&����EUN��[���'�������������������������������������� ��VEUN��[����������������������������������������� ��[���'����EUN��[������������������������������������������ ��VWDW[�������[�)(&������������������������������� ��VWDW[�������[�)(&��$���������������������������� ��RSHQ������2B5'21/<������������������������������ ��JHWGLUHQW��[������������������������������������ OVHHN������������������������������������������� ��NIFQWO����)B*(7)'���[����'���������������������� ��NIFQWO����)B6(7)'���[��������������������������� ��JHWGLUHQW��[������������������������������������ JHWGLUHQW��[������������������������������������ FORVH������������������������������������������� ��NLRFWO������������[�����������[����������������� ��NZULWH�����[�����$)����������������������������� �����O�V���R�X�W?QNIFQWO����)B*(7)/���[��������������������������� ��FORVH������������������������������������������� ��NIFQWO����)B*(7)/���[��������������������������� ��BH[LW���

Page 153: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -23 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB kernel debugger

Introduction The KDB is the kernel debugger used on AIX5L running on Power systems.

Availability The kernel debugger must be enabled in order to be used on AIX5L.The following command should return 00000001 if the kernel debugger was enabled:

#kdb (0)> dw kdb_availkdb_avail+000000: 00000001 00000000 00000000 00000000

Overview The major functions of the KDB are:

• Setting breakpoints within the kernel or kernel extensions• Execution control through various forms of step commands• Formatted display of selected kernel data structures• Display and modification of kernel data• Display and modification of kernel instructions• Modification of the state of the machine through alteration of system registers

Loading KDB In AIX 5L, the KDB is included in all unix kernels found in /usr/lib/boot. In order to use it, the KDB must be loaded at boot time. To allow KDB to load use the following command:• bosboot -a -D -d /dev/ipldevice, or bosdebug -D: will

load KDB at boot time. • bosboot -a -I -d /dev/ipldevice, or bosdebug -I: will load

and invoke the KDB at boot time. • bosboot -ad /dev/ipldevice, or bosdebug -o: will not load or

invoke the KDB at boot time.

You must reboot the system in order to take these changes in account.

Continued on next page

Page 154: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-24 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB kernel debugger -- continued

Starting KDB The KDB interface maybe be started, if loaded, under the following circumstances:• If the bosboot or bosdebug was run with -I, this mean that the tty

attached to a native serial port will show up the KDB just after the kernel is loaded.

• You may invoke manually the KDB from a tty attached to a native serial port using: Ctrl-4 or Ctrl-\, or from a native keyboard using Ctrl-alt-Numpad4.

• An application make a call to the breakpoint() kernel services or to the breakpoint system call.

• A breakpoint previously set using the KDB has been reached• A fatal system error occurs. A dump might be generated on exit from the KDB.

KDB Concept When the KDB Kernel Debugger is invoked, it is the only running program until you exit the KDB or you use the start sub command to start another cpu. All processes are stopped and interrupts are disabled. The KDB Kernel Debugger runs with its own Machine State Save Area (mst) and a special stack. In addition, the KDB Kernel Debugger does not run operating system routines. Though this requires that kernel code be duplicated within KDB, it is possible to break anywhere within the kernel code. When exiting the KDB Kernel Debugger, all processes continue to run unless the debugger was entered via a system halt.

Page 155: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -25 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

kdb command

Introduction The kdb command, unlike the KDB kernel debugger, allows examination of an operating system image issued on Power systems.

The kdb command maybe used on a running system but will not provide all functions available with the KDB kernel debugger.

Parameters The kdb command maybe used with the following parameters:• no parameter: the kdb will use /dev/mem as the system image file and /usr/lib/

boot/unix as the kernel file. In this case root permissions are required.• -m system_image_file: the kdb will use the image file provided.• -u kernel_file: the kdb will use the kernel file. This is required to analyze a

system dump on a system with different level of unix.• -k kernel_modules: a comma separated list of kernext symbols to add.• -w: to view XCOFF object• -v: to print CDT entries• -h: to print help• -l: to disable inline more, useful to run non interactive session.

Loading errors If the system image file provided doesn’t contain a valid dump or the kernel file doesn’t match the system image file, the following message may be issued by the kdb command:

# kdb -m dump_file -u /usr/lib/boot/unixThe specified kernel file is a 64-bit kernelcore mapped from @ 700000000000000 to @ 7000000000120a7Preserving 884137 bytes of symbol tableFirst symbol __mulhKERNEXT FUNCTION NAME CACHE (90112 bytes) allocatedKERNEXT COMMANDS SPACE (8192 bytes) allocatedComponent Dump Table not found.Kernel not included in this dump.dump core corruptedmake sure /usr/lib/boot/unix refers to the running kernel

Page 156: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-26 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB miscellaneous sub commands

Introduction The following table represents the miscellaneous sub commands and their matching crash/lldb sub commands when available

reboot sub command

The reboot subcommand can be used to reboot the machine. This subcommand issues a prompt for confirmation that a reboot is desired before executing the reboot. If the reboot request is confirmed, the soft reboot interface is called (sr_slih(1)).

! sub command The ! sub command allow the user to run an aix command without leaving the kdb or KDB kernel debugger.

? sub command Help or ? sub command can be used to display a long sub command listing or to display help by subjects.

A particular help a a command can be display using the sub command followed by ?

Continued on next page

machdepfunction

crash/lldb sub commands

KDB sub commands

kdb sub commands

reboot the machine reboot reboot N/A

display help ?/help ? ?

run an aix command ! ! !

exit q/go q go

set debugger parameters set

display elapsed time time N/A

enable/disable debug debug

calculate/convert an hexadecimal expression

calc/conv hcal/cal hcal/cal

calculate/convert a decimal expression

calc/conv dcal dcal

Page 157: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -27 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB miscellaneous sub commands -- continued

q sub command For the KDB Kernel Debugger, this subcommand exits the debugger with all breakpoints installed in memory. To exit the KDB Kernel Debugger without breakpoints, the ca subcommand should be invoked to clear all breakpoints prior to leaving the debugger.

The optional dump argument can be specified to force an operating system dump. The method used to force a dump depends on how the debugger was invoked.

set sub command

The set sub command can be used to toggle the kdb parameters. Set accept the following parameters:

• none: will display the actual parameters• 1: no_symbol • 2: mst_wanted• 3: screen_size• 4: power_pc_syntax• 5: origin • 6: Unix symbols start from • 7: hexadecimal_wanted • 8: screen_previous • 9: display_stack_frames• 10: display_stacked_regs• 11: 64_bit • 12: ldr_segs_wanted • 13: emacs_window • 14: Thread attached local breakpoint• 15: KDB stops all processors• 17: kext_IF_active • 18: trace_back_lookup • 19: IPI_enable

time sub command

The time command can be used to determine the elapsed time from the last time the KDB Kernel Debugger was left to the time it was entered.

Continued on next page

Page 158: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-28 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB miscellaneous sub commands -- continued

debug sub command

The debug subcommand may be used to print additional information during KDB execution, the primary use of this subcommand is to aid in ensuring that the debugger is functioning properly. The debug sub command can be used with the following arguments:

• no argument: will display the current debug flags• dbg1++/dbg1--: set/unset FM HW lookup debug.• dbg2++/dbg2--: set/unset vmm tr/tv cmd debug • dbg3++/dbg3--: set/unset vmm SW lookup debug • dbg4++/dbg4--: set/unset symbol lookup debug • dbg5++/dbg5--: set/unset stack trace debug • dbg61++/dbg61--: set/unset BRKPT debug (list) • dbg62++/dbg62--: set/unset BRKPT debug (instr) • dbg63++/dbg63--: set/unset BRKPT debug (suspend)• dbg64++/dbg64--: set/unset BRKPT debug (phantom)• dbg65++/dbg65--: set/unset BRKPT debug (context)• dbg71++/dbg71--: set/unset DABR debug (address) '• dbg72++/dbg72--: set/unset DABR debug (register) '• dbg73++/dbg73--: set/unset DABR debug (status) '• dbg81++/dbg81--: set/unset BRAT debug (address) • dbg82++/dbg82--: set/unset BRAT debug (register) '• dbg83++/dbg83--: set/unset BRAT debug (status)

hcal/dcal sub commands

The hcal subcommand evaluates hexadecimal expressions and displays the result in both hex and decimal.

The dcal subcommand evaluates decimal expressions and displays the result in both hex and decimal.

Page 159: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -29 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB dump/display/decode sub commands

Introduction The following table represents the dump/display/decode sub commands and their matching crash/lldb sub commands when available

d/dw/dd/dp/dpw/dpd sub commands

d/dw/dd/dp/dpw/dpd sub commands are use to display memory with the following sizes:

• d,dp display bytes• dw,dpw: display words• dd,dpd (display double words)Addresses are specified by:• virtual addresses for d,dw and dd• physical for dp,dpw and dpdThese sub commands accept the following arguments:• Address - starting address of the area to be dumped. hexadecimal values, or

hexadecimal expressions can be used in specification of the address. • count - number of bytes (d, dp), words (dw, dpw), or double words (dd, dpd)

to be displayed. The count argument is a hexadecimal value.

Continued on next page

dump/display/decodefunction

crash/lldb sub commands

KDB sub commands

kdb sub commands

display byte data display d d

display word data od (2 units)/display dw dw

display double word data od (4 untis)/display dd dd

display code id/decode/od (format I)/un

dc/dpc/dis dc/dpc/dis

display registers float/sregs dr dr

display device byte ddvb/ddpb N/A

display device half word ddvh/ddph N/A

display device word ddvw/ddpw N/A

display device double word ddvd/ddpd N/A

display physical memory display dp/dpw/dpd dp/dpw/dpd

find pattern find find/findp find/findp

extract pattern link ext/extp ext/extp

Page 160: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-30 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB dump/display/decode sub commands -- continued

dc/dpc/dis sub commands

The display code subcommands, dc,dis and dpc may be used to decode instructions. The address argument for the dc subcommand is an effective address. The address argument for the dpc subcommand is a physical address. They accept the following arguments:

• Address - address of the code to disassemble. This can either be a virtual (effective) or physical address, depending on the subcommand used. Symbols, hexadecimal values, or hexadecimal expressions can be used in specification of the address.

• count - indicates the number of instructions to be disassembled. The value specified must be a decimal value or decimal expression.

ddvb/ddvh/ddvd/ddpv/ddph/ddpd sub commands

IO space memory (Direct Store Segment (T=1)) can not be accessed when translation is disabled. bat mapped areas must also be accessed with translation enabled, else cache controls are ignored.

The subcommands ddvb, ddvh, ddvw and ddvd can be used to access these areas in translated mode, using an effective address already mapped.

The subcommands ddpb, ddph, ddpw and ddpd can be used to access these areas in translated mode, using a physical address that will be mapped.

On 64-bit machine, double words correctly aligned are accessed (ddpd and ddvd) in a single load (ld) instruction.

DBAT interface is used to translate this address in cache inhibited mode (PowerPC only).

ddvb/ddvh/ddvd/ddpv/ddph/ddpd sub commands use the following parameters:

• Address - address of the starting memory area to display. This can either be a effective or real address, dependent on the subcommand used. Symbols, hexadecimal values, or hexadecimal expressions can be used in specification of the address.

• count - number of bytes (ddvb, ddpb), half words (ddvh, ddph), words (ddvw, ddpw), or double words (ddvd, ddpd) to display. The count argument is a hexadecimal value.

Continued on next page

Page 161: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -31 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB dump/display/decode sub commands -- continued

find findp sub commands

The find and findp subcommands can be used to search for a specific pattern in memory. The find subcommand requires an effective address for the address argument, whereas the findp subcommand requires a real address. find and findp accept the following parameters:

• -s - flag indicating that the pattern to be searched for is an ASCII string • Address: address where the search is to begin. This can either be a virtual

(effective) or physical address, depending on the subcommand used. Symbols, hexadecimal values, or hexadecimal expressions can be used in specification of the address.

• string: ASCII string to search for if the -s option is specified. • pattern: hexadecimal value specifying the pattern to search for. The pattern is

limited to one word in length. • mask: if a pattern is specified, a mask can be specified to eliminate bits from

consideration for matching purposes. This argument is a one word hexadecimal value.

• delta: increment to move forward after an unsuccessful match. This argument is a one word hexadecimal value.

ext/extp sub commands

The ext and extp subcommands can be used to display a specific area from a structure. If an array exists, it can be traversed displaying the specified area for each entry of the array. These subcommands can also be used to traverse a linked list displaying the specified area for each entry.

For the ext subcommand the address argument specifies an effective address. For the extp subcommand the address argument specifies a physical address.

ext and extp accept the following arguments:

• -p: flag to indicate that the delta argument is the offset to a pointer to the next area.

• Address: address at which to begin display of values. This can either be a virtual (effective) or physical address depending on the subcommand used. Symbols, hexadecimal values, or hexadecimal expressions can be used in specification of the address.

• delta: offset to the next area to be displayed or offset from the beginning of the current area to a pointer to the next area. This argument is a hexadecimal value.

• size: hexadecimal value specifying the number of words to display. • count: hexadecimal value specifying the number of entries to traverse

Continued on next page

Page 162: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-32 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB dump/display/decode sub commands -- continued

dr sub command The display registers sub command can be used to display:

• gp: general purpose• sr: segment, • sp: special, or • fp: floating point registers. • register name: Individual registers. •The current context is used to locate the values to display. The switch sub command can be used to change context to other threads.

examples The following show examples of the use of display sub commands:

# hostname <== get the hostnameoc3b42# ctrl-\ <== enter the kdbKDB(0)> find -s 0 oc3b42utsname+000020: 6F63 3362 3432 0033 3443 3030 0000 0000 oc3b42.34C00....KDB(0)> dd utsname 3 <== display 3 double word of utsnameutsname+000000: 4149580000000000 0000000000000000 AIX.............utsname+000010: 0000000000000000 0000000000000000................utsname+000020: 6F63336234320033 3443303000000000 oc3b42.34C00....(0)> dr sp <== display the special purposes registers in current contextiar : 000000000000B65C msr : A0000000000090B2 cr : 44284448lr : 000000000001C950 ctr : 0000000000000020 xer : 0EB8C400mq : DEADBEEF asr : 000000000EB8E001dsisr: 00000000 dar : 0000000000000000 dec : 00000000sdr1: 0000000000000000 srr0: 0000000000000000 srr1: 0000000000000000dabr: 0000000000000000 tbu : 00000000 tbl : 00000000sprg0: 0000000000000000 sprg1: 0000000000000000sprg2: 0000000000000000 sprg3: 0000000000000000pir : 00000000 pvr : 00000000 ear : 00000000hid0: 00000000 iabr: 0000000000000000buscsr: 0000000000000000 l2cr: 0000000000000000 l2sr: 0000000000000000via : 0000000000000000 sda : 0000000000000000mmcr0: 00000000 mmcr1: 00000000pmc1: 00000000 pmc2: 00000000 pmc3: 00000000 pmc4: 00000000pmc5: 00000000 pmc6: 00000000 pmc7: 00000000 pmc8: 00000000

Page 163: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -33 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB modify memory sub commands

Introduction The following table represents the modify memory sub commands and their matching crash/lldb sub commands when available

m/mp/mw/mpw/md/mpd sub commands

m/mp/mw/mpw/md/mpd sub commands are use to modify memory with the following sizes:

• m.mp display bytes• mw.mpw: display words• md,mpd (display double words)Addresses are specified by :• virtual addresses for m,mw and md• physical for mp,mpw and mpdThese sub commands accept the following arguments:• Address - starting address of the area to be dumped. hexadecimal values, or

hexadecimal expressions can be used in specification of the address.

The sub commands will prompt for new values until a “.” value is entered.

Continued on next page

modify memoryfunction

crash/lldb sub commands

KDB sub commands

kdb sub commands

modify sequential bytes alter -c/stc m N/A

modify sequential word alter -w/st mw N/A

modify sequential double word alter -l md N/A

modify sequential half word sth sth N/A

modify registers set mr N/A

modify device byte mdvb/mdpb N/A

modify device half word mdvh/mdph N/A

modify device double word mdvd/mdpd N/A

modify physical memory mp/mpw/mpd

N/A

Page 164: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-34 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB modify memory sub commands -- continued

mr sub commands

The mr sub command can be used to modify general purpose, segment, special, or floating point registers. Individual registers can also be selected for modification by register name. The current thread context is used to locate the register values to be modified. The switch sub command can be used to change context to other threads. When the register being modified is in the mst context, KDB alters the mst. When the register being modified is a special one, the register is altered immediately. Symbolic expressions are allowed as input.The following arguments can be used:• gp - modify general purpose registers. • sr - modify segment registers. • sp - modify special purpose registers. • fp - modify floating point registers. • reg_name - modify a specific register, by name.mr will prompt for input if a register name was specified, or will prompt for input until a “.” is entered.

mdvb/mdpb/mdvh/mdph/mdvd/mdpd sub commands

These subcommands are available to write in IO space memory. To avoid bad effects, memory is not read before, only the specified write is performed with translation enabled.Access can be in bytes, half words, words or double words.Address can be an effective address or a real address.The subcommands mdvb, mdvh, mdvw and mdvd can be used to access these areas in translated mode, using an effective address already mapped. The subcommands mdpb, mdph, mdpw and mdpd can be used to access these areas in translated mode, using a physical address that will be mapped. On 64-bit machine, doublewords correctly aligned are accessed (mdpd and mdvd) in a single store instruction. DBAT interface is used to translate this address in cache inhibited mode (PowerPC only).These subcommands accept the following parameters:• Address - address of the memory to modify. This can either be a virtual

(effective) or physical address, dependent on the subcommand used. Symbols, hexadecimal values, or hexadecimal expressions can be used in specification of the address.

These sub commands will prompt for input until a “.” is entered.

Continued on next page

Page 165: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -35 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB modify memory sub commands -- continued

examples # uname -a<== get utsname structureoc3b42# ctrl-\ <== enter the kdbKDB(0)> dd utsname 6 <== display 6 double word of utsnameutsname+000000: 4149580000000000 0000000000000000 AIX.............utsname+000010: 0000000000000000 0000000000000000................utsname+000020: 6F63336234320033 3443303000000000 oc3b42.34C00....KDB(0)> mw utsname+000020utsname+000020: 6F633362 = 616c6578utsname+000024: 34320033 =.KDB(0)> dw utsname 12 <== display 12 words of utsnameutsname+000000: 41495800 00000000 00000000 00000000 AIX.............utsname+000010: 00000000 00000000 00000000 00000000................utsname+000020: 616C6578 34320033 34433030 00000000 alex42.34C00....utsname+000030: 00000000 00000000 00000000 00000000................utsname+000040: 30000000 00000000 0.......KDB(0)>q# uname -a <== now let see what we didAIX alex42 0 5 000714834C00

Page 166: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-36 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB trace sub commands

introduction The following table represents the trace sub commands and their matching crash/lldb sub commands when available

bt sub command The trace point subcommand bt can be used to trace each execution of a specified address. Each time a trace point is encountered during execution, a message is displayed indicating that the trace point has been encountered. The displayed message indicates the first entry from the stack.

The bt sub command accept the following parameters:

• -p - flag to indicate that the trace address is a real address. • -v - flag to indicate that the trace address is an virtual address. • Address - address of the trace point. This may either be a virtual (effective) or

physical address. Symbols, hexadecimal values, or hexadecimal expressions may be used in specifying an address.

• script - a list of subcommands to be executed each time the indicated trace point is executed. The script is delimited by quote (") characters and commands within the script are delimited by semicolons (;).

The bt sub command can also use a test parameter to break at the specified address only if the test condition is true

The conditional test requires two operands and a single operator. Values that can be used as operands in a test subcommand include symbols, hexadecimal values, and hexadecimal expressions. Comparison operators that are supported include: ==, !=, >=, <=, >, and <.

Additionally, the bitwise operators ^ (exclusive OR), & (AND), and | (OR) are supported.

When bitwise operators are used, any non-zero result is considered to be true.

Continued on next page

trace function crash/lldb sub commands

KDB sub commands

kdb sub commands

set/list trace point loop bt N/A

clear trace point ct N/A

clear all trace points cat N/A

Page 167: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -37 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB trace sub commands -- continued

ct/cat sub command

The cat and ct sub commands erase all and individual trace points, respectively. The trace point cleared by the ct subcommand may be specified either by a slot number or an address. These sub commands accept the following arguments:

• -p: flag to indicate that the trace address is a real address. • -v: flag to indicate that the trace address is an virtual address. • slot: slot number for a trace point. This argument must be a decimal value. • Address:address of the trace point. This may either be a virtual (effective) or

physical address. Symbols, hexadecimal values, or hexadecimal expressions may be used in specifying an address.

examples The following example show the use of the trace sub commands:

# <== ctrl-\ to enter the KDB from a native serial portDebugger entered via keyboard..waitproc_find_run_queue+00006C srwi r29,r31,3 <00000000> r29=F1000097140A011C,r31=0KDB(0)> bt open <== add a trace point at open() address.open+000000 (sid:00000000) trace {hit: 0}KDB(0)> q <== exit the debugger# ls <== run some command to call open[0][00387D04]open+000000 (0000000020008B88, 0000000000000000, 00000000000001B6 [??])[0][00387D04]open+000000 (0000000020000CA4, 0000000000000000, 00000000F00A0810 [??]).bash_history dev lpp sbin u.bashrc etc lpp_name scratch unix.sh_history home mnt smit.log usr.xerrors j2 opt smit.script varaudit lib proc tftpbootbin lost+found qd0 tmp# <== ctrl-\ to enter the KDB from a native serial portKDB(0)> bt open "dr" <== will run dr when open is entered.open+000000 (sid:00000000) trace {hit: 0}KDB(0)> q <== exit the debugger# ls <== run some command to call openr0 : 00000000000090B2 r1 : F00000002FF3B390 r2 : 000000000046AC80r3 : 0000000020008B88 r4 : 0000000000000000 r5 : 00000000000001B6r6 : 0000000000000000 r7 : 0000000000000000 r8 : 000000001E821C00r9 : 0000000000000000 r10 : 0000000011D3E8F0 r11 : F00000002FF3B400r12 : F10000971E821C00 r13 : F10000971F1FF200 r14 : 0000000000000001r15 : 000000002000D2A8 r16 : 000000002FF22D6C r17 : 00000000FFFFFFCBr18 : 0000000000000001 r19 : 0000000000000000 r20 : 0000000020007680r21 : 0000000000000000 r22 : 0000000000002CB6 r23 : 0000000000000000r24 : 000000002FF229F0 r25 : 0000000000000014 r26 : 000000002000D2DCr27 : 0000000000000000 r28 : 00000000F0061768 r29 : 00000000FFFFFFFFr30 : 00000000D0054FAC r31 : 0000000000000000

Page 168: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-38 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB break point and step sub commands

Introduction The following table represents the breakpoint and step sub commands and their matching crash/lldb sub commands when available

b/lb sub command

The b subcommand sets a permanent global breakpoint in the code. KDB checks that a valid instruction will be trapped. If an invalid instruction is detected a warning message is displayed. If the warning message is displayed the breakpoint should be removed; otherwise, memory can be corrupted (the breakpoint has been installed).

The lb sub command will act the same way as the b sub command except the break point will be local to the thread or cpu depending on the set option 14.

The following arguments may be used with the b/lb sub commands :

• -p - flag to indicate that the breakpoint address is a real address. • -v - flag to indicate that the breakpoint address is an virtual address. • Address - address of the breakpoint. This may either be a virtual (effective) or

physical address. Symbols, hexadecimal values, or hexadecimal expressions may be used in specification of the address.

Continued on next page

breakpoint and step function

crash/lldb sub commands

KDB sub commands

kdb sub commands

set/list break point break/breaks b N/A

set/list local break point break/breaks lb N/A

clear local break point clear lc N/A

clear break points clear c N/A

clear all breakpoint clear ca N/A

go to end of function r N/A

go until address gt N/A

next instruction step n/nextis/stepi N/A

step on bl/blr S N/A

step on branch B N/A

Page 169: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -39 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB break point and step sub commands -- continued

c/lc/ca sub commands

c/lc and ca can be used to clear break points. The differences are:

• c will clear general break points• lc will clear local break points• ca will clear all break points•The b and lc sub commands will use the following parameters:

• -p - flag to indicate that the breakpoint address is a real address. • -v - flag to indicate that the breakpoint address is an virtual address. • slot - slot number of the breakpoint. This argument must be a decimal value. • Address - address of the breakpoint. This may either be a virtual (effective) or

physical address. Symbols, hexadecimal values, or hexadecimal expressions may be used in specification of the address.

The lc may use this additional parameter:

• ctx - context to be cleared for a local break. The context may either be a CPU or thread specification.

r/gt sub command

A non-permanent breakpoint can be set using the subcommands r and gt. These subcommands set local breakpoints which are cleared after they have been hit. The r subcommand sets a breakpoint on the address found in the lr register. In SMP environment, it is possible to hit this breakpoint on another processor, so it is important to have thread/process local break point.

The gt subcommand performs the same as the r subcommand except that the breakpoint address must be specified.

r and gt sub commands accept the following parameters:

• -p - flag to indicate that the breakpoint address is a real address. • -v - flag to indicate that the breakpoint address is an virtual address. • Address - address of the breakpoint. This may either be a virtual (effective) or

physical address. Symbols, hexadecimal values, or hexadecimal expressions may be used in specification of the address.

Continued on next page

Page 170: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-40 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB break point and set sub commands -- continued

n/s sub command

The two subcommands n and s provide step functions. The s subcommand allows the processor to single step to the next instruction. The n subcommand also single steps, but it steps over subroutine calls as though they were a single instruction.

The n/s sub commands accept the following parameter:

• count: specify how many steps are executed before returning to the KDB prompt.

S/B sub commands

The S subcommand single steps but stops only on bl and br instructions. With that, you can see every call and return of routines. A count can also be used to specify how many times KDB continues before stopping.

The B subcommand steps stopping at each branch instruction.

The S/B sub commands accept the following parameter:

• count: specify how many steps are executed before returning to the KDB prompt.

Continued on next page

Page 171: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -41 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB break point and step sub commands -- continued

Example The following example will show the use of break points:# Debugger entered via keyboard..waitproc_find_run_queue+00006C srwi r29,r31,3 <00000000> r29=0,r31=0KDB(0)> br open <== we set a break point on open..open+000000 (sid:00000000) permanent & globalKDB(0)> q <== we exit the kdb# ls <== do some command that will certainly call openBreakpoint <== open was called so we enter the KDB.open+000000 li r6,0 <0000000000000000> r6=0KDB(0)> s <== do one step.open+000004 stdu stkp,FFFFFF80(stkp) stkp=F00000002FF3B390,FFFFFF80(stkp)=F00000002FF3B310KDB(0)> n <== an other one.open+000008 mflr r0 <.sys_call_ret+000000>KDB(0)> dis.open+000008 32 <== not let’s find a the following branch.open+000008 mflr r0 .open+00000C extsw r4,r4.open+000010 addi r7,stkp,70.open+000014 std r0,90(stkp).open+000018 clrlwi r5,r5,0.open+00001C bl <.copen> <== here it is.open+000020 ori r0,r3,0.open+000024 clrlwi r4,r3,0KDB(0)> B <== this will break at the next branch taht should be open+1c.open+00001C bl <.copen> r3=0000000020008B88KDB(0)> s <== we step that branch.copen+000000 std r31,FFFFFFF8(stkp) r31=0,FFFFFFF8(stkp)=F00000002FF3B308KDB(0)> dr lr <== let see what is in the link registerlr : 0000000000387D24.open+000020 ori r0,r3,0 <0000000020008B88> r0=000000000000377C,r3=0000000020008B88KDB(0)> r <== break on the lr (we will return to the calling function).open+000020 ori r0,r3,0 <0000000000000000> r0=0000000000000030,r3=0KDB(0)> ca <== clear all break point before leavingKDB(0)> q <== exit the KDB

Page 172: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-42 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB name list/symbol sub commands

Introduction The following table represents the name list/symbol sub commands and their matching crash/lldb sub commands when available

nm sub command

The nm subcommand translates symbols to addresses.nm use the following argument:

• symbol - symbol name.

ns sub command The ns subcommand toggles symbolic name translation on and off. This is equivalent to the set sub command option 1.

ts sub command The ts subcommand translates addresses to symbolic representations. ts use the following argument:

• Address - effective address to be translated. This argument may be a hexadecimal value or expression.

examples (0)> nm kdb_avail <== display addresses for the kdb_avail symbolSymbol Address: 0046AE70 TOC Address: 0046AC80(0)> set 1 <== turn address translation offSymbolic name translation off(0)> ts 046AE70 <== get symbol for 046AE700046AE70 <== didn’t get it because address translation is turned off(0)> ns <== turn address translation back onSymbolic name translation on(0)> ts 046AE70 <== no we should get the symbolkdb_avail+000000

name list symbolfunction

crash/lldb sub commands

KDB sub commands

kdb sub commands

translate symbol to eaddr nm nm nm

no symbol mode (toggle) hide ns ns

translate eaddr to symbol ts/ds ts ts

Page 173: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -43 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB watch break point sub commands

Introduction The following table represents the watch break point sub commands and their matching crash/lldb sub commands when available

wr, ww, wrw, lwr, lww, lwrw, cw and lcw sub commands

On PowerPC architecture, a watch register (the DABR Data Address Breakpoint Register or HID5 on Power 601) can be used to enter KDB when a specified effective address is accessed. The register holds a double-word effective address and bits to specify load and/or store operation.

So the watch break points can be used with the following rules

• wr and lwr will break on read• ww and lww will break on write• wrw and lwrw will break on read or write• wr,ww and wrw will break in any context• lwr,lww and lwrw will break in a specific cpu.• cw and lcw will clear general or local watch break points.wr, ww, wrw, lwr, lww, lwrw,cw and lcw will accept the following arguments:• -p: flag indicating that the address argument is a physical address. • -v: flag indicating that the address argument is a virtual address. • -e: flag indicating that the address argument is an effective address. • Address: address to be watched. Symbols, hexadecimal values, or hexadecimal

expressions can be used in specification of the address. • size: indicates the number of bytes that are to be watched. This argument is a

decimal value.

Continued on next page

watch break pointfunction

crash/lldb sub commands

KDB sub commands

kdb sub commands

stop on read data watch wr N/A

stop on write data watch ww N/A

stop on r/w data watch wrw N/A

local stop on read data watch lwr N/A

local stop on write data watch lww N/A

local stop on r/w data watch lwrw N/A

clear watch cw cw N/A

local clear watch lcw lcw N/A

Page 174: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-44 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB watch break point sub commands -- continued

examples KDB(0)> wr utsname 3 <== set a break on read of utsname for 3 bytesCPU 0: utsname+000000 eaddr=001CB9C8 size=3 hit=0 mode=R Xlate ONCPU 1: utsname+000000 eaddr=001CB9C8 size=3 hit=0 mode=R Xlate ONKDB(0)> q <== exit the debugger# uname -a <== run some command that will read the utsname Watch trap: 001CB9C8 <utsname+000000>.umem_move+000030 lbzx r7,r6,r3 r7=000000000000B6B4, r6=0, r3=00000000001CB9C8KDB(0)> wr <== verify the number of hits -------vCPU 0: utsname+000000 eaddr=001CB9C8 size=3 hit=1 mode=R Xlate ONCPU 1: utsname+000000 eaddr=001CB9C8 size=3 hit=1 mode=R Xlate ONKDB(0)> cw <== clear watch break pointsKDB(0)> lwr utsname <== now set a local watch break point (only cpu 0)CPU 0: utsname+000000 eaddr=001CB9C8 size=8 hit=0 mode=R Xlate ONKDB(0)> lcw <== clear local watch break pointsKDB(0)> q <== exit kdb, will resume the current threadAIX oc3b42 0 5 000714834C00

Page 175: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -45 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB machine status sub commands

Introduction The following table represents the status sub commands and their matching crash/lldb sub commands when available

stat sub command

The stat subcommand displays system statistics, including the last kernel printf() messages, still in memory. The following information is displayed for a processor that has crashed:

• Processor logical number • Current Save Area (CSA) address • LED valueFor the KDB Kernel Debugger this subcommand also displays the reason why the debugger was entered. There is one reason per processor.

Continued on next page

machine statusfunction

crash/lldb sub commands

KDB sub commands

kdb sub commands

system status message stat/reason/sysinfo stat stat

Page 176: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-46 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB machine status sub commands -- continued

example KDB(0)> statSYSTEM_CONFIGURATION:POWER_PC POWER_630 machine with 2 cpu(s) (64-bit registers)SYSTEM STATUS:sysname... AIXnodename.. oc3b42release... 0version... 5machine... 000714834C00nid....... 0714834CDebugger entered via keyboard.age of system: 18 hr., 8 min., 13 sec.xmalloc debug: enabledDebug kernel error message: No debug cause was specified.SYSTEM MESSAGES:AIX Version 5.0Starting NODE#000 physical CPU#002 as logical CPU#001... done.kmod_load failed for psekdbAll Rights Reserved (C) Copyright Platypus Technology International Holdings Limitedqik_alert: Unit is not ready!init 0.?.?.?.?.?.?.?.?.?.?ERROR LOG: for mtn_get_adpt_info, location= 1 0, 1105A, DEAFDEAF.?.?.?.?.?.?.?.?.?.?.qik_alert: Unit is not ready!init 2.?!.?.?.?.?.?.?.?.?.?.! J2 Bring Up: gfs:0x00000001Number of CPUs: 2L1 Data Cache Line Size: 128System Memory Size: 512 MByteVMM minPageReadAhead:2 maxPageReadAhead:8nCacheClass:5iCache: inodeSize:888(vode:88,inode:800(gnode:104,dinode:512))iCache: nInode:52225 nCacheClass:5 nHashClass:8192nCache: nName:65536 nHashClass:8192jCache: nBuffer:5120 bufferHeaderSize(176:208)jCache: nCacheClass:5 nBufferPerCacheClass:1024vmPager: nBufferPerPagerDevice:512txCache: nTxBlock:1024 txBlockSize:88txCache: nTxLock:57400 txLockSize:72 lockShortage:53813j2_debug: Error Log Table j2Error:0x003F5580j2_debug: Event Trace Table j2Trace:0x003F9588 J2 Bring UP Complete.j2_mount: Mount Failure: File System Dirty.lockd: cannot contact statd(), continuing<- end_of_bufferKDB(0)> q

Page 177: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -47 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB kernel extension loader sub commands

Introduction The following table represents the kernel extension loader sub commands and their matching crash/lldb sub commands when available

lke and stbl sub commands

The subcommands lke and stbl can be used to display current state of loaded kernel extensions using the following parameters:

• -l: list the current entries in the name list cache. • Address: effective address for the text or data area for a loader entry. The

specified entry is displayed and the name list cache is loaded with data for that entry. The Address can be specified as a hexadecimal value, a symbol, or a hexadecimal expression.

• -a addr: display and load the name list cache with the loader entry at the specified address. The Address can be a hexadecimal value, a symbol, or a hexadecimal expression.

• -p pslot: display the shared library loader entries for the process slot indicated. The value for pslot must be a decimal process slot number.

• -l32: display loader entries for 32-bit shared libraries. • -l64: display loader entries for 64-bit shared libraries. • slot: slot number. The specified value must be a decimal number.

rmst sub command

A symbol table can be removed from KDB using the rmst subcommand. This subcommand requires that either a slot number or the effective address for the loader entry of the symbol table be specified.

Continued on next page

kernel extension loaderfunction

crash/lldb sub commands

KDB sub commands

kdb sub commands

list loaded extension le lke lke

list loaded symbol tables stbl stbl

remove symbol table rmst rmst

list export tables map exp exp

Page 178: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-48 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB kernel extension loader sub commands -- continued

exp sub command

The exp subcommand can be used to look for an exported symbol or to display the entire export list. If no argument is specified the entire export list is printed. If a symbol name is specified as an argument, then all symbols which begin with the input string are displayed.

examples (0)> nm kbdconfig <== get address for symbol kbdconfigSymbol Not Found<== not found because it is in a kernext not in cache(0)> lke -l <== the cache is emptyKERNEXT FUNCTION NAME CACHE empty(0)> lke <== list kernel extensions..21 01978B00 01AE9000 000063D0 00080262 msedd_chrp64/usr/lib/drivers/isa/msedd_chrp22 01978900 01ACA000 00008F68 00080262 kbddd_chrp64/usr/lib/drivers/isa/kbddd_chrp(0)> lke 22 <== load kernext into the cache ADDRESS FILE FILESIZE FLAGS MODULE NAME22 01978900 01ACA000 00008F68 00080262 kbddd_chrp64/usr/lib/drivers/isa/kbddd_chrple_flags....... TEXT DATAINTEXT DATA DATAEXISTS 64le_next........ 01978A00 le_svc_sequence 66666666le_fp.......... 00000000le_filename.... 01978988 le_file........ 01ACA000le_filesize.... 00008F68 le_data........ 01AD2100le_tid......... 01AD2100 le_datasize.... 00000E68le_usecount.... 00000002 le_loadcount... 00000002le_ndepend..... 00000001 le_maxdepend... 00000001le_ule......... 00000000 le_deferred.... 00000000le_exports..... 00000000 le_de.......... 6666666666666666le_searchlist.. 00000000 le_dlusecount.. 00000000le_dlindex..... FFFFFFFF le_lex......... 00000000le_fh.......... 00000000 le_depend.... @ 01978980TOC@........... 01AD2C50 <PROCESS TRACE BACKS> .ureg_pm 01ACA1C0 .reg_pm 01ACA25C .qvpd 01ACA3B4 .initadpt 01ACA520 .cleanup 01ACA754 .kbdconfig 01ACA924..(0)> lke -l <== see if it was loaded correctlyKERNEXT FUNCTION NAME CACHE .ureg_pm 01ACA1C0 .reg_pm 01ACA25C .qvpd 01ACA3B4 .initadpt 01ACA520 .cleanup 01ACA754 .kbdconfig 01ACA924..(0)> nm kbdconfig <== no see if we find the address for the symbolSymbol Address : 01ACA924 TOC Address : 01AD2C50

Page 179: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -49 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB address translation sub commands

Introduction The following table represents the address translation sub commands and their matching crash/lldb sub commands when available

tr and tv sub commands

The tr and tv sub commands can be used to display address translation information. The tr sub command provides a short format; the tv subcommand a detailed format.

For the tv subcommand, all double hashed entries are dumped, when the entry matches the specified effective address, corresponding physical address and protections are displayed. Page protection (K and PP bits) is displayed according to the current segment register and machine state register values.

tr and tv sub commands takes the following arguments :

• Address - effective address for which translation details are to be displayed. Symbols, hexadecimal values, or hexadecimal expressions can be used in specification of the address.

examples (0)> tr @iar <== display the physical address of the current instructionPhysical Address = 000000000002CB58(0)> tv @iar <== display the physical mapping of the current instructioneaddr 000000000002CB58 sid 0000000000000000 vpage 000000000000002C hash1 0000002Cp64pte_cur_addr 0000000001001600 sid 0000000000000000 avpi 00 hsel 0 valid 1rpn 000000000000002C refbit 1 modbit 0 wimg 2 key 3____ 000000000002CB58 ____ K = 0 PP = 11 ==> read only

eaddr 000000000002CB58 sid 0000000000000000 vpage 000000000000002C hash2 0000FFD3Physical Address = 000000000002CB58(0)>

address translationfunction

crash/lldb sub commands

KDB sub commands

kdb sub commands

translate to real address xlate tr tr

display MMU translation xlate tv tv

Page 180: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-50 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB process/thread sub commands

Introduction The following table represents the process/thread sub commands and their matching crash/lldb sub commands when available

ppda sub command

The ppda sub command displays per processor data areas with the following conditions :

• no arguments : displays the current process data area• * : display a summary for all CPUs. • cpu : display the ppda data for the specified CPU. This argument must be a

decimal value. • Address : effective address of a ppda structure to display. Symbols,

hexadecimal values, or hexadecimal expressions can be used in specification of the address.

intr sub command

The intr sub command prints entries in the interrupt handler table with the following conditions :

• no arguments : display a summary of all entries in the interrupt handler table.• slot : slot number in the interrupt handler table. This value must be a decimal

value. • Address : effective address of an interrupt handler. Symbols, hexadecimal

values, or hexadecimal expressions can be used in specification of the address.

Continued on next page

processfunction

crash/lldb sub commands

KDB sub commands

kdb sub commands

display per processor data ppd ppda ppda

display interrupt handler intr intr

display mst area mst/tcb mst mst

display process table proc proc proc

display thread table th th th

display thread tid th ttid ttid

display thread pid tpid tpid

display user area user user user

switch thread cm sw/switch sw/switch

Page 181: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -51 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB process/thread sub commands -- continued

mst sub command

The mst sub command prints Machine State Save Area for :

• the current context : if no argument is provided• slot : thread slot number. This value must be a decimal value. • Address : effective address of an mst to display. Symbols, hexadecimal values,

or hexadecimal expressions can be used in specification of the address.

proc sub command

The proc subcommand displays process table entries using :

• * : display a summary for all processes. • -s flag : display only processes with a process state matching that specified by

flag. The allowable values for flag are: SNONE, SIDLE, SZOMB, SSTOP, SACTIVE, and SSWAP.

• slot : process slot number. This value must be a decimal value. • Address : effective address of a process table entry. Symbols, hexadecimal

values, or hexadecimal expressions can be used in specification of the address.

th sub command The thread subcommand displays thread table entries using :

• no argument : the current thread is displayed.• * :display a summary for all thread table entries. • -w flag : display a summary of all thread table entries with a wtype matching

the one specified by the flag argument. Valid values for the flag argument include: NOWAIT, WEVENT, WLOCK, WTIMER, WCPU, WPGIN, WPGOUT, WPLOCK, WFREEF, WMEM, WLOCKREAD, WUEXCEPT, and WZOMB.

• slot :thread slot number. This must be a decimal value. • Address :effective address of a thread table entry. Symbols, hexadecimal

values, or hexadecimal expressions can be used in specification of the address.

ttid and tpid sub commands

ttid and tpid respectively display :

• the thread table entry by thread id • the threads table entry by process id

Continued on next page

Page 182: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-52 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB process/thread sub commands -- continued

user sub command

The user subcommand displays u-block information for :

• no argument : the current process• slot : slot number of a thread table entry. This argument must be a decimal

value. • Address : effective address of a thread table entry. Symbols, hexadecimal

values, or hexadecimal expressions can be used in specification of the address.The following parameters can be used to reduce the output from user :• -ad : display adspace information only. • -cr : display credential information only. • -f : display file information only. • -s : display signal information only. • -ru : display profiling/resource/limit information only. • -t : display timer information only. • -ut : display thread information only. • -64 : display 64-bit user information only. • -mc : display miscellaneous user information only.

sw sub command

By default, KDB shows the virtual space for the current thread. The sw subcommand allows selection of the thread to be considered the current thread. Threads can be specified by slot number or address. The current thread can be reset to its initial context by entering the sw subcommand with no arguments. For the KDB Kernel Debugger, the initial context is also restored whenever exiting the debugger.

sw will use the following arguments :

• u : flag to switch to user address space for the current thread. • k : flag to switch to kernel address space for the current thread. • th_slot : specifies a thread slot number. This argument must be a decimal value. • th_Address : address of a thread slot. Symbols, hexadecimal values,

orhexadecimal expressions can be used in specification of the address.

Continued on next page

Page 183: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -53 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB process/thread sub commands -- continued

examples (0)> ppda * <== display all ppda summary SLT CSA CURTHREAD SRR0ppda+000000 0 F00000002FF3B400 KERN_heap+40ABC00 D0059E18ppda+001000 1 F00000002FF3B400 KERN_heap+E8FBC00 000000010000120C(0)> ppda <== display ppda for current cpu (0)Per Processor Data Area [0014ED80]csa..............F00000002FF3B400 mstack...........0000000000838DF8fpowner..........0000000000000000 curthread........F1000097140ABC00syscall..........000000000008202E intr.............0000000000000000i_softis.....................0000 i_softpri....................0000prilvl...........F1000097140C1600 worst_run_pri................00FFrun_pri........................FF v_pnda...........00000000001FC570cpunidx......................0000ppda_pal[0]..............00000000 ppda_pal[1]..............00000000ppda_pal[2]..............00000000 ppda_pal[3]..............00000000phy_cpuid....................0000 sradid.......................0000slb_reload_index.............0000 ppda_fp_cr...............00000000flih save[0].....0000000020000000 flih save[1].....000000000001E10Cflih save[2].....A000000000009032 flih save[3].....0000000000000000flih save[4].....0FFFFFFFF3FFFE80 flih save[5].....000000000046AC80flih save[6].....0000000000000000 flih save[7].....0000000000000000flih save[8].....0000000000000000 flih save[9].....0000000000000000flih save[10].....0000000000000000usegp............0000000000000000 srflag...........7000000000000000srsave[0]........000000000000736F srsave[1]........000000000000736Fsrsave[2]........0000000000000000 srsave[3]........0000000000000000srsave[4]........0000000000000000gsegs[0].eaddr...0000000000000000 gsegs[0].vsid....0000000000000000gsegs[1].eaddr...0000000000000000 gsegs[1].vsid....0000000000000000gsegs[2].eaddr...0000000000000000 gsegs[2].vsid....0000000000000000gsegs[3].eaddr...0000000000000000 gsegs[3].vsid....0000000000000000Useracc addr.........0000000000000000Useracc size.........0000000000000000dsisr....................42000000 dsi_flag.................00000003dar..............0000000020010920dssave[0]........0000000000000020 dssave[1]........000000002FF226F0dssave[2]........00000000F009E9BC dssave[3]........000000002000F8E0dssave[4]........00000000F0046E28 dssave[5]........0000000000000000dssave[6]........0000000000000000 dssave[7]........00000000200454E0dssrr0...........00000000D0052904 dssrr1...........200000000000D0B2dssprg1..........000000002FF22D54 dsctr............0000000002155980dslr.............000000000038F248 dsxer....................20000008dsmq.....................00000000 pmapstk..........00000000001CF8D0pmapsave64.......0000000000000000 pmapcsa..........0000000000000000schedtail[0].....0000000000000000 schedtail[1].....0000000000000000schedtail[2].....0000000000000000 schedtail[3].....0000000000000000cpuid........................0000 stackfix.......................00lru............................00 vmflags..................00000000sio............................00 reservation....................00hint...........................00 no_vwait.......................00lock.....................00000000scoreboard[0]....0000000000000000 scoreboard[1]....0000000000000000scoreboard[2]....0000000000000000 scoreboard[3]....0000000000000000scoreboard[4]....0000000000000000 scoreboard[5]....0000000000000000scoreboard[6]....0000000000000000 scoreboard[7]....0000000000000000

Continued on next page

Page 184: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-54 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB process/thread sub commands -- continued

example continued

intr_res1................00000000 intr_res2................00000000mpc_pend.................00000000 iodonelist.......0000000000000000run_queue........F1000097140A1000 global_run_queue.F1000097140A0118ppda_timer.... @ 000000000014F0B0 decompress.......0000000000000000TB_ref_u.................01580CBC TB_ref_l.................40000000sec_ref..................39B7F005 nsec_ref.................0C3B4A07_ficd....................00000000 icndx........................07F7ppda_qio.................00000000 cs_sync..................00000000perfmon_sv[0]....0000000000000000 perfmon_sv[1]....0000000000000000thread_private...........00000000 cpu_priv_seg.....0000000000000000ri_flih_paddr....0000000000F28F00 ri_save6.........0000000000000000util_start_time..........00000000 util_accumulator.........00000000ppda_ha_event....0000000000000000 ppda_ha_fun......0000000000000000ppda_ha_arg......0000000000000000wp_available.............00000001frs_id.......................0000 memp_id........................00newprivseg...............00000000trace vectors. @ 000000000014F1F0 ppda_trcbufp0....0000000000000000wlm_cpulocal_dataF100009716320000WLM (Only non-null slots are shown)........Slot time npagesppda_dseg_count..0000000000000000 ppda_iseg_count..0000000000000000ppda_emul_tptr...0000000000000000 ppda_align_iar...000000000000B658ppda_align_tptr..F1000097165A2A00 ppda_align_ea....F1000082C01BC926ppda_emul_iar....0000000000000000 ppda_emul_count..........00000000ppda_align_count.........00451303 radindex...... @ 000000000014EE84TIMER....................t_free...........F10000971E87D200 t_active.........F100009713FF3100t_freecnt................00000001 trb_called.......0000000000000000trb_lock...... @ 000000000014F0D0 trb_lock.........0000000000000000systimer.........F100009713FF3100 ticks_its................00000042ref_time.tv_sec..0000000039B7F006 ref_time.tv_nsec.........0EA6319Ftime_delta.......0000000000000000 time_adjusted....F100009713FF3100wtimer.next......F100009716458180 wtimer.prev......F10000971ECD42D0wtimer.func......0000000000203F80 wtimer.count.....0000000000000000wtimer.restart...0000000000000000 w_called.........0000000000000000watchdog_lock. @ 000000000014F138 watchdog_lock....0000000000000000KDB......................kdb_ppda_r0......0000000000000001 kdb_ppda_r1......000000002FF228B0kdb_ppda_r2......00000000F01951F4 kdb_ppda_r15.....000000002FF22D54kdb_ppda_srr0....00000000D043AB18 kdb_ppda_srr1....200000000004D0B2flih_save................22282229 proc_state...............0000000Bcsa..............0000000000CD8A88ri_flih_paddr....0000000000F28F00 ri_r6............0000000000000000(0)> intr <== display the interrupt handler table SLT INTRADDR HANDLER TYPE LEVEL PRIO BID FLAGS

i_data+0000E8 5 F1000097140B0FC0 00000000 0004 00000004 0003 900000C0 0050i_data+0000E8 5 F10000971ECD4000 019EA5C0 0004 0000000D 0003 900000C0 0050..

Continued on next page

Page 185: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -55 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB process/thread sub commands -- continued

example continued

(0)> mst <== display the current mstMachine State Save Areaiar : 000000000002CB58 msr : A0000000000010B2 cr : 28442224lr : 0000000000000000 ctr : 00000000003E2150 xer : 20000000mq : FFFFFFFF asr : 0000000005622001r0 : 0000000044484244 r1 : F00000002FF3B200 r2 : 000000000046AC80r3 : 00000000003356E4 r4 : A0000000000090B2 r5 : F1000097163BF301r6 : F00000002FF3AF40 r7 : 0000000000000105 r8 : 000000000014FD80r9 : 0000000000000001 r10 : 00000000000021B6 r11 : 0000000000000105r12 : 000000000020CDD0 r13 : F1000097140AB600 r14 : 0000000000000004r15 : 0000000011000081 r16 : 0000000070000080 r17 : 0000000000000001r18 : 0000000000000003 r19 : 0000000000000000 r20 : 00000000FFFEFBFFr21 : F1000097140AB778 r22 : 0000000048242224 r23 : 0000000000000000r24 : 0000000000000000 r25 : 000000000000000B r26 : 0000000000000000r27 : F100008080000280 r28 : F100008090000080 r29 : F1000097140C1A00r30 : F1000097140AB600 r31 : 0000000000000004s0 : 0000000000000000 s1 : 000000000FFFFFFF s2 : 000000000FFFFFFFs3 : 000000000FFFFFFF s4 : 000000000FFFFFFF s5 : 000000000FFFFFFFs6 : 000000000FFFFFFF s7 : 000000000FFFFFFF s8 : 000000000FFFFFFFs9 : 000000000FFFFFFF s10 : 000000000FFFFFFF s11 : 000000000FFFFFFFs12 : 000000000FFFFFFF s13 : 000000000FFFFFFF s14 : 000000000FFFFFFFs15 : 000000000FFFFFFFprev 0000000000000000 stackfix F00000002FF3B200kjmpbuf 0000000000000000 excbranch 0000000000000000intpri 00 backt 00 flags 00fpscr 0000000000000000 fpscrx 00000000 fpowner 00fpeu 00 fpinfo 00 alloc F000 ptaseg F100000050000000o_iar 0000000000000000 o_toc 0000000000000000o_arg1 0000000000000000 o_vaddr 0000000000000000Except : csr 0000000000000000 dsisr 0000000042000000 bit set: DSISR_PFT DSISR_ST esid 000000002000796E dar F10000971F15700C dsirr 0000000000000106(0)> p * -s SACTIVE <== display all active process SLOT NAME STATE PID PPID ADSPACE CL #THSpvproc+000000 0 swapper ACTIVE 0000000 0000000 0000000000000B00 0 0001pvproc+000280 1 init ACTIVE 0000001 0000000 000000000000E2FD 0 0001pvproc+000500 2 wait ACTIVE 0000204 0000000 0000000000001B02 0 0001pvproc+000780 3 wait ACTIVE 0000306 0000000 0000000000002B04 0 0001pvproc+000A00 4 lrud ACTIVE 0000408 0000000 0000000000003B06 65 0001pvproc+000C80 5 xmgc ACTIVE 000050A 0000000 000000000000BB16 65 0001pvproc+000F00 6 netm ACTIVE 000060C 0000000 000000000000CB18 65 0001pvproc+001180 7 gil ACTIVE 000070E 0000000 000000000000DB1A 65 0005..(0)> th -w NOWAIT <== display all thread that wait for nothingSLOT NAME STATE TID PRI RQ CPUID CL WCHAN

pvthread+000180 3!wait RUN 000307 0FF 1 00001 0

Continued on next page

Page 186: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-56 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB process/thread sub commands -- continued

example continued

(0)> th 3 <== now display details on thread 3SLOT NAME STATE TID PRI RQ CPUID CL WCHANpvthread+000180 3>wait RUN 000307 0FF 1 00001 0NAME................ waitFLAGS............... KTHREADWTYPE............... WCPU.................tid :0000000000000307 ......tsleep :FFFFFFFFFFFFFFFF...............flags :00001000 ..............flags2 :00000000DATA.........pvprocp :F100008080000780 <pvproc+000780>LINKS.....prevthread :F100008090000180 <pvthread+000180>..........nextthread :F100008090000180 <pvthread+000180>DISPATCH.......synch :FFFFFFFFFFFFFFFFSCHEDULER...affinity :00000001 .................pri :000000FF.............boosted :00000000 ...............wchan :0000000000000000...............state :00000002 ...............wtype :00000004CHECKPOINT......vtid :00000000LOCK........ lock_d @ F100008090000190 0000000000000000PROCFS......procfsvn :0000000000000000THREAD.......threadp :F1000097140AB000 ........size :00000080FLAGS............... SIGAVAIL KTHREAD FUNNELLED SIGSLIH SIGINTR.................tid :0000000000000307 ......stackp :0000000000000000.................scp :0000000000000000 .......ulock :0000000000000000...............uchan :0000000000000000 ....userdata :0000000000000000..................cv :0000000000000000 .......flags :0000000000003004..............atomic :0000000000000000 ......flags2 :0000000000000000DATA...........procp :F1000097140ABE00 <KERN_heap+40ABE00>...........pvthreadp :F100008090000180 <pvthread+000180>...............userp :F00000002FF3B898 <__ublock+000498>............uthreadp :F00000002FF3B400 <__ublock+000000>SLEEP/LOCK......usid :0000000000000000 ......wchan1 :0000000000000000..............wchan2 :0000000000000000 ......swchan :0000000000000000...........eventlist :0000000000000000 ......result :00000000.............polevel :00000000 ..............pevent :0000000000000000..............wevent :0000000000000000 .......slist :0000000000000000...........wchan1sid :0000000000000000 wchan1offset :00000000...........lockcount :00000000 ..........adsp_flags :0000DISPATCH.......ticks :0000BC2C ...............prior :F1000097140AB000................next :F1000097140AB000 ......dispct :00000000008B4EF3...............fpuct :0000000000000000MISC........graphics :0000000000000000 ...pmcontext :0000000000000000...........lockowner :0000000000000000 ..kthreadseg :0000000107FFFFFF..........time_start :0000000000000000 ..........wlm_charge :0SIGNAL........sigproc:00000000 ..............cursig :00000000......(pending) sig :[3] 0000000000000000 .................[2] 0000000000000000......................[1] 0000000000000000 .................[0] 0000000000000000............sigmask :[3] 0000000000000000 .................[2] 0000000000000000......................[1] 0000000000000000 .................[0] 0000000000000000SCHEDULER......cpuid :00000001 ..............scpuid :00000001.........affinity_ts :0006A57F ..............policy :00000001.................cpu :00000078 .............lockpri :00000000.............wakepri :000000FF ................time :00000000.............sav_pri :000000FF ...........run_queue :F1000097140A2000................cpu2 :00000078.............suspend :00000001 .............fsflags :00000000..........norun_secs :00000000CHECKPOINT..chkerror :0000 ............chkblock :00000000PROCFS.......whystop :00000000 ............whatstop :00000000..............weight :00000008 ........allowed_cpus :C0000000.......prefunnel_cpu :00000000......threadcontrolp :0000000000000000...........controlvm :0000000000000000PVTHREAD...pvthreadp :F100008090000180 ........size :00000080

Continued on next page

Page 187: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -57 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB process/thread sub commands -- continued

example continued

(0) ttid 70e <== now display threads for gil(70e) that should have 5 threadsSLOT NAME STATE TID PRI RQ CPUID CL WCHANpvthread+000380 7 gil SLEEP 00070F 025 1 65pvthread+000580 11 gil SLEEP 000B17 025 1 65 netisr_serverspvthread+000500 10 gil SLEEP 000A15 025 1 65 netisr_serverspvthread+000480 9 gil SLEEP 000913 025 1 65 netisr_serverspvthread+000400 8 gil SLEEP 000811 025 1 65 netisr_servers(0)> user -ad 5 <== display address space for thread 5User-mode address space mapping: segs32_raddr.0000000000000000uadspace node allocation......(U_unode) @ F00000002FF3E028usr adspace 32bit process.(U_adspace32) @ F00000002FF3E048segment node allocation.......(U_snode) @ F00000002FF3E008segnode for 32bit process...(U_segnode) @ F00000002FF3E2A8U_adspace_lock @ F00000002FF3E4E8 lock_word.....0000000000000000 vmm_lock_wait.0000000000000000V_USERACC strtaddr:0x0000000000000000 Size:0x0000000000000000vmmflags......00000000(0)> sw 5 <== switch to the thread 5Switch to thread: <pvthread+000280>(0)> tpid <== display the current tpid that should be slot 5 SLOT NAME STATE TID PRI RQ CPUID CL WCHANpvthread+000280 5*xmgc SLEEP 00050B 03C 1 65 KERN_heap+ECD5730(0)> sw <== switch back to initial threadSwitch to initial thread: <pvthread+001200>(0)> tpid <== display the current tpid that should be initial pvthread+001200 SLOT NAME STATE TID PRI RQ CPUID CL WCHANpvthread+001200 36*kdb_64 RUN 002467 03C 0 0

Page 188: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-58 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB Kernel stack sub commands

Introduction The following table represents the Kernel stack sub commands and their matching crash/lldb sub commands when available

f sub command The stack sub command displays all the stack frames from the current instruction as deep as possible. Interrupts and system calls are crossed and the user stack is also displayed. In the user space, trace back allows display of symbolic names. The amount of data displayed may be controlled through the mst_wanted and display_stacked_frames options of the set sub command. You can also request to see the stacked registers using the display_stacked_regs set option.

The f sub command can be invoked using the following :

• no argument : the stack for the current thread is displayed.• +x : flag to include hex addresses as well as symbolic names for calls on the

stack. This option remains set for future invocations of the stack subcommand, until changed via the -x flag.

• -x : flag to suppress display of hex addresses for functions on the stack. This option remains in effect for future invocations of the stack subcommand, until changed via the +x flag.

• tslot : decimal value indicating the thread slot number • Address : hex address, hex expression, or symbol indicating the effective

address for a thread slot

Continued on next page

kernel stackfunction

crash/lldb sub commands

KDB sub commands

kdb sub commands

trace a kernel stack fs f f

Page 189: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -59 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB Kernel stack sub commands -- continued

examples (0)> f +x <== display the stack frame for the current threadpvthread+000380 STACK:[0002CB58]et_wait+00036C (0000000000212A0C, A0000000000010B2, 0000000000122A0C [??])[000EF170]netthread_start+0000B8 ()[00060F6C]procentry+000010 (??, ??, ??, ??)(0)> f -x <==display the stack frame without addressespvthread+000380 STACK:et_wait+00036C (.backt+000000, A0000000000010B2, .v_prepin+000000 [??])netthread_start+0000B8 ()procentry+000010 (??, ??, ??, ??)(0) set 10 <== want to see the stacked registersdisplay_stacked_regs is true(0)> f <== show the stack frame with stacked registerspvthread+000380 STACK:et_wait+00036C (.backt+000000, A0000000000010B2, .v_prepin+000000 [??]) r31 : 0000000000000000 r30 : 0FFFFFFFF0100000 r29 : 0000000000205E38 r28 : 00000000DEADBEEF r27 : 00000000DEADBEEF r26 : 00000000DEADBEEF r25 : 00000000DEADBEEF r24 : 00000000DEADBEEF r23 : 00000000DEADBEEF r22 : 00000000DEADBEEF r21 : 00000000DEADBEEF r20 : 00000000DEADBEEF r19 : 00000000DEADBEEF r18 : 00000000DEADBEEF r17 : 00000000DEADBEEF r16 : 00000000DEADBEEF r15 : 00000000DEADBEEF r14 : 00000000DEADBEEFnetthread_start+0000B8 () r31 : 00000000DEADBEEF r30 : 00000000DEADBEEF r29 : 00000000DEADBEEFprocentry+000010 (??, ??, ??, ??)

Page 190: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-60 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB LVM sub commands

Introduction The following table represents the LVM sub commands and their matching crash/lldb sub commands when available

volgrp,pvol, lvol and pbuf sub command

volgrp, pvol, lvol and pbuf will respectively display :

• volume group information (including lvol structures). volgrp addresses are registered in the devsw table, in the DSDPTR field.

• physical volume information. pvol addresses are registered within the vlogrp structure.

• logical volume information. lvol addresses are registered within the volgrp and lvol structures.

• physical buffer information. pbuf addresses are registered withing volgrp and pvol structures.

All lvm sub commands takes addresses as parameters.

Continued on next page

LVMfunction

crash/lldb sub

commands

KDB sub commands

kdb sub commands

display physical buffer pbuf pbuf

display volume group volgrp volgrp

display physical volume pvol pvol

display logical volume lvol lvol

Page 191: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -61 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB LVM sub commands -- continued

examples (0)> dev 0xa <== get the device switch table entry for a volume groupSlot address F1000097140C3500MAJOR: 00A..dump: 010E3D00 mpx: .nodev (0009E378) revoke: .nodev (0009E378) dsdptr: F10000971660D000 <== the pointer to the volgrp structure selptr: 00000000 opts: 0000002A DEV_DEFINED DEV_MPSAFE(0)> volgrp F10000971660D000VOLGRP............. F10000971660D000vg_lock............... FFFFFFFFFFFFFFFF partshift............. 0000000Eopen_count............ 0000000A flags................. 00000000lvols............... @ F10000971660D010 <== pointer to the lvol structpvols............... @ F10000971660E010 <== pointer to the pvol structmajor_num............. 0000000Avg_id................. 0007148300004C00000000E12335DF7Dnextvg................ 00000000 opn_pin............. @ F10000971660E428von_pid............... 00000A32 nxtactvg.............. 00000000ca_freepvw............ 00000000 ca_pvwmem............. 00000000ca_hld.............. @ F10000971660E488 ca_pv_wrt........... @ F10000971660E4A0..(0)> lvol F10000971E624E00 <== display on of the lvol structureLVOL............ F10000971E624E00work_Q.......... 00000000 lv_status....... 00000000lv_options...... 00001000 nparts.......... 00000001i_sched......... 00000000 nblocks......... 00034000parts[0]........ F10000971E621A00 pvol@ F1000097163DF200 <== pointer to pvol structure.............dev 8000000E00000001 start 002C9100parts[1]........ 00000000parts[2]........ 00000000maxsize......... 00000000 tot_rds......... 00000000complcnt........ 00000000 waitlist........ FFFFFFFFstripe_exp...... 00000000 striping_width.. 00000000lvol_intlock. @ F10000971E624E60 lvol_intlock.... 00000000(0)> pvol F1000097163DF200 <== now display the pvolPVOL............... F1000097163DF200dev................ 8000000E00000001 xfcnt.............. 00000000armpos............. 00000000 pvstate............ 00000000pvnum.............. 00000000 vg_num............. 0000000Afp................. F1000096000022F0 flags.............. 00000000num_bbdir_ent...... 00000000 fst_usr_blk........ 00001100beg_relblk......... 00867C2D next_relblk........ 00867C2Dlmax_relblk......... 00867D2C defect_tbl......... F1000097165F4C00ca_pv............ @ F1000097163DF250 sa_area[0]....... @ F1000097163DF260sa_area[1]....... @ F1000097163DF270 pv_pbuf.......... @ F1000097163DF280 <== pointer to pbufoclvm............ @ F1000097163DF3C8

Page 192: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-62 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB SCSI sub commands

Introduction The following table represents the scsi sub commands and their matching crash/lldb sub commands when available

asc,vsc and csd sub commands

The asc,vsc and csd sub commands respectively prints:

• ascsi adaptesr informations : the ascsiddpin kernext is used to locate the adp_ctrl structure

• vscsi adapters informations : the vscsiddpin kernext is used to locate the vscsi_ptrs structure

• scdisk disk informations L the scdiskpin kernext is used to locate the scdisk_list structure

•If no argument is specified the asc subcommand loads the slot numbers with addresses from the adp_ctrl structure. The asc,vsc sub commands can use the following arguments:

• no argument : prompt for the structure address.• slot : slot number of the adp_ctrl,vscsi_ptrs or scdisk_list entry to be displayed.

This value must be a decimal number. • Address : effective address of the structure to display. Symbols, hexadecimal

values, or hexadecimal expressions can be used in specification of the address.

Continued on next page

SCSIfunction

crash/lldb sub

commands

KDB sub

commands

kdb sub

commands

display ascsi N/A asc asc

display vscsi N/A vsc vsc

display scdisk N/A scd scd

Page 193: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -63 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB SCSI sub commands -- continued

Examples (0)> lke 57 ADDRESS FILE FILESIZE FLAGS MODULE NAME

57 04E39480 01237AC0 00008958 00000262 /etc/drivers/ascsiddpinle_flags....... TEXT DATAINTEXT DATA DATAEXISTSle_next........ 04E39400 le_fp.......... 00000000le_filename.... 04E394D8 le_file........ 01237AC0le_filesize.... 00008958 le_data........ 0123FE60(0)> d 0123FE60 800123FE60: 0123 EE3C 0123 EE38 0123 EE34 0123 EE30 .#.<.#.8.#.4.#.00123FE70: 0123 EE2C 0123 EE28 0123 EE24 0123 EE20 .#.,.#.(.#.$.#.0123FE80: 0123 EE80 0123 EEC0 0123 EF00 0123 EF40 .#...#...#...#.@0123FE90: 0123 EF80 0123 EFC0 0123 F000 0123 F040 .#...#...#...#.@0123FEA0: 0123 F080 0123 F0C0 0123 F100 0123 F140 .#...#...#...#.@0123FEB0: 0123 F180 0123 F1C0 0123 F200 0123 F240 .#...#...#...#.@0123FEC0: 0000 0000 0000 0002 0000 0002 5002 D000 ............P...0123FED0: 5002 E000 0000 0000 0000 0000 0000 0000 P...............(0)> asc <== run asc and enter the address we found previouslyUnable to find <adp_ctrl>Enter the adp_ctrl address (in hex): 0123FEC0Adapter control [0123FEC0]semaphore............00000000num_of_opens.........00000002num_of_cfgs..........00000002ap_ptr[ 0]...........5002D000ap_ptr[ 1]...........5002E000..(0)> asc 1 <== now that asc was ran once, we can use slot numbersAdapter info [5002E000]ddi.resource_name..... ascsi1intr............... @ 5002E000 ndd...................506FC020seq_number............00000001 next..................00000000local.............. @ 5002E1A4 ddi................ @ 5002E1D0active_head...........00000000 active_tail...........00000000wait_head.............00000000 wait_tail.............00000000num_cmds_queued.......00000000 num_cmds_active.......00000000adp_pool..............506C3128 surr_ctl........... @ 5002E22Csta................ @ 5002E27C time_s.tv_sec.........00000000time_s.tv_nsec........00000000 tcw_table.............506C3F9Copened................00000001 adapter_mode..........00000001adp_uid...............00000002 peer_uid..............00000000sysmem................506C0000 sysmem_end............506C3FADbusmem................00654000 busmem_end............00658000tm_tcw_table..........00000000 eq_raddr..............00654000dq_raddr..............00655000 eq_vaddr..............506C0000dq_vaddr..............506C1000 sta_raddr.............00656000sta_vaddr.............506C2000 bufs..................00658000tm_sysmem.............00000000 wdog............... @ 5002E344tm................. @ 5002E360 delay_trb.......... @ 5002E37Cxmem............... @ 5002E3B8 dma_channel...........04001000

Continued on next page

Page 194: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-64 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB SCSI sub commands -- continued

Example continued

mtu...................00141000 num_tcw_words.........00000011shift.................00000000 tcw_word..............00000000resvd1................00000000 cfg_close.............00000000vpd_close.............00000000 locate_state..........00000004locate_event..........FFFFFFFF rir_event.............FFFFFFFFvpd_event.............FFFFFFFF eid_event.............FFFFFFFFebp_event.............FFFFFFFF eid_lock..............FFFFFFFFrecv_fn...............0124024C tm_recv_fn............00000000tm_buf_info...........00000000 tm_head...............00000000tm_tail...............00000000 tm_recv_buf...........00000000tm_bufs_tot...........00000000 tm_bufs_at_adp........00000000tm_bufs_to_enable.....00000000 tm_buf................00000000tm_raddr..............00000000 proto_tag_e...........00000000proto_tag_i...........00000000 adapter_check.........00000000eid................ @ 5002E42C limbo_start_time......00000000dev_eid............ @ 5002E4B0 tm_dev_eid......... @ 5002E8B0pipe_full_cnt.........00000000 dump_state............00000000pad...................00000000 adp_cmd_pending.......00000000reset_pending.........00000000 epow_state............00000000mm_reset_in_prog......00000000 sleep_pending.........00000000bus_reset_in_prog.....00000000 first_try.............00000001devs_in_use_I.........00000000 devs_in_use_E.........00000000num_buf_cmds..........00000000 next_id...............00000045next_id_tm............00000000 resvd4................00000000ebp_flag..............00000000 tm_bufs_blocked.......00000000tm_enable_threshold...00000000 limbo.................00000000critical_path.........00000000 epow_reset_needed.....00000000

Page 195: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -65 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB memory allocator sub commands

Introduction The following table represents the memory allocator sub commands and their matching crash/lldb sub commands when available

kmstats sub command

The kmstats sub command prints kernel allocator memory statistics. If no address is specified, all kernel allocator memory statistics are displayed. If an address is entered, only the specified statistics entry is displayed.

kmbuckets sub command

The kmbucket sub command prints kernel memory allocator buckets. If no arguments are specified information is displayed for all allocator buckets for all CPUs. kmbucket accept the following parameters :

• -l - display the bucket free list. • -c cpu - display only buckets for the specified CPU. The cpu is specified as a

decimal value. • -i index - display only the bucket for the specified index. The index is specified

as a decimal value. • Address - display the allocator bucket at the specified effective address.

Symbols, hexadecimal values, or hexadecimal expressions may be used in specification of the address.

Continued on next page

memory allocatorfunction

crash/lldb sub commands

KDB sub commands

kdb sub commands

display kernel heap heap heap

display heap debug xmalloc xm xm

display kmem buckets kmbucket kmbucket

display kmem statistics mblk kmstats kmstats

Page 196: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-66 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB memory allocator sub commands -- continued

xm sub command

The xmalloc subcommand may be used to display memory allocation information. Other than the -u option, these subcommands require that the Memory Overlay Detection System (MODS) is active. The MODS can be activated using the bosdebug command.

• -s : display allocation records matching addr. If Address is not specified, the value of the symbol Debug_addr is used.

• -h : display free list records matching addr. If Address is not specified, the value of the symbol Debug_addr is used.

• -l : enable verbose output. Applicable only with flags -f, -a, and -p. • -f : display records on the free list, from the first freed to the last freed. • -a : display allocation records. • -p page : display page information for the specified page. The page number is

specified as a hexadecimal value. • -d : display the allocation record hash chain associated with the record hash

value for Address. If Address is not specified, the value of the symbol Debug_addr is used.

• -v : verify allocation trailers for allocated records and free fill patterns for free records.

• -u : display heap statistics. • -S : display heap locks and per-cpu lists. Note, the per-cpu lists are only used

for the kernel heaps. • Address : effective address for which information is to be displayed. Symbols,

hexadecimal values, or hexadecimal expressions can be used in specification of the address.

• heap_addr : effective address of the heap for which information is displayed. If heap_addr is not specified, information is displayed for the kernel heap. Symbols, hexadecimal values, or hexadecimal expressions can be used in specification of the address.

heap sub command

The heap subcommand displays information about heaps. If no argument is specified information is displayed for the kernel heap. Information can be displayed for other heaps by specifying an address of a heap_t structure.

Continued on next page

Page 197: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -67 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB memory allocator sub commands -- continued

Examples (0)> heap <== display kernel heapsPinned heap 00730290sanity......... 4E554D41 alt............ 00000001heapaddr[00]... F100009710000000 [01].. 0heapaddr[02]... 0 [03].. 0baseaddr[00]... F100009713FF3000 [01].. 0baseaddr[02]... 0 [03].. 0numpages[00]... 1C00D [01].. 0numpages[02]... 0 [03].. 0Kernel heap 007302F8sanity......... 4E554D41 alt............ 00000000heapaddr[00]... F1000097100000D8 [01].. 0heapaddr[02]... 0 [03].. 0baseaddr[00]... F100009713FF3000 [01].. 0baseaddr[02]... 0 [03].. 0numpages[00]... 1C00D [01].. 0numpages[02]... 0 [03].. 0(0)> xm -S F1000097100000D8 <== display heap lock/cpu for kernel heap 007302F8Locks:Lock for allocation size 16: F100009710000248 AvailableLock for allocation size 32: F1000097100002C8 AvailableLock for allocation size 64: F100009710000348 AvailableLock for allocation size 128: F1000097100003C8 AvailableLock for allocation size 256: F100009710000448 AvailableLock for allocation size 512: F1000097100004C8 AvailableLock for allocation size 1024: F100009710000548 AvailableLock for allocation size 2048: F1000097100005C8 AvailableHeap lists:CPU List # Unpinned Pinned 0 0 0 0 0 1 0 0 .

.0 9 0 0

0 10 0 00 11 2322A000 0

1 0 0 0..(0)> kmstats <== display all the kernel allocator memory statsmh_freelater ............0000000000E3E830displaying kmemstats for offset 0 freeaddress...............F100009715FB46E0 inuse..(x)............0000000000000000calls..(x)............0000000000000000 memuse..(x)...........0000000000000000limit blocks..(x).....0000000000000000 map blocks..(x).......0000000000000000maxused..(x)..........0000000000000000 limit..(x)............0000000000000000failed..(x)...........0000000000000000 lock............... @ F100009715FB4728lock..(x).............0000000000000000

.

.

.

Continued on next page

Page 198: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-68 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB memory allocator sub commands -- continued

Examples continued

(0)> kmbucket <== display all kernel memory allocator bucketsdisplaying kmembucket for cpu 0 offset 5 size 0x00000020address...............F100009715FA4C48 b_next..(x)...........F1000082C007BB80b_calls..(x)..........0000000000000026 b_total..(x)..........0000000000000080b_totalfree..(x)......000000000000005D b_elmpercl..(x).......0000000000000080b_highwat..(x)........00000000000003F5 b_couldfree (sic).(x).0000000000000000b_failed..(x).........0000000000000000 lock............... @ F100009715FA4C90lock..(x).............0000000000000000

displaying kmembucket for cpu 0 offset 6 size 0x00000040.

.

Page 199: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -69 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB file system sub commands

Introduction The following table represents the file system sub commands and their matching crash/lldb sub commands when available

buffer,hbuffer and fbuffer sub command

The buffer,hbuffer and fbuffer sub command respectivelly prints :

• buffer cache headers. • buffer cache hash list headers.• buffer cache freelist headers.If no argument is specified a summary is printed. Details can be displayed by selecting a slot number or an address using :• slot : a buffer pool slot number. This argument must be a decimal value. • Address : effective address of a buffer pool entry. Symbols, hexadecimal

values, or hexadecimal expressions can be used in specification of the address.

Continued on next page

file systemfunction

crash/lldb sub commands

KDB sub commands

kdb sub commands

display buffer buffer buffer buffer

display buffer hash table hbuffer hbuffer

display freelist fbuffer fbuffer

display gnode gnode gnode

display gfs gfs gfs

display file file file file

display inode inode inode inode

display inode hash table hinode hinode

display inode cache list icache icache

display rnode rnode N/A

display vnode vnode vnode vnode

display vfs vfs vfs vfs

display specnode specnode specnode

display devnode devnode devnode

display fifo node fifonode fifonode

display hnode hash table hnode hnode

Page 200: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-70 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB file system sub commands -- continued

inode, hinode and icache sub commands

The inode, Hinode and Icache respectively displays :

• inode table entries. If no argument is entered a summary for used (hashed) inode table entries is displayed.

• inode hash list entries.• inode cache list entries.

These sub commands use the following arguments :

• slot : slot number of an entry. This argument must be a decimal value. • Address : effective address of an entry. Symbols, hexadecimal values, or

hexadecimal expressions can be used in specification of the address.

gnode, vnode, specnode, devnode, fifonode, rnode and hnode sub commands

gnode, vnode, specnode, devnode, fifonode, rnode and hnode sub commands respectively displays :

• generic node structure at the specified address.• virtual node (vnode) table entries.• special device node structure at the specified address.• device node (devnode) table entries.• fifo node table entries.• remote node structure at the specified address.• hash node table entries.

These sub commands accept the following arguments :

• slot : slot number of a f table entry. This argument must be a decimal value. • Address : effective address of a table entry. Symbols, hexadecimal values, or

hexadecimal expressions can be used in specification of the address.

vfs sub command

The vfs subcommand displays entries of the virtual file system table. If no argument is entered a summary is displayed with one line for each entry. Detailed information can be obtained for an entry by identifying the entry of interest. Individual entries can be displayed using :

• slot : slot number of a virtual file system table entry. This argument must be a decimal value.

• Address : address of a virtual file system table entry. Symbols, hexadecimal values, or hexadecimal expressions can be used in specification of the address.

Continued on next page

Page 201: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -71 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB file system sub commands -- continued

gfs sub command

The gfs subcommand displays the generic file system structure at the specified address.

file sub command

The file subcommand displays file table entries. If no argument is entered all file table entries are displayed in a summary. Used files are displayed first (count > 0), then others. Detailed information can be displayed using :

• slot : slot number of a file table entry. This argument must be a decimal value. • Address : effective address of a file table entry. Symbols, hexadecimal values,

or hexadecimal expressions can be used in specification of the address.

Examples (0)> vfs <== display mounted vfs GFS DATA TYPE FLAGS

1 KERN_heap+5F7C470 00394EC8 F100009715F7D990 JFS DEVMOUNT... /dev/hd4 mounted over / 2 KERN_heap+5F7C4D0 00394EC8 F100009715F7DE60 JFS DEVMOUNT... /dev/hd2 mounted over /usr 3 KERN_heap+5F7C530 00394EC8 F100009715F7DD00 JFS DEVMOUNT... /dev/hd9var mounted over /var 4 KERN_heap+5F7C410 00394EC8 F100009715F7D8E0 JFS DEVMOUNT... /dev/hd3 mounted over /tmp 5 KERN_heap+5F7C590 00394EC8 F100009715F7DAF0 JFS DEVMOUNT... /dev/hd1 mounted over /home 6 KERN_heap+5F7C5F0 00395008 0000000000000000 PROCFS... /proc mounted over /proc 7 KERN_heap+5F7C650 00394F68 F10000971EB5A3D0 AIX DEVMOUNT... /dev/lv01 mounted over /j2(0)> gfs 0039500 <== display gfs for jfs entrygfs_data. 706F7374FBE1FFF8 gfs_flag. SYS5DIR FUMNT VERSION42 NOUMASKgfs_ops.. E981008038210070gn_ops... 7D8803A64E800020gfs_name. Ngfs_init. 00000054000E776Cgfs_rinit 607F00007C0802A6gfs_type.gfs_hold. E8625080(0)> file <== display the file tableADDR COUNT OFFSET DATA TYPE FLAGSF100009600001080 1 0000000000000000 F1000097160CC2B0 VNODE WRITE NOCTTYF1000096000010D0 1 0000000000000000 F1000082C0078800 SOCKET READ WRITEF100009600001120 29 0000000000000000 F1000097159BB290 VNODE READ RSHAREF100009600001170 2 0000000000000000 F100009714C89830 VNODE READ RSHAREF1000096000011C0 34 0000000000026282 F100009714A01C60 VNODE READ RSHAREF100009600001210 1 0000000000000100 F100009715696290 VNODE EXECF100009600001260 3 00000000000230E2 F100009714AA6620 VNODE READ RSHARE

Page 202: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-72 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB system table sub commands

Introduction The following table represents the system table sub commands and their matching crash/lldb sub commands when available

var sub command

The var subcommand prints the var structure and the system configuration of the machine including :

• Base kernel parameters• Calculated High-Water marks• VMM tunable variables• System configuration informations

Continued on next page

system tablefunction

crash/lldb sub commands

KDB sub commands

kdb sub commands

display var var var var

display devsw table devsw devsw devsw

display system timer request blocks

callout trb trb

display simple lock lock -s slk slk

display complex lock lock -c clk clk

search for deadlock dlock N/A dla

display ipl proc information iplcb iplcb

display trace buffer trace trace

display the stream queue queue streams streams

Page 203: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -73 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB system table sub commands -- continued

devsw sub command

The dev subcommand display device switch table entries. If no argument is specified, all entries are displayed. To display a specific entry use :

• major - indicates the specific device switch table entry to be displayed by the

• major number : This is the hexadecimal value of the device. • Address : effective address of a driver. The device switch table entry with

the driver closest to the indicated address is displayed; and the specific driver is indicated. Symbols, hexadecimal values, or hexadecimal expressions can be used in specification of the address.

trb sub command The trb subcommand displays Timer Request Block (TRB) information. If this

subcommand is entered without arguments a menu is displayed allowing selection of the data to be displayed. Otherwise, you can use the following arguments :

• * : selects display of Timer Request Block (TRB) information for TRBs on all CPUs. The information displayed will be summary information for some options.

• cpu x : selects display of TRB information for the specified CPU. Note, the characters "cpu" must be included in the input. The value x is a hexadecimal number.

• option - the option number indicating the data to be displayed. The available option numbers are :• 1. TRB Maintenance Structure - Routine Addresses• 2. System TRB• 3. Thread Specified TRB• 4. Current Thread TRB's• 5. Address Specified TRB• 6. Active TRB Chain• 7. Free TRB Chain• 8. Clock Interrupt Handler Information• 9. Current System Time - System Timer Constants

slk,clk and dla sub commands

slk and clk display respectively simple and complex lock. If no argument is specifyed, a list a major locks will be displayed. Then, you can use the address of the lock to display the lock structure.

dla will search for deadlock.

Continued on next page

Page 204: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-74 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB system table sub command -- continued

iplcb sub command

The iplcb sub command will display the IPL Control Block structure using the following parameters :

• [cpu] to print IPL CB (will display all informations including cpu information for [cpu].

• * : print summary of all processors• -dir : print directory information• -proc [cpu] : print processor information• -mem : print memory region information• -sys : print system information• -user : print user information• -numa : print NUMA information

trace sub command

The trace sub command displays data in the kernel trace buffers. Data is entered into these buffers via the shell subcommand trace. The trace sub command accept the following parameters :

• -h : display trace headers. • -c chan : select the trace channel for which the contents are to be monitored.

The value for chan must be a decimal constant in the range 0 to 7. • hook : a hexadecimal value specifying the hook IDs to report on. • :subhook : allows specification of subhooks, if needed. The subhooks are

specified as hexadecimal values.

Examples (0)> !ls -al /dev/cd0 <== find the cd0 major numberbr--r--r-- 1 root system 14, 0 Sep 08 11:18 /dev/cd0(0)> lke 57 <== load the kernext for scsidd ADDRESS FILE FILESIZE FLAGS MODULE NAME57 049D6B00 00DB9740 000070D8 00080262 s_scsidd64/usr/lib/drivers/pci/s_scsiddle_flags....... TEXT DATAINTEXT DATA DATAEXISTS 64le_next........ 049D6900 le_svc_sequence 00000000.....

Continued on next page

Page 205: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -75 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB system table sub command -- continued

Example continued

(0)> dev 0xd <== display the cd0 deviceSlot address F10000971406F680MAJOR: 00D open: .ssc_open (00DBC0B0) close: .ssc_close (00DBEAD8) read: .nodev (00059694) write: .nodev (00059694) ioctl: .ssc_ioctl (00DBD1DC) strategy: .ssc_strategy (00DC3C2C) ttys: 00000000 select: .nodev (00059694) config: .ssc_config (00DBE180) print: .nodev (00059694) dump: .ssc_dump (00DCDEF4) mpx: .nodev (00059694) revoke: .nodev (00059694) dsdptr: 00000000 selptr: 00000000 opts: 0000002A DEV_DEFINED DEV_MPSAFE(0)> trb cpu 1 7 <== display the trb free list for cpu 1CPU #1 TRB #1 of 13 on Free List Timer address..............F100009715F8B780 trb->to_next...............0000000000000000 trb->knext.................F10000971E27AD00 trb->kprev.................0000000000000000 Owner id (-1 for dev drv)..00000000000042A1 Owning processor...................00000001 Timer flags........................00000010 INCINTERVAL trb->timerid...............0000000000000000 trb->eventlist.............FFFFFFFFFFFFFFFF trb->timeout.it_interval...0000000000000000 sec. 00000000 nsec. Next scheduled timeout ....0000000039BE55A6 sec. 19B39935 nsec. Completion handler.........00000000001DA910 .rtsleep_end+000000 Completion handler data....F100009715F8B7B0 Int. priority .....................FFFFFFFF Timeout function...........0000000000000000CPU #1 TRB #2 of 13 on Free List.(0)> iplcb -mem <== display the iplcb memory region informationMemory information [10008AAC]SLOT ADDR SIZE NODE ATTR LABEL0 0000000000000000 0000000000FF1000 0 VirtAddr FreeMem 1 0000000000FF1000 000000000000F000 0 VirtAddr RMALLOC 2 0000000001000000 0000000006FCC000 0 VirtAddr FreeMem 3 0000000007FCC000 0000000000029000 0 None RTAS_HEAP 4 0000000007FF5000 000000000000B000 0 VirtAddr IPLCB 5 0000000008000000 0000000018000000 0 VirtAddr FreeMem 6 0000000020000000 FFFFFFFFE0000000 0 None IO_SPACE

Continued on next page

Page 206: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-76 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB system table sub command -- continued

Example continued

(0)> trace <== show the trace buffers trace was started for proc events.Trace channel[0 - 7]: 0Trace Channel 0 (7 entries)Current queue starts at F1000097231F2000 and ends at F100009723232000Current entry is #7 of 7 at F1000097231F2130Hook ID: SYSC_EXECVE (00000134) Hook Type: Timestamped|Generic C000 ThreadIdent: 00003F0B Timestamp: 26E264B2F6 Subhook ID/HookData: 0000 Data Length: 0007 bytes D0: 00000001 *Variable Length Buffer: F1000097231F2140Current queue starts at F1000097231F2000 and ends at F100009723232000Current entry is #6 of 7 at F1000097231F2108..

Page 207: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -77 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB network sub commands

Introduction The following table represents the network sub commands and their matching crash/lldb sub commands when available

ifnet sub command

The ifnet sub command prints interface information. If no argument is specified, information is displayed for each entry in the ifnet table. Data for individual entries can be displayed by specifying :

• slot : specifies the slot number within the ifnet table for which data is to be displayed. This value must be a decimal number.

• Address : effective address of an ifnet entry to display. Symbols, hexadecimal values, or hexadecimal expressions can be used in specification of the address.

tcpcb and sock sub command

The tcpcb and socket sub commands respectively prints:

• tcpcb information for TCP/UDP blocks. • socket information for TCP/UDP blocks.•If no argument is specified tcpcb information is displayed for all TCP and UDP blocks. tcpcb and sock accept the following command :

• tcp : display tcpcb information for TCP blocks only. • udp : display tcpcb information for UDP blocks only. • Address - effective address of a tcpcb structure to be displayed. Symbols,

hexadecimal values, or hexadecimal expressions can be used in specification of the address

Continued on next page

networkfunction

crash/lldb sub commands

KDB sub commands

kdb sub commands

display interface netstat ifnet ifnet

display TCBs ndb tcb tcb

display UDBs ndb udb udb

display sockets sock sock sock

display TCP CB ndb tcpcb tcpcb

display mbuf mbuf mbuf mbuf

Page 208: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-78 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB network sub commands -- continued

tcb and udb sub commands

tcb and udb sub commands can be used respectively to display :

• tcb block information + socket information• udb block information + socket informationtcb and udb sub commands accept the following parameters :• slot : specifies the slot number within the b table for which data is to be

displayed. This value must be a decimal number. • Address : effective address of a udb entry to display. Symbols, hexadecimal

values, or hexadecimal expressions can be used in specification of the address.

Examples (0)> ifnetSLOT 1 ---- IFNET INFO ----(@ 007545E0)---- name........ lo0 unit........ 00000000 mtu......... 00004200 flags....... 0E08084B (UP|BROADCAST|LOOPBACK|RUNNING|SIMPLEX|NOECHO|BPF|GROUP_ROUTING......|64BIT|CANTCHANGE|MULTICAST) timer....... 00000000 metric...... 00000000 address: 127.0.0.1 init()...... 00000000 output().... 001DBF38 start()..... 00000000 done()...... 00000000 ioctl()..... 001DBF20 reset()..... 00000000 watchdog().. 00000000 ipackets.... 000000B5 ierrors..... 00000000 opackets.... 000000B5 oerrors..... 00000000 collisions.. 00000000 next........ F10000971614F000 type........ 00000018 addrlen..... 00000000 hdrlen...... 00000000 index....... 00000001

ibytes...... 00003448 obytes...... 00003448 imcasts..... 00000000 omcasts..... 00000000 iqdrops..... 00000000 noproto..... 00000000 baudrate.... 00000000 arpdrops.... 00000000 ifbufminsize 00000000 devno....... 00000000 chan........ 00000000 multiaddrs.. F1000082C0157468 tap()....... 00000000 tapctl...... 00000000 arpres().... 00000000 arprev().... 00000000 arpinput().. 00000000 ifq_head.... 00000000 ifq_tail.... 00000000 ifq_len..... 00000000 ifq_maxlen.. 00000032 ifq_drops... 00000000 ifq_slock... 00000000 slock....... 00000000 multi_lock.. 00000000 6_multi_lock 00000000 addrlist_lck 00000000 gidlist..... 00000000 ip6tomcast() 00000000 ndp_bcopy(). 00000000 ndp_bcmp().. 00000000 ndtype...... 01000000 multiaddrs6. F1000082C0158F00

SLOT 2 ---- IFNET INFO ----(@ F10000971614F000)---- name........ tr0 unit........ 00000000 mtu......... 000005D4..(0)> tcpcb @ F1000082C0031C34 <== display the first tcpcb---- TCPCB ---(@ F1000082C0031C34)---- seg_next... F1000082C0031C34 seg_prev...... F1000082C0031C34 t_softerror 00000000 t_state....... 00000004 (ESTABLISHED) t_timer.... 00000000 (TCPT_REXMT) t_timer.... 00000000 (TCPT_PERSIST) t_timer.... 00000CFB (TCPT_KEEP) t_timer.... 00000000 (TCPT_2MSL) t_rxtshift. 00000000 t_rxtcur...... 00000004 t_dupacks..... 00000000 t_maxseg... 000005AC t_force....... 00000000

Continued on next page

Page 209: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -79 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB network sub commands -- continued

Example continued

t_flags.... 00000000 () t_oobflags. 00000000 () t_iobc..... 00000000 t_template.. F1000082C0031C64

t_inpcb..F1000082C0031B54 <== pointer to tcb or udb structure t_timestamp... 2DF79401 snd_una....... 8D452AB5 snd_nxt....... 8D452AB5 snd_up........ 8D452920 snd_wl1....... 42612E19 snd_wl2....... 8D452AB5 iss........... 8D4514FA snd_wnd....... 00003E64 rcv_wnd....... 00004410 rcv_nxt....... 42612E1B rcv_up........ 42612E18 irs........... 42612D92 snd_wnd_scale. 00000000 rcv_wnd_scale. 00000000 req_scale_sent 00000000 req_scale_rcvd 00000000 last_ack_sent. 42612E1B timestamp_rec. 00000000 timestamp_age. 00002BE3 rcv_adv....... 4261722B snd_max....... 8D452AB5 snd_cwnd...... 0000DD34 snd_ssthresh.. 3FFFC000 t_idle........ 00002B45 t_rtt......... 00000000 t_rtseq....... 8D452920 t_srtt........ 00000007 t_rttvar...... 00000004 t_rttmin...... 00000002 max_rcvd...... 00000000 max_sndwnd.... 00003E64 t_peermaxseg.. 000005AC(0)> tcb f1000082C0031B54 <== display the tcb for the pointer found before-------- TCB --------- INPCB INFO ----(@ F1000082C0031B54)---- next........ F1000082C0032354 prev........ 04BB8F80 head........ 04BB8F80 iflowinfo... 00000000 faddr_6... @ F1000082C0031B74 fport....... 00008036 fatype...... 00000001 oflowinfo... 00000000 laddr_6... @ F1000082C0031B8C lport....... 00000017 latype...... 00000001 socket...... F1000082C0031800 ppcb........ F1000082C0031C34 route_6... @ F1000082C0031BAC ifa.....00000000 flags....... 00000400 proto....... 00000000 tos......... 00000000 ttl......... 0000003C rcvttl...... 00000000 rcvif....... F10000971614F000 options..... 00000000 refcnt...... 00000002 lock........ 00000000 rc_lock..... 00000000 moptions.... 00000000 hash.next... 04BEB040 hash.prev... 04BEB040 timewait.nxt 00000000 timewait.prv 00000000---- SOCKET INFO ----(@ F1000082C0031800)---- <== we also get socket information type........ 0001 (STREAM) opts........ 010C (REUSEADDR|KEEPALIVE|OOBINLINE) linger...... 0000 state....... 0102 (ISCONNECTED|NBIO) pcb.. F1000082C0031B54 proto.. 04BAC870 lock.. F1000082C007B740 head.00000000 q0...... 00000000 q....... 00000000 dq...... 00000000 q0len....... 0000 qlen........ 0000 qlimit...... 0000 dqlen....... 0000 timeo....... 0000 error....... 0000 special..... 0A8C pgid.... 00000000 oobmark. 00000000

snd:cc...... 00000000 hiwat... 00002000 mbcnt... 00000000 mbmax... 00008000 lowat... 00001000 mb...... 00000000 sel..... 00000000 events...... 0000 iodone. 00000000 ioargs. 00000000 lastpkt. F1000082C01BE800 wakeone. FFFFFFFF timer... 00000000 timeo... 00000000 flags....... 0048 (SEL|NOINTR) wakeup.. 00F66E78 wakearg. C01FF918 lock.... FFFFFFFFF1000082rcv:cc...... 00000000 hiwat... 00004410 mbcnt... 00000000 mbmax... 00011040 lowat... 00000001 mb...... 00000000 sel..... 00000000 events...... 0004 iodone.. 00000000 ioargs.. 00000000 lastpkt. F1000082C01A9800 wakeone. FFFFFFFF timer... 00000000 timeo... 00000000 flags....... 0048 (SEL|NOINTR) wakeup.. 00F66E78 wakearg. C01FF800 lock.... FFFFFFFFF1000082 tpcb.... 00000000 fdev_ch. F10000971E186DC0 sec_info 00000000 qos..... 00000000 gidlist. 00000000 private. 00000000 uid..... 00000000 bufsize. 00000000 threadcnt00000000 nextfree 00000000 siguid.. 00000000 sigeuid. 00000000 sigpriv. 00000000 sndtime. 00000000 sec 00000000 usec rcvtime. 00000000 sec 00000000 usecproc/fd: 44/0 44/1 44/2

Page 210: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-80 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB VMM sub commands

Introduction The following table represents the VMM sub commands and their matching crash/lldb sub commands when available

Continued on next page

VMMfunction

crash/lldb sub commands

KDB sub commands

kdb sub commands

VMM kernel segment data /vmm-1 vmker vmker

VMM RMAP vmm-rmap rmap rmap

VMM control variables /vmm-2 pfhdata pfhdata

VMM statistics /vmm-3 vmstat vmstat

VMM Addresses /vmm-a vmaddr vmaddr

VMM paging device table vmm-pdt pdt pdt

VMM segment control blocks vmm-scb scb scb

VMM PFT entries vmm-pft pft pft

VMM PTE entries vmm-pte pte pte

VMM PTA segment vmm-pta pta pta

VMM STAB ste ste

VMM segment register sr64 sr64 sr64

VMM segment status segst64 segst64 segst64

VMM APT entries vmm-apt apt apt

VMM wait status /vmm-9 vmwait vmwait

VMM address map entries vmm-ame ames ames

VMM zeroing kproc /vmm-f zproc zproc

VMM error log /vmm-e vmlog vmlog

VMM reload xlate table vrld vrld

IPC information vmm-sem/shm ipc ipc

VMM lock anchor/tblock lockanch lockanch

VMM lock hash table lockhash lockhash

VMM lock word lockword lockword

VMM disk map vmdmap vmdmap

VMM spin locks vmlocks vmlocks

Page 211: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -81 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB VMM sub commands -- continued

vmker, pfhdata, vmstat, vmaddr, vmwait, zproc, vmlog, vrld and vmlocks sub commands

These sub commands will display VMM information about :

• vmker : virtual memory kernel data.• pfhdata : virtual memory control variables.• vmstat : virtual memory statistics• vmaddr : addresses of VMM structures.• vmwait : displays VMM wait status using the address of a wait chanel.• zproc : displays information about the VMM zeroing kproc.• vmlog : displays the current VMM error log entry.• vrld : displays the VMM reload xlate table. This information is only used on

SMP PowerPC machine, to prevent VMM reload dead-lock.• vmlocks : displays VMM spin lock data.

scb sub command

The sub sub command provides options for display of information about VMM segment control blocks. The scb sub command will prompt a menu to display scb using the following options :

• 1 : index• 2 : sid• 3 : srval• 4 : search on sibits• 5 : search on npsblks• 6 : search on nvpages• 7 : search on npages• 8 : search on npseablks• 9 : search on lock• a : search on segment type• b : add total scb_vpages• c : search on segment class• d : search on segment pvproc

ames sub command

The ames subcommand provides options for display of the process address map for either the current or a specified processes. The ames sub command will prompt a menu to display address map using the following options :

• 1 : current process• 2 : specified process• 3 : specified address map

Continued on next page

Page 212: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-82 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB VMM sub commands -- continued

pft sub command

The pft sub command provides options for display of information about VMM page frame table. The pft sub command will prompt a menu to display page frame information using the following options :

• 2 : h/w hash (sid,pno)• 3 : s/w hash (sid,pno)• 4 : search on swbits• 5 : search on pincount• 6 : search on xmemcnt• 7 : scb list• 8 : io list• 9 : deferred pgsp service frames

pte sub command

The pte sub command provides options for display of information about VMM page table entries . The pte sub command will prompt a menu to display scb using the following options :

• 1 : index• 2 : sid,pno• 3 : page frame• 4 : PTE group

pta sub command

The pta subcommand displays data from the VMM PTA segment. The following optional arguments maybe used to determine the data to be displayed :

• -r - to display XPT root data. • -d - to display XPT direct block data. • -a - to display the Area Page Map. • -v - to display map blocks. • -x - to display XPT fields. • -f - prompt for the sid/pno for which the XPT fields are to be displayed. • sid - segment ID. Symbols, hexadecimal values, or hexadecimal expressions

may be used for this argument. • idx - index for the specified area. Symbols, hexadecimal values, or

hexadecimal expressions may be used for this argument.

pdt sub command

The pdt subcommand displays entries of the paging device table. An argument of * results in all entries being displayed in a summary. Details for a specific entry can be displayed using a slot number.

Continued on next page

Page 213: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -83 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB VMM sub commands -- continued

rmap sub command

The rmap subcommand displays the real address range mapping table. If an argument of * is specified, a summary of all entries is displayed. If a slot number is specified, only that entry is displayed. If no argument is specified, the user is prompted for a slot number, and data for that and all higher slots is displayed, as well as the page intervals utilized by VMM.

ste sub command

The ste subcommand provides options for display of information about segment table entries for 64-bit processes. The ste sub command will prompt a menu to display segments using the following options :

• 1 :esid• 2 : sid• 3 : dump hash class (input=esid)• 4 : dump entire stab

sr64 sub command

The sr64 sub command displays segment registers for a 64-bit process. Using the following parameters :

• none : the segment registers will be displayed for the current process.• -p pid : process ID of a 64-bit process. This must be a decimal or hexadecimal

value depending on the setting of the hexadecimal_wanted switch. • esid : first segment register to display (lower register numbers are ignored).

This argument must be a hexadecimal value. • size : value to be added to esid to determine the last segment register to display.

This argument must be a hexadecimal value.

apt sub command

The apt subcommand provides options for display of information from the alias page table.The apt sub command will prompt a menu to display aliases using the following options :

• 1 : index• 2 : sid,pno• 3 : page frame

Continued on next page

Page 214: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-84 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB VMM sub commands -- continued

segst64 sub command

The segst64 subcommand displays segment state information for a 64-bit process. The information display can be filtered using :

• no argument : the information for the current process is displayed.• -p pid - process ID of a 64-bit process. This must be a decimal or hexadecimal

value depending on the setting of the hexadecimal_wanted switch. • -e esid - first segment register to display (lower register numbers are ignored).• -s seg - limit display to only segment register with a segment state that matches

seg. Possible values for seg are: SEG_AVAIL, SEG_SHARED,• SEG_MAPPED, SEG_MRDWR, SEG_DEFER, SEG_MMAP,

SEG_WORKING, SEG_RMMAP, SEG_OTHER, SEG_EXTSHM, and SEG_TEXT.

• value - limit display to only segments with the specified value for the segfileno field.

ipc sub command

The ipc subcommand reports interprocess communication facility information. The ipc sub command will prompt a menu to display ipc using the following options :

• ***TBD***

lockanch, lockhash and lockword sub commands

These sub commands will display VMM lock information for :

• lockanch : anchor data and data for the transaction blocks in the transaction block table.

• lockhash : lock hash list.• lockword : lock words.lockanch, lockhash and lockword accept the following parameters :• slot : slot number of an entry in the VMM lock table. This argument must be a

decimal value. • Address : effective address of an entry in the VMM lock table. Symbols,

hexadecimal values, or hexadecimal expressions may be used in specification of the address.

Continued on next page

Page 215: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -85 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB VMM sub commands -- continued

vmdmap sub command

The vmdmap subcommand displays VMM disk maps. To look at other disk maps it is necessary to initialize segment register 13 with the corresponding srval. vmdmap accept the following arguments :

• no arguments : all paging and file system disk maps are displayed.• slot : Page Device Table (pdt) slot number. This argument must be a decimal

value.

examples ***TBD***

Page 216: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-86 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB SMP sub commands

Introduction The following table represents the SMP sub commands and their matching crash/lldb sub commands when available

start, stop and cpu sub commands

start, stop and cpu commands will allow you to :

• start a cpu• stop a cpu• display status or switch to another cpuThese sub commands accept a cpu number as parameter.

Examples ***TBD***

SMPfunction

crash/lldb sub commands

KDB sub commands

kdb sub commands

Start cpu start N/A

Stop cpu stop N/A

Switch to cpu cpu cpu cpu

Page 217: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -87 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB data and instruction block address translation sub commands

Introduction The following table represents the block address translation sub commands and their matching crash/lldb sub commands when available

dbat and ibat sub commands

On PowerPC machine, the dbat and ibat sub commands may be used to display dbat and ibat registers. dbat and idat accept the following arguments :

• no argument : all dbat registers are displayed. • index : just the specified dbat register is displayed.

mdbat and mibat sub commands

On PowerPC machine, the mdbat and mibat sub commands may be used to modify dbat and ibat registers. The processor data bat register is altered immediately. KDB takes care of the valid bit, the word containing the valid bit is set last. mdbat and mibat accept the following arguments :

• no argument : all dbat or ibat registers are prompted for modification. • index : just the specified dbat or ibat register is prompted for modification.

Continued on next page

block address translation

function

crash/lldb sub commands

KDB sub commands

kdb sub commands

display dbats dbat dbat

display ibats ibat ibat

modify dbats mdbat mdbat

modify ibtas mibat mibat

Page 218: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-88 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

KDB data and instruction block address translation sub commands -- continued

Examples KDB(0)> dbat 2 <== display bat register 2 BAT2: 00000000 00000000 bepi 0000 brpn 0000 bl 0000 v 0 wimg 0 ks 0 kp 0 pp 0 KDB(0)> mdbat 2 alter bat register 2 BAT register, enter <RC> twice to select BAT field, enter <.> to quit BAT2 upper 00000000 = <CR/LF> BAT2 lower 00000000 = <CR/LF> BAT field, enter <RC> to select field, enter <.> to quit BAT2.bepi: 00000000 = 00007FE0 BAT2.brpn: 00000000 = 00007FE0 BAT2.bl : 00000000 = 0000001F BAT2.v : 00000000 = 00000001 BAT2.ks : 00000000 = 00000001 BAT2.kp : 00000000 = <CR/LF> BAT2.wimg: 00000000 = 00000003 BAT2.pp : 00000000 = 00000002 BAT2: FFC0003A FFC0005F bepi 7FE0 brpn 7FE0 bl 001F v 1 wimg 3 ks 1 kp 0 pp 2 eaddr = FFC00000, paddr = FFC00000 size = 4096 KBytes

Page 219: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -89 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

KDB bat/brat sub commands

Introduction The following table represents the bat/brat sub commands and their matching crash/lldb sub commands when available

btac,lbtac, cbtac and lcbtac sub commands

The btac and lbtac sub commands can be used to stop when Branch Target Address Compare is true using hardware registers HID1 and HID2 on PowerPC systems in the following condictions :

• btac : set a general branch target• lbtac : set a local branch target on a cpu basis.lbtac and lcbtac respectively clear general and local branch targets.

Examples KDB(0)> btac open <== set BRAT on open function KDB(7)> btac <== display current BRAT status CPU 0: .open+000000 eaddr=001B5354 vsid=00000000 hit=0 CPU 1: .open+000000 eaddr=001B5354 vsid=00000000 hit=0KDB(0)> q <== exit the debugger ... Branch trap: 001B5354 <.open+000000> .sys_call+000000 bcctrl <.open> KDB(0)> btac <== display current BRAT status (we have one hit) CPU 0: .open+000000 eaddr=001B5354 vsid=00000000 hit=1 CPU 1: .open+000000 eaddr=001B5354 vsid=00000000 hit=0

bat/brat function

crash/lldb sub

commands

KDB sub

commands

kdb sub

commands

branch target btac N/A

clear branch target cbtac N/A

local branch target lbtac N/A

clear local branch target lcbtac N/A

Page 220: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-90 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

IADB kernel debugger

Introduction The IADB is kernel debugger used on AIX5L running on IA-64 platform.

Availability The kernel debugger must be enabled in order to be used on AIX5L.The following command should return 0000000000000001 if the kernel debugger was enabled :

# iadb(0)> d dbg_availE000000004755BD8: 0000000000000001

Overview The major functions of the IADB are :

• Setting breakpoints within the kernel or kernel extensions• Execution control through various forms of step commands• Formatted display of selected kernel data structures• Display and modification of kernel data• Display and modification of kernel instructions• Modification of the state of the machine through alteration of system registers

loading IADB In AIX5L, the IADB is included in the unix_ia64 kernel located in /usr/lib/boot. In order to use it, the IADB must be loaded at boot time. To allow IADB to load use the following command :

bosboot -a -D -d /dev/ipldevice, or bosdebug -D : will load IADB at boot time.

• bosboot -a -I -d /dev/ipldevice, or bosdebug -I : will load and invoke the IADB at boot time.

• bosboot -ad /dev/ipldevice, or bosdebug -o : will not load and invoke the IADB at boot time.

You must reboot the system in order to take these changes in account.

Continued on next page

Page 221: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -91 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

IADB kernel debugger -- continued

starting IADB The KDB maybe be started, if loaded, under the following circumstances :

• If the bosboot or bosdebug was run with -I, this mean that the tty attached to a native serial port will show up the IADB just after the kernel is loaded.

• You may invoke manually the IADB from a tty attached to a native serial port using a native keyboard using Ctrl-alt-Numpad4. For example:

Debugger entered by hitting cntrl-atl-numpad4

AIX/IA64 KERNEL DEBUGGER ENTERED Due to...

Debugger entered via keyboard with key in SERVICE position using numpad 4

IP->E00000000008C910 waitproc_find_run_queue()+210: { .mib

==>0: adds sp = 0x40, sp

1: mov.i ar.lc = r33

2: br.ret.sptk.few rp

;; }

>CPU0>

• An application make a call to the breakpoint() kernel services or to the breakpoint system call.

• A breakpoint previously set using the IADB has been reached• A fatal system error occurs. A dump might be generated on exit from the

IADB.

IADB concept When the IADB Kernel Debugger is invoked, it is the only running program until you exit IADB or you use the start sub command to start another cpu. All processes are stopped and interrupts are disabled. The IADB Kernel Debugger runs with its own Machine State Save Area (mst) and a special stack. In addition, the IADB Kernel Debugger does not run operating system routines. Though this requires the kernel code be duplicated within IADB, it is possible to break anywhere within the kernel code. When exiting the IADB Kernel Debugger, all processes continue to run unless the debugger was entered via a system halt.

Page 222: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-92 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

iadb command

Introduction The iadb command, unlike the IADB kernel debugger, allows examination of an operating system image issued on IA-64 systems.

The iadb command may be used on a running system but will not provide all functions available with the IADB kernel debugger.

Parameters The iadb command maybe used with the following parameters :

• no parameter : the iadb will use /dev/mem as the system image file and /usr/lib/boot/unix as the kernel file. In this case root permissions are required.

• -d system_image_file : the iadb will use the image file provided.• -u kernel_file : the iadb will use the kernel file. This is required to analyze a

system dump on a system that has a different unix level.• -i include file list(may be comma separated)• -u user modules list for any symbol retrieval(comma separated list)

Loading errors If the system image file provided doesn’t contain a valid dump or the kernel file doesn’t match the system image file, the following message may be issued by the iadb command:

# iadb -u /usr/lib/boot/unix -d dump_file**TBD**

Page 223: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -93 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

IADB break point and step sub commands

Introduction The following table represents the breakpoint and step sub commands and their matching crash/lldb sub commands when available

br sub command The br subcommand can be used to set and display software break points. The br subcommand accept the following options :

• None : will display the currently set break points.• -a ‘N’ : will break after ‘N’ occurrences• -c {expr} : will break if the condition {expr} is true• -d : deferred, will set the break point when the module will be loaded• -e ‘N’ : will break every ‘N’ occurrences• -t ‘tid’ : will break only if current thread id is ‘tid’• -u ‘N’ : break up to ‘N’ occurrences• address : the break point address

c sub command The c sub command can be use to clear some or all break points. The c sub command accept the following parameters :

• index : index of the break point as listed in the br output• address : address of the break point• all : clear all break points.

Continued on next page

breakpoint and step function

crash/lldb sub commands

IADB sub commands

iadb sub commands

set/list break point br N/A

set/list local break point N/A

clear local break point N/A

clear break points c N/A

clear all breakpoint N/A

go to end of function sr N/A

go until address N/A

single step s/so N/A

step a bundle sb N/A

step to next branch stb

step on bl/blr N/A

step on branch N/A

Page 224: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-94 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

IADB break point and step sub commands -- continued

Examples The following example will show the use of br,c and s sub commands :

# ps -mF "THREAD" <== search for our thread idUSER PID PPID TID S CP PRI SC WCHAN F TT BND COMMANDroot 8008 1 - A 0 60 1 - 240001 0 - -ksh - - - 10865 S 0 60 1 - 400 - - -#<== hit ctrl-alt-numpad4 to enter the IADBAIX/IA64 KERNEL DEBUGGER ENTERED Due to....Debugger entered via keyboard.IP->E0000000000884B1 waitproc()+131: { .mii 0: ld4.acq r40 = [r36]==>1: adds r8 = 0x1, r41 ;; 2: cmp.eq p6, p0 = 0, r40 }> dis kread+90 <== in bundle 90 of kread we have a branch to rdwr()E000000000333B90 kread()+90: { .mib 0: st8 [r11] = r9 1: nop.i 0 2: br.call.sptk.few rp = <rdwr()+0> ;; }> br -a 5 -t 2A71 kread <== set a break point after 5 kread for our tid> br <== list break pointsbrk[0] = br -a 5 -t (tid = 2A71 pid = 1F48) kread()+0> go <== exit IADBSee Ya!# <== hit enter, this will call 3 kreadbrk[0] = br -a 5 -t (tid = 2A71 pid = 1F48) kread()+0brk[0] = br -a 5 -t (tid = 2A71 pid = 1F48) kread()+0brk[0] = br -a 5 -t (tid = 2A71 pid = 1F48) kread()+0# <== hit enter, this will call 3 kreadbrk[0] = br -a 5 -t (tid = 2A71 pid = 1F48) kread()+0brk[0] = br -a 5 -t (tid = 2A71 pid = 1F48) kread()+0AIX/IA64 KERNEL DEBUGGER ENTERED Due to...<== after 5 kread we enter IADBBreak instruction interrupt.IP->E000000000333B00 kread()+0: { .mii==>0: alloc r35 = ar.pfs, 5, 0, 5, 0 1: adds sp = -0xA0, sp 2: mov r36 = rp ;; }> s <== we step one instruction at a time in bundle 1IP->E0000000002E220 kread()+1: { .mii ==>0: alloc r35 = ar.pfs, 5, 0, 5, 0 1: adds sp = -0xA0, sp 2: mov r36 = rp ;; }

Continued on next page

Page 225: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -95 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

IADB break point and step sub commands -- continued

Examples continued

> sb <== we step to next bundle (bundle 10)AIX/IA64 KERNEL DEBUGGER ENTERED Due to...Break instruction interrupt.IP->E0000000002E2230 kread()+10: { .mii==>0: adds r8 = 0x18, sp 1: adds r40 = 0x20, sp 2: adds r9 = 0x28, sp }

> stb <== we step to next branch that points to rdwr()Another thread is currently stepping. To avoidconfusion, only one thread can be activelystepped.

Would you rather step this thread? (y/n) y

IP->E0000000002E2620 rdwr()+0: { .mii==>0: alloc r41 = ar.pfs, 11, 0, 6, 0 1: adds sp = -0x50, sp 2: mov r42 = rp ;; }> sr <== we return from rdwr() so we come back in kread in bundle A0 AIX/IA64 KERNEL DEBUGGER ENTERED Due to...Break instruction interrupt.IP->E0000000002E22A0 kread()+80: { .mii==>0: adds r9 = 0, r8 1: nop.i 0 ;; 2: cmp4.eq p6, p7 = 0, r9 > c all <== we clear all break point when the job is done

> br <== list break points

No Active Breakpoints

> go <== exit IADB

See Ya!

Page 226: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-96 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

IADB dump/display/decode sub commands

Introduction The following table represents the dump/display/decode sub commands and their matching crash/lldb sub commands when available

d sub command The d sub command can be use to display virtual memory using the following parameters :

• address : address or symbol to dump• ordinal : number of byte access (1,2,4,or 8)• number : number of elements to dump (of size 'ordinal')• none : continue dumping from previous d sub command

dp sub command

The dp sub command can be used to display physical memory using :

• address : physical address to dump• ordinal : number of byte access (1,2,4,or 8)• count : number of elements to dump (of size 'ordinal')

Continued on next page

dump/display/decodefunction

crash/lldb sub commands

IADB sub commands

iadb sub commands

display byte data N/A d (ordinal 1) d (ordinal 1)

display word data od (2 units) d (ordinal 4) d (ordinal 4)

display double word data od (4 untis) d (ordinal 8) d (ordinal 8)

display code decode/od (format I)

dis dis

display registers b/cfm/fpr/iip/iipa/ifa/intr/ipsr/isr/itc/kr/p/perfr/r/rr/rse

b/cfm/fpr/iip/iipa/ifa/intr/ipsr/isr/itc/kr/p/perfr/r/rr/rse

display device byte dio (ordinal 1) dio (ordinal 1)

display device half word dio (ordinal 2) dio (ordinal 2)

display device word dio (ordinal 4) dio (ordinal 4)

display device double word dio (ordinal 8) dio (ordinal 8)

display physical memory dp ***TBD

display pci config space dpci ***TBD

find pattern find

extract pattern

Page 227: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -97 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

IADB dump/display/decode sub commands -- continued

dio sub command

The dio sub command can be used to display the I/O space using the following parameters :

• port : I/O port address to dump• ordinal : number of byte access (1,2,4,or 8)• count : number of elements to dump (of size 'ordinal')

dis subcommand The dis sub command can be used to list instructions at a defined address using :

• address : address or symbol to disassemble• count : number of bundles to disassemble

registers sub commands

The following sub commands can be used to display registers informations :

• b :Display Branch Register(s)• cfm : Display Current Stacked Register• fpr : Display FPR(s) (f0 - f127)• iip : Display or Modify Instruction Pointer• iipa : Display Instruction Previous Address• ifa : Display Fault Address• intr : Display Interrupt Registers• ipsr : Display/Decode IPSR• isr : Display/Decode ISR• itc : Display Time Registers ITC ITM & ITV• kr : Display Kernel Register(s)• p : Display Predicate Register(s)• perfr : Display Performance Register(s)• r : Display General Register(s)• rr : Display Region Register(s)• rse : Display Register Stack Registers

dpci sub command

The dpci sub command can be used to display pci devices configuration space using the following parameters :

• bus : Hardware bus number of target PCI bus• dev : PCI Device Number of target PCI device• function : PCI Function Number of target PCI device• register : Configuration register offset to read• ordinal : Size of access to make (1,2,4,8)

Continued on next page

Page 228: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-98 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

IADB dump/display/decode sub commands -- continued

Examples >CPU0> d dbg_avail <== Display Virtual memory address at dbg_availE00000000407D6E0: 0000000000000001

>CPU0> dp 0x1000 2 5 <== Display 5 half-words from physical address 0x1000

0000000000001000: 0000 0000 0000 0000 0000

>CPU0> dio 0x3f6 1 8 <== Display 8 bytes from port 0x3F6

00000FFFFC0FDBF6: 50FF000000006F60 P.....o‘

>CPU0> dis kread <== Disassemble from kreadE0000000002E2220 kread()+0: { .mii 0: alloc r35 = ar.pfs, 5, 0, 5, 0 1: adds sp = -0xA0, sp 2: mov r36 = rp ;; }

>CPU0> dpci 0 0x58 0 0x20 4 <== Display 4-byte word from PCI config register 0x20 for device dpci 0 0x58 0 0x20 4 0x58, function 0, on bus0

PCI Config Space Bus 0, Dev 0x58, Fnc 0:reg 20: FFFFFFFF

>CPU0> d enter_dbg <== Display Virtual memory address at enter_dbg

E0000000040CF150: 0000000000000000

>CPU0> m enter_dbg 4 0x43 <== Modify enter_dbg with a 4-byte store of data 0x43

E0000000040CF150: 00000043

Continued on next page

Page 229: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -99 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

IADB dump/display/decode sub commands -- continued

Examples continued

>CPU0> d enter_dbg <== Display Virtual memory address at enter_dbg

E0000000040CF150: 0000000000000043

>CPU0> dp 0x5000 <== Display physical memory at location 0x5000

0000000000005000: FFFFFFFFFFFFFFFF

>CPU0> mp 0x5000 8 0x1122334455667788 <== Modify Physical memory at location 0x5000 with 8-byte store of data 0x1122334455667788

0000000000005000: 1122334455667788

>CPU0> dp 0x5000 <== Display physical memory at location 0x5000

0000000000005000: 1122334455667788

Page 230: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-100 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

IADB modify memory sub commands

Introduction The following table represents the modify memory sub commands and their matching crash/lldb sub commands when available

m sub command The m sub command can be used to modify the virtual memory contents using :

• addr : symbol or virtual address to modify• ordinal : size of each data element (1,2,4,8)• data1 : first data element to be stored with access of size 'ordinal'• data2.. : subsequent data elements to be stored

mp sub command

The mp sub command can be used to modify the physical memory contents with the following parameters :

• addr : physical address to modify• ordinal : size of each data element (1,2,4,8)• data1 : first data element to be stored with access of size 'ordinal'• data2.. : subsequent data elements to be stored

Continued on next page

modify memoryfunction

crash/lldb subcommands

IADB subcommands

iadb subcommands

modify sequential bytes alter -c m N/A

modify sequential word alter -w N/A

modify sequential double word alter -l N/A

modify registers b/iip/kr/p/r/rr N/A

modify device byte mio N/A

modify device half word N/A

modify device double word N/A

modify physical memory mp N/A

Page 231: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -101 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

IADB modify memory sub commands -- continued

registers sub commands

The following sub commands can be used to modify registers informations :

• b :Set Branch Register(s)• iip : Modify Instruction Pointer• kr : Set Kernel Register(s)• p : Set Predicate Register(s)• r : Set General Register(s)• rr : Set Region Register(s)

mio sub command

The mio sub command can be use to modify I/O space using :

• addr : I/O port address to modify• ordinal : size of each data element (1,2,4,8)• data1 : first data element to be stored with access of size 'ordinal'• data2.. : subsequent data elements to be stored

Examples >CPU0> b <== Display branch registers b00:E00000000008E050 waitproc()+1B0b01:BADC0FFEE0DDF00Db02:BADC0FFEE0DDF00Db03:BADC0FFEE0DDF00Db04:BADC0FFEE0DDF00Db05:BADC0FFEE0DDF00Db06:E00000000008DEA0 waitproc()+0b07:BADC0FFEE0DDF00D

>CPU0> iip <== Display instruction pointer IIP : E00000000008E000:waitproc()+160

Continued on next page

Page 232: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-102 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

IADB modify memory sub commands -- continued

Examples continued

>CPU0> kr <== Display all kernel registers kr0:00000FFFFC000000kr1:0000000000000000kr2:0000000000000000kr3:0000000000000000kr4:C000006013220000kr5:0200000000000000kr6:C00000601324CCC0kr7:C000006013200000

>CPU0> p <== Display all predicate registers p00:1 p16:0 p32:0 p48:0p01:0 p17:0 p33:0 p49:0p02:0 p18:0 p34:0 p50:0p03:0 p19:0 p35:0 p51:0p04:0 p20:0 p36:0 p52:0p05:0 p21:0 p37:0 p53:0p06:0 p22:0 p38:0 p54:0p07:1 p23:0 p39:0 p55:0p08:0 p24:0 p40:0 p56:0p09:0 p25:0 p41:0 p57:0p10:0 p26:0 p42:0 p58:0p11:0 p27:0 p43:0 p59:0p12:0 p28:0 p44:0 p60:0p13:0 p29:0 p45:0 p61:0p14:0 p30:0 p46:0 p62:0p15:0 p31:0 p47:0 p63:0

Continued on next page

Page 233: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -103 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

IADB modify sub commands -- continued

Examples continued

>CPU0> r <== Display all general registers r00:BADC0FFEE0DDF00D [0] r16:E00000971404B008 [0]r01:E000000004002818 [0] r17:00000000C0000000 [0]r02:BADC0FFEE0DDF00D [0] r18:0000000000000014 [0]r03:BADC0FFEE0DDF00D [0] r19:0000000040000000 [0]r04:BADC0FFEE0DDF00D [0] r20:E00000971404D000 [0]r05:BADC0FFEE0DDF00D [0] r21:0000000000000000 [0]r06:BADC0FFEE0DDF00D [0] r22:0000000000000000 [0]r07:BADC0FFEE0DDF00D [0] r23:E00000971404D008 [0]r08:0000000000000000 [0] r24:0000000000000000 [0]r09:0000000000000000 [0] r25:BADC0FFEE0DDF00D [0]r10:0000000000000002 [0] r26:BADC0FFEE0DDF00D [0]r11:0000000080000000 [0] r27:BADC0FFEE0DDF00D [0]r12:0003FEFFF3FFF7C0 [0] r28:BADC0FFEE0DDF00D [0]r13:E00000971405C600 [0] r29:BADC0FFEE0DDF00D [0]r14:E00000971404C02C [0] r30:BADC0FFEE0DDF00D [0]r15:E00000971404B028 [0] r31:BADC0FFEE0DDF00D [0] r32:C000006013200000 [0]r33:C000006013200290 [0]r34:E00000971404B11C [0]r35:E00000971404B120 [0]r36:E0000000040C6060 [0]r37:E0000000040C6068 [0]r38:0000000000000186 [0]r39:0000000000000009 [0]r40:0000000000000001 [0]r41:0000000000000001 [0]

Continued on next page

Page 234: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-104 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

IADB modify memory sub commands -- continued

Examples continued

>CPU0> rr <== Display all region registers

rr0:0000000000480931

rr1:0000000000200431

rr2:0000000000280531

rr3:0000000000000030

rr4:0000000000000030

rr5:0000000000180331

rr6:0000000000100269

rr7:0000000000080131

>CPU0> mio 0x408 8 0 <== Modify I/O port 0x408 with 8-byte store of data 0

Page 235: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -105 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

IADB name list/symbol sub commands

The following table represents the name list/symbol sub commands and their matching crash/lldb sub commands when available

map sub command

The map sub command can be used to translate a symbol into an address and revers and so accept the following as parameter :

• symbol : symbol to show address for• address : address to show symbol for

Examples >CPU0> map (r34) <== Lookup symbol for address in r34

>CPU0> map 0xe000000000000000 <== Lookup symbol for address 0xe000000000000000

>CPU0> map foo+0x100 <== Lookup symbol for symbol ‘foo’+0x100

name list symbolfunction

crash/lldb sub commands

IADB sub commands

iadb sub commands

translate symbol to eaddr nm map map

no symbol mode (toggle)

translate eaddr to symbol ts/ds map map

Page 236: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-106 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

IADB watch break point sub commands

Introduction The following table represents the watch break point sub commands and their matching crash/lldb sub commands when available

dbr sub command

The dbr command can be used to set break point on data access using :

• action : the action to watch for :• r = = Break on Read• w = = Break on Write• rw = = Break on Read or Write

• mask : bit mask of which address bits to match• plvl_mask : bit mask of which privilege levels to match

• 0x1 = = CPL 0 (Kernel)• 0x2 = = CPL 1 (unused)• 0x2 = = CPL 1 (unused)• 0x4 = = CPL 2 (unused)• 0x8 = = CPL 3 (User)

• addr : the address to trigger on

cdbr sub command

The cdbr sub command can be used to clear previously set data break points using :• index : index of DBR breakpoint (from dbr cmd)• all : clear all DBRs

Continued on next page

watch break pointfunction

crash/lldb sub commands

IADB sub commands

iadb sub commands

stop on read data dbr r N/A

stop on write data dbr w N/A

stop on r/w data dbr rw N/A

local stop on read data N/A

local stop on write data N/A

local stop on r/w data N/A

clear watch cdbr N/A

local clear watch N/A

Page 237: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -107 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

IADB watch break point sub commands -- continued

Examples >CPU0> dbr <== Display all current breakpoints

>CPU0> dbr foo <== Break on access to ‘foo’

>CPU0> dbr -t foo <== Break on any access to ‘foo’ for current thread

>CPU0> cdbr 3 <== Clear DBR in slot 3

>CPU0> cdbr 0xe000000000011cc0 <== Clear DBR at address 0xe000000000011cc0

Page 238: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-108 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

IADB machine status sub commands

Introduction The following table represents the trace sub commands and their matching crash/lldb sub commands when available

sys sub command

The sys sub command will display the following information :

• Build level and build date • Number and type of processors • Memory size • Processor Speed • Bus Speed

reason sub command

The reason sub command will display the reason why debugger was entered along with IP and assembly code of the bundle at that IP

Continued on next page

machine statusfunction

crash/lldb sub

commands

IADB sub

commands

iadb sub

commands

system status message stat sys+reason

switch thread

Page 239: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -109 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

IADB machine status sub commands -- continued

Examples >CPU1> sys <== Display system information Kernel : AIX 0036E_500IA, Built on Sep 27 2000 at 14:52:02 Memory : 1023MB Processors : 2 Itanium, Stepping 0 Proc Speed : 665374960 HZ Bus Speed : 133074992 HZ >CPU1> reason <== Display reason debugger was enteredDebugger entered via keyboard with key in SERVICE position using numpad 4IP->E00000000008E000 waitproc()+160: { .mii==>0: alloc r35 = ar.pfs, 5, 0, 5, 0 1: adds sp = -0xA0, sp 2: mov r36 = rp ;; }

Page 240: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-110 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

IADB kernel extension loader sub commands

Introduction The following table represents the kernel extension loader sub commands and their matching crash/lldb sub commands when available

kext The kext sub command will display all loaded kernel extensions and their text and data load addresses

ldsyms and unldsyms sub commands

The ldsyms and unldsyms will load or unload a kernel extension symbols using :

• -p [path] : where path is the absolute file path of the kernel extension• module : the module name

examples (0) kext <== list loaded kernel extensions..Name : /usr/lib/drivers/isa/kbdddTextMapped: 0xE000009729630000 to 0xE000009729645FFF, Size: 0x00016000DataMapped: 0xE000009729660000 to 0xE000009729665FFF, Size: 0x00006000UnwindTBL: 0xE000009729644BA8 to 0xE0000097296453E7, Size: 0x00000840TextStart: 0xE000009729630120 Load count: 2 Use count: 0..(0) nm kbdconfig <== try to get address for kbdconfig symbolSymbol not found(0)>ldsyms kbddd <== load kbddd symbols(0)>nm kbdconfig <== now nm should workkbdconfig : e000009729639560

kernel extension loaderfunction

crash/lldb sub

commands

IADB sub

commands

iadb sub

commands

list loaded extension le kext/ldsyms/unldsyms

list loaded symbol tables

remove symbol table

list export tables

Page 241: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -111 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

IADB address translation sub commands

Introduction The following table represents the address translation sub commands and their matching crash/lldb sub commands when available

parameters x addr

where;

addr = symbol or virtual address to translate

Examples >CPU0> x foo+0x4000 <== Display the physical translation for foo+0x400

>CPU0> x 0x20000000 <== Display the physical translation for virtual address 0x20000000

>CPU0> x (r1) <== Display the physical address in r1

address translationfunction

crash/lldb sub commands

IADB sub commands

iadb sub commands

translate to real address x

display MMU translation

Page 242: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-112 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

IADB process/thread sub commands

Introduction The following table represents the process/thread sub commands and their matching crash/lldb sub commands when available

ppda The ppda sub command will display Per Processor Descriptor Area and accept the following parameters :

• cpu : which CPU's ppda to display (logical numbering)

mst The mst sub command will display the Machine State Stack using :

• addr : address of an MST to display

pr The pr sub command will display process informations using :

• -p {value} :for process where PID = = {value}• -s {value} : for process in slot {value}• -v {value} : for proc struct pointer = = {value}• -a : detailed display for all processes• * : process table display

Continued on next page

processfunction

crash/lldb sub commands

IADB sub commands

iadb sub commands

display per processor data area

ppd ppda ppda

display interrupt handler

display mst area mst mst mst

display process table proc pr pr

display thread table th th th

display thread tid th th th

display thread pid

display user area user/du us us

display run queue rq

display sleep queue sq

Page 243: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -113 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

IADB process/thread sub commands -- continued

th The th sub command will display thread information related to :

• -s {slot} : detailed thread info for thread in 'slot'• -t {tid} : detailed thread info for thread 'tid'• -v {thrdptr} : detailed thread info for thread pointer ''thrdptr'• -a : detailed thread info for all threads• * : display thread table

us The us sub command will display user structure information for:

• -p : process id (PID)• -t : Thread id (TID)• * : All processes

rq The rq will return the run queue information related to :

• -b {bucket} : detailed info for threads in bucket of all run queue slots• -g : global info for run queues• -q [ number ] : detailed info for all queues• -v {address} : detailed info for threads at run queue address

sq The sq sub command will display the sleep queue related to :

• -b {bucket} : detailed info for threads in 'bucket'• -v {address} : detailed info for threads at sleep queue 'address'

Examples ***TBD

Page 244: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-114 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

IADB LVM sub commands

Introduction The following table represents the LVM sub commands and their matching crash/lldb sub commands when available

parameters

Examples

LVMfunction

crash/lldb sub commands

IADB sub commands

iadb sub commands

display physical buffer

display volume group

display physical volume

display logical volume

Page 245: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -115 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

IADB SCSI sub commands

Introduction The following table represents the scsi sub commands and their matching crash/lldb sub commands when available

parameters

Examples

SCSIfunction

crash/lldb sub

commands

IADB sub commands

iadb sub commands

display ascsi

display vscsi

display scdisk

Page 246: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-116 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

IADB memory allocator sub commands

Introduction The following table represents the memory allocator sub commands and their matching crash/lldb sub commands when available

parameters

Examples

memory allocatorfunction

crash/lldb sub commands

IADB sub commands

iadb sub commands

display kernel heap

display kernel xmalloc xmalloc xmalloc xmalloc

display heap debug

display kmem buckets

display kmem statistics

Page 247: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -117 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

IADB file system sub commands

Introduction The following table represents the file system sub commands and their matching crash/lldb sub commands when available

parameters

Examples

file systemfunction

crash/lldb sub commands

IADB sub commands

iadb sub commands

display buffer

display buffer hash table

display freelist

display gnode

display gfs

display file file

display inode inode

display inode hash table

display inode cache list

display rnode

display vnode vnode vnode vnode

display vfs vfs vfs vfs

display specnode

display devnode

display fifo node

display hnode hash table

Page 248: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-118 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

IADB system table sub commands

Introduction The following table represents the system table sub commands and their matching crash/lldb sub commands when available

dev The dev sub command will display the device switch table using :

• major : major number slot to display

iplcb The iplcb sub command will display the IPL control block

Examples

system tablefunction

crash/lldb sub commands

IADB sub commands

iadb sub commands

display var var

display devsw table devsw dev dev

display system timer request blocks

display simple lock lock -s

display complex lock lock -c

display ipl proc information iplcb iplcb

display trace buffer

Page 249: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -119 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

IADB network sub commands

Introduction The following table represents the network sub commands and their matching crash/lldb sub commands when available

parameters

Examples

networkfunction

crash/lldb sub commands

IADB sub commands

iadb sub commands

display interface netstat

display TCBs

display UDBs

display sockets sock

display TCP CB

display mbuf mbuf

Page 250: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-120 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

IADB VMM sub commands

Introduction The following table represents the VMM sub commands and their matching crash/lldb sub commands when available

Continued on next page

VMMfunction

crash/lldb sub commands

IADB sub commands

iadb sub commands

VMM kernel segment data

VMM RMAP vmm-rmap

VMM control variables

VMM statistics

VMM Addresses

VMM paging device table

vmm-pdt

VMM segment control blocks

vmm-scb

VMM PFT entries vmm-pft

VMM PTE entries vmm-pte

VMM PTA segment vmm-pta

VMM STAB

VMM segment register sr64

VMM segment status segst64 u -64 u -64

VMM APT entries vmm-apt

VMM wait status

VMM address map entries

vmm-ame

VMM zeroing kproc

VMM error log

VMM reload xlate table

IPC information vmm-sem/shm

VMM lock anchor/tblock

VMM lock hash table

VMM lock word

VMM disk map

VMM spin locks

Page 251: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -121 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

IADB VMM sub commands -- continued

parameters

Examples

Page 252: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-122 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

IADB SMP sub commands

Introduction The following table represents the SMP sub commands and their matching crash/lldb sub commands when available

cpu The cpu command can be used to display or change the current cpu you are working on using :• num : logical CPU number to switch to

Examples >CPU0> cpu 1 <== Switch the debug process to processor 1 AIX/IA64 KERNEL DEBUGGER ENTERED Due to...Debugger entered via MPC stopIP->E00000000008C7F2 waitproc_find_run_queue()+F2: { .mii 0: adds r20 = 0x1, r10 1: shr.u r19 = r11, r10 ;;==>2: and r21 = r17, r19 }>CPU1>

SMPfunction

crash/lldb sub commands

IADB sub commands

iadb sub commands

Start cpu

Stop cpu

Switch to cpu cpu cpu cpu

Page 253: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -123 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

IADB block address translation sub commands

Introduction The following table represents the block address translation sub commands and their matching crash/lldb sub commands when available

parameters

Examples

block address translationfunction

crash/lldb sub commands

IADB sub commands

iadb sub commands

display dbats

display ibats

modify dbats

modify ibtas

Page 254: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-124 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

IADB bat/brat sub commands

Introduction The following table represents the bat/brat sub commands and their matching crash/lldb sub commands when available

parameters

Examples

bat/brat function

crash/lldb sub commands

IADB sub commands

iadb sub commands

branch target

clear branch target

local branch target

clear local branch target

Page 255: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -125 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

IADB miscellaneous sub commands

Introduction The following table represents the miscellaneous sub commands and their matching crash/lldb sub commands when available

help sub command

‘The help sub command can be used with out parameter to display the command listing or with a command as parameter to display an help related to that command.

kdbx sub command

The kdbx sub command can be used to set the symbol needed to use kdb with the kdbx interface.The following variables are set by kdbx and will modify output of certain sub commands :• kdbx_addrd : Display breakpoint address instead of symbol name• kdbx_bindisp : Display output in binary format instead of ASCII format

go sub command The go sub command is used to leave the KDB, this will start the dump process if the KDB was entered while the system was crashing.

Continued on next page

miscellaneousfunction

crash/lldb sub commands

IADB sub commands

iadb sub commands

reboot the machine

display help help/? help

run an aix command !

set kdbx compatibility kdbx

exit go

set debugger parameters set set set

display elapsed time

enable/disable debug

calculate/convert an hexadecimal expression

calc

calculate/convert a decimal expression

Page 256: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-126 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

IADB miscellaneous sub commands -- continued

set sub command

The set sub command can be used to set or display the following kdb parameters :• rows=number : set number of rows on current display• mltrace={on|off} : mltrace on/off; only on DEBUG kernel• sctrace={on|off} : verbose syscall prints on/off; only on DEBUG kernel• itrace={on|off} : enable/disable tracing on/off; only on DEBUG kernel• umon={on|off} : enable/disable umon performance tool• exectrace={on|off} : verbose exec prints on/off; only on DEBUG kernel• excpenter={on|off} : debugger entry on exception on/off• ldrprint={on|off} : verbose loader prints on/off; only on DEBUG kernel• kprintvga={on|off} : kernel prints to VGA on/off• dbgtty={on|off} : use debugger TTY as console on/off• dbgmsg={on|off} : Tee Console and LED output to TTY• hotkey={on|off} : enter debugger on key press on/off; only on DEBUG kernel

Examples

Page 257: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -127 of 128Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide

Exercise

Introduction In this exercise you will configure the system to enable the live debugger and invoke both the live and image debugger for your system.

Complete the following steps:

Continued on next page

Step Action Reference

1. Enable the Memory Overlay Detection System (MODS) using the bosdebug command.

2. Enable the live debugger with the bosboot command.

3. Reboot the system, and login as root.

4. Verify MODS is enabled with the debugger.

> stat

xmalloc debug: ________________

5. Verify the debugger is available:

Power PC:kdb> dw kdb_avail> q

IA-64: iadb> d dbg_avail> go

6. Execute the following truss command:

# truss -t kread -i ksh

Hit the enter key. How many kread functions were executed? __________

Enter the exit command to exit truss:

# exit

Page 258: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-128 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

Exercise -- continued

Step Action Reference

7. Change directory to /var/adm/ras.

8. Start the image debugger against the crash dump captured in the previous lesson.

9. Execute the following commands:• iadb: reason

Why was the debugger entered?

___________________________

• kdb: p * or iadb: pr *

What is the process id for the errdemon?

____________________________

• Execute the ls command:

kdb: !ls or iadb: ! ls

• iadb: sys

What build of AIX5L was the crash dump taken on?

__________________________

10. Exit the debugger: q

11. Enter the live debugger:

Ctrl-Alt-NUMAPAD4

12. Enter the cpu command. What is the status of CPU0?________________________________

13. Exit the live debugger.

Page 259: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -1 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Unit 7. Process Management

PlatformThis lesson is independent of platform.

Lesson ObjectivesAt the end of the lesson you will be able to:

• List and describe the states of a process.

• List the steps taken by the kernel to create a new process as the result of a fork() system call, and the steps taken to create a new thread of execution.

• Describe what happens when a process terminates.

• List the three thread models available in AIX 5.

• Identify the relationship between the internal structures proc, thread, user and u_thread.

• Use the kernel debugging tool to locate and examine processes, proc, thread, user and u_thread data structures.

• Manage process scheduling using available commands, manage processes and threads on a SMP system (to best employ cache affinity scheduling), and manage processes on a ccNUMA system (to best employ quad affinity scheduling).

• List the factors determining what action the threads of a process will take when a signal is received.

• Write a simple C program that use the fork() system call to spawn new processes, that uses the wait() system call to retrieve the exit status of a child process, that creates a simple multi-threaded program by using the pthread_create() system call, and that uses exec() system call to load a new program into memory.

Page 260: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-2 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Process Management Fundamentals

Process definition

A process can be defined by the list of items which builds it. A process consists of:

• A process table entry

• A process ID (PID)

• Virtual address space

- User-area (U-area)

- Program “text”

- Data

- User and kernel stacks

• Statistical information

Definition of process management

Process management consists of the tools and ability to have many processes and threads existing simultaneously in a system, and to share usage of the CPU or, in a SMP system, CPUs. Process management also includes the ability to start, stop, and force a stop of a process.

The tools and information used to manage the processes

• A process is a self-contained entity that consists of the information required to run a single program, such as a user application.

• The kernel contains a table entry for each process called the proc entry.

• The proc entry contains information necessary to keep track of the current state and location of page tables for the process.

• The proc entry resides in a slot in an array of proc entries.

• The kernel is configured with a fixed number of slots.

• All processes have a process ID or PID.

• The PID is assigned when the process is created and provides a convenient way for users to refer to the other processes.

• The process contains a list of virtual memory addresses that the process is allowed to access.

• The user-area (u_area) of a process contains additional information about the process when it is running.

• The kernel tracks statistical information for the process, such as the amount of time the process uses the CPU, the amount of memory the process is using, etc. The statistical information is used by the kernel for managing its resources and for accounting purposes.

Page 261: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -3 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Process operations fork() system call

Process operations

Four basic operations define the lifetime of a process in the system:

• fork - Process creation

• exec - Loading of programs in process

• exit - Death of process

• wait - The parent process notification of the death of the child process.

Fork new processes

The fork system call is the way to create a new process

• All processes in the system (except the boot process) are created from other processes through the fork mechanism.

• All processes are descendants of the init process (process 1).

• A process that forks creates a child process that is nearly a duplicate of the original parent process.

• The child has a new proc entry (slot), PID, and registers.

• Statistical information is reset, and the child initially shares most of the virtual memory space with the parent process.

• The child process initially runs the same program as the parent process. The child may use the exec() call to run another program.

The fork() system call

The parent process has an entry in the process and thread table before the fork() system call; after the fork() system call, another independent process is created with entries in the Process and Thread tables.

Continued on next page

Thread Table

Process Table

Parent Process

Child Process

......

......

......fork()......

6\VWHP�FDOO�

3DUHQW�HQWU\

&KLOG�HQWU\

&KLOG�HQWU\

3DUHQW�HQWU\

$,;�.HUQHO

Page 262: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-4 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Process operations fork() system call -- continued

Inherited attributes after a fork() system call

The illustration shows what happens when the fork() system call is issued. The caller creates a child process that is almost an exact copy of the process itself. The child process inherits many attributes of the parent, but receives a new user block and dataregion.

The child process inherits the following attributes from the parent process:

• Environment

• Close-on-exec flags and signal handling settings

• Set user ID mode bit and Set group ID mode bit

• Profiling on and off status

• Nice value

• All attached shared libraries

• Process group ID and tty group ID

• Current directory and Root directory

• File-mode creation mask and File size limit

• Attached shared memory segments and Attached mapped file segments

• Debugger process ID and multiprocess flag, if the parent process has multiprocess debugging enabled (described in the ptrace subroutine).

Continued on next page

Page 263: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -5 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Process operations fork() system call -- continued

Attributes not inherited from the parent process

Not all attributes are inherited from the parent. The child process differs from the parent process in the following ways:

• The child process has only one user thread; it is the one that called the fork subroutine, no matter how many threads the parent process had.

• The child process has a unique process ID.

• The child process ID does not match any active process group ID.

• The child process has a different parent process ID.

• The child process has its own copy of the file descriptors for the parent process. However, each file descriptor of the child process shares a common file pointer with the corresponding file descriptor of the parent process.

• All semadj values are cleared.

• Process locks, text locks, and data locks are not inherited by the child process.

• If multiprocess debugging is turned on, the trace flags are inherited from the parent; otherwise, the trace flags are reset.

• The child process utime, stime, cutime, and cstime are set to 0.

• Any pending alarms are cleared in the child process.

• The set of signals pending for the child process is initialized to the empty set.

• The child process can have its own copy of the message catalogue for the parent process.

Continued on next page

Page 264: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-6 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Process operations fork() system call -- continued

The fork() system call code example

The following code illutrates the usage of the fork() system call. After the call there will be two processes executing two different copies of the same code. A process can determine if it is the parent or the child from the return code.

int statuslocation;

pid_t proc_id;

tproc_id=fork();

if ( proc_id < 0 ) {

printf ("fork error \n");

exit (-1);

}

if ( proc_id > 0 ) {

/*Parent process waiting for child to terminate */

proc_id2 = wait(&statuslocation);

}

if ( proc_id == 0 ) {

/* I’m the child proces */

{.............}

Continued on next page

Page 265: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -7 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Process operations fork() system call -- continued

Listing processes with the ps command after fork()

Executing the test program creates two processes, which can be listed with the ps command. The program name in the example is fork and that name is listed as the command for both the parent and the child. Note that the child’s PPID is equal to the PID of the parent.

F S UID PID PPID C PRI NI ADDR SZ TTY TIME CMD

240001 A 0 10346 10236 0 60 20 5b8b 496 pts/1 0:00 ksh

200001 A 0 10742 10346 0 68 24 9bb3 44 pts/1 0:00 fork

1 A 0 10990 10742 0 68 24 dbbb 44 pts/1 0:00 fork

Processes without the parent process

In the previous example, it was shown how the PID of the calling process becomes the PPID of the child process. This example shows what happens if the parent process terminates before the child process terminates. If we rewrite the program so that the parent process terminates after fork() without waiting for the child, the system will replace the PPID with 1, which is the init process. The init process will then pickup the SIGCHLD signal so that the system can free the process table, even though the parent process does not exist. This situation is shown below:

F S UID PID PPID C PRI NI ADDR SZ TTY TIME CMD

240001 A 0 10346 10236 0 60 20 5b8b 496 pts/1 0:00 ksh

40001 A 0 10996 1 0 68 24 8330 44 pts/1 0:00 fork

200001 A 0 11216 10346 3 61 20 dbbb 244 0:00 ps

Zombie processes

If, for some reason, no processes receive the SIGCHLD signal from the child, the empty slot will remain in the process table, even though other resources are released. Such a process is called a zombie, and is listed in ps as <defunct>. The example below shows some of these zombie processes......F S UID PID PPID C PRI NI ADDR SZ TTY TIME CMD

200003 A 0 1 0 0 60 20 500a 704 - 0:03 init

240401 A 0 2502 1 0 60 20 d2da 40 - 0:00 uprintfd

240001 A 0 2622 2874 0 60 20 2965 5208 - 0:46 X

40001 A 0 2874 1 0 60 20 c959 384 - 0:00 dtlogin

50005 Z 0 3776 1 1 68 24 0:00 <defunct>

40401 A 0 3890 1 0 60 20 91d2 480 - 0:00 errdemon

240001 A 0 4152 1 0 60 20 39c7 88 - 0:21 syncd

240001 A 0 4420 4648 0 60 20 4b29 220 - 0:00 writesrv

240001 A 0 4648 1 0 60 20 b1d6 308 - 0:00 srcmstr

50005 Z 0 10072 1 0 68 24 0:00 <defunct>

50005 Z 0 10454 1 0 68 24 0:00 <defunct>

Page 266: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-8 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Process operations exec() system call

Exec system call to load a new program

The exec subroutine does not create a new process; it loads a new program into the process.

• To execute a new program, a process uses the exec set of system calls to load the new program into memory and execute the program.

• Each program can successively exec other programs to load and execute in the process.

Valid program files for the exec() system call

The fork() system call creates a new process with a copy of the environment, and the exec() system call loads a new program into the current process, and overlays the current program with a new one (which is called the new-process image). The new-process image file can be one of three file types:

• An executable binary file in XCOFF file format.

• An executable text file that contains a shell procedure.

• A file that names an executable binary file or shell procedure to be run.

Inherited attributes after the exec() system call

The new-process image inherits the following attributes from the calling process image: session membership, PID, PPID, supplementary group IDs, process signal mask, and pending signals.

Continued on next page

Page 267: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -9 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Process operations exec() system call -- continued

The exec() system call

The illustration show how the process and thread table remain unchanged after the exec() system call.

Thread Table

Process Table

Parent Process

......

......

......exec()......

6\VWHP�FDOOV

Page 268: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-10 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Process operations exec() system call

The exec() system call code example

The following code illustrates the usage of the execv() system call. After the call, the current process will be overlaid with the new program. To illustrate the function, the output from the program is listed after the program.

The program first defines two variables. The first is a pointer to the program name to be executed, and the second is a pointer to the arguments (by convention the first argument parsed is the program name itself). The program source for sleeping.c is not supplied, as any program can be used for this example.

#include <unistd.h>

int returncode;

char *argumentp[3],arg1[50],arg2[50],arg3[50];

const char *Path="/home/olc/prog/thread/sleeping";

main(argc,argv)

int argc;

char **argv;

{

strcpy (arg1,"/home/olc/prog/thread/sleeping");

strcpy (arg2,"test param 1");

strcpy (arg3,"test param 2");

argumentp[0]=arg1;

argumentp[1]=arg2;

argumentp[2]=arg3;

/* ArgumentV=*arguments; */

printf ("before execv \n");

returncode = execv(Path,argumentp);

printf ("after execv \n");

exit (0);

}

and the program output:

before execv

I’m the sleeping process

Continued on next page

Page 269: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -11 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Process operations exec() system call -- continued

The exec() system call

While the program in the example is being executed, we can examine the process status with the ps command. Notice that the program name for the example is “exec,” and the program name for the called program is “sleeping.” As we see in the listing from the ps command, the current program is replaced with the new one, and we never reach the print statement "after execv\n." The program prints “I’m the sleeping process,” because the main program has been replaced with the program in the path variable. If we look closer at the output from the ps -l command before and after the system call, we can tell that the program name has been replaced, but the process ID and PPID remains the same.

Before the exec system call take place:

#> ps -l

. F S UID PID PPID C PRI NI ADDR SZ TTY TIME CMD

240001 A 0 10346 10236 0 60 20 5b8b 492 pts/1 0:00 ksh

200001 A 0 10696 10346 2 61 20 6bad 240 pts/1 0:00 ps

200001 A 0 10964 10346 0 68 24 4388 40 pts/1 0:00 exec

And after the exec() system call, the exec program is replaced with sleeping:

#> ps -l

. F S UID PID PPID C PRI NI ADDR TTY TIME CMD

240001 A 0 10346 10236 0 60 20 5b8b pts/1 0:00 ksh

200001 A 0 10698 10346 2 61 20 a354 pts/1 0:00 ps

200001 A 0 10964 10346 0 68 24 4388 pts/1 0:00 sleeping

Page 270: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-12 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Process operations exit system call

Exit: what happens when a process terminates

The exit system call is executed at the end of every process, the system call cleans up, releases memory, text and data, but leaves an entry in the process table so that a return value and other status information can be passed to the parent process if needed.

• exit - termination of a process

• When a program no longer needs to run or execute other programs, it can exit.

• A program that exits causes the process to enter the zombie state.

Exiting from a program

There are basically three ways that a process can terminate: the program can have reached the end of the program flow and meet an explicit exit(exit_value) statement, the program flow can end without an exit() statement (in which case the linker automatically inserts a call to the exit system call), or the running program receives a signal from an external source such as keyboard interrupt (<Ctrl-c>) from the user. If the program receives an interrupt, the program path will switch to the interrupt handling routine, either in the program, or the system default routine, which will terminate the program with an exit.

When executing the exit() system call, all memory and other resources are freed, and the parameter supplied to exit(0 is placed in the process table as the exit value for the process. After the completion of the exit() system call, a signal SIGCHLD is issued to the parent process (the process at this stage is nothing but the process table entry). This state is called the zombie state, when the parent process reacts to the SIGCHLD signal and reads the return code from the process table, the system can remove the process table entry, clean up, and free the process table entry.

In rare occasions the parent process can not respond to the signal immediately, we can see the zombie in the process table with the ps command. A zombie will be listed as <defunct>.

Page 271: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -13 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Process operations, wait() system call

Waiting for the death of a child process

The wait system call is placed at the end of a program; normally it is placed there by the programmer as the system call wait(), but if not, the system will automatically add a wait one. The wait call is used to notify the parent process of the death of the child process and for releasing the child’s process slot.

• The parent process can be notified of the death of the child by waiting with a system call or catching the proper signal.

• Once the parent process acknowledges the death of a child process, the child process' slot is freed.

Continued on next page

Page 272: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-14 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Process states -- continued

Process states In AIX, processes can be in one of five states:

• Idle

• Active

• Stopped

• Swapped

• Zombie

Idle state When processes are being created, they are first in the idle state. This state is temporary until the fork mechanism is able to allocate all of the necessary resources for the creation and fill in the process table for a new process.

Active state Once the new child process creation is completed, it is placed in the active state. The active state is the normal process state, and threads in the process can be running or be ready-to-run.

Stopped processes

Processes can also be stopped or in a stopped state. Process can be stopped by the SIGSTOP signal. Stopped processes can be restarted by the SIGCONT signal. If a process is stopped, all threads are in the stopped state.

Swapped processes

If a process is swapped, it means that another process is running, and the process, or any threads, cannot run until scheduler makes it active again.

Continued on next page

Page 273: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -15 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Process states -- continued

Zombie process

When a process terminates with an exit system call, they first goes into the zombie state, such processes have most of their resources freed. However, a small part of the process remains, such as the exit value that the parent process uses to determine why the child process died. If the parent process issues a wait system call, the exit status is returned to the parent, and the remaining resources of the child process are freed, and the process ceases to exist. The slot can then be used by another newly created process.

If the parent process no longer exists when a child process exits, the init process frees the remaining resources held by the child. Sometimes we can see a Zombie staying in the process list for a longer time; one example of this situation could be that a process exited, but the parent process is busy or waiting in the kernel and unable to read the return code.

State transitions for AIX processes

The illustration show how a process is being started with a fork() system call, turns into an active process, and how active process can change between swapped, active and stopped state. A terminating process becomes a zombie until the entire process is removed.

Idle

Active

Zombie

Swapped Stopped

fork()

Non existing

Page 274: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-16 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Kernel Processes

Kernel processes:

Kernel processes - Kproc

• Are created by the kernel.

• Have a private u-area/kernel stack.

• Share "text" and data with the rest of the kernel.

• Are not affected by signals.

• Cannot use shared library object code or other user protection domain code.

• Run in the Kernel Protection Domain.

Some processes in the system are kernel processes. Kernel processes are created by the kernel itself to execute independent of threads action. Even though a kernel process shows up in the process table, through "Berkeley" ps, it is part of the kernel. The scheduler is one example of a kernel process. Kernel processes are scheduled like user processes, but tend to have higher priorities.

Kernel processes can have multiple threads, as can user processes.

Page 275: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -17 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Thread Fundamentals

Thread definition

Like a process, a thread can be defined by separate components. A thread consists of:

• A thread table entry

• A thread ID (TID)

Processes and threads

• Process holds address space

• Thread holds execution context

• Multiple threads can run within one process

- One CPU can run one thread at a time, on SMP systems, threads can actually run truly concurrent

Threads • Threads allow multiple execution units to share the same address space.

• The thread is the fundamental unit of execution.

• Thread has IDs (TIDs) like a process has IDs (PIDs).

• An independent flow of control within a process.

• In a multi threaded process, each thread can execute on a different code concurrently.

• Managing threads needs fewer resources than managing processes.

• Inter-thread communication is more efficient than inter-process communication, especially because variables can be shared.

Threads share data and address space

Threads reduce the need for IPC operation, because they allow multiple execution units to share the same address space, and thereby easily share data. On the other hand, it adds complexity and risk to the programming. For example: synchronization and locking has to be controlled by the threads.

Threads are the unit of execution

The thread is the fundamental unit of execution and the scheduler and dispatcher only work with threads. Therefore, every process has at least one thread.

Continued on next page

Page 276: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-18 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Thread Fundamentals -- continued

Thread IDs (TID) and Process IDs (PID)

TIDs are listed for all threads in the threads table; TIDs are always odd. PIDs are listed for all processes in the process table; PIDs are always even, except for the init process, where PID = 1. Threads represent independent flows within a process; the system does not provide synchronization, and the control must be in the thread itself.

In a multi-threaded process, each thread can execute on a different code concurrently controlled by the program paths.

One of the main reasons for using threads is that managing threads requires fewer resources than managing processes. Inter-thread communication is more efficient than inter-process communication.

Page 277: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -19 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

AIX Thread

AIX Threads • A thread is an independent flow of control that operates within the same address space as other independent flows of controls within a process. In other operating systems, threads are sometimes called "lightweight processes," or the meaning of the word "thread" is sometimes slightly different.

• Multiple threads of control allow an application to overlap operations such as reading from a terminal or writing to a disk file. This also allows an application to service requests from multiple users at the same time.

• Multiple threads of control within a single process are required by application developers to be able to provide these capabilities without the overhead of multiple processes.

• Multiple threads of control within a single process allow application developers to exploit the throughput of multiprocessor (MP) hardware.

TID format Threads IDs have the following format for 32-bit kernels:

And for 64-bit kernels the TID is 64-bit

• INDEX identifies the entry in the thread table corresponding to the designated TID (thread[INDEX]).

• COUNT is a generation count that is intended to avoid the rapid reallocation of TIDs. When a new TID is to be allocated, its value is calculated on the first available thread table entry. Slots are recycled.

Continued on next page

31 24 7 1 0

INDEX COUNT0 0 0 0 0 0 1

8

63 56 7 1 0

INDEX COUNT0 0 0 0 0 0 1

8

Page 278: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-20 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

AIX threads -- continued

TID format listed with kdb

The following is a 64-bit slot in the thread table listed with kdb; the TID is 002143 HEX =>, the index = 21, and the COUNT= 43, 21 hex = 33 decimal. According to the figure, this is the slot number in the thread table; the value is listed in the next line of the output from kdb.

(0)> thread 33

SLOT NAME STATE TID PRI RQ CPUID CL WCHAN

pvthread+001080 33 sendmail SLEEP 002143 03C 0 0

If we look in the memory at address pvthread+0001080 we can se the 64-bit TID structure.

(0)> d pvthread+001080

pvthread+001080: 0000 0000 0000 2143 0000 0000 0000 0000

(0)>

Page 279: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -21 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Thread Concepts

Threads concepts

• An application is said to be thread safe when multiple threads in a process can run the application successfully without data corruption.

• A library is thread safe when multiple threads can be running a routine in that library without data corruption (another word for this is reentrant).

• A kernel thread is a thread of control managed by the kernel.

• A user thread is a thread of control managed by the application.

• User threads are attached to kernel threads to gain access to system services.

• In a multi-threaded system such as AIX:

- The process is the swappable entity.

- The thread is the schedulable entity.

Thread mapping models

• User threads are mapped to kernel threads by the threads library. The way this mapping is done is called the thread model. There are three possible thread models, corresponding to three different ways, to map user threads to kernel threads:

• M:1 model

• 1:1 model

• M:N model.

• The AIX Version 4.1 and later threads support is based on the OSF/1 libpthreads implementation. It supports what is referred to as the 1:1 model. This means that for every thread visible in an application, there is a corresponding kernel thread. Architecturally, it is possible to have a M:N libpthreads model, where "M" user threads are multiplexed on "N" kernel threads. This is supported in AIX 4.3.1 and AIX 5L.

• The mapping of user threads to kernel threads is done using virtual processors. A virtual processor (VP) is a library entity that is usually implicit. For a user thread, the virtual processor behaves as a CPU for a kernel thread. In the library, the virtual processor is a kernel thread or a structure bound to a kernel thread.

• The libpthreads implementation is provided for application developers to develop portable multi-threaded applications The libpthreads.a library has been written as per the POSIX 1003.4a Draft 10 specification in AIX 4.3. Previous versions of AIX support the POSIX 1003.4a Draft 7 specification. The libpthreads is a linkable user library that provides user space threads services to an application. The libpthreads_compat.a provides the POSIX 1003.4a Draft 7 specification pthreads model on AIX 4.3.

Page 280: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-22 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Threads Models

M:1 threads model

In the M:1 model, all user threads are mapped to one kernel thread and all user threads run on one VP. The mapping is handled by a library scheduler. All user threads programming facilities are completely handled by the library. This model can be used on any systems, especially on traditional single-threaded systems.

Continued on next page

Library Scheduler

VP

User ThreadsUser Threads

Threads Library

Kernel Thread

M:1 Threads Model

Page 281: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -23 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Threads Models -- continued

1:1 threads model

In the 1:1 model, each user thread is mapped to one kernel thread and each user thread runs on one VP. Most of the user threads programming facilities are directly handled by the kernel threads.

Continued on next page

VP

User ThreadsUser Threads

Threads Library

Kernel Threads

1:1 Threads Model

VP VP

Page 282: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-24 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Threads Models -- continued

M:N threads model

In the M:N model, all user threads are mapped to a pool of kernel threads and all userthreads run on a pool of virtual processors. A user thread may be bound to a specific VP, as in the 1:1 model. All unbound user threads share the remaining VPs. This is the most efficient and most complex thread model; the user threads programming facilities are shared between the threads library and the kernel threads.

Library Scheduler

VP

User Threads

Threads Library

Kernel Threads

M:N Threads Model

VP VP

Page 283: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -25 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Thread states

Thread states In AIX, the kernel allows many threads to run at the same time, but there can only be one thread executing on each CPU at a time. The thread state is kept in t_state in the thread table (for detailed information look in the /usr/include/sys/thread.h file).

Each thread can be in one of the following five states:

• Idle

• Ready to run

• Running

• Sleeping

• Stopped

• Swapped

• Zombie

Idle state When processes and threads are being created, they are first in the idle state. This state is temporary until the fork mechanism is able to allocate all of the necessary resources for the creation and fill in the thread table for a new thread.

Ready to run Once the new thread creation is completed, it is placed in the ready to run state. The thread waits in this state until the thread is ran. When the thread is running, it continues to run until it has used a time slice, gives up the CPU or is preempted by a higher priority thread.

Running thread

A thread in the running state is the thread executing at the CPU. The thread state will change between running and ready to run until the thread finishes execution; the thread then goes to the Zombie state

Sleeping Whenever the thread is waiting for an event, the thread is said to be sleeping.

Continued on next page

Page 284: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-26 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Thread states -- continued

Stopped A stopped thread is a thread stopped by the SIGSTOP signal. Stopped threads can be restarted by the SIGCONT signal.

Swapped Though swapping takes place at the process level and all threads of a process are swapped at the same time, the thread table is updated whenever the thread is swapped.

Zombie The zombie state is a intermediate state for the thread lasting only until the all resources owned by the thread are given up.

State transitions for AIX threads

The illustration show the states for AIX threads. Threads are typically changing between running, ready to run, sleeping and stopped during the life time of the thread.

Being Created

Running

Zombie

Sleeping Stopped

fork()

Non existing

Swapped

Ready to run

Page 285: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -27 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Thread Management

Thread / process relationship

• The diagram below shows how the process shares most of the data among the threads; although each thread has its own copy of the registers, some kernel thread have specific data, and therefore have a private stack. Thus, data can be passed between threads via global variables.

• A conventional unithreaded UNIX process can only harm itself (if incorrectly coded).

• All threads in a process share the same address space, so in an incorrectly coded program, one thread can damage the stack and data areas associated with other threads in that process.

• Except for such areas as explicitly shared memory segments, a process cannot directly affect other processes.

• There is some kernel data that is shared between the threads, but the kernel also maintains thread specific data.

• Per-process data is needed even when the process is swapped out is in the pvproc structure. The pvproc structure is pinned.

• Per-process data is needed only when the process is swapped in is in the user structure.

• Per-thread data is needed even when the process is swapped out is in the pvthread structure. The pvthread thread structure is pinned.

• Per-thread data is needed only when the process is swapped in is in the uthread structure.

Data placement overview

Thread

Registers

Stack

KernelThread

Data

Kernel ProcessData

BSSProgram

Data

Code

ThreadThread

RegistersRegisters

StackStack

KernelKernelThreadThread

DataData

Page 286: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-28 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Process swapping

Process swapping

Page 287: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -29 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Thread Scheduling

Thread scheduling

Scheduling and dispatching is the ability to assign CPU time to threads in the system in a efficient and fair way. The problem is to design the system to handle many simultaneous threads and at the same time still be responsive to events.

Clock tics and time slices

The division of time among the threads on the AIX system relies on clock tics. Every 1/100 of a second, or 100 times a second, the dispatcher is called and does the following:

• Increases the running tic counter for the running process.

• Scans run queues for the thread with the highest priority.

• Dispatchs the most favored thread.

Every real second the scheduler is awake, it recalculates the priority for all threads.

Thread priority • AIX priority has 128 (0-127) levels that are called run queue levels.

• The higher the run queue level, the lower priority.

• Priority 127 can only be used by the wait process.

• User processes can get priority changed from -20 to + 20 levels (renice).

• User processes are in the range 40 - 80.

• A clock tick interrupt decreases thread priority.

• The scheduler (swapper) increases thread priority.

The priority is based on the basic priority level, the initial nice value, the renice value and a penalty.

Continued on next page

Base Priority

default value = 40

Nice value default = 20

Renice value

Penalty based onruntime

-20 - +20Higher value = lower priority

Page 288: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-30 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Thread Scheduling -- continued

Thread dispatching

• Dispatcher chooses the highest priority thread to execute.

• Threads are the dispatchable unit for the AIX scheduler.

• Each thread has its own priority (0-127) and scheduling algorithm.

• There are three Scheduling algorithms:

- SCHED_RR Round Robin

- SCHED_FIFO

- SCHED_OTHER

SCHED_RR threads scheduling algorithms

• SCHED_RR

- This is a Round Robin scheduling mechanism in which the thread is time-sliced at fixed priority.

- This scheme is similar to creating a fixed priority, real time process.

- The thread must have root authority to be able to use this scheduling mechanism.

SCHED_FIFO threads scheduling algorithms

• SCHED_FIFO

- A non-preemptive scheduling scheme.

- The thread runs at fixed priority and is not time-sliced.

- It will be allowed to run on a processor until it voluntarily relinquishes by blocking or yielding.

- A thread using SCHED_FIFO must also have root authority to use it.

- It is possible to create a thread with SCHED_FIFO that has a high enough priority that it could monopolize the processor.

SCHED_OTHER threads scheduling algorithms

• SCHED_OTHER

- The default AIX scheduling.

- Priority degrades with CPU usage.

Continued on next page

Page 289: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -31 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Process and Thread Scheduling -- continued

Thread scheduling

Like most UNIX systems, AIX uses a multilevel round-robin model for process and thread scheduling. Processes and threads at the same priority level are linked together and placed on a run queue. AIX has 128 run queues, 0-127, each representing one of the 128 possible priorities. When a process starts running is determined by a given priority based on the nice value, and the process is linked with other processes at the same level. As the process is running and consumes CPU resources, the priority decreases until it it finishes, or until the priority is so low that other processes get CPU time. If a process does not run, the priority increases until it can get CPU time again. The drawing below illustrates the 128 run queue levels and six processes: three at priority 60 and three at 70.

Continued on next page

127

0

100

80

60

40

120

20

Idle process

Page 290: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-32 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Process and Thread Scheduling -- continued

Thread scheduling algorithm

The scheduler is using the following algorithm to calculate priorities for the running processes:

For every clock tick (1/100 sec.):

• The running thread is charged for one tick.

• The dispatcher is called, scans the run queues, and dispatches the one with the highest priority.

The scheduler runs every second:

• It calculates new priority for all threads.

• For each thread set, the number of used ticks is equal to (used ticks)* d/32 where 0 <= d <= 32.

The algorithm for calculating the priority is:

• new_nice = 60 + 2* nice if nice > 0

• new_nice = 60 + nice if nice < 0

• Priority = used ticks * (new_nice + 4) / 64 * r/32 + new_nice, where 0<=r<=32

Invariants:

-20 <= nice <= 20

0 <= r <= 32

0 <= d <= 32

0<= ticks <= 120

0 <= p <= 126

The r and d controls how a process is impacted by the run time; r impacts how severely a process is penalized by used CPU time, while d controls how fast the system “forgives” previous CPU consumption.

The r and d can be set by the schedtune [-r <r_val] [-d d_val] command.

Continued on next page

Page 291: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -33 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

The Dispatcher

The dispatcher The dispatcher runs under the following circumstances:

• A time interval has passed. (1/100 sec.)

• A thread voluntarily gives up the CPU.

• A thread is returning to user mode from kernel mode.

• Another thread has been made runnable (awake).

Context switch procedure

The context switch procedure consists of:

• Saving the machine state of the departing thread.

• Recalling the machine state of the selected thread.

• Mapping the process private data and other virtual space of the selected thread.

• Switching the CPU to execute with the selected thread's registers.

Context switch The procedures switches context to make a different thread execute:

• As a thread executes in the CPU, its priority becomes less favored.

• The scheduler re-calculates the priority of the executing thread and measures the new priority against the priorities of the threads that are runnable.

• In AIX, the run queues are divided into 128 separate priority queues with priority 0 being the most-favored priority and priority 127 the least-favored.

• Threads at the same priority level are on the same run queue for quick determination of the next runnable process.

• All of the threads on a more-favored priority queue run before threads on a less-favored priority queue.

• Queue 127 contains the wait threads. There is one wait thread per CPU, and these run only when there are no other runnable threads.

Continued on next page

Page 292: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-34 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

The Dispatcher -- continued

Thread preemption

In AIX, the kernel allows preemption of both user and kernel threads.

• Preemption allows the kernel to respond to real-time processes much faster.

• On most UNIX systems, when a thread is in kernel mode, no other thread can execute until the thread in kernel mode returns to user mode or voluntarily gives up the CPU.

• In AIX, other higher priority threads may preempt threads running in kernel mode.

• This feature supports real-time processing where a real-time process must respond to an action immediately.

• Some sections of code have been determined to be critical sections where preemption is not possible because preemption may cause inconsistent kernel data structures. These sections are protected either by preventing preemption (by disabling interrupts) or by holding a lock.

• The kernel can use locks to serialize access to global kernel data that could be corrupted by preemption.

• The thread holding the lock for a piece of data is guaranteed to run at a higher priority than the set of threads waiting for the lock. This is called priority promotion.

• However, other threads running at higher priority and not asking for the lock on the same piece of data can preempt the locking thread.

The MP scheduler/dispatcher

• Hard Cache Affinity - The ability to bind a thread or process to a processor.

• Soft Cache Affinity - An attempt to run a thread or process on the same processor.

• Support funneling threads - Funneling threads is a method to run non- MP-safe threads on MP hardware.

Continued on next page

Page 293: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -35 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

The Dispatcher -- continued

The MP dispatcher/scheduler

AIX Scheduling uses a time-sharing, priority based scheduling algorithm. This algorithm favors threads that do not consume large amounts of the processor, as the amount of processor time used by the thread is included in the priority recalculation equation. Fixed priority threads are allowed. The priority of these threads do not change regardless of the amount of processor time used.

There is one global priority-based, multi-level run queue (runq). All threads that are runable are linked into one of these runq entries. There are currently 128 priorities (0-127). The scheduler periodically scans the list of all active threads and recalculates thread priorities based on the amount of processor time used.

Page 294: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-36 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

AIX run queues

Multiple run queues (MRQ)

• AIX 4.3.3. uses multiple specialized run queues instead of just one global queue.

• Each processor has its own local run queue, and each node has a global run queue.

• Processors dispatch threads from the local and the global run queue.

Global run queue

• Fixed priority POSIX-compliant threads

• Load balancing of newly created threads and very low priority threads

• Process executed with RT_GRQ=ON exported

• Threads that are bound to a processor are never placed in the global run queue

The fixed priority threads guarantees strict priority order execution. Load balancing is achieved with new and low priority threads. New threads can be picked up by any CPU because they have not run yet, and the cache penalty is therefore small. Also, low priority threads can easily be moved as they do not have data in cache.

If processes has the variable RT_GRQ=ON set, they will sacrifice cache optimization for best possible real-time behavior. That is, the process will be on the Global Run Queue and run on the first available CPU. Threads can be bound to one CPU and will then never be on the global run queue.

Continued on next page

RQ

0

CPU0

CPU1

CPU2

CPU3

RQ

1

RQ

2

RQ

3

Node 0CPU 0 - 3

RQ

8

CPU8

CPU9

CPU10

CPU11

RQ

9

RQ RQ

Node 2CPU 8 - 11

RQ

4

CPU4

CPU5

CPU6

CPU7

RQ

5

RQ

6

RQ

7

Node 1CPU 4 - 7

10 11

Page 295: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -37 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

AIX run queues -- continued

Local run queues

• Local run queues reduce lock contention

• Shorter and simpler run queues scans

• Stronger implied affinity

• Reduced cache contention in the kernel

Each local run queue has its own lock. This reduces the lock contention, and make the lock handling faster. The local queue makes the scan faster because there are no special handling of bound threads, and simple handling of soft affinity with one CPU per run queue. The kernel cache contention is reduced because each CPU updates the dispatcher state, and the structures for threads in the local run queue are more likely to be in the local cache.

Initial load balancing

When new unbound threads are created, they should initially be placed so that the system load remains balanced. This has to be handled differently for new processes and for additional threads in an existing process.

If the thread is the initial thread in a new process:

• Choose a run queue; round-robin first among all nodes, and secondly within the run queues of the chosen node.

• Look for an idle CPU in the chosen run queue.

• Look for an idle CPU in the chosen node.

• Look for an idle CPU anywhere on the system.

• Otherwise, add to round_robin global run queue.

If the new thread is an additional thread for an already existing process

• Choose a run queue; round-robin among the run queues in the process’ node.

• Look for an idle CPU in the chosen run queue.

• Look for an idle CPU in the chosen node.

• Otherwise, add to global run queue for this node.

Continued on next page

Page 296: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-38 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

AIX run queues -- continued

Idle load balancing

Idle load balancing occurs when a CPU goes idle, and starts looking for work in other run queues. The criteria for permitting a thread steal are:

• Foreign run queue threads are greater than 1/4 load factor of the node.

• There is at least one stealable (unbound) thread available.

• There is at least one unstealable (bound) thread available.

• The number of threads stolen from this run queue during the current clock interval is less than 20.

• Should multiple run queues meet these criteria, the one with the most threads will be used.

• If this run queue’s lock is available, its best priority unbound thread will be stolen, assuming its p_lock_d is available.

• Note that failure to lock the run queue or the thread will cause the dispatcher to loop through waitproc, thereby opening up a periodic enablement window.

Page 297: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -39 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Process and Threads data structures

Process and thread management data structure overview

Four main data structures is used for process management:

• proc

• thread

• user

• uthread

The figure below show how the tables are linked together.

Continued on next page

/dev/kmem user memory

Process Data

User Area

Ublock

Uthread Uthread

Kernel Stack Kernel Stack

Thread Stack Thread Stack

Gobal Data

Process Text Segment

Process Table

pvproc

pvproc

pvproc

pvproc

pvproc

Thread Table

pvthread

pvthread

pvthread

pvthread

pvthread

Pvthreaduser

uthread

Page 298: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-40 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Process and Threads data structures -- continued

Thread management data structure overview

The diagram above shows that the thread structures contain pointers to all the other structures required to run that particular thread. This is a reflection of the fact that the thread is the schedulable entity, and the system must be able to access all structures from the pointers in the thread table. The thread table are doubly and circularly linked to all other threads for a particular process. Note that the ublock structure contains the user structure plus uthread structures for the initial thread. The uthread structures for all other threads are in the uthread (and kernel thread stacks) segment. The first uthread structure is kept separate within in the ublock so that the fields it contains can be addressed directly and such that fork and exec can operate with only the process private segment to deal with.

The proc and thread structures are maintained in the kernel extension segment as a part of the process and thread tables of the kernel. Every in-use entry in these tables is pinned, such that the information there is always available to the kernel. The user and uthread structures are maintained in the process private segment of the corresponding process. These structures are only pinned when the process is not swapped out. When the process is swapped out, they are unpinned.

Process and thread links

The previous diagram shows how the tables are linked together. Each process in the system has an entry in the process table. Each process entry has a pointer to the list of threads for the process, and the thread list has a pointer back to the process table. The thread list is a double circular linked list of all the threads owned by the process, and the pvthreads entries point to the user area and uthread field in the process data area.

Continued on next page

Page 299: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -41 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Process and Threads data structures -- continued

Proc structure fields and pointers C structures

The following is an extract of the fields in the proc table to show the pointers. Note that each entry in the proc table starts with a pointer to a pvproc structure (we will later discuss the pvproc structure). The proc table holds the number of threads, and the pvproc table has a pointer pv_threadlist that points to the first thread for the process in the thread table. A complete listing of the structures can be found in the file /usr/include/sys/proc.h.

struct proc {

struct pvproc *p_pvprocp; /* my global process data*/

pid_t p_pid; /* unique process identifier */

uint p_flag; /* process flags */

/* thread fields */

ushort p_threadcount; /* number of threads */

ushort p_active; /* number of active threads */

.......

};

struct pvproc {

/* identifier fields */

pid_t pv_pid; /* unique process identifier */

pid_t pv_ppid; /* parent process identifier */

/* thread fields */

struct pvthread *pv_threadlist; /* head of list of threads */

.......};

Continued on next page

Page 300: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-42 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Process and Threads data structures -- continued

Thread table fields and links to the process table and the ublock

Like the process table, the thread table is divided into two tables a pvthread table and a thread table. The complete structures can be found in the file /usr/include/sys/thread.h. The structures listed contains only selected variables.

The thread and pvthread structures have a pointer, *tv_pvprocp back to the owner process, pointers *uthreadp and *userp to the user thread and user area, and the thread list linked in a circular doubly linked list (with the *prevthread and *nextthread fields).

struct thread {

struct pvthread *t_pvthreadp; /* my pvthread struct */

struct t_uaddress {

struct uthread *uthreadp; /* local data */

struct user *userp; /* owner process’ ublock (const)*/

} t_uaddress;

......

struct pvthread {

/* identifier fields */

tid_t tv_tid; /* unique thread identifier */

/* related data structures */

struct thread *tv_threadp; /* my pvthread struct */

struct pvproc *tv_pvprocp; /* owner process (global data) */

struct {

struct pvthread *prevthread;/* previous thread */

struct pvthread *nextthread;/* next thread */

} tv_threadlist; /* circular doubly linked list */

...

Page 301: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -43 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Process and Threads data structures addresses

Process and thread tables’ addresses in the kernel

AIX 5L has 64-bit kernel and the addresses are 64-bit long. Both process and thread tables are kept in the kernel extension segment at fixed addresses.

• The proc table starts at 0xF100008080000000.

• The thread table starts at 0xF100008090000000.

Both tables are maintained as arrays.

• Entries are called “slots.”

• Slot number can be derived from PID or TID (see the example).

AIX 4.3.3 is a 32-bit kernel and the addresses are only 32-bit long the values for an AIX 4.3.3 32-bit kernel are:

• The proc table starts at 0xe2000000.

• The thread table starts at 0xe6000000.

Both tables are maintained as arrays.

• Entries are called “slots.”

Slot number can be derived from PID or TID bit 8 - 23. See the example and list from the process table on an AIX 5L power system

• Generation count for each slot is incremented every time a PID or TID is created in that slot.

Continued on next page

Page 302: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-44 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Process and Threads data structures addresses -- continued

Looking at AIX 4 process structures with kdb

Looking at the process table with kdb, we can tell that there is a difference between AIX 4 and AIX 5. List the process table with the p subcommand in kdb. The process table starts at address proc and the process slot used by kdb is 7936, which is offset by 326000 (hex) from the start of the process table. The size of proc is 326000 (hex) / 7936 (dec) = 416 (dec) = 1A0 (hex).

SLOT NAME STATE PID PPID PGRP UID EUID ADSPACE CL

proc+326000 7936*kdb_up ACTIVE 1F001A 0123A 1F001A 00000 00000 00001302 00

The size of each process slot can be verified with the p * subcommand. In the following list, each slot is offset by 1A0 bytes

SLOT NAME STATE PID PPID PGRP UID EUID ADSPACE CL

proc+000000 0 swapper ACTIVE 00000 00000 00000 00000 00000 0000780F 00

proc+0001A0 1 init ACTIVE 00001 00000 00000 00000 00000 0000500A 00

proc+000340 2 wait ACTIVE 00204 00000 00000 00000 00000 00008010 00

proc+0004E0 3 netm ACTIVE 00306 00000 00000 00000 00000 0000B817 00

And the location of proc in memory can be retrieved with the nm subcommand.

(0)> nm proc

Symbol Address : E2000000

TOC Address : 001F9EF8

Continued on next page

Page 303: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -45 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Process and Threads data structures addresses -- continued

Looking at AIX 5 process structures with kdb

The same lists will look different on an AIX 5 system. First, on a list of the proc table, we can tell that the structure used is no longer proc but pvproc, and each pvproc slot is 6680 (hex) / 41 (dec) = 280 (hex) long.

(0)> p

SLOT NAME STATE PID PPID ADSPACE CL #THS

pvproc+006680 41*kdb_64 ACTIVE 0002996 00037D8 00000000200040AA 0 0001

Listing the first three slots shows that the offset is 280(hex) between the slots.

(0)> p *

SLOT NAME STATE PID PPID ADSPACE CL

pvproc+000000 0 swapper ACTIVE 0000000 0000000 0000000000000B00 0

pvproc+000280 1 init ACTIVE 0000001 0000000 000000000000E2FD 00

pvproc+000500 2 wait ACTIVE 0000204 0000000 0000000000001B02 0

The pvproc address in memory is found using the nm command.

(0)> nm pvproc

Symbol Address : F100008080000000

TOC Address : 0046AC80

(0)>

Continued on next page

Page 304: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-46 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

AIX 5 process and Thread data structures -- continued

Process data structure changes in AIX 5

The changes in the process table are made to support the NUMA (Non-Uniform memory) structure in AIX 5L.

A NUMA system consist of one or more separate nodes connected by a very fast connection. The nodes operates as one computer, running one copy of AIX. The name NUMA refers to the fact that the memory access time is not constant. A CPU accessing memory on its own node will get the memory fast (accessed via the local bus). A CPU accessing remote memory will have to get the data from a remote node, and the access will be slower.

In order to make the system efficient, we want to keep all parts of a process close together so that memory access is fast; therefore, the proc structure has been rearranged and divided into two parts. Struct pvproc, that holds global process data and the rest, is still in struct proc. This change allows the NUMA system to move processes around between CPU’s or “QUADS,” and still have most of the process table local to the process. However, some of the process table must be kept at the main node in a NUMA system.

Because of things like shared memory, processes can form migration groups. These are groups of processes, shared memory, files, and so on. that are logically attached to each other. The most common form of logical attachment involves one being intrinsically tied in with another process. For example, a process that creates a shared memory segment is logically attached to it. If another process uses the shared memory segment, it is logically attached to it, and as a result is in a migration group with the first process. Additionally, the user is allowed to create logical attachments between items through the NUMA APIs

The proc structure in an AIX 5 system starts with a pvproc structure and continue with process flags. The start of the structure is listed here; for a full listing, see the file /user/include/sys/proc.h.

struct proc {

struct pvproc *p_pvprocp; /* my global process data*/

pid_t p_pid; /* unique process identifier*/

uint p_flag; /* process flags */

Continued on next page

Page 305: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -47 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

AIX 5 process and Thread data structures -- continued

Process ID PID and process table slot number

The process ID or thread ID is composed of process slot number and a generation count, bit 0 tells us if it is a PID or a TID (all PID’s are even). The next 7 bits are the generation count; the generation count prevents the rapid reuse of process IDs. Bits 8 to 23 is the slot number in the process table. The information can be verified from the pvproc list, where bits 8-23 in the PID field match the process slot number in the pvproc table.

Process table example from an AIX 5L system.

0000000

63 24 23

Slot Number Generation Count 0 if PID1 if TID

8 7 1 0

SLOT NAME STATE PIDpvproc+001180 7 gil ACTIVE 000070Epvproc+001400 8 wlmsched ACTIVE 0000810pvproc+001900 10 shlap64 ACTIVE 0000AD2pvproc+001B80 11 syncd ACTIVE 0000B4Epvproc+002080 13 lvmbb ACTIVE 0000D22pvproc+002580 15 errdemon ACTIVE 0000F50

Page 306: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-48 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

What is new in AIX 5

Priority boost Priority boost is a facility added that ensures that higher priority processes get CPU time, and that the time such processes have to wait for lower prioritized processes is minimized. Priority boost was implemented in AIX 4.3 and is further enhanced in AIX 5.

The background for priority boost is demonstrated in the following scenario. Assume that we have three resources, Locks A,B, and C. Two processes, process 1 and 2, both want to get access to resource B, but process 1, a low priority process, has the lock, and process 2 has to wait. However, another process (process 3) has higher priority than 1and gets most of the CPU time. In this scenario, the high priority process 2 is waiting on the lower priority process 1 because it holds a lock. To resolve this situation, priority boost was added to AIX 4.3.

Priority boost increases the process priority of process holding locks:

• When a process has to wait for a lock, it increases the priority of the process that has the lock to its own priority.

• Other processes waiting for the same lock also get increased priority.

• Only the kernel thread is increased; as soon as the altered process leaves the kernel, the priority is set back to the original value

Continued on next page

Lock A

Lock B

Lock C

Process 1Low priority

Process 2High priority

Process 3Medium priority

Process 2High priority

Page 307: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -49 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

What is new in AIX 5 -- continued

User area in User64

The user structure is much larger in the 64-bit kernel than in the 32-bit kernel. To improve efficiency and performance in the 32-bit kernel, two structures are maintained: a 32-bit and a 64-bit. This ensures that the kernel does not copy data areas which are not used.

What is system hang detection and why do we need it?

Runaway processes and hanging system are hard to detect from locked systems, and methods to detect the runaway process are needed.

• Misbehaving high priority applications are a recurring problem.

• When one or more processes or threads are stuck in the running state, they can prevent any other lower priority threads from running.

• If the priority is above the default user priority, the machine can appear to be hung.

• The hung situation is very difficult to debug since the administrator can not tell what is happening on the system.

The solution to the hang problem is the system hang detection. It is implemented by the shdaemon, which runs at the highest user priority. Shdaemon monitors the lowest priority process that run on the system in a given period of time, and if the system fails to run process below a given threshold, an action is taken. The system hang detection can be set by the shconf command, but the easiest way is to use the smit panel. There are five distinct actions that can be taken, and for each of them a timeout value and a threshold priority value can be set.

Log an Error in the Error Log [disable] Detection Time-out [120] Process Priority [60]

Display a warning message on a console [disable] Detection Time-out [120] Process Priority [60] Terminal Device [console]

Launch a recovering getty on a console [enable] Detection Time-out [120] Process Priority [60] Terminal Device [console]

Launch a command [disable] Detection Time-out [120] Process Priority [60] Script [ ]

Automatically REBOOT system after Detection [disable] Detection Time-out [300] Process Priority [39]

Page 308: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-50 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Signals

What are signals?

• Signals are a way of notifying a process or thread of a system event.

• Signals also provide a means of interprocess communication.

• A signal is represented as a single bit within a bit field in the kernel.

• The bit field used for signals is 64-bit wide, but only about 40 signals are defined.

• AIX 4.3.3 defines only 37 signals for the user.

• AIX 5 has 44 defined signals, but three of them are not used.

Types of signals

• There are two types of signals in AIX: synchronous and asynchronous.

• Synchronous signals are only delivered to a thread, usually as a result of an error condition or exception caused by the thread, that is, SIGILL is delivered to a thread that tries to execute an illegal instruction.

• Asynchronous signals are generated externally to the current thread or process.

• Asynchronous signals may be delivered to a process (that is, kill() or to another thread within the same process (that is, thread_kill() or tidsig() ).

Signal types Signals may be generated for a number of reasons:

• An exception, as segment violation

• An Interrupt, as a clock tick

• An Alarm, as when the timer expires

• Process management, as when a child process dies

• Device I/O, as data ready

• Signals from another process

Signal mechanism

When an event triggers a signal, the kernel sets the corresponding bit in the pending signal bit field for the process (p_sig) or thread (t_sig).

• All signals are enabled by default, and when returning from the kernel, threads are looking for signals.

• If the signal is being ignored (masked), nothing happens.

Page 309: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -51 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Signal handling

Signal delivering

When a signal has been generated but not yet handled, it is said to be pending.

• Pending signals are detected when returning from a system call.

• Pending signals are detected when resuming in user mode.

• Pending signals are detected entering or during an interruptible sleep.

• Signals may be caught, blocked or ignored by a process.

Signal handling

Signal handling is done at the process level and signal masking is done at the thread level. That is, each thread in a process must use the signal handler set up by the process, but each has its own signal mask.

• If a pending signal is not specifically handled by the process, it is delivered to all threads in the process.

• If the signal is handled by the process, the signal is delivered to the thread that is not blocking the signal.

• If all threads are blocking a signal, it is left pending for the process until one thread unmasks the signal or the signal is removed from the pending list.

• If more than one signal is pending, only one is chosen for delivery at a time.

• When a signal is being handled, it is moved to the p_cursig or t_cursig field in the pvproc or pvthread structure.

Signal handler routines

There is a default system handler for all signals, but most signals have a local system handler routine, or the signal is ignored or blocked.

• SIGKILL and SIGSTOP can not be handled by a local routine, these signals will always be handled by the system default routine.

• SIGKILL and SIGSTOP can not be blocked the process will always handle the signal.

Continued on next page

Page 310: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-52 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Signal handling -- continued

Signal actions The default action for a signal depends on the signal, but may be one of the following:

• Abort: This will generate a core dump and terminate the process.

• Exit: This will terminate the process without generating a core dump.

• Ignore: The signal is ignored.

• Stop: This action will suspend the process or thread.

• Continue: This action will resume a suspended process or thread.

Page 311: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -53 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Signals

Signals SIGHUP 1 /* hangup, generated when terminal disconnects */

SIGINT 2 /* interrupt, generated from terminal special char */

SIGQUIT 3 /* (*) quit, generated from terminal special char */

SIGILL 4 /* (*) illegal instruction (not reset when caught)*/

SIGTRAP 5 /* (*) trace trap (not reset when caught) */

SIGABRT 6 /* (*) abort process */

SIGEMT 7 /* EMT intruction */

SIGFPE 8 /* (*) floating point exception */

SIGKILL 9 /* kill (cannot be caught or ignored) */

SIGBUS 10 /* (*) bus error (specification exception) */

SIGSEGV 11 /* (*) segmentation violation */

SIGSYS 12 /* (*) bad argument to system call */

SIGPIPE 13 /* write on a pipe with no one to read it */

SIGALRM 14 /* alarm clock timeout */

SIGTERM 15 /* software termination signal */

SIGURG 16 /* (+) urgent contition on I/O channel */

SIGSTOP 17 /* (@) stop (cannot be caught or ignored) */

SIGTSTP 18 /* (@) interactive stop */

SIGCONT 19 /* (!) continue (cannot be caught or ignored) */

SIGCHLD 20 /* (+) sent to parent on child stop or exit */

SIGTTIN 21 /* (@) background read attempted from ctl terminal*/

SIGTTOU 22 /* (@) background write attempted to control terminal */

SIGIO 23 /* (+) I/O possible, or completed */

SIGXCPU 24 /* cpu time limit exceeded (see setrlimit()) */

SIGXFSZ 25 /* file size limit exceeded (see setrlimit()) */

SIGMSG 27 /* input data is in the ring buffer */

SIGWINCH 28 /* (+) window size changed */

SIGPWR 29 /* (+) power-fail restart */

SIGUSR1 30 /* user defined signal 1 */

SIGUSR2 31 /* user defined signal 2 */

SIGPROF 32 /* profiling time alarm (see setitimer) */

SIGDANGER 33 /* system crash imminent; free up some pg space */

Page 312: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-54 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

SIGVTALRM 34 /* virtual time alarm (see setitimer) */

SIGMIGRATE 35 /* migrate process */

SIGPRE 36 /* programming exception */

SIGVIRT 37 /* AIX virtual time alarm */

SIGKAP 60 /* keep alive poll from native keyboard */

SIGGRANT SIGKAP /* monitor mode granted */

SIGRETRACT 61 /* monitor mode should be relinguished */

SIGSOUND 62 /* sound control has completed */

SIGSAK 63 /* secure attention key */

Continued on next page

Page 313: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -55 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Signals -- continued

Signal data structures

The file /usr/include/sys/proc.h defines the proc structure and the following information about signals is kept in the proc structure.

/* Signal information */

sigset_t p_sig; /* pending signals */

sigset_t p_sigignore; /* signals being ignored */

sigset_t p_sigcatch; /* signals being caught */

sigset_t p_siginfo; /* keep siginfo_t for these */

Signals • A signal is a bit set in an array with enough bits set aside for each signal number.

• The bits are turned on by kernel code as the process is executing in kernel mode or by the processing of interrupts that are determined to be assigned to the process.

• Signals can also be sent from one process to another process through the use of system calls.

• Signals are delivered to the process when:

- The process returns to the User Protection Domain.

- There is a transition from ready-to-run state to running state.

• To deliver a signal, the kernel checks whether the process is receiving the signal.

• If the signal is being received, the kernel sets the receiving process to perform the appropriate action.

• The appropriate action may be to invoke the signal handler for that particular signal, kill the process, or ignore the signal.

• If the signal is blocked by the process, it is left pending until the process is no longer blocking the signal.

• Signals can be delivered to a group of processes.

• Signals can be sent to process or thread.

• Thread receives signal if:

• A signal is synchronous and attributable to particular thread. For example: SIGSEGV.

• A signal is sent by thread in the same process via thread_kill system call.

• Otherwise, the signal goes to process.

Continued on next page

Page 314: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-56 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Signals -- continued

Signals to a process

• If a signal is not being caught, a signal action applies to entire process.

- Every thread is terminated, stopped, or continued, depending on action.

• If a signal is being caught:

- Pick one thread that is not blocking signal to receive it.

- If all threads are blocking, a signal pending on process is sent.

Page 315: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -57 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Exercises

Exercises after this module

In this exercise, the student will be supplied with programs that will create process and threads using the available thread models. The programs should be very simple source and will be supplied to the student. Kernel debugging tools (running on a live kernel) are then used to interrogate the kernel structures associated with the process and threads of the program. The first code example explores the fork() system call and how variables are private to each process. The second example show how threads are created and how global variables are shared because all threads share user space, but local variables in functions are not shared because those data are kept on the stack to make the procedure reentrant. The third example is a signal handler example.

Continued on next page

Page 316: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-58 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Exercises -- continued

C code example to explore fork() and wait() system calls

Use C code to create siblings with the fork() system call notice that the variable is private to each process.

#include <unistd.h> int i; int *statuslocation; pid_t proc_id; pid_t proc_id2; main(argc,argv) int argc; char **argv; { int this=7; proc_id=fork(); /* error routine */ if ( proc_id < 0 ) { printf ("fork error \n"); exit (-1); } if ( proc_id > 0 ) { this= this+4; printf("waiting for child \n"); proc_id2 = wait(statuslocation); printf("I’m Parent variable= %d \n",this); exit (0); } if ( proc_id == 0 ) { printf (" I’m the child proces \n");

sleep(1); printf ("I’m the child the variable is %d\n",this);

printf ("I’m the child terminating\n"); exit (0); }

}

Continued on next page

Page 317: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -59 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Exercises -- continued

C code to explore the thread system call

#include <pthread.h>

#include <stdio.h>

#include <stdlib.h>

void *mythread(void *data);

int x = 0;

int main(void)

{

/* This will be an array holding the threads ids for each thread */

pthread_t tids[11];

int i;

/* We will now create the 5 threads. */

for(i=0;i<5;i++) {

pthread_create(&tids[i], NULL, mythread, NULL);

}

/* We will now wait for each thread to terminate */

for(i=0;i<5;i++)

{

/* this will block until the specified thread finishes execution.

* second argument to pthread_join can be a pointer that will have

* the return value of thread stored in it */

pthread_join(tids[i], NULL);

}

return(0);

}

/* This is our actual thread function */

void *mythread(void *data)

{

int v;

printf (" x was %d v was %d , now change it ",x,v);

if (x < 20) x= 444;

if (v < 20) v= 444;

printf (" x is %d v is %d \n",x,v);

pthread_exit(NULL);

}

Continued on next page

Page 318: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-60 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Exercises -- continued

C sample code to explore process renice, the proces priority, and the program run long time, there is time to look at the process table

The program explores proces priority, the program run long time such that ther are time to look at the process table, and the nice value with the ps command.

int i,ii; long ll;

long ll1();

main(argc,argv) char *argv[]; int argc; { i=atoi(argv[1]); ii = nice(i); ll=1; for (i = 1;i < 5000; i++) { ll = ll1(ll); ll++; } }

long ll1(long l1) { int e; long l2,l3; bb=l1; for (e = 1;e < 50000; e++) { l2 = sin(l3); l3 = l2+l3; } return(l3); }

Continued on next page

Page 319: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -61 of 62Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide

Exercises -- continued

C code example to explore signal handling

The Signal code sample catch the signals and print a message whenever a signal is being caught. What happens if the same signal is being send twice? And how can this behaviour be changed.

#include <stdio.h> #include <fcntl.h> #include <termio.h> #include <signal.h>

int i; void sig1(), sig2(), sig3();

main() {

signal( SIGHUP,sig1); signal( SIGINT,sig2); signal( SIGQUIT,sig3);

for (i = 1;i < 100; i++) { sleep(15); printf ("been sleeping for 15 sec. \n"); } }

void sig1() { printf("interrupt 1 modtaget \n"); }

void sig2() { printf("interrupt 2 modtaget \n"); } void sig3() { printf("interrupt 3 modtaget \n"); }

Page 320: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-62 of 62 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Page 321: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -1 of 30Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, common_vmm.fm Guide

Unit 8. Memory Management

ObjectivesAfter completing this unit, you should be able to describe the common features of VMM on POWER and IA64:

• virtual memory

• page mapping

• memory objects

• VMM tuning parameters

• object types

• shared memory objects

References

Page 322: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-2 of 30 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, common_vmm.fm

Overview of Virtual Memory Management

Introduction Traditionally, the memory management component of the operating system (VMM) is responsible for managing the system’s real memory resources. Virtual memory systems provide the capability to run programs whose memory requirements exceed the system’s real memory resources by allowing programs to execute when they are only partially resident in memory and by utilizing disk to extend memory.

Memory management

The virtual memory system divides real memory into fixed- length pages and allocates pages to program as it requires them. Such a system allows multiple programs to reside in memory and execute simultaneously.

The virtual memory system is responsible for keeping track of which pages of a program are resident in memory and which are on secondary storage (disk).

It handles interrupts from the address translation hardware in the system to determine when pages must be retrieved from secondary storage and placed in real memory.

When all of real memory is in use, it decides which program’s pages are to be replaced and paged out to secondary storage.

Each time a process access a virtual address, the virtual address is mapped (if not already mapped) by the VMM to a physical address where the data is located.

Access Protection

The VMM also provides for access protection to prevent illegal access to data. This protects programs from incorrectly accessing kernel memory or memory belonging to their programs. Access protection also allows programs to setup memory that may be shared between process.

VMM on POWER opposed to IA-64 VMM

In this lesson the common feature of VMM on POWER and IA64 are described. For the most part, the IA64 VMM design inherits design on the Power architecture. The majority of data structures, the serialization model, and the majority of code are common between the two. Separate lessons will describe POWER and IA64 VMM context.

Page 323: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -3 of 30Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, common_vmm.fm Guide

Memory Management Definitions

Introduction The following terms relating to virtual memory concepts will be defined in this section:

• Page

• Frame

• Address space

• Effective address

• Virtual memory

• Physical address

• Paging Space

Illustration Follow this diagram as you read about the virtual memory concepts.

Page Page is a fixed size chunk of contiguous storage that is treated as the basic entity transferred between memory and disk. Pages stay separately from each other, they do not overlap in virtual address space. AIX 5L uses a fixed page size of 4096 bytes for both Power and IA64. The smallest unit of memory managed by hardware and software is one page

Frame The place in real memory used to hold the page is called frame. You can think that the page is the collection of information and the frame is the place in memory to hold that information.

Continued on next page

PhysicalMemory Process 1

Process 2

Virtual addressspace

Pagingspace

Effectiveaddress

Page 324: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-4 of 30 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, common_vmm.fm

Memory Management Definitions -- continued

Address Space Address space is the set of addresses available to a program that it can be use to access memory. This lesson describes three types of address space:

• Effective address space.

• Virtual address space.

• Physical address space.

Effective Address

Effective address are the addresses reference by the machine instructions of a program or kernel. The effective address space is the range of addresses defined by the instruction set, 64-bits on AIX 5L. The effective address space is mapped to different physical address space or disk files for each process. Programs/process see one contiguous address space.

Virtual Address

The virtual address space is the set of all memory objects that could be made addressable by the hardware. The virtual address is a bigger (has more address bits) than the effective address. Processes have access to a limited range of virtual addresses given to them by the kernel.

Physical Address

The physical address space is dependent on how much memory (memory chips) are on the machine. Physical address space maps one- to- one with the machine’s hardware memory.

Paging space Paging space is disk area used by the memory manager to hold inactive memory pages with no other home. In AIX the paging space is mainly used to hold the pages from working storage (process data pages). If a memory page is not in physical memory it may be loaded from disk, this is called a page-in. Writing a modified page to disk is called a page-out.

Page 325: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -5 of 30Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, common_vmm.fm Guide

Demand Paging

Introduction AIX is a demand paging system. Physical pages (frames) are not allocated for virtual pages until they are needed (referenced).

• Data is copied to a physical page only when referenced.

• Paging is done on the fly and is invisible to the user.

• Data comes from:

• A page from the page space.

• A page from a file on disk.

When a virtual address is referenced on a page that has no mapping to a frame, the mapping is done on the fly and the page frame is loaded from where it is mapped. The loading is invisible to the user process. Demand paging saves much of the overhead of creating new processes because the pages for execution do not have to be loaded unless they are needed. If a process never uses parts of its virtual space, valuable physical memory will never be used.

Page Faults A page fault occurs when a program tries to access a page that is not currently in real memory. Memory that has been recently used is kept in real memory, while memory that has not been recently used is kept aside in paging space.

For speed, most systems have the mapping of virtual addresses to real addresses done in the hardware. This mapping is done on a page- by- page basis. When the hardware finds that there is no mapping to real memory, it raises a page fault condition. The operating system software must handle these faults in such a way that the page fault is transparent to the user program.

Virtual Memory manager

The job of a virtual memory management system is to handle page faults so that they are transparent to the thread using virtual memory addresses.

Pool of Physical Free Pages

A pager daemon attempts to keep a pool of physical pages free. If the number of pages available goes below a high- water mark threshold, the pager frees the oldest (referenced further back in time) pages until a low- water mark threshold is reached.

Continued on next page

Page 326: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-6 of 30 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, common_vmm.fm

Demand Paging -- continued

Pageable Kernel

AIX’s kernel is pageable. Only some of the kernel in physical memory at one time. Kernel pages that are not currently being used can be unused can be paged out.

Pinned Pages Some parts of the kernel are required to stay in memory because it is not possible to perform a page-in when those pieces of code execute. These pages are said to be pinned. The bottom halves of devices drivers (interrupt processing) are pinned. Only a small part of the kernel is required to be pined.

Page 327: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -7 of 30Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, common_vmm.fm Guide

Memory Objects

Introduction A fundamental feature of AIX 5L’s Virtual Memory Manager is the use of addressable memory objects.

Objects In AIX 5L provides access to a 256 MB objects called segments. The predominant features these objects are:

• All objects are broken into pages.

• Objects can be shared among processes.

• Objects can grow by adding additional pages.

• Objects can be attached or detached from processes.

• New objects can be created or destroyed by threads in a process.

The benefit of this object-level addressing is high degree of sharing that can be accomplished.

Object specifier

VMM code and interfaces operate on object specified as:

<object ID,object Offset>

POWER VMM Design against IA-64 Design

The POWER architecture provides for efficient access to 256MB objects (segments in POWER terminology) in the global virtual address space.

The 256 MB objects are also used on IA-64 VMM implementation; however, segments are implemented in software instead of hardware. Term “segment” and “object” have the same meaning but keep in mind that term “segment” in IA64 should be considered in software context.

Page 328: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-8 of 30 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, common_vmm.fm

Memory Object types

Introduction There are five types of objects defined by the VMM:

• working

• persistent

• client

• log

• mapping

Working Objects

Working objects (also called working storage and working segments) are temporary segments used during the execution of a program for its stack and data areas. Process data are created by the loader at run time and are page in and page out of paging space. Working storage segment, holds the amount of paging space allocated to pages in the segment, associated with it. The part of AIX kernel is also pageable and are the part of working storage.

Persistent Objects

The VMM is used for performing I/O operations for file systems. Persistent objects are used to hold file data for the local file systems. When the process opens the file, the data pages are page-in. When contents of file changes the page is marked as modified and eventually page out directly to original disk location. File system reads and writes occur by attaching the appropriate file system object and performing loads/stores between the mapped object and the user buffer. File data pages and also program text are both part of persistent storage; however, the program text pages are read only pages and are page-in but never page-out to disk. Persistent pages are not using paging space.

Client Objects Client objects are used for pages of client file systems (all file systems types other than JFS). When remote pages are modified they are marked and eventually page-out to original disk location across the network. Remote program text pages (read-only pages) are page out to paging space from where they can be page in later if needed.

Log Objects Log objects are used for writing or reading JFS file systems logs during journalling operations.

Mapping Objects

Mapping objects are used to support the mmap() interfaces which allows an application to map multiple objects to the same memory segment.

Page 329: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -9 of 30Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, common_vmm.fm Guide

Page Mapping

Introduction This section describes the page mapping functions in the VMM.

VMM Function The main function of virtual memory manager is to make translations from effective addresses to real addresses.

Hardware differences

The exact procedure used by the VMM depends heavily on hardware processor used by the system. As AIX 5L runs of both Power and IA-64 processors this lesson will describe the process in general terms. More exact descriptions of address translation can be found in the hardware specific lessons.

Diagram This diagram shows the overall relationship among the major AIX data structures involved in mapping a virtual page to a real page or to paging space.

Continued on next page

hardware specifictable

softwarepage frame table

SID table

external pagetables (XPT) paging space

real memoryeffectiveaddress space

filesystem

file inode

Page 330: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-10 of 30 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, common_vmm.fm

Page Mapping -- continued

Hardware Page Mapping

Hardware page mapping is determined by processor architecture. The processor generates the hash function which is used to look up the appropriate hardware tables for the proper translation. The hardware specific table(s) used on Power is a hardware Frame Page Table (PFT), on IA-64 a Virtual Hash Page Table (VHPT) is used.

Software Page Frame Table

Software Page Frame Tables (SWPFT) are extensions of the hardware frame table and are used and managed by the VMM software. SWPFT contains informations connected with a page as well as page in, page out flags, free list flag, block number. It contains also the device information (PDT) used to obtain the proper page from disk.

Page Faults Page faults occur when the hardware has looked through its page frame tables but cannot find a real page mapping for a virtual page.

A page fault causes AIX Virtual Memory Manager (VMM) to do the bulk of its work. It handles the fault by first verifying that the requested page is valid. If the page is valid the VMM determines the location of the page, recovers the page if necessary and updates the hardware’s frame page table with the location of the page. A faulted page will be recovered from one of the following locations:

• In physical memory (but not in the hardware PFT).

• On a paging disk (working object)

• On a filesystem object (persistent object)

Protection Fault

Protection fault occurs when page is in memory but process has no rights to access it.

.

Page 331: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -11 of 30Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, common_vmm.fm Guide

Page Not In Hardware Frame Table

introduction The size of the hardware page tables is limited; therefor, the hardware can’t satisfy all address translation requests. The VMM software must supplement the hardware tables with software managed page tables.

Procedure The procedure used for page fault handling when the page is not found in hardware specific tables; however is in physical memory consists of several steps detailed in this illustration and the following table.

Continued on next page

hardware specifictable

softwarepage frame table

SID table

external pagetables (XPT)

paging space

real memoryeffectiveaddress space

virtualpagenumber

real pagenumber

filesystem

file inode

Page 332: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-12 of 30 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, common_vmm.fm

Page Not In Hardware Frame Table -- continued

Procedure (continued)

Note: these steps assume the memory page is in memory just not in the hardware page tables..

Important is to remember that the dispatcher is not run . The faulted thread just continues the execution at the instruction that caused the fault.

PTEGs PowerPC processors hash the PFT into Page Table. Equivalence Groups (PTEGs), and these groups may only be able to hold 16 page entries each. Since there may be more than 16 pages that hash into one PTEG, the VMM has to decide which ones are not in the PTEG. Then, when a page fault occurs for one of these pages, VMM only has to reload the PTEG with the page in question replacing some other page.

Step Action

1 A page fault is generated by the address translation hardware. The page might be in real memory, just not in hardware specific table due to its size limits.

2 The AIX Virtual Memory Manager first verifies that the requested page is valid. If the page is not valid a kernel exception is generated.

3 If the page is valid, the VMM starts looking through the software PFT for the page. This processing almost duplicates the hardware processing, but uses software page tables. The software PFTs are pinned.

4 If the page is found:

• Hardware specific table is updated with real page number for this page and process resumes execution.

• No page-in of the page occurs.

Page 333: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -13 of 30Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, common_vmm.fm Guide

Page on Paging Space

Introduction If the page was not found in real memory, VMM determines whether it is on paging space or else where on disk. If the page is in paging space the disk block containing the page is located and the page loaded into a free memory page.

Waiting for I/O Copying a page from paging space to an available frame is not a synchronous process. Any process or thread waiting for a page fault to be handled is put to sleep until the page is available.

Procedure The procedure for loading a page from paging space is show in this illustration and in the table that follows.

Continued on next pag

hardware specifictable

softwarepage frame table external page

tables (XPT)paging space

real memoryeffectiveaddress space

virtualpagenumber

real pagenumber

segment ID table

XPTaddressand pagenumber

diskblocknumber

filesystem

file inode

Page 334: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-14 of 30 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, common_vmm.fm

Page on Paging Space -- continued

Procedure (continued)

The net effect is that the process or thread has no knowledge that a page fault occurred except for a delay in it’s processing.

Continued on next page

Step Action

1 The VMM looks up the object ID for this address in the Segment ID table and gets the External Page Table (XPT) root pointer.

2 The VMM finds the correct XPT direct block from XPT root.

3 The VMM gets paging space disk block number from XPT direct block.

4 VMM takes the first available frame from the free frame list. (the free list contains one entry for each free frame of real memory).

5 If the free frame list is empty, the VMM uses an algorithm to select several active pages to steal.

• If the page to be stolen is modified , an I/O request is issued to write the contents of the selected page to disk.

• Once written, the frames containing the stolen pages are added to the free list, and one is selected to hold the page from paging space.

6 VMM indicates device and logical block for the page. An I/O request loads the frame with the data for the faulting page.

7 When the I/O completes VMM is notified and the thread waiting on the frame is awakened.

8 The disk block is loaded from paging space or the file system.

9 The hardware PFT is updated, and the process/thread resumes at the faulting instruction

Page 335: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -15 of 30Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, common_vmm.fm Guide

Page on Paging Space -- continued

External Page Table (XPT)

The XPT maps a page within a working storage segments to a disk block on external storage. The XPT is two level tree structure.

The first level of tree is XPT root block. The second level consists of 256 direct blocks. Each word in the root block is a pointer to one of the direct block. Each word of the direct block contains the page state and disk block information for the single page in the segment.

Each XPT direct block covers the 1MB of the 256MB segment.

.

Continued on next page

XPT Root

0

255

XPT Direct Block 0

XPT Direct block 255

XPT entry 255

XPT entry 0

XPT entry 255

.

.

.

.

.

.

XPT entry 0

0

1MB

255MB

256MB

page 0

page 255

page 65280

page 65535

Disk blocks in paging space

Page 336: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-16 of 30 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, common_vmm.fm

Page on Paging Space -- continued

Paging Space Allocation Policy

AIX offers two policies for allocating paging space. If the environment variable PSALLOC=early, then the early allocation policy is used which will cause a disk block to be allocated whenever a memory request is made. This guarantees that the paging space will be available if it is needed.

If the environment variable is not set, then the default late allocation policy is used and a disk block is not allocated until it becomes necessary to page out the page. This policy decreased paging space requirements on large-memory systems which do little paging.

Free memory list

The VMM maintains a linked list containing all the currently free real memory pages in the system. When a page fault occurs, VMM just takes the first page from this list to assign to the faulting page. When the free frame list is empty and a page fault occurs, VMM selects several active pages to be stolen (usually around 20 or so), and all these pages are then added to the free list This reduces the amount of time spent starting and running the steal routines.

Paging Device Table (PDT)

The Paging Device Table (PDT) contains an entry for every device referenced by the VMM.

It is used for filesystem, paging, log and remote pages.

There is a pending I/O list associated with PDT.

The pending I/O list contains all page frames awaiting I/O for the device.

Page frames are removed from the list as soon as the I/O has been dispatched to the device.

Page 337: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -17 of 30Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, common_vmm.fm Guide

Loading Pages From The Filesystem

Introduction Persistent pages do not use XPT (eXternal Page Table). VMM uses the information contained in file’s inode structure to locate the pages for the file.

Procedure Persistent pages are paged from local files located on a filesystems. Local files will have a segment allocated and will have an entry (SID) in the segment information Table. The inode is pointed to by the SID entry allowing VMM to find and page in the faulting block.

hardware specifictable

softwarepage frame table external page

tables (XPT)paging space

real memoryeffectiveaddress space

virtualpagenumber

real pagenumber

segment ID tablediskblocknumber

filesystem

file inode

inodeaddress

Page 338: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-18 of 30 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, common_vmm.fm

Filesystem I/O

Introduction The paging functions of the VMM is also used to preform reads and writes to files by processes.

File system objects

File system reads and writes occur by attaching the appropriate file system object and performing loads/stores between the mapped object and the user buffer. It means that file objects are not directly addressable in the current address space but instead are temporarily attached.

A local file has a segment allocated and has an entry (SID) in the segment information Table. File gnode contains the information which segment belongs to the particular file.

Persistent pages

AIX is using a large portion of memory as the filesystem buffer cache. The pages for files compete for the storage the same way as other pages. The VMM schedules the modified persistent pages to be written to their original location on disk when:

• VMM needs the frame for another page

• file is closed

• sync operation is performed

The sync operation can be performed by syncd daemon running on the system (by default the syncd daemon is run every 60 seconds) and by calling sync() function or running sync command. Scheduling does not mean that the data are written to disk at once.

Page 339: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -19 of 30Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, common_vmm.fm Guide

Free Memory and Page Replacement

Introduction To maintain system performance the VMM always wants some physical memory to be available for page-ins . This section describes the free memory list and the algorithms used to keep pages on the list.

Free memory list

The VMM maintains a linked list containing all the currently free real memory pages in the system. When a page fault occurs, VMM just takes the first page from this list to assign to the faulting page. When the free frame list is empty and a page fault occurs, VMM selects several active pages to be stolen (usually around 20 or so), and all these pages are then added to the free list. This reduces the amount of time spent starting and running the steal routines.

Page Replacement Algorithm

The method used to select a page which should be replaced is called Page Replacement Algorithm. The mechanism used to determine which pages to steal is a pseudo-LRU (Least Recently Used) algorithm called the clock-hand algorithm. This algorithm is commonly used in operating systems when the hardware provides only a reference bit for each page in physical memory. The hardware automatically sets the reference bit for a page translation whenever a store occurs to the page. The clock hand algorithm checks frames by frame number looking for pages that have not been referenced since the last time the algorithm looked at the page. If a page has been referenced since the last time the algorithm looked at the frame, the algorithm clears the reference bit and goes to look at the next frame. If the page has not been referenced since the last time the algorithm looked at the frame, the page is stolen

Continued on next page.

Page 340: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-20 of 30 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, common_vmm.fm

Free Memory and Page Replacement -- continued

Clock Hand The algorithm is called the clock-hand algorithm because the algorithm acts like a clock hand that is constantly pointing at frames in order. The clock-hand advances whenever the algorithm advances to the next frame. If a modified page is stolen, the clock-hand algorithm writes the page to disk (to paging space or a file system) before stealing the page.

Reference = 1

Reference = 0

Reference = 1

Reference = 0

The reference bitis changed tozero when the

clock handpasses

This page iseligible to be

stolen

rotation

Physical page

Page 341: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -21 of 30Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, common_vmm.fm Guide

vmtune

Introduction Some number of pages of different type must retain in memory to maintain system performance. The VMM keeps the statistics for each page types by enforcing thresholds in page replacement algorithm. When a number of pages approaches threshold , the page replacement algorithm selects proper pages for replacement and favors other pages. VMM takes appropriate action to bring the state of memory back within bounds.

VMM Tunable Parameters

The vmtune command changes operational parameters of the Virtual Memory Manager controlling the thresholds.

Parameter Description

minfree Page replacement is invoked whenever the number of free page frames falls below this threshold.

maxfree The page replacement algorithm replaces enough pages so that this number of frames are free when it completes.

LruBucket Specifies the size (in 4K pages) of the least recently used (lru) page-replacement bucket size. This is the number of page frames which will be examined at one time for possible page-outs when a free frame is needed. A lower number will result in lower latency when looking for a free frame, but will also result in behavior that is not as much like a true lru algorithm.

MaxPin Specifies the maximum percentage of real memory that can be pinned. The default value is 80. If this value is changed, the new value should ensure that at least 4MB of real memory will be left unpinned for use by the kernel.

minperm Specifies the point below which file pages are protected from the re-page algorithm. This value is a percentage of the total real-memory page frames in the system. The specified value must be greater than or equal to 5.

MaxPerm Specifies the point above which the page stealing algorithm steals only file pages. This value is expressed as a percentage of the total real-memory page frames in the system. The specified value must be greater than or equal to 5.

MinPgAhead Specifies the number of pages with which sequential read-ahead starts. This value can range from 0 through 4096. It should be a power of 2.

MaxPgAhead Specifies the maximum number of pages to be read ahead. This value can range from 0 through 4096. It should be a power of 2 and should be greater than or equal to MinPgAhead.

NpsWarn Specifies the number of free paging-space pages at which the operating system begins sending the SIGDANGER signal to processes. The default value is 512.

Page 342: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-22 of 30 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, common_vmm.fm

Fatal Memory Exceptions

Introduction Not all page and protection faults can be handled by the O/S. When an fault occurs that can not be handled by the O/S the system will panic and immediately halt.

Fatal memory exceptions

In all of the following cases, the VMM bypasses all kernel exception handlers and immediately halts the system:

• A page fault occurs in the interrupt environment.

• A page fault occurs with interrupts partially disabled.

• A protection fault occurs while in kernel mode on kernel data.

• The system is out of paging space, or an I/O error occurs on kernel data.

• An instruction storage exception occurs while in kernel mode.

• A memory exception occurs while in kernel mode without an exception handler set up.

Page 343: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -23 of 30Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, common_vmm.fm Guide

Memory Objects (Segments)

Introduction Each segment has unique segment ID in segment table. There are a number of important segment types in AIX :

• kernel

• user text

• shared library text

• shared data

• process private

• shared library data

Kernel segment

This segment is described separately for Power and IA-64 in their lessons.

User text The user text segment contains the code of the program. Threads in user mode have read-only access to text segment to prevent the modification during running of the program. This protection allows a single copy of a text segment to be shared by all processes associated with the same program. For example, If the two threads in the system are running the ls command then the instructions of ls are shared between them.

Running a debugger

When a debugger is running on a program a private read/write copy of text segment is used. This allows debaters to set breakpoints directly in code. In that case the status of text segment is changed from shared to private.

Continued on next page

Page 344: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-24 of 30 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, common_vmm.fm

Memory Objects (Segments) -- continued

Shared Library Text

The shared library text segment contains mappings whose addresses are common across all processes. A shared library segment:

• Contains a copy of the program text (instructions) for the shared libraries currently in use in the system.

• These segments are added to the user address space by the loader when the first shared library is loaded.

• Each process using text from this segment has a copy of the corresponding data in the per- process shared library data segment.

Executable modules list the shared libraries they need at exec() time. The shared library text is loaded into this segment when an module is loaded via the exec() system call. Or a program may issue load() calls to get additional shared modules.

Per-Process Shared Library Data Segment

The functions in the shared library that have data that can not be shared between processes and are loaded as process private data.

• This segment holds items required by modules in the shared text segment(s).

• There is one of these segments for each process

• Addresses of data items are generally the same across processes

• Data itself is not shared

The shared library data segments acts like extension of the process private segment.

Shared data Mapped memory regions, also called shared memory areas, can serve as a large pool for exchanging data among processes.

Process private

Process Private Segment is not shared between other processes. The process private segment contains:

• user data (for 32-bit programs that aren’t maxdata programs)

• the user stack (for 32-bit programs)

• text and data from explicitly loaded modules (for 32-bit programs)

• kernel per-process data (accessible only in kernel mode)

• primary kernel thread stack (accessible only in kernel mode)

• per-process loader data (accessible only in kernel mode)

Page 345: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -25 of 30Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, common_vmm.fm Guide

Shared Memory segments

Introduction Mapped memory regions, also called shared memory areas, can serve as a large pool for exchanging data among processes.

• A process can create and/or attach a shared data segment that is accessible by other processes.

• A shared data segment can represent a single memory object or a collection of memory objects.

• Shared memory can be attached read-only or read-write.

Benefit Shared memory areas can be most beneficial when the amount of data to be exchanged between processes is too large to transfer with messages, or when many processes maintain a common large database.

Methods of Sharing

The system provides two methods of sharing memory:

• Mapping file data into the process address space (mmap() services).

• Mapping processes to anonymous memory regions that may be shared (shmat services).

Shared memory address

The shared memory is process based and can be attached at different effective addresses in different processes

Serialization There is no implicit serialization support when two or more processes access the same shared data segment. The available subroutines do not provide locks or access control among the processes. Therefore, processes using shared memory areas must set up a signal or semaphore control method to prevent access conflicts and to keep one process from changing data that another is using.

process Aeffectiveaddress space

process Beffectiveaddressspace

real memory

VMM

Page 346: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-26 of 30 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, common_vmm.fm

shmat Memory Services

Introduction shmat services, are typically used to create and use shared memory objects from a program.

shmat functions

Your program can use the following functions to create and manage shared memory segments.

• shmctl() - Controls shared memory operations

• shmget() - Gets or creates a shared memory segment

• shmat()- Attaches a shared memory segment from a process

• shmdt()- Detaches a shared memory segment from a process

• disclaim() - Removes a mapping from a specified address range within a shared memory segment

Using shmat shmget() system call is used to create a shared memory region and when supporting larger objects than 256MB shared memory regions, creates multiple segments.

shmat() system call is used to gain address ability to a shared memory region.

Limitations Right now shmget() on the 64-bit kernel is limited to 8 segments even for 64-bit applications. Thus, the largest shared memory region that one can create is 2Gb. This limitation will be removed if it is a 64-bit application that performs the shmget(). There will be no explicit limitation, other than what system resources will bear. 32-bit applications will still retain the 2Gb limitation.

EXTSHM Environment variable EXTSHM=ON allows shared memory regions to be created with page granularity instead of the default segment granularity thus allowing more shared memory regions within the same sized address space but no increase in the total amount of share memory region space.

Continued on next page

Page 347: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -27 of 30Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, common_vmm.fm Guide

shmat Memory Services -- continued

When to use Use the shmat() services under the following circumstances:

When mapping files larger than 256MB.

For 32-bit application, eleven or fewer files are mapped simultaneously , and each is smaller than 256MB

When mapping shared memory regions which need to be shared among unrelated processes (no parent-child relationship).

When mapping entire files.

Page 348: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-28 of 30 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, common_vmm.fm

Memory Mapped Files

Introduction Shared segments can be used to map any ordinary file directly into memory.

• Instead of reading and writing to the file, the program would just load or store in the segment

• This avoids buffering of the I/O data in the kernel.

• This provides easy random access, as the file data is always available.

• This avoids the system call overhead of read() and write().

• Either shmat() or mmap() system calls can be used

File mapping The system allows file mapping at the user level. This allows a program to access file data through loads and stores to its virtual address space. This single level store approach can also greatly improve performance by creating a form of Direct Memory Access (DMA) file access. Instead of buffering the data in the kernel and copying the data from kernel to user, the file data is mapped directly into the user’s address space.

Shared files The file can even be shared between multiple processes even if some are using mapping and others are using the read/ write system call interface. Of course, this may require some sort of synchronization scheme between the processes.

shmat to map files

When using shmat to map memory file an open file descriptor is used in place of shared memory ID. Once the file segment is mapped , it is treated like any other shared segment and can be shared with other processes.

Continued on next page

Page 349: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -29 of 30Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, common_vmm.fm Guide

Memory Mapped Files -- continued

mmap services mmap services, is typically used for mapping files, although it may be used for creating shared memory segments as well.

• madvise() - Advises the system of a process' expected paging behavior

• mincore() - Determines residency of memory pages

• mmap() - Maps an object file into virtual memory

• mprotect() - Modifies the access protections of memory mapping

• msync() - Synchronizes a mapped file with its underlying storage device

• munmap() - Un-maps a mapped memory region

Both the mmap and shmat services provide the capability for multiple processes to map the same region of an object such that they share address ability to that object. However, the mmap subroutine extends this capability beyond that provided by the shmat subroutine by allowing a relatively unlimited number of such mappings to be established.

When to use mmap

Use mmap under the following circumstances:

Continued on next page

Portability of the application is a concern.

Many files are mapped simultaneously.

Only a portion of a file needs to be mapped.

Page-level protection needs to be set on the mapping.

Private mapping is required.

Page 350: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-30 of 30 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, common_vmm.fm

Memory Mapped Files -- continued

Mapping Types There are a 3 mapping types :

• read-write mapping

• read-only mapping

• deferred-update mapping

Read-Write Mapping

Read -write mapping allows loads and stores in the segment to behave like reads and writes to the corresponding file. If a thread loads beyond the end of the file, the load will load zero values.

Read-only Mapping

Read only mapping allows only loads from the segment. The operating system generates a SIGSEGV signal if a program attempts an access that exceeds the access permission given to a memory region. Just as with read-write access, a thread that loads beyond the end of the file loads zero values.

Deferred Update Mapping

Deferred update mapping also allows loads and stores to the segment to behave like reads and writes to the corresponding file. The difference between this mapping and read-write mapping is that the modifications are delayed. Any storing into the segment modifies the segment but does not modify the corresponding file.

With deferred update, the application can begin modifying the file data (by memory mapped loads and stores) and then either commit the modifications to the file system (via fsync()) or discard the modifications completely. This can greatly simplify error recover and allows the application to avoid a costly temporary file that may otherwise be required.

Data written to a file that a process has opened for deferred update (with the O_DEFER flag) is not written to permanent storage until another pro-cess issues an fsync() subroutine against this file or runs a synchro-nous write subroutine (with the O_SYNC flag) on this file.

Page 351: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -1 of 18Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm Guide

Unit 9. IA-64 Virtual Memory Manager

ObjectivesAfter completing this unit, you should be able to

• List the size of the effective and virtual address space on the IA64 platform.

• .Show how regions, region register, and region ID are used in AIX 5L.

• Name the region register that is used to identify a processes private region.

• Given an address identify the region it belongs.

References• Intel IA-64 Architecture , Software Developer’s Manual

Page 352: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-2 of 18 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm

IA-64 Addressing Introduction

Introduction AIX-5L on the IA-64 platform is designed as a 64-bit kernel. Unlike the Power version of AIX 5L no 32-bit kernel is available. This lesson describes the address translation mechanism used by AIX 5L on the IA64 platform.

Overview The IA-64 platform provides an effective address space that is 64-bits wide.

• The effective address space is divided into eight regions.

• Each region has a region register associated with it (rr0 - rr7).

• The region registers under control of the OS supplies an additional 24 bits of addressing creating a 85-bit virtual address space.

ILP32 In addition to a 64-bit programming model AIX 5L provides a 32-bit address environment (ILP32). The IPL32 address space is 4 GB. A zero extension model is used to convert 32-bit address to 64-bits for address translation. The ILP32 effective address space is completely contained in the first 4 GB of the 64-bit model.

Page 353: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -3 of 18Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm Guide

Regions

Introduction The 64 bit effective address is broken into 8 regions. This section describes how the regions are addressed.

Region selector

The 64-bit effective address space consists of 8 regions each region addressed by 61 bits. A region is selected by the upper 3 bits of the effective address. Each region has a region register associated with it ( rr0 - rr7)that contains a 24-bit Region IDentifier for the region. When translating effective addresses to virtual addresses the 24 bit region identifier is combined with the lower 61 bits of the virtual address to form a 85 bit virtual address..

Managing region registers

The AIX 5L operating system manages the contents of region registers. An address space is made accessible to a processes by loading the proper RID to one of the eight region registers.

063 60

61 bits3 bits

region ID

24 bits

261

* 2 24

= 2 85

Page 354: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-4 of 18 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm

Region Registers

Introduction Each region register contains a Region IDentifier (RID) and region attributes.

Region Registers

The fields making up the region registers are detailed:

rv rid ps rv ve

63 32 8 2 1 0

Region Register

field description

rv reserved

ve VHPT Walker Enable1-VHPT walker is enabled for the region0-VHTP walker is disabled for the region

ps Preferred page size. Selects the virtual address bits for hash function for TLB or VHPT

rid 24-bit region identifier

Page 355: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -5 of 18Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm Guide

Address Translation

Introduction The VMM software in AIX 5L works closely with the hardware to translate effective address to an address in physical memory.

VMM hardware This diagram and the table on the next page describe the hardware compoints and the process used to preform address translations..

Continued on next page

63 0effective address

60

Region registers

rr0

rr7

region id

hash

region id key VPN rights physical pno.

key rights

Translation Lookaside Buffer

offsetvirtual page number

24

062

offsetphysical page number

protection key registers

searchsearch

physical address

virtual address

Page 356: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-6 of 18 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm

Address Translation -- continued

Address Translation Details

This table describes the process of address translation.

32 bit Address Translation

32 bit address translation is done the same way as 64 bit translation. There is no bit in processor hardware telling that hardware is working in 32 or 64 bit mode as it is on POWER.

Translation Lookaside Buffer

The cache of active virtual memory addresses is called Translation Lookaside Buffer (TLB). TLB contains Page Table Entries (PTE) that were recently used. The TLB stores recently used virtual addresses and corresponding physical addresses.

Step Action

1 Effective address contains three parts:

• Virtual Region Number (VRN),

• Virtual Page Number (VPN)

• Page Offset

2 The 3 VRN bits are used to select region register.

3 The region register provides a 24 bit region ID.

4 The region ID and the virtual page number are used to search for an address translation found in the TLB or the hardware maintained page tables.

5 If no match is found a page fault is generated transferring control to the OS. The OS must resolve the fault by making a page available and updating the translation tables.

6 A successful translation produces a physical page number. This page number is combined with the page offset to produce a physical address.

Page 357: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -7 of 18Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm Guide

Single vs. Multiple Address Space

Introduction The IA-64 model provides the ability for ether a Single or Multiple address space model. These models and are described in this section.

Single Address Space (SAS)

In a single address model all process on the system share a single address space. Such a model is possible due to the enormous size of a 64-bit address space as opposed to a 32-bit one. The term single address space refers to the use of shared regions containing objects mapped at a unique global address. For such mapping a common region ID and page number is provided.

Multiple Address Space (MAS)

In this model each process has a private address space. Not all of the 8 regions can be used by a process because the operating system must be mapped on top of one or more of the regions. For each process private region(s) there is unique RID associated with it.

Address Space on IA-64

The address space model used by AIX on IA-64 combines attributes of both MAS (multiple address space) and SAS (single address space). Region 0 is defined by the operating system to be a process private region. Each process is assigned unique RID for that region which is loaded into region register each time the process is dispatched. Therefore region 0 provides what is effectively a MAS model.

All other regions are treats as shared address space (SAS), as such the region ID’s for those regions are constant and don’t need to be changed at context switch. SAS usage is necessary to achieve the desired degree of sharing of address translations for shared objects: to achieve a single translation for an object all accesses must be made through a common global address.

The sharing semantic (private, globally shared, shared-by-some) is determined by whether or not multiple processes utilize the same RID and also in the case of “shared-by-some” ,whether they have access to specific protection keys.

Page 358: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-8 of 18 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm

AIX 5L Region Usage

Introduction The region identifier (RID), much like the POWER segment identifier (SID), participates in the hardware address translation such that in order to share the same address translation, the same RID must be used. For a process to share a memory region with another process (or the kernel) the same RID must be loaded in the region register in both process’s context.

Region Usage Table

The following table shows the kernel usage model for the 8 virtual regions

Continued on next page

VRN Style Name Usage0 MAS Private process data, stack , heap , mmap ,

ILP32 shared library text,ILP32 main text,u-block,kernel thread stacks/msts

1 SAS/MAS Text LP64 shared library text,LP64 main text

2 SAS LP64shmat3 SAS LP64 shmat4 n/a reserved5 SAS Temp kernel temporary attach , global

buffer pool6 SAS Kernel2 kernel global w/large page size7 SAS Kernel kernel global

Page 359: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -9 of 18Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm Guide

AIX 5L Region Usage -- continued

Region Usage Details

Region usage is detailed here:

• Region 0 is the process private region. Only the running process will have access to its own private region.

• Region 1 is dedicated for mappings of LP64 executable text. This includes globally shared text such as shared libraries and share-by-some text such as the main text of a program. This region is SAS under normal circumstances and is MAS when the process is being debugged.

• Regions 2-3 are the primary residence of shared non-text mappings which include user mappings via shmat.Region 4 is reserved for future use.

• Region 5 is dedicated to support of kernel temporary attach. In AIX 5L the temporary attach mechanism has been adapted to promote the SAS model.

• Regions 6-7 contain kernel global mappings.

ILP32 The address space of a 32-bit programs (using the ILP32 instruction set) is from 0 to 4GB and is solely contained in region 0.

Private segment

Providing process data, heap, and stack as well as per-process kernel information such as the u-block in a single private segment means that just that segment needs to be copied across fork (e.g. copy-on-write semantics).

Page 360: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-10 of 18 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm

Memory Protection

Introduction The IA-64 architecture provides two methods for applying protection to a page:

• Access rights for each translation.

• Protection keys

Protection Keys

Protection keys are used to control which processes have access to individual objects in the single address space to achieve a “shared-by-some” semantic, such as exists for shmat objects.

There is a special bit in hardware and when this bit is turned on(1) then memory references go through protection key access checks during address translations.

There are also protection key registers (at least 16) and VMM manages and keeps track of the particular entry.

Protection key register fields

Protection key register fields:

Continued on next page

field usage

v valid bit.When 1 it means that register contains valid key

wd write disable.When 1 ,write permissions denied

rd read disable.When 1, read permission is denied.

xd execute disable.When 1 ,execute permission is denied.

key protection key(18-24 bits)

Page 361: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -11 of 18Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm Guide

Memory Protection -- continued

Process The process of memory access using protection keys is described in this table.

Continued on next page

Step Action

1 During an address translation by the hardware a protection key is identified for the page being translated.

2 The protection key of the translation is checked against protection keys found in protection key registers (stored by the OS).

3 If the match succeeds then protection rights are applied to the translation. The access can be allowed or not allowed based on the protection key value.

4 If the access is not allowed, then the protection key permission fault is raised and control goes to VMM.

5 In the case when match is not found ( from step 2) the protection key mss fault is raised and VMM inserts the correct protection key into protection key registers.

Page 362: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-12 of 18 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm

Memory Protection -- continued

Protection Key Example

An example of protection key usage is described in this illustration and table.

Continued on next page

shared object

virtual address spaceprocess A address space

process B address space

Step Action

1 A shared object is assigned the protection key 0x1.

2 Processes A and B share the object with the following permissions:

• Process A has read/write access to the object.

• Process B has read-only access to the object.

3 When A is running VMM inserts the protection key register with 0x1 and the ‘wd’ and ‘rd’ bits cleared. The process can read and write all pages in the object.

4 When B is running VMM inserts the protection key register with 0x1 and the ‘rd’ bit cleared.The process can only read pages in the object.

Page 363: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -13 of 18Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm Guide

Memory Protection -- continued

Access Rights In addition to the protection key mechanism the IA-64 architecture provides page protection by associating access and privilege level information witch each translation. However, the majority of page access rights support in AIX 5L is in the common code base shared with POWER. Therefore the software mechanism for dealing with page protection were all left as is so at the upper layers conform to the POWER access rights mechanisms. These consist of:

• per segment K bits

• POWER style per-page protection bits.

At the low platform dependent layer , these POWER style protections are translated to the IA-64 hardware informations.

Page 364: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-14 of 18 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm

LP64 Address Space

Introduction Segments and segment services are used for management of objects both on POWER and IA64.

Segments on IA64

The segment model was original developed with the Power hardware architecture in mind. A segment can be thought of as a hardware object on Power. Selection of the segment is made directly by the hardware’s translation of a virtual address. As we have seen the IA64 hardware address memory by regions. A regions is a much larger areas of the virtual address space that a segment. On IA64 the software manage segments on top of the region model; therefor, on IA64 a segment is a software object not a hardware one.

The user space segment model on IA64 is shown in this table:

ESID (hex) Name

0000_0000_0-0000_0000_F Low 4GB Reserved

0000_0001_0-0000_0001_F Aliased Main Text

0000_0002_0-0000_0002_F Private Dynamically Loaded Text

0000_0003_0-0000_0003_F Private Data, BSS

0000_0100_0-0001_FFFF_F Private Heap

0002_0000_0-0002_FFFF_F Default Mmap, Aliased Shmat

0003_0000_0-0003_FEFF_F User Stack

0003_FF00_0-0003_FF00_2 Kernel reserved

0003_FF00_2-0003_FF00_3 Process Private Segment

0003_FF00_3-0003_FF0F_F Kernel Thread Segments

0003_FF10_0-0003_FFFF_F Kernel reserved

2000_0001_0-2000_0001_F LP64 Shared Library Text

2000_0100_0-2003_FFFF_F Main Text

4000_0001_0-4003_FFFF_F Global Shmat (normal page size)

6000_0001_0-6003_FFFF_F Global Shmat (superpage)

Page 365: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -15 of 18Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm Guide

ILP32 Address Space

Introduction The layout of the 4GB ILP32 address space is principally the same as that for POWER 32-bit applications. The motivations for preserving this layout for IA64 are compatibility and performance.

This table details the segment usage for the ILP32 model.:

Big Data Model A big data model is supported for 32-bit applications on POWER. This allows an application to specify maximum requirements for heap, data, and stack.Such a model is required for programs which exceed the limits imposed by the normal 32-bit address space (i.e. a shared 256MB segment for heap, data, and stack).This model will be also supported on IA64 for 32 bit applications in future releases

ESID Name Example Uses

0 n/a Not used

1 Text Main text

2 Private main+libc data, stack, heap, u-block, kernel stack

3-12 n/a shmat, mmap

13 Shlib Text Shared library text

14 n/a shmat, mmap

15 Shlib Data Post-exec data, private text

Page 366: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-16 of 18 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm

Exercise

Introduction Complete the following written exercise and the lab exercise on the following page.

Test yourself Complete the following questions.

1. The effective address size for a 64 bit process is?

A. 32 bitsB. 64 bitsC. 84 bits

2. The virtual address size on the IA-64 platform is?

A. 32 bitsB. 64 bitsC. 84 bits

3. One of eight region registers is used for each address translation. How is the region register selected?

4. A 64-bit process running on AIX 5L on IA-64 hardware has a private region of memory that is located in what region?

Continued on next page

Page 367: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -17 of 18Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm Guide

Exercise -- continued

Lab Follow the instruction in this table to complete this lab.

Step Action

1 Logon to you IA64 lab system.

2 su to root and start the iadb utility.

$ su# iadb

3 Display the thread structure for the current context using the command:

0> th

The thread structure displayed will be the thread for the running iadb process.

4 Look for the field labeled t_procp this will contain a pointer to the proc structure. Examine this address. What region is this address in?

5 Look for the field labeled userp this will contain a pointer to the threads user area. Examine this address. What region is this address in?

6 Of the two address you examined witch one is in the process’s privet region?

Page 368: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-18 of 18 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm

Page 369: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -1 of 6Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, IA64LinkageStub.fm Guide

Unit 10. IA-64 Linkage Convention

Page 370: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-2 of 6 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, IA64LinkageStub.fm

Page 371: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -3 of 6Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, IA64LinkageStub.fm Guide

Page 372: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-4 of 6 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, IA64LinkageStub.fm

Page 373: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -5 of 6Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, IA64LinkageStub.fm Guide

Page 374: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-6 of 6 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, IA64LinkageStub.fm

Page 375: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -1 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

Unit 11. LVM

Lesson Objectives At the end of the module the student should have gained knowledge about:

Have an overview of the LVM, and Identify the LVM components such as

• Logical volume

• Physical volume

• Mirroring, and parameters for mirroring

• Striping and parameters for striping

Physical disk layout Power

Physical disk layout IA-64

LVM Physical layout including VGDA and VGSA

Know the function of LVM Passive Mirror Write Consistency

Know the function of LVM Hot spare disk

Know the function of LVM Hot spot management

Know the function of LVM Online backup (4.3.3.)

Know the function of LVM Variable logical track group (LTG)

Know the function of each of the High-Level LVM commands

Trace LVM commands with the trace command

Know the function of LVM Library calls

Know briefly about Disk Device Calls

Know briefly about Disk low level Device Calls such as SCSI calls and SSA

Furthermore it is an objective that the student get experience from exercises with the content of this section. The exercises will

• Examine the physical disk layout of a logical volume and a physical volume.

• Examinine the impact of LVM Passive Mirror Write Consistency

• Examinine the function of LVM LTG

• Trace some LVM system activity.

PlatformThis lesson is independent of platform.

Page 376: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-2 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

Referenceshttp://w3.austin.ibm.com/:/projects/tteduc/ Technology Transfer Home Page

Page 377: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -3 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

Logical Volume Manager overview

Introduction The Logical Volume Manager (LVM) is the layer between the operating system (AIX) and the physical hard drives, the LVM provides reliable data storage (Logical volumes) to the OS. The LVM make use of the underlying physical storage, but hides the actual physical drives and drive layout. This section will explain how its done, how the data can be traced, and which parameters impacts the performance in different scenarios.

Physical volume

A hierarchy of structures is used to manage fixed-disk storage. Each individual fixed-disk drive, called a physical volume (PV) has a name, such as /dev/hdisk0. Every physical volume in use belongs to a volume group (VG). All of the physical volumes in a volume group are divided into physical partitions (PPs) of the same size (by default 2MB in volume groups that include physical volumes smaller than 300MB, 4MB otherwise). For space-allocation purposes, each physical volume is divided into five regions (outer_edge, inner_edge, outer_middle, inner_middle and center). The number of physical partitions in each region varies, depending on the total capacity of the disk drive.

Within each volume group, one or more logical volumes (LVs) are defined.

Logical volume

Logical volumes are groups of information located on physical volumes. Data on logical volumes appears to be contiguous to the user but can be discontiguous on the physical volume. This allows file systems, paging space, and other logical volumes to be resized or relocated, span multiple physical volumes, and have their contents replicated for greater flexibility and availability in the storage of data.

Each logical volume consists of one or more logical partitions (LPs). Each logical partition corresponds to at least one physical partition. If mirroring is specified for the logical volume, additional physical partitions are allocated to store the additional copies of each logical partition.

Continued on next page

Page 378: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-4 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

Logical Volume Manager overview -- continued

Physical disks A disk must be designated as a physical volume and be put into an available state before AIX can assign it to a volume group. A physical volume has certain configuration and identification information written on it. This information includes a physical volume identifier and for IA-64 partition information for the disk. When a disk becomes a physical volume, it is divided into 512-byte physical blocks.

The first time you start up the system after connecting a new disk, AIX

detects the disk and examines it to see if it already has a unique physical volume identifier in its boot record. If it does, the disk is designated as a physical volume and a physical volume name (typically, hdiskx where x is a unique number on the system) is permanently associated with that disk until you undefine it.

Volume groups The physical volume must become part of a volume group before it can be utilized by LVM. A volume group is a collection of 1 to 32 physical volumes of varying sizes and types. A physical volume may belong to only one volume group. The system will as default allow you to define up to 256 logical volumes per volume group, but the actual number you can define depends on the total amount of physical storage defined for that volume group and the size of the logical volumes you define.

There can be up to 255 volume groups per system.

A VG that is created with standard physical and logical volume limits can be converted to big format which can hold up to 128 PVs and up to 512 more LVs. This operation requires that there be enough free partitions on every PV in the VG for the Volume group descriptor area (VGDA) expansion.

MAXPVS: 32 (128 big VG) MAXLVS: 255 (512 big VG)

Continued on next page

Logical Storage Management

Volume groups 255 per system

Physical volume (MAXPVS / volume group factor) per volume group

Physical partition (1016 x volume group factor) volume group factor = 1, 2, 4, 8, 16, 32, 64, 28, or 256 MB

Logical volumes MAXLVS per volume group

Logical partitions (MAXPVS * 1016) per logical volume

Page 379: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -5 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

Logical Volume Manager overview -- continued

Physical partitions PP

In the design of LVM, each logical partition maps to one physical partition. And, each physical partition maps to a number of disk sectors. The design of LVM limits the number of Physical Partitions that LVM can track per disk to 1016. In most cases, not all the possible 1016 tracking partitions are used by a disk. The default size of each physical partition during a "mkvg" command is 4 MB, which implies that individual disks up to 4 GB can be included in a volume group.

If a disk larger than 4 Gb is added to a volume group (based on usage of the 4 MB size for Physical Partition) the disk addition will fail with a warning message that the physical partition size needs to be increased. There are two instances where this limitation will be enforced. The first case is when the user tries to use "mkvg" to create a volume group where the number of physical partitions on one of the disks in the volume group would exceed 1016. In this case, the user must pick from the

available physical partition size ranges of 1, 2, (4), 8, 16, 32, 64, 128, and 256 megabytes and use the "-s" option to "mkvg". The second case is where the disk which violates the 1016 limitation is attempting to join a pre-existing volume group with the "extendvg" command. The user can either recreate the volume group with a larger physical partition size (which will allow the new disk to work with the 1016 limitation) or the user can create a stand-alone volume group (consisting of a larger physical partition size) for the new disks.

Continued on next page

Page 380: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-6 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

Logical Volume Manager overview -- continued

Device drivers, hierachy and interface to LVM devices

The figure shows the interfaces to the LVM at different layers, starting top down, the file system JFS or J2, use the LVMDD API interface to access LV’s, the LVMDD use the disk DD to access the physical disk which is handles by the SCSI DD or the SSA DD depending on the type of disk. we do also have interface and commands to manipulate the LVM system, the high level commands are complex commands written as shell scripts as the mklv command. These scripts use basic LVM commands, such as lcreatelv, which are AIX binaries to perform the operations. The basic commands are written in C and use the LVM API liblvm.a access the LVM.

Continued on next page

JFS

LVM DD

Disk DD

SCSI DD

SSADD

High level

commands

liblvm.a

commands

Page 381: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -7 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

Logical Volume Manager overview -- continued

VGDA description

The VGDA is an area at the front of each disk which contains information about the volume group, the logical volumes that reside on the volume group and disks that make up the volume group. For each disk in a volume group, there exists a VGDA concerning that volume group. This VGDA area is also used in quorum voting.

The VGDA contains information about what other disks make up the volume group. This information is what allows the user to just specify one of the disks in the volume group when they are using the "importvg" command to import a volume group into an AIX system. The importvg will go to that disk, read the VGDA and find out what other disks (by PVID) make up the volume group and automatically import those disks into the system. The information about neighboring disks can sometimes be useful in data recovery. For the logical volumes that exist on that disk, the VGDA gives information about that logical volume so anytime some change is done to the status of the logical volume (creation, extension, or deletion), then the VGDA on that disk and the others in the volume group

must be updated.

The VGDA space, that allows for 32 disks, is a fixed size which is part of the LVM design. Large disks require more management mapping space in the VGDA, which causes the number and size of available disks to be added to the existing volume group to shrink. When a disk is added to a volume group, not only does the new disk get a copy of the updated VGDA, but as mentioned before, all existing drives in the volume group must be able to accept the new, updated

VGDA.

VGSA description

The Volume Group Status Area (VGSA) records information on stale partitions for mirroring.

The VGSA is comprised of 127 bytes, where each bit in the bytes represents up to 1016 physical partitions that reside on each disk. The bits of the VGSA are used as a quick bit-mask to determine which physical partitions, if any, have become stale. This is only important in the case of mirroring where there exists more than one copy of the physical partition. Stale partitions are flagged by the VGSA. Unlike the VGDA, the VGSA’s are specific only to the drives which they exist. They do not contain information about the status of partitions on other drives in the same volume group. The VGSA is also used to determine which physical partitions must undergo data resyncing when mirror copy resolution is performed.

Continued on next page

Page 382: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-8 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

Logical Volume Manager overview -- continued

BIG VGDA Volume Group Design (BigVG) implemented in AIX 4.3.2

The original design of the VGDA and VGSA limit the number of disks that can be added to a volume group to 32, and the total number of logical volumes to 256 (including one reserved for LVM internal use). With the proliferation of disk arrays, the need for increased capacity in a single volume group is growing.

This section describes the requirements for a new big Volume Group Descriptor Area and Volume Group Status Areas, here after referred as VGDA and VGSA.

Objectives

• Increase maximum number of disk per VG from 32 to 128

• Increase maximum number of logical volumes per VG to 512

• Provide migration path for small VG to big VG

Changes in commands:

• mkvg

• -B option is added to create big VGs.

• -t If the t flag (factor value) is not used, the default total of 1016physical partitions per physical volume limit will be set. Using the factor value will change the physical partitions per disk to 1016* factor and the total number of disks per VG to 64/factor. BigVG can not be imported/activate into systems with pre AIX 4.3.2 versions.

• chvg

• -B option added to convert the small VG to bigVG. -B flag can be used to convert the small VG to the bigVG format. This operation will expand the VGDA/VGSA to change the total number of disks that can be added to the volume group from 1-32 to 64. Once converted, these volume groups cannot be imported/activated into systems running pre AIX 4.3.2 versions. If both t and B flags are specified, factor will be update first and then VG is converted to bigVG format (sequential operation).

Continued on next page

Page 383: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -9 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

Logical Volume Manager overview -- continued

LVM Flexibility LVM offer great flexibility for the system administrator and users such as

• Real-time Volume Group and Logical Volume expansion/deletion

• Ability to customize data integrity check

• Use of Logical Volume under file system

• Use of Logical Volume as raw data storage

• User customized logical volumes

Real-time Volume Group and Logical Volume expansion / deletion

Typical UNIX operating systems have static file systems that require the archiving, deletion, and recreation of larger file systems in order for an existing file system to expand. LVM allows the user to add disks to the system without bringing the system down and allows the real-time expansion of the file system through the use of the logical volume. All file systems exist on top of logical volumes. However, logical volumes can exist without the presence of a file system. When a file system is created,

the system first creates a logical volume, then places the journaled file system (jfs) "layer" on top of that logical volume. When a file system is expanded, the logical volume associated with that file system is first "grown", then the jfs is "stretched" to match the grown logical volume.

Ability to customize data integrity checks

The user has the ability to control which levels of data integrity checks are placed in the LVM code in order to tune the system performance. The user can change the mirror write consistency check, create mirroring, and change the requirement for quorum in a volume group.

Use of Logical Volume under a file system

The logical volume is a logical to physical entity which allows the mapping of data. The jfs maps files defined in its file system in its own logical way and then translates file actions to a logical request. This logical request is sent to the LVM device driver which converts this logical request into a physical request. When the LVM device driver sends this physical request to the disk device driver, it is further translated into another physical mapping. At this level, LVM does not care about where the data is truly located on the disk platter. But with this logical to physical abstraction, LVM provides for the easy expansion of a file system, ease in mirroring data for a file system, and the performance improvement of file access in certain LVM configurations.

Continued on next page

Page 384: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-10 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

Logical Volume Manager overview -- continued

Use of Logical Volumes as raw data storage

As stated before, the logical volume can run without the existence of the jfs file system to hold data. Typically, database programs use the "raw" logical volume as a data "device" or "disk". They use the LVM logical volumes (rather than the raw disk itself) because LVM allows them to control which disks the data resides, allows the flexibility to add disks and "grow" the logical volume, and gives data integrity with the mirroring of the data via the logical volume mirroring capability.

User customized logical volumes

The user can create logical volumes, using a map file, that will allow them to specify the exact disk(s) the logical volume will inhabit and the exact order on the disk(s) that the logical volume will be created in. This ability allows the user to tune the creation of their logical volumes for performance cases.

Write Verify LVM setting

There is a capability in LVM to specify that you wish an extra level of data integrity is assured every time you write data to the disk. This is the ability known as write verify. This capability is given to each logical volume in a volume group. When you have write verify enabled, every write to a physical portion of a disk that’s part of a logical volume causes the disk device driver to issue the Write and Verify scsi command to the disk. This means that after each write, the disk will reread the data and do an IOCC parity check on the data to see if what the platter wrote exactly matched what the write request buffer contained. This type of extra check understandably adds more time to the completion length of a write request, but it adds to the integrity of the system.

Continued on next page

Page 385: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -11 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

Logical Volume Manager overview -- continued

Quorum checking for LVM volume groups

Quorum checking is the voting that goes on between disks in a volume group to see if a majority of disks exist to form a quorum that will allow the disks in a volume group to become and stay activated. LVM runs many of its commands and strategies based on having the most current copy of some data. Thus, it needs a method to compare data on two or more disks and figure out which one contains the most current information. This need gives rise to the need of a quorum. If not enough quorums can be found during a varyonvg command, the volume group will not varyon. Additionally, if a disk dies during normal operation and the loss of the disk causes volume group quorum to be lost, then the volume group will notify the user that it is ceasing to allow any more disk i/o to the remaining disks and enforces this by performing a self varyoffvg. However, the user can turn off this quorum check and its actions by telling LVM that it always wants to varyon or stay up regardless of the dependability of the system. Or, the user can force the varyon of a volume group that doesn’t have quorum. At this point, the user is responsible for any strange behavior from that volume group.

Page 386: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-12 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

Data Integrity and LVM Mirroring

Mirroring, and parameters for mirroring

When discussing mirrors in LVM, it is easier to refer to each copy, regardless of when it was created, as a copy. the exception to this is when one discusses Sequential mirroring. In Sequential mirroring, there is a distinct PRIMARY copy and SECONDARY copies. However, the majority of mirrors created on AIX systems are of the Parallel type. In Parallel mode, there is no PRIMARY or SECONDARY mirror. All copies in a mirrored set are just referred to as copy, regardless of which one was created first. Since the user can remove any copy from any disk, at any time, there can be no ordering of copies.

AIX allows up to three copies of a logical volume and the copies may be in sequential or parallel arrangements. Mirrors improve the data integrity of a system by providing more than one source of identical data. With multiple copies of a logical volume, if one copy cannot provide the data, one or two secondary copies may be accessed to provided the desired data.

Staleness of Mirrors

The idea of a mirror is to provide an alternate, physical copy of information. If one of the copies has become unavailable, usually due to disk failure, then we refer to that copy of the mirror as going "stale". Staleness is determined by the LVM device driver when a request to the disk device driver returns with a certain type of error. When this occurs, the LVM device driver notifies the VGSA of a disk that a particular physical partition on that disk is stale. This information will prevent

further read or writes from being issued to physical partitions defined as stale by the VGSA of that disk. Additionally, when the disk once again becomes available (suppose it had been turned off accidentally), the synchronization code knows which exact physical partitions must be updated, instead of defaulting to the update of the entire disk. Certain High Level commands will display the physical partitions and their stale condition so that the user can realize which disks may be experiencing a physical failure.

Continued on next page

Page 387: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -13 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

Data Integrity and LVM Mirroring -- continued

Sequential Mirroring

Sequential vs. Parallel mirror, and What good is Sequential Mirroring?

Sequential mirroring is based on the concept of an order within mirrors. All read and write requests first go through a PRIMARY copy which services the request. If the request is a write, then the write request is propagated sequentially to the SECONDARY drives. Once the secondary drives have serviced the same write request, then the LVM device driver will consider the write request complete.

Parallel Mirroring

In Parallel mirroring, all copies are of equal ordering. Thus, when a read request arrives to the LVM, there is no first or favorite copy that is accessed for the read. A search is done on the request queues for the drives which contain the mirror physical partition that is required. The drive that has the fewest requests is picked as the disk drive which will service the read request. On write requests, the LVM driver will broadcast to all drives which have a copy of the physical partition that needs updating. Only when all write requests return will the write be considered complete and the write-complete message will be returned to the calling program.

Continued on next page

Writereq

Writereq

Writereq

Writeack

WriteackWrite

ack

Disk 1 Disk 2 Disk 3

Page 388: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-14 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

Data Integrity and LVM Mirroring -- continued

Mirror Write Consistency Check

Mirror Write Consistency Check (MWCC) is a method of tracking the last 62 writes to a mirrored logical volume. If the AIX system crashes, upon reboot the last 62 writes to mirrors are examined and one of the mirrors is used as a "source" to synchronize the mirrors (based on the last 62 disk locations that were written). This "source" is of importance to parallel mirrored systems. In sequentially mirrored systems, the "source" is always picked to be the Primary disk. If that disk fails to respond, the next disk in the sequential ordering will be picked as the "source" copy. There is a chance that the mirror picked as "source" to correct the other mirrors was not the one that received the latest write before the system crashed. Thus, the write that may have completed on one copy and incomplete on another mirror would be lost.

AIX does not guarantee that the absolute, latest write request completed before a crash will be there after the system reboots. But, AIX will guarantee that the parallel mirrors will be consistent with each other. If the mirrors are consistent with each other, then the user will be able to realize which writes were considered successful before the system crashed and which writes will be retried. The point here is not data accuracy, but data consistency. The use of the Primary mirror copy

as the source disk is the basic reason that sequential mirroring is offered. Not only is data consistency guaranteed with MWCC, but the use of the Primary mirror as the source disk increases the chance that all the copies have the latest write that occurred before the mirrored system crashed.

Ability to detect stale mirror copies and correct

The Volume Group Status Area (VGSA) tracks the status of 1016 physical partitions per disk per volume group. During a read or write, if the LVM device driver detects that there was a failure in fulfilling a request, the VGSA will note the physical partition(s) that failed and mark that partition(s) "stale". When a partition is marked stale, this is logged by AIX error logging and the LVM device driver will know not to send further partition data requests to that stale partition. This saves wasted time in sending i/o requests to a partition that most likely will not respond. And when this physical problem is corrected, the VGSA will tell the mirror

synchronization code which partitions need to be updated to have the mirrors contain the same data.

Page 389: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -15 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

LVM Striping

Striping and parameters for striping

Disk striping is the concept of spreading sequential data across more than one disk to improve disk i/o. The theory is that if you have data that is close to each other, and if you can divide the request into more than one disk i/o, you will reduce the time it takes to get the entire piece of data. This request must be done so it is transparent to the user. The user doesn’t know which pieces of the data reside on which disk and does not see the data until all the disk i/o has completed (in the case of a read) and the data has been reassembled for the user. Since LVM has the concept of a logical to physical mapping already built into its design, the concept of disk striping is an easy evolution. Striping is broken down into the "width" of a stripe and the "stripe length". The width is how many disks the sequential data should lay across. The stripe length is how many sequential bytes reside on one disk before the data jumps to another disk to continue the sequential information path.

Continued on next page

Page 390: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-16 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

LVM Striping -- continued

Striping Example

We present an example to show the benefit of striping: A piece of data that is stored of the disk is 100 bytes. The physical cache of the system is only 25 bytes. Thus, it takes 4 read requests to the same disk to complete the reading of 100 bytes: As you can see, since the data is on the same disk, four sequential reads must be required.

If this logical volume were created with a stripe width of 4 (how many disks) and a stripe size of 25 (how many consecutive bytes before going to the next disk), then you would see:

As you can see, each disk only requires one read request and the time to gather all 100 bytes has been reduced 4-fold. However, there is still a bottleneck of having the four independent data disks channel through one adapter card. But, this can be remedied with the expensive option of having each disk on an independent adapter card. Note the effect of using striping: the user has now lost the usage of 3 disks that could have been used for other volume groups.

hdisk0: First read-bytes 0-24

hdisk0: Fourth read-bytes 75-99

hdisk0: Second read-bytes 25-49

hdisk0: Third read-bytes 50-74

hdisk0: First read-bytes 0-24

hdisk3: Fourth read-bytes 75-99

hdisk1: Second read-bytes 25-49

hdisk2: Third read-bytes 50-74

Page 391: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -17 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

LVM Performance

Performance with disk mirroring

Disk mirroring can improve the read performance of a system, but at a cost to the write performance. Of the two mirroring strategies, parallel and sequential, parallel is the better of the two in terms of disk i/o. In parallel mirroring, when a read request is received, the lvm device driver looks at the queued requests (read and write) and finds the disk with the least number of requests waiting to execute. This is a change from AIX 3.2, where a complex algorithm tried to approximate the disk that would be "closest" to the required data (regardless of how many jobs it had queued up). In AIX 4.1, it was decided that this complex algorithm did not significantly improve the i/o behavior of mirroring and so the complex logic was scrapped. The user can see how this new strategy of finding the shortest wait line would improve the read time. And with mirroring, two independent requests to two different locations can be issued at the same time without causing disk contention, because the requests will be issued to two independent disks. However, with the improvement to the read request as a result of disk mirroring and the multiple identical sources of reads, the LVM disk driver must now perform more writes in order to complete the write request. With mirroring, all disks that make up a mirror are issued write commands which each disk must complete before the LVM device driver considers a write request as complete

Continued on next page

Page 392: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-18 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

LVM Performance -- continued

Changeable parameters that affect LVM performance

There are a few parameters that the user can change per logical volume which will affect the performance of the logical volume in terms of data access efficiency.

From experience however, many people have different views of how to achieve that efficiently, so there can’t be a specific "right" recommendation given in these notes.

Inter-policy - This comes in two variations, min and max. The two choices tells LVM how the user wishes the logical volume to be spread over the disks in the volume group. With min, this tells LVM that the logical volume should be spread over as few disks as possible. The max policy directs LVM to spread the logical volume over as many disks that are defined by the volume group and limited by the "Upper Bound" variable. Some users try to use this variation to form a cheap version of disk striping on systems below AIX 4.1. However, it must be stated that the Inter-policy is a "recommendation" to the allocp binary (Partition allocation routine), not a strict requirement. In certain cases, depending on what is free on a disk, these allocation policies may not be achievable.

Intra-policy - There are five regions on a disk platter defined by the intra-policy: edge, inner-edge, middle, inner-middle, and center. This policy will tell the LVM what the preferred location of the logical volume on the disk platter. Depending on the value also provided for inter-policy, this preference may or may not be satisfied by LVM. Many users have different ideas as to which portion of the disk is considered the "best", so no recommendation is given in these notes.

Mirror write consistency check - As mentioned before, the mirror write consistency check tracks the last 62 distinct writes to physical partitions. If the user turns this off, they will shorten (although slightly), the path length involved in a disk write. However, the trade-off may be inconsistent mirrors if the system crashes during a write call.

Write verify - This by default is turned off by LVM when a logical volume is created. If this value is turned on for a logical volume, additional time during writes will be accumulated as the IOCC check is performed for each write to the disk platter.

Continued on next page

Page 393: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -19 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

LVM Performance -- continued

Physical Connections

Mirroring on different disks - The default of disk mirroring is that the copies should exist on different disks. This is for performance as well as data integrity. With copies residing on different disks, if one disk is extremely busy, then a read request can be completed the other copy residing on a less busy disk. Although it might seem the cost would be the same for writes, the section "Command tag queuing" should show that writing to two copies on the same disk is worse than writing to two copies on separate disks.

Mirroring across different adapters - Another method to improve disk throughput is to mirror the copies across adapters. This will give you a better chance of not only finding a copy on a disk that is least busy, but it will also improve your chances of finding an adapter that is not as busy. LVM does not realize, nor care, that the two disks do not reside on the same adapter. If the copies were on the same adapter, the bottleneck there is still the bottleneck of getting your data through the flow of other data coming from other devices sharing the same adapter card. With multi-adapters, the throughput through the adapter channel should improve.

Command tag queuing - This is a feature only found on scsi-2 devices. In scsi-1, an adapter may get many requests, but will only send out one command at a time. Thus, if the scsi device driver received three requests for i/o, it will buffer the last two requests until the first one sent is received. It then will pick the next one in line and issue that command. Thus, the target device will only receive one command at a time. With command-tag queuing on scsi-2 devices, multiple commands may be sent out to the same device at once. The two device drivers (disk and scsi adapter) will be capable of determining which command returned and what to do with that command. Thus, disk i/o throughput can be improved.

Continued on next page

Page 394: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-20 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

LVM Performance -- continued

Physical Placement of Logical Partitions

The one important ability of LVM is the ability to let the user dictate how on the disk platter the logical volume should be assigned. This is done with the map file that can be used in the "mklv" and "mklvcopy" commands. This map file will allow the user to assign a distinct physical partition number to a distinct logical partition number. Thus, people with different theories on the optimal layout for data partitions can customize their systems according to their personal preferences.

Performance consideration with Disk Striping

Disk striping is introduced in AIX 4.1. This is another word to describe the RAID 0 implementation in software. This functionality is based on the assumption that large amounts of data can be more efficiently retrieved if the request were broken up into smaller requests given to multiple disks. And if the multiple disks are on multiple adapters, then the theory works even better, as mentioned in the previous sections of mirroring across different disks and adapters. In the previous sections, we describe the efficiency gained for mirrors. In this case, the same efficiency is gained with data across disks and adapters, but without mirroring. Thus there is a savings on the write case, as compared to mirrors. But, there is a slight loss in the read case, as compared to mirrors, because now there isn’t more than one copy to read from if one disk is busier than the other.

Performance summarize

To sum up previously mentioned ideas about mirroring. If you have a system that is mainly to be used in read cases, mirroring gives you an advantage because there is more than one version of the same data to be used to satisfy a read request. The only downfall is that if you require just as many writes as reads, then the system must wait for all the writes to complete before the single write command is considered complete.

Additionally, there are two types of mirroring, parallel and sequential. Parallel is the more efficient of the two, and is the default mirroring option unless otherwise specified by the user. In parallel, the "best" disk is chosen for the read request, all write requests are issued independently to each disk that holds a copy of the data. In sequential mirroring, the same disk is always used as the first disk to be read. Thus, all reads are guaranteed to be issued to the "primary" disk (there is no "primary" in parallel mirroring) and the writes must complete in a sequential order before the write is considered complete.

Page 395: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -21 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

Physical disk layout Power

AIX 4.3.3 and AIX 5 IDs

This section will explore the physical disk layout on Power platform.

There are three identifiers commonly used within LVM: Physical Volume Identifier (PVID), Volume Group Identifier (VGID), and Logical Volume Indentifier (LVID). The last two, VGID and LVID, are closely tied. The LVID is simply a dot "." and a minor number appended to the end of the VGID. The VGID is a combination of the machines unique processor serial number (uname -a) and the date that the volume group was created.

The implementation of LVM, has always been to assume that the VGID of a system was made up of 2 32 bit words. Throughout the code however, the VGID/LVID is represented with the system data type struct unique_id which is made up of 4 32 bit words. However the LVM library, driver, and commands have always assumed or enforced the notion that the last 2 words, word3 and word 4 of this structure are zeroes.

AIX 5 is now changed such that all 4 32 bit words are used for a total of 128 bit or 32 HEX digits. The MSb 32 bits are copied from the processor ID and the remaining 96 bits are the milisecond time stamp at creation time.

Continued on next page

AIX 4.3.3

PVID

LVID

VGID Byte8 Byte7 Byte 6 Byte5 Byte4 Byte3 Byte2 Byte1

Byte9 Byte8 Byte 7 Byte 6 Byte5 Byte4 Byte3 Byte2 Byte1

. X

0 0 0 9 0 2 7 7

0 0 0 9 0 2 7 7

Byte8 Byte7 Byte 6 Byte5 Byte4 Byte3 Byte2 Byte1

Byte16 Byte15 Byte 14 Byte 13 Byte 12 Byte11 Byte 10 Byte 9 Byte8 Byte7 Byte 6 Byte5 Byte4 Byte3 Byte2 Byte1

Byte17

Byte16 Byte15 Byte 14 Byte 13 Byte 12 Byte11 Byte 10 Byte 9 Byte8 Byte7 Byte 6 Byte5 Byte4 Byte3 Byte2 Byte1

Byte16 Byte15 Byte 14 Byte 13 Byte 12 Byte11 Byte 10 Byte 9 Byte8 Byte7 Byte 6 Byte5 Byte4 Byte3 Byte2 Byte1

AIX 5

PVID

VGID

LVID

Page 396: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-22 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

Physical disk layout Power -- continued

Example IDs from AIX 4 and AIX 5L systems that shows how IDs are constructed from processor ID

The processor ID is 64 bit in AIX 5 the uname function cut out bit 33 to 47 such that the result is the first word and the last 16 bit of the last word.

LVID and VGID combine 64 bit processor ID and 64 bit time stamp to form an ID. PVIDs are made of 32 bit processor ID and bits from the timestamp.

Example from AIX 5 Power system

PVID hdisk0: 00071483229d06620000000000000000

PVID hdisk1: 00071483b50bbaee0000000000000000

LVID hd1: 0007148300004c00000000e19f7c5aa3.8

LVID hd2: 0007148300004c00000000e19f7c5aa3.5

LVID hd3: 0007148300004c00000000e19f7c5aa3.7

LVID hd4: 0007148300004c00000000e19f7c5aa3.4

VGID rootvg: 0007148300004c00000000e19f7c5aa3

VGID testvg: 0007148300004c00000000e1b50bc8ec

uname -a: 000714834C00

In a AIX 4 system all the IDs are made of the MSB 32 bit of the processor ID and 32 bit time stamp to form an ID.

Example from AIX 4.3.3 Power system

PVID hdisk0: 0009027724fdbd9f

PVID hdisk1: 0009027779fe61c6

LVID hd1: 0009027724fdc36d.8

LVID hd2: 0009027724fdc36d.5

LVID hd3: 0009027724fdc36d.7

LVID hd4: 0009027724fdc36d.4

VGID rootvg: 0009027724fdc36d

VGID datavg: 000902771db64c28

uname -a: 000902774C00

Continued on next page

Page 397: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -23 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

Physical disk layout Power -- continued

Physical volume, with a logical volume testlv defined

The following example show a disk dump from sector 0 at a power system uninitialized is data not written by the LVM, sections holding 00’s or initialized are cut out for clarity. The ID’s are those listed in the previous section.

000000 ¦ C9 C2 D4 C1 00 00 00 00 00 00 00 00 00 00 00 00

000010 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

000070 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

000080 ¦ 00 07 14 83 B5 0B BA EE 00 00 00 00 00 00 00 00 - PVID hdisk1

000090 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0001F0 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

000200 ¦ -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized

00400 ¦ 39 C7 F2 9F 14 87 93 46 00 00 00 00 00 00 00 00

000410 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0005E0 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

����)�����������������������������&��)���)������������

�������è�-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized

���')��è�-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized

���(����)��&�����'��������������������&�������������(����VWUXFW�OYPBUHF�GHILQHG�LQ�OYPUHF�K

���(���è�%���%�&��(&���������������������������������������9*,'�WHVWYJ

���(������������&���������&��'������������������������

���(������������������������������������%$������������

���(���������������-- -- -- -- -- -- -- -- -- -- -- -- Uninitialized

��������������������������������������������������������'()(&7

�����������������������������������������������������

����)�������������������������������������������������

�������è�-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized

���%)��è�-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized

���&���è��)��&�����'��������������������&�������������(��èB/90����VWUXFW�OYPBUHF

���&���è�%���%�&��(&�������������������������������������è�9*,'�WHVWYJ�

���&������������&���������&��'������������������������������_���������

���&������������������������������������%$��������������������������$�

���&���������������-- -- -- -- -- -- -- -- -- -- -- -- Uninitialized

���')��è�-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized

���(���è�������������������������������������������������è'()(&7

���(��������������������������������������������������������

Page 398: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-24 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

���))�������������������������������������������������

�������è�-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized

��)))��è�-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized

����������&�����)������%����$�������������������������� 7KH�9*6$�

������������������������������������������������������

���)(�������������������������������������������������

���))�����������������������������&�����)������%����$�

����������&�����)�����'���&��'��������������������&���� 7KH�9*'$

����������������(��%���%�&��(&������������������������

������������������������������������������������������

������������������������������������������������������

�����������������������������������������������������$

������������������������������������������������������

������������������������������������������������������

Continued on next page

Page 399: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -25 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

Physical disk layout Power -- continued

Disk data continued ����)�������������������������������������������������

�������������������%���%�%$�((������������������������

�����������&������������������������������������������

������������������������������������������������������

������������������������������������������������������

������������������������������������������������������

������������������������������������������������������

������������������������������������������������������

������������������������������������������������������

������������������������������������������������������

������������������������������������������������������

����$�������������������������������������������������

����%�������������������������������������������������

����&�������������������������������������������������

����'�������������������������������������������������

����(�������������������������������������������������

����)�������������������������������������������������

������������������������������������������������������

������������������������������������������������������

������������������������������������������������������

������������������������������������������������������

�����������������������������$������������������������

������������������������������������������������������

����)�������������������������������������������������

��������������������&���������������������������������

������������������������������������������������������

����)�������������������������������������������������

����������&�����)�����'���&��'������������������������

������������������������������������������������������

����)�������������������������������������������������

����������&�����)������%����$�������������������������

������������������������������������������������������

����(�������������������������������������������������

����)�����������������������������&�����)������%����$�

����������&�����)�����'���&��'��������������������&���

����������������(��%���%�&��(&������������������������

������������������������������������������������������

����)�������������������������������������������������

�����������������������������������������������������$

Page 400: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-26 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

������������������������������������������������������

������������������������������������������������������

21A5F0 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ¦......

21A600 ¦ 74 65 73 74 6C 76 00 00 00 00 00 00 00 00 00 00 ¦testlv

21A610 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ¦......

��(�)�������������������������������������������������

��(����è�-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized

Continued on next page

Page 401: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -27 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

Physical disk layout Power -- continued

Disk data continued

��)))��è�-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized

�������è��������������&�����������������$����������������è$,;�/9&%��MIV���

����������������������������������������������������������������������

����������������������������������������������������������������������

�������������������������������������������������������������F��������

�������è�����������������������&�������������������������èH��WHVWOY�������

����������������������������������������������������������������������

����������������������������������������������������������������������

����������������������������������������������������������������������

���������������������������������������������������������7XH�6HS������

��������$��������$�����������������������$����������������������������

����$��è��������������������������������������������$����è�7XH�6HS��������

����%������$�����������������������$����������������������������������

����&���������������������������������'��������������������&���\PH�\��

����'������$��������(��)��(�������������������������������1RQH��������

����(�����������������������������������������������������������������

����)�����������������������������������������������������������������

����������������������������������������������������������������������

����������������������������������������������������������������������

����������������������������������������������������������������������

����������������������������������������������������������������������

����������������������������������������������������������������������

����������������������������������������������������������������������

����������������������������������������������������������������������

����������������������������������������������������������������������

����������������������������������������������������������������������

����������������������������������������������������������������������

����$�����������������������������������������������������������������

����%�����������������������������������������������������������������

����&�����������������������������������������������������������������

����'�����������������������������������������������������������E��EF�

����(��è��������(����������������������������'(�$'�%(�()�èHF��������������

����)��è�-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized

�������è�-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized

Continued on next page

Page 402: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-28 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

Physical disk layout Power -- continued

lvm_rec structure from file /usr/include/lvmrec.h

The structure lvm_rec is used by the lvm routines to define the disk layout struct lvm_rec

/* structure which describes the physical volume LVM record */

{

__long32_t lvm_id;

/* LVM id field which identifies whether the PV is a member of a volume group */

#define LVM_LVMID 0x5F4C564D /* LVM id field of ASCII "_LVM" */

struct unique_id vg_id;

/* the id of the volume group to which this physical volume belongs */

__long32_t lvmarea_len;

/* the length of the LVM reserved area */

__long32_t vgda_len;

/* length of the volume group descriptor area */

daddr32_t vgda_psn [2];

/* the physical sector numbers of the beginning of the volume

group descriptor area copies on this disk */

daddr32_t reloc_psn;

/* the physical sector number of the beginning of a pool of

blocks (located at the end of the PV) which are reserved for

the relocation of bad blocks */

__long32_t reloc_len;

/* the length in number of sectors of the pool of bad block relocation blocks */

short int pv_num;

/* the physical volume number within the volume group of this physical volume */

short int pp_size;

/* the size in bytes for the partition, expressed as a power of

2 (i.e., the partition size is 2 to the power pp_size) */

__long32_t vgsa_len;

/* length of the volume group status area */

daddr32_t vgsa_psn [2];

/* the physical sector numbers of the beginning of the volume

group status area copies on this disk */

short int version;

/* the version number of this volume group descriptor and status area */

short int vg_type;

int ltg_shift;

char res1 [444]; /* reserved area */

};

If we use the string “_LVM” we can locate the above structure in the previous disk dump an assign values to the variablesstruct lvm_rec

Page 403: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -29 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

Variable VALUE#define LVM_LVMID 0x5F4C564D

struct unique_id vg_id; 0007148300004C00000000E1B50BC8EC

__long32_t lvmarea_len; 00001074

__long32_t vgda_len; 00000832

daddr32_t vgda_psn [2]; 00000088 000008C2

daddr32_t reloc_psn; 00867C2D

__long32_t reloc_len; 00000100

short int pv_num; 0001

short int pp_size; 0018

__long32_t vgsa_len; 00000008

daddr32_t vgsa_psn [2]; 00000080 000008BA

int ltg_shift; 0001

char res1 [444]; Uninitialized

Page 404: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-30 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

VGSA structure

struct vgsa_area {

#ifdef _KERNEL

struct timestruc32_t b_tmstamp; /* Beginning time stamp */

#else

struct timestruc_t b_tmstamp;

#endif

/* Bit per PV */

uint pv_missing[(MAXPVS + (NBPI - 1)) / NBPI];

/* Stale PP bits */

uchar stalepp[MAXPVS][VGSA_BT_PV];

short factor; /* for pvs with > 1016 pps */

char pad2[10]; /* Padding */

#ifdef _KERNEL

struct timestruc32_t e_tmstamp; /* Ending time stamp */

#else

struct timestruc_t e_tmstamp;

#endif

};

struct big_vgsa_area {

#ifdef _KERNEL

struct timestruc32_t b_tmstamp; /* Beginning time stamp */

#else

struct timestruc_t b_tmstamp;

#endif

char b_tmbuf64bit[24]; /* Bit per PV */

uint pv_missing[(MAX_EVER_PV + (NBPI - 1)) / NBPI]; /* Stale PP bits */

uchar stalepp[MAX_EVER_PV][VGSA_BT_PV];

short factor; /* for pvs with > 1016 pps */

short version; /* vgsa version */

char valid[4]; /* Validity string "LVM" */

char pad2[824]; /* Padding */

char e_tmbuf64bit[24];

#ifdef _KERNEL

struct timestruc32_t e_tmstamp; /* Ending time stamp */

#else

struct timestruc_t e_tmstamp;

#endif

};

Page 405: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -31 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

Physical disk layout IA-64

Introduction to AIX 5L on IA-64 and EFI partitioned disks

IA64 systems has a different design than Power system, some, if not all, IA-64 systems will use The Extensible Firmware Interface (EFI). EFI has defined a new disk partitioning scheme to replace the legacy DOS partitioning support.

When booting from a disk device, the EFI firmware utilizes one or more system partitions containing an EFI file system (FAT32) to locate EFI applications and drivers, including the OS boot loader. These applications and drivers provide ways to extend firmware or provide the operating system with assistance during boot time or runtime. In addition, it is expected that operating systems will define partitions unique to the operating system. EFI applications, will also have the capability to display and potentially create additional partitions before the OS is booted.

AIX traditionally has not supported partitioned disks because AIX was the only OS running on the RS/6000 systems. Therefore the entire disk is defined by an hdisk ODM object and /dev/hdiskn special file with a single major and minor number assigned to the physical disk. In AIX 4.3.3 when a disk becomes a physical volume (having a PVID) an old style MBR (master boot record) renamed the IPL control block which contains the PVID is written into the first sector at the disk.

The overall design for disk partitioning on AIX 5L on IA-64 is to introduce disk partitioning at the disk driver level. An hdisk ODM object will still refer to the physical disk, however multiple special files will be created and associated with the partitions on the disk. Besides the EFI system partitions, AIX 5L on IA-64 disk configure method will recognize IA-64 physical volume partitions.

AIX 5L on IA-64 supports a maximum of 4 partitions, of these one partition can be a physical volume partition, other partitions are EFI system partitions. Therefore only one AIX PV, and one volume group can be defined per physical disk.

A new command, efdisk, act as a partition manager

Special files will be created for the following partition types:

• Entire physical disk n Access (used by efdisk) /dev/hdiskn_all

• System Partition index y on physical disk n /dev/hdiskn_sy

• Physical volume Partition on physical disk n /dev/hdiskn

• Unknown partition index x on physical disk n /dev/hdiskn_px

Continued on next page

Page 406: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-32 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

Physical disk layout IA-64 -- continued

Creating new partitions at a IA-64 system AIX 5L on IA-64 will partition disks under the following circumstances:

• Under the direction of the user/administrator via the efdisk command.

• During bos install after the designation of a "boot" disk (install targets)

• When adding a disk that is not yet a physical volume to a VG

• Under the direction of the "chdev -l hdiskx -a pv=yes: command

The disk system after a default installation

After installing AIX 5L on a system with one disk, the physical drive and the /dev special files can be listed.

lsdev -Cc disk

hdisk0 Available 00-19-10 Other IDE Disk Drive

/dev/hdisk0 - hdisk0, AIX 5L PV

/dev/hdisk0_all - The entire disk starting at block 0

/dev/hdisk0_s0 - EFI System partition 0 at disk 0

The EFI system partition holds HW information and EFI firmware data the disk is DOS formatted and can be accessed through dos utilities as in the example.

5L-IA64:/tmp> dosdir -D/dev/hdisk0_s0

A.OUT

BOOT.EFI

Free space: 33155072 bytes

Continued on next page

Page 407: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -33 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

Physical disk layout IA-64 -- continued

Creating partitions with efdisk

After creating four partitions we can list the start block number and length with the efdisk command.

------------------------------------------------------

Partition Index: 0

Partition Type: Physical Volume

StartingLBA: 1 (0x1)

Number of blocks: 819200 blocks (0xc8000)

Partition Index: 1

Partition Type: System Partition

StartingLBA: 819201 (0xc8001)

Number of blocks: 409600 blocks (0x64000)

Partition Index: 2

Partition Type: System Partition

StartingLBA: 1228801 (0x12c001)

Number of blocks: 614400 blocks (0x96000)

Partition Index: 3

Partition Type: System Partition

StartingLBA: 1843201 (0x1c2001)

Number of blocks: 614400 blocks (0x96000)

Continued on next page

Page 408: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-34 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

Physical disk layout IA-64 -- continued

Disk layout at IA-64 systems

The following disk dump lists the data in hex format, the six leftmost digits is the byte offset from physical start of disk, each line list 16 bytes. The data is read at a IBM Power system with the same utility as previous examples, when byte swapping is mentioned it is relative to what it would have been at a disk connected to a AIX Power system.

000000 ¦ C1 D4 C2 C9 00 00 00 00 00 00 00 00 00 00 00 00 - AMBI in ebcdic = IBMA byte swapped

000010 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0001C0 ¦ FF FF 09 FF FF FF 01 00 00 00 00 80 0C 00 00 FF -start LBA = 0x1

length = 0xc8000

0001D0 ¦ FF FF EF FF FF FF 01 80 0C 00 00 40 06 00 00 FF -start LBA = 0x0c8001

length = 0x064000

0001E0 ¦ FF FF EF FF FF FF 01 C0 12 00 00 60 09 00 00 FF -start LBA = 0x12c001

length = 0x096000

0001F0 ¦ FF FF EF FF FF FF 01 20 1C 00 00 60 09 00 55 AA -start LBA = 0x1c2001

length = 0x096000

000200 ¦ C1 D4 C2 C9 00 00 00 00 00 00 00 00 00 00 00 00 - AMBI in ebcdic = IBMA byte swapped

000210 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 - 0x200 = start LBA 1 = first part.

�������è��'�����&��)�����������������&�������(�����������è09/B� �B/90�E\WH�VZDSSHG

�������è����&���(�%)�������������������������������������è�OYPBUHF�VWUXFW�RIIVHW�E\��[����

�������è�&�����������))��(��&����������������������������èIURP�WKH�OYP�VWUXFW�DW�SRZHU��GDWD

�������è�������������������������%$����������������������èLQ�SDUWLWLRQ�LV�SODFHG�DV�DW�39V

Continued on next page

Page 409: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -35 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

Physical disk layout IA-64 -- continued

Disk layout at IA-64 systems

�������è�������������������������������������������������è'()(&7���GHIHFW�OLVW

�������è�������������������������������������������������èRIIVHW�E\��[����FRPSDUHG�WR�3RZHU

�������è�������������������������������������������������è'()(&7����GHIHFW�OLVW

�������è�������������������������������������������������è

����������$��&$�����)������������������������������������9*6$���7LPH�VWDPS

������������������������������������������������������

����)�����������������������������$��&$�����)������������HQG�9*6$�WLPHVWDPS

����������$��&$��������%��(��������������������&���������9*'$�VWDUW�WLPH�VWDPS��

�������è�(��������������&���(�%)���������������������������9*,'�IRU�LD��YJ

������������������������������������������������������

For reference information the PVID, LVID and VGID are listed below.

$,;�LD�������LD������������������������������������

���������������������������������������������������

39,'�KGLVN���������������F�����FDHH����EHD��G������

/9,'�OY��������������������F��������H�EI�HF��������

/9,'�OY��������������������F��������H�EI�HF��������

9*,'�LD��YJ����������������F��������H�EI�HF��������

Page 410: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-36 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

LVM Passive Mirror Write Consistency

AIX 5L Passive Mirror Write Consistency

The previous Mirror Write Consistency Check (MWCC) algorithm has been in place since AIX 3.1. This original design has served the Logical Volume Manager (LVM) well, but has always slowed the performance of mirrored logical volumes that performed massive and varied writes. A new design is implemented in AIX 5 to supplement the original MWCC design.

AIX 4 MWCC algorithm

The AIX 4 MWCC method uses a table called the mwc table. This table is kept in memory as well on the disk platter. The table has 62 entries and each entry tracks the last 62 distinct Logical Track Group (LTG) writes. An LTG is 128 Kilobytes. The mwc table is only concerned with writes, not reads. The algorithm can be expressed in pseudo-code:

if (action is a write)

{

if (LTG to be written is already in the mwc table array in memory)

{

proceed and issue the write to the mirrors

wait until all mirrored writes complete

return to calling process

}

else

{

update the mwc table with this latest LTG number overwriting the

oldest LTG entry in the mwc table (in memory), write the memory

mwc table to the edge of the platter of all disks in the volume group

wait for the mwc table writes to complete - when the mwc table write of

the disk that holds the LTG in question returns, this is considered write

complete of the mwc table. issue the parallel mirror writes to all the

mirrors. wait until all mirrored writes complete and return to calling

process

}

}

else

process the read

Continued on next page

Page 411: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -37 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

LVM Passive Mirror Write Consistency -- continued

MWCC usage for recovery

The reason for having mwcc is: Recovery from a crash while i/o is proceeding on a mirrored logical volume. By implication, this means that mwcc is ignored for non-mirrored logical volumes. A key phrase is data "in flight", which implies that a write has been issued to a disk and the write order has not come back from the disk with a confirmation that the action is complete. Thus, there is no certainty that the data did in fact get written to the disk. mwcc tracks the last 62 write orders so that upon reboot, this table is used to rewrite the last 62 mirror writes. It is more than likely that all the writes finished before the system crash, however LVM goes ahead and goes to each of the 62 distinct LTGs, reads one copy of the mirror and writes it to the other mirror(s) that exist. Note that mwcc does not guarantee that the absolute latest write is made available to the user. Mwcc just guarantees that the images on the mirrors are consistent (identical).

AIX 4 MWCC performance implications

The current mwcc algorithm has a penalty for heavily random writes. There is a performance sag associated with doing an extra write for each write you perform. A good example, taken from a customer, is a mail server that had mirrored accounts. Thousands of users were constantly writing or deleting files from their mail accounts. Thus, the LTG counter was constantly being changed and written to disk. In addition to that overhead, if the mwcc table has been dispatched to be written, new requests that come into the LVM work queue are held until the mwcc table write returns so that it can be updated and once more sent down to the disk platters to be updated.

Current AIX 4 MWCC workaround.

Currently, the only way customers can work around the performance penalty associated with mwcc is to turn the functionality off. But in order to insure data consistency, they must do a syncvg -f <vgname> immediately after a system crash and reboot to synchronize data.

Since there is no mwcc table on the platter, there is no way to determine which LTGs need resyncing, thus a forced resync of ALL partitions is required. Omitting this synchronization may cause inconsistent data.

AIX 5 LVM Passive Mirror Write Consistency Check

The MWCC implementation in AIX 5 provides a new passive algorithm, but only for big VGs, The reason for this is that we need space for a dirty flag for each logical volume, and only the VGSA for big VGs provides this space.

Page 412: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-38 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

AIX 5 Passive MWCC algorithm

The new MWCC algorithm set a flag when the mirrored LV is open in RW mode, and the flag is no cleared until the last close on the device. The flag is then examined during subsequent boots, the algorithm implemented is:

1. The user opens a mirrored logical volume.

2. The lvm driver marks a bit in the VGDA which states that for purposes of passive mwcc, the lv is "dirty"

3. Reads and writes occur to the mirrored lv with no (traditional) mwcc table writes

4. The machine crashes

5. Upon reboot, the volume group automatically varies on. As part of this varyonvg, checks are made to see if dirty bits exists for each lv

6. For each logical volume that is dirty, a "syncvg -f -l <lvname>" is performed, regardless of whether or not the user wants to do this.

Advantage:

The behavior of a mirrored write will be the same as those of a mirrored logical volume with now mwcc. Since crashes are very rare, the need for mwcc resync is negligible. Thus, a mostly unnecessary write (mwc table update) will be avoided.

Disadvantage:

After a crash, the entire logical volume is considered dirty, although only a few blocks could have changed. Until all the partitions have been resync’ed, then the logical volume will always be considered dirty while the logical volume is open. Additionally, reads will be a bit slower as a read-then-sync operation must be performed.

Commands affected by the Passive MWCC algorithm

Varyonvg command will inform the user that a background forced sync may be occurring with the passive MWCC recovery.

Syncvg command will inform user that a non-forced sync on a logical volume with a passive MWCC will result in a forced background sync.

Lslv command has been altered such that the output shows if Passive MWCC is set and active.

To set passive sync

• mklv -w p = Use Passive MWCC algorithm

• chlv -w p = Use Passive MWCC algorithm

Changes in Kernel extensions due to Passive MWCC

Three functions are changed hd_open, hd_close, and hd_ioctl:

hd_open: if the logical volume being opened is part of a big VG, it is being opened for write, it is mirrored, and the mwcc policy is passive, the lv_dirty_bit

Page 413: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -39 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

representing the logical volume minor number is marked as dirty. Multiple settings of this may occur as multiple opens results in multiple visits to hd_open.

hd_close: only when a logical volume is being closed for the last time, this function is called. When this occurs, the function checks to see if the logical volume is part of a big VG, it has more than one copy, the mwcc policy is set to passive and the passive_mwcc_recover flag of the logical volume is not set. If all these conditions are true, then the lv_dirty_bit of the logical volume is cleared and the logical volume mirrors are considered 100% consistent with each other.

hd_ioctl: this will return additional status and tell the user if the logical volume is current marked as needing to undergo or is actually undergoing passive mwcc

recovery (all reads result in a resync of the mirrors).

The function hd_mirread is called upon the completion, successful or otherwise, of a read of a mirrored logical volume. When entering this function, if the passive_mwcc_recover flag is set, then the function will search the other viable mirrors that were not read and copy the contents of the just read mirror into those other mirrors via first set the mirrors to avoid with the pb_mirbad variable, then calling the function hd_fixup.

The function hd_kdeflvs, which is called at varyonvg time, looks to see if the volume group is mirrored, has the mwcc policy set to passive, and is a big volume group. If it is, then it checks the lv_dirty_bit of that logical volume in the VGSA. If the bit is set, then the driver notifies itself that it is going to be in passive mwcc recovery state by setting the passive_mwcc_recover flag to true.

Changes to allow hd_kextend to work properly with the new LV_ACTIVE_MWC definition.

Changes in hdpin.exp

Export the call hd_sa_update so that hd_top can update the VGSA well with the modified lv_dirty_bit as a result of hd_open or hd_close.

Page 414: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-40 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

AIX 5 LVM Hot Spare Disk in a Volume group.

AIX 5 Hot Spare Disk function

• Automatic migration of failed disks for mirrored LVs

• Ability to create spare disk pool for a VG

The hot spare function applies to mirrored LVs, non mirrored LVs on a failing disk can not be recovered and therefore no attempt is made.

AIX 5 Hot Spare disk chpv command

Chpv [-h Hotspare] ... existing flags ... PhysicalVolume

-h hotspare

• Sets the sparring characteristics of the physical volume such that the physical volume can be used as a hot spare and the allocation permission for physical partitions on the physical volume specified by the PhysicalVolume parameter. This flag has no meaning for non mirrored logical volumes. The Spare variable can be either:

• y

• Marks the disk as a hot spare disk within the VG it belongs to and prohibits the allocation of physical partitions on the physical volume. The disk must not have any partitions allocated to logical volumes to be successfully marked as a hot spare disk.

• n

• Removes the disk from the hot spare pool for the volume group in which it resides and allows allocation of physical partitions on the physical volume.

Continued on next page

Page 415: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -41 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

AIX 5 LVM Hot Spare Disk in a Volume group. -- continued

AIX 5 Hot Spare disk chvg command

Chvg [-s Sync] [-h Hotspare] ... existing flags .... VolumeGroup

-h hotspare

• Sets the sparing characteristics for the volume group specified by the VolumeGroup parameter. Either allows (yes) the automatic migration of failed disks, or prohibits (no) the automatic migration of failed disks. This flag has no meaning for non mirrored logical volumes

• y

• Allows the automatic migration of failed disks. Use one for one migration of partitions from one failed disk to one spare disk. The smallest disk in the volume group spare pool that is big enough for one to one migration will be used.

• Y

• Allows the automatic migration of failed disks. Potentially use the entire pool of spare disks to migrate to as apposed to a one for one migration of partitions to a spare.

• n

• Prohibits the automatic migration of failed disks. This is the default value for a volume group.

• r

• Removes all disks from the hotspare pool for the volume group.

-s sync

Sets the synchronization characteristics for the volume group specified by the VolumeGroup parameter. Either allows (yes) the automatic synchronization of stale partitions or prohibits (no) the automatic synchronization of stale partitions. This flag has no meaning for non mirrored logical volumes.

• y

• Attempt to automatically synchronize stale partitions.

• n

• Prohibits automatic synchronization of stale partitions. This is the default for a volume group.

• Lsvg -p will show the status of all physical volumes in the VG.

• Lsvg will show status of current state of sparing and synchronization.

• Lspv will show if a disk is a spare.

Page 416: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-42 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

LVM Hot spot management

AIX 5 LVM Hot Spot Management

Provides tools to determine which logical partitions have high I/O traffic and allow the migration of those logical partitions to other disks. The benefit from this system is to:

• Improve performance by eliminating hot spots.

• The system can also be used to migrate certain logical partitions for maintenance.

LVM Hot spot data collection

lvmstat { -l | -v } Name [ -e | -d ] [ -F ] [ -C ] [ -c Count ] [ -s ] [ Interval [ Iterations ] ]

The lvmstat command generates reports that can be used to change logical volume configuration to better balance the input/output load between physical disks. By default, the statistics collection is not enabled in the system. You must use the -e flag to enable this feature for the logical volume or volume group in question. Enabling the statistics collection for a volume group enables for all the logical volume in that volume group.

The first report generated by lvmstat provides statistics concerning the time since the system was booted. Each subsequent report covers the time since the previous report. All statistics are reported each time lvmstat runs. The report consists of a header row followed by a line of statistics for each logical partition or logical volume depending on the flags specified.

Flags

• -c Count Prints only the specified number of lines of statistics.

• -C Causes the counters that keep track of the iocnt, Kb_read and Kb_wrtn be cleared for the specified logical volume/volume group.

• -d Specifies that statistics collection should be disabled for the logical volume/volume group in question.

• -e Specifies that statistics collection should be enabled for the logical volume/volume group in question.

• -F Causes the statistics to be printed colon-separated.

• -l Specifies the name specified is the name of the logical volume.

• -s Suppresses the header from the subsequent reports when Interval is used.

• -v Specifies that the Name specified is the name of the volume group.

Continued on next page

Page 417: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -43 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

LVM Hot spot management -- continued

LVM Hot Spot lists

The lvmstat command is useful in determining whether a physical volume is becoming a hindrance to

performance by identifying the busiest physical partitions for a logical volume.

The lvmstat command generates two types of reports, per Logical partition statistics in a logical volume and per logical volume statistics in a volume group. The reports has the following format:

# lvmstat -l hd3

Log_part mirror# iocnt Kb_read Kb_wrtn Kbps

1 1 0 0 0 0.00

2 1 0 0 0 0.00

3 1 0 0 0 0.00

# lvmstat -v rootvg

Logical Volume iocnt Kb_read Kb_wrtn Kbps

hd2 1592 5620 880 0.05

hd9var 71 32 28 0.00

hd8 71 0 284 0.00

hd4 13 8 60 0.00

hd1 11 1 21 0.00

Migrating Hot Spots

migratelp LVname/LPartnumber[ /Copynumber ] DestPV[/PPartNumber]

The migratelp moves the specified logical partition LPartnumber of the logical volume LVname to the DestPV physical volume. If the destination physical partition PPartNumber is specified it will be used, otherwise a destination partition is selected using the intra region policy of the logical volume. By default the first mirror copy of the logical partition in question is migrated. A value of 1, 2 or 3 can be specified for Copynumber to migrate a particular mirror copy.

The migratelp command fails to migrate partitions of striped logical volumes.

Examples

Page 418: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-44 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

move the first logical partitions of logical volume lv00 to hdisk1, type:

migratelp lv00/1 hdisk1

move second mirror copy of the third logical partitions of logical volume hd2 to hdisk5, type:

migratelp hd2/3/2 hdisk5

Page 419: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -45 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

LVM split mirror AIX 4.3.3.

Splitting and reintegrating a mirror

For a long time it has been a desire to be able to make online backups, especially in installations with mirrored volumes it’s been a requested feature to be able to split the mirror and use one side of the mirror for online backups. It has been possible to do a manual split and later reintegration, but it has been rather complicated and therefore unsafe. In AIX 4.3.3. this feature has been made available with an easy command interface.

A mirrored LV can be divided with the chfs command, the example will split the LV mounted on /testfs, copy number 3 will be mounted ad /backup.

chfs -a splitcopy=/backup -a copy=3 /testfs

The LV is reintegrated in two steps

# umount /backup

# rmfs /backup

Page 420: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-46 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

LVM Variable logical track group (LTG)

AIX 5 introduce Variable LTG size to improve disk performance

Today the Logical Volume Manager (LVM) shipped with all versions of AIX has a constant max transfer size of 128K also know within LVM as the Logical Track Group (LTG). All IO within LVM must be on a Logical Track Group boundary. When AIX was first released all disks supported 128K. Today many disks are going beyond 128K and the efficiency of many disks such as RAID Arrays are impacted if the IO is not a multiple of the stripe size and the stripe size is normally larger than 128K.

The enhancements in AIX 5 will allow a VG LTG size to be specified at VG creation time. The enhancements allows the VG LTG to be changed when volume group is active but no logical volumes are open. The Default LTG size is still 128K, other sizes must be requested by the user. Mkvg/chvg will fail if the specified LTG is larger than the max_transfer size of the target disk(s). Extendvg will fail if the specified LTG is larger than the max_transfer size of the target disk(s). The change of LTG size will not be allowed for disks active in concurrent mode.

Variable LTG size and commands

LTG now supports the following sizes

• 128K - Default value

• 256K

• 512K

• 1024K

Variable LTG commands:

• mkvg -L <size> - create a new volumegroup with LTGsize = <size>

• chvg -L <size> - change a volumegroup to LTGsize = <size>

• lsvg <volume group> will display the LTG size

Page 421: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -47 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

LVM command overview

High level commands

• varyonvg executable

• extendvg shell script

• extendlv shell script

• mkvg shell script

• mklv shell script

• lsvg executable

• lspv executable

• lslv executable

Internal commands

• getlvodm executable

• getvgname executable

• putlvodm executable

• synclvodm executable

• allocp executable

• mapread

• map_alloc

• migfix executable

Low level commands

• lcreatevg executable

• lmigratelv executable

• lquerypv executable

• lqueryvg executable

• lextendlv executable

• lreducelv executable

• lquerylv executable

• lqueryvgs executable

Page 422: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-48 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

LVM Problem Determination

LVM Problem Determination

The Purpose of this section is

• What is the root cause of the error?

• Can this problem be distilled to the simplest case?

• What has been lost and what is left behind?

• Is this situation repairable?

Because in most cases, each LVM problem case is specific to a user and their environment, this section isn't a "how-to" section. Instead, it's mostly a checklist section which will help the user gather necessary information to rationally determine the root cause of the problem and if the problem can be fixed in the field, rather than sending to Level 3 software support. And if the problem must be sent to Level 3, this will give suggested information that would speed the problem determination/solution given by Level 3.

Find out What is the root cause of the error?

The first question to be asked is if this problem is really in the LVM layer. The sections that detail how an I/O request is handed down from layer to layer might help clarify all the sections that must be considered. The most important initial determination is whether the problem is in above the LVM layer, in the LVM layer, or below the LVM layer. For instance, an application program such as Oracle or HACMP/6000 that accesses the LVM directly might have a problem. If you can determine what actions these failing programs are attempting to the LVM, then try to recreate this action by hand using a method that is not based on those application programs. If your attempt by hand works, then the focus of the problem shifts "up" to the application program. Obviously if it fails, then you isolated the problem at the LVM layer (or below). Or, the problem could simply be corruption to the data needed by LVM; the programs are behaving correctly, but data needed by LVM is corrupted which is causing LVM to behave strangely. An additional bonus to the field investigator is the fact that most high-level commands are shell scripts. Thus, if they are familiar with shell programming, they may turn on the shell output and what the execution of the shell commands to observe the failure point. This information might produce additional helpful information to the problem record. Finally, if there is corruption or loss of data required by LVM (such as a disk accidently erased from a volume group), it helps to find the exact steps performed (or even not performed) by the user so that the investigator can deduce the state of the system and what useful LVM information is left behind.

Continued on next page

Page 423: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -49 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

LVM Problem Determination -- continued

Can this problem be distilled to the simplest case?

Many times problem reports from the field to Level 3 concerning LVM are difficult to investigate because clarification is required (to determine the root cause of the problem). Or, the problem is described with the complex user configurations. If it is possible, the most basic action of the LVM is the one that should be investigated. This is not always possible as some problem may only be exposed when running in a complex environment. However, whenever possible one should try to distill the case into how the action to a logical volume is causing misbehavior by the system. And in that clarification, a non-LVM root cause may be discovered instead.

What has been lost and what has been left behind?

This type of question is typically asked of the system when some sort of accident has resulted in data corruption or loss of LVM required information. Given the state of the system before the corruption, the steps that most likely caused the corruption, and the current state of the machine, one can deduce what is left to work with. Sometimes one will receive conflicting information. This is because part of the ODM disagrees with part of the VGDA. The ODM is the one that is easily alterable (compared to the VGDA).

Is this situation repairable?

Sometimes you have enough information to know what is missing and what should be done to repair the system. However, the design of ODM, the system configurator, and LVM prevents the repair. By fixing one problem, another is spawned. And, one is caught in a deadlock situation that cannot be fixed unless one wrote very specific kernel code to repair the internal aspects of the LVM (most likely the VGDA). This is not a trivial solution, but it is possible. It is only through experience that a judgement can be made if recovery can be attempted.

Continued on next page

Page 424: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-50 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

LVM Problem Determination -- continued

Problem Recovery

• Warn the user of the consequences

• Gather all possible data

• Save off what can be saved

• Each case is different, so must be the solution

Although this might seem a trivial step, when you attempt problem recovery, most of the time you must alter or destroy an important internal structure within the LVM (such as the VGDA). Once this is done, if the recovery attempt didn't work, the user's system is usually in worse shape than before the recovery attempt. Many users will decline the recovery attempt once this warning is given. However, it is better to warn them ahead of time!

Gather all possible data

While the volume group is still partially accessible, gather all possible data about the current volume group. The VGDA will provide information about missing logical volumes, which will be important. Once the recovery procedure starts, important reference information such as that gathered from the VGDA will be lost for good. And if your information is incomplete, then you may be stuck with no where to go.

Save off what can be saved

Before starting the recovery, make a copy of files that can be restored in case something goes wrong. A good example would be something like the ODM database files that reside in /etc/objrepos. Sometimes the recovery steps involves deleting information from those databases. And once deleted, if one is unsure of their form, one can't try to recreate some of the structures or values.

Each case is different, so must each solution be

Since each LVM problem is most likely going to be unique for that system, these notes cannot provide a list of steps one would take in a repair. Once again, the recovery steps must be based on individual experiences with LVM. The LVM lab exercise on recovery provides a glimpse of the complexity and information required to repair a system. However, this lab is just an example, not a template of how all fixes should be attempted.

Page 425: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -51 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

Trace LVM commands with the trace command

Tracehook 105 Trace HOOK 105 : HKWD KERN LVM

This event is recorded by the Logical Volume Manager for selected events.

LVM relocingblk bp=value pblock=value relblock=value

• Encountered relocated block

• bp=value, Buffer pointer

• pblock=value, Physical block number

• relblock=value, Relocated block number.

LVM oldbadblk bp=value pblock=value state=value bflags

• Bad block waiting to be relocated

• bp=value, Buffer pointer

• pblock=value, Physical block number

• state=value, State of the physical volume

• bflags, Buffer flags are defined in the sys/buf.h file.

LVM badblkdone bp=value

• Block relocation complete

• bp=value, Buffer pointer.

LVM newbadblk bp=value badblock=value error=value bflags

• New bad block found

• bp=value, Buffer pointer

• badblock=value, Block number of bad block

• error=value, System error number (the errno global variable)

• bflags, Buffer flags are defined in the sys/buf.h file.

LVM swreloc bp=value status=value error=value retry=value

• Software relocating bad block

• bp=value, Buffer pointer

• status=value, Bad block directory entry status

• error=value, System error number (the errno global variable)

Page 426: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-52 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

• retry=value, Relocation entry count.

LVM resyncpp bp=value bflags

• Resyncing Logical Partition mirrors

• bp=value, Buffer pointer

• bflags, Buffer flags are defined in the sys/buf.h file.

Continued on next page

Page 427: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -53 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

Trace LVM commands with the trace command -- continued

Trace hook 105 continued

LVM open device name flags=value

Open device name, Name of the device

flags=value, Open file mode.

LVM close device name

Close device name, Name of the device.

LVM read device name ext=value

Read device name, Name of the device

ext=value, Extension parameters.

LVM write device name ext=value

Write device name, Name of the device

ext=value, Extension parameters.

LVM ioctl device name cmd=value arg=value

ioctl device name, Name of the device

cmd=value, ioctl command

arg=value, ioctl arguments.

Example on a trace -a --j105

ID ELAPSED_SEC DELTA_MSEC APPL SYSCALL KERNEL INTERRUPT

001 0.000000000 0.000000 TRACE ON channel 0 Mon Sep 18 21:52:50 2000

105 20.598330739 6.109275 LVM close: rloglv00

105 20.598415445 0.084706 LVM close: rlv00

Continued on next page

Page 428: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-54 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

Trace LVM commands with the trace command -- continued

Trace hook 10B

10B : HKWD KERN LVMSIMP

This event is recorded by Logical Volume Manager for selected events.

Recorded Data

Event:

LVM rblocked: bp=value

Request blocked by conflict resolution

bp=value

Buffer pointer.

LVM pend: bp=value resid=value error=value bflags

End of physical operation

bp=value, Buffer pointer

resid=value, Residual byte count

error=value, System error number (the errno global variable)

bflags, Buffer flags are defined in the sys/buf.h file.

• LVM lstart: device name bp=value lblock=value bcount=value bflags opts: Value

• Start of logical operation

• device name, Device name

• bp=value, Buffer pointer

• lblock=value, Logical block number

• bcount=value, Byte count

• bflags, Buffer flags are defined in the sys/buf.h file

• opts: value, Possible values:

• WRITEV, HWRELOC, UNSAFEREL, RORELOC, NO_MNC, MWC_RCV_OP, RESYNC_OP, ,AVOID_C1, AVOID_C2, AVOID_C3

Example on a trace -a --j10b:

ID ELAPSED_SEC DELTA_MSEC APPL SYSCALL KERNEL INTERRUPT

Page 429: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -55 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

001 0.000000000 0.000000 TRACE ON channel 0 Mon Sep 18 21:52:50 2000

10B 0.007512611 7.512611 LVM pend:pbp=F100 00971615E580 resid=0000 error=0000 B_WRITE

10B 0.007523970 0.011359 LVM lend:rhd9var lbp=F10000 971E17E1A0 resid=0000 error=0000 B_WRITE

10B 8.968758818 8961.234848 LVM lstart: rhd4 lbp=F100009

Page 430: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-56 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

LVM Library calls

List of Logical Volume Subroutines

The library of LVM subroutines is a main component of the Logical Volume Manager.

LVM subroutines define and maintain the logical and physical volumes of a volume group. They are used by the system management commands to perform system management for the logical and physical volumes of a system. The programming interface for the library of LVM subroutines is available to anyone who wishes to provide alternatives to or expand the function of the system management commands for logical volumes.

Note: The LVM subroutines use the sysconfig system call, which requires root user authority, to query and update kernel data structures describing a volume group. You must have root user authority to use the services of the LVM subroutine library.

The following services are available:

• lvm_querylv Queries a logical volume and returns all pertinent information.

• lvm_querypv Queries a physical volume and returns all pertinent information.

• lvm_queryvg Queries a volume group and returns pertinent information.

• lvm_queryvgs Queries the volume groups of the system and returns information for groups that are varied on-line.

Page 431: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -57 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

logical volume device driver LVMDD

LVM logical volume device driver

The Logical Volume Device Driver (LVDD) is a pseudo-device driver that operates on logical volumes through the /dev/lvn special file. Like the physical disk device driver, this pseudo-device driver provides character and block entry points with compatible arguments. Each volume group has an entry in the kernel device switch table. Each entry contains entry points for the device driver and a pointer to the volume group data structure. The logical volumes of a volume group are distinguished by their minor device numbers.

• Attention: Each logical volume has a control block located in the first 512 bytes. Data begins in the second 512-byte block. Care must be taken when reading and writing directly to the logical volume, because the control block is not protected from writes. If the control block is overwritten, commands that use it can no longer be used.

Character I/O requests are performed by issuing a read or write request on a /dev/rlvn character special file for a logical volume. The read or write is processed by the file system SVC handler, which calls the LVDD ddread or ddwrite entry point. The ddread or ddwrite entry point transforms the character request into a block request. This is done by building a buffer for the request and calling the LVDD ddstrategy entry point.

Block I/O requests are performed by issuing a read or write on a block special file /dev/lvn for a logical volume. These requests go through the SVC handler to the bread or bwrite block I/O kernel services. These services build buffers for the request and call the LVDD ddstrategy entry point. The LVDD ddstrategy entry point then translates the logical address to a physical address (handling bad block relocation and mirroring) and calls the appropriate physical disk device driver.

On completion of the I/O, the physical disk device driver calls the iodone kernel service on the device interrupt level. This service then calls the LVDD I/O completion-handling routine. Once this is completed, the LVDD calls the iodone service again to notify the requester that the I/O is completed.

The LVDD is logically split into top and bottom halves. The top half contains the ddopen, ddclose, ddread, ddwrite, ddioctl, and ddconfig entry points. The bottom half contains the ddstrategy entry point, which contains block read and write code. This is done to isolate the code that must run fully pinned and has no access to user process context. The bottom half of the device driver runs on interrupt levels and is not permitted to page fault. The top half runs in the context

of a process address space and can page fault.

Page 432: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-58 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

Disk Device Calls

scsidisk, SCSI Disk Device Driver

This driver supports the small computer system interface (SCSI) and the Fibre Channel Protocol for SCSI (FCP) fixed disk, CD-ROM (compact disk read only memory), and read/write optical (optical memory) devices.

Syntax

#include <sys/devinfo.h>

#include <sys/scsi.h>

#include <sys/scdisk.h>

Device-Dependent Subroutines

Typical fixed disk, CD-ROM, and read/write optical drive operations are implemented using the open, close, read, write, and ioctl subroutines.

open and close Subroutines:

The openx subroutine is intended primarily for use by the diagnostic commands and utilities. Appropriate authority is required for execution.

The ext parameter passed to the openx subroutine selects the operation to be used for the target device. The /usr/include/sys/scsi.h file defines possible values for the ext parameter.

rhdisk Special File Provides raw I/O access to the physical volumes (fixed-disk) device driver.

The rhdisk special file provides raw I/O access and control functions to physical-disk device drivers for physical disks. Raw I/O access is provided through the /dev/rhdisk0, /dev/rhdisk1, ..., character special files.

Direct access to physical disks through block special files should be avoided. Such access can impair performance and also cause data consistency problems between data in the block I/O buffer cache and data in system pages. The /dev/hdisk block special files are reserved for system use in managing file systems, paging devices and logical volumes.

The r prefix on the special file name indicates that the drive is to be accessed as a raw device rather than a block device.

Page 433: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -59 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

Page 434: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-60 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

Disk low level Device Calls such as SCSI calls

SCSI Adapter Device Driver

The SCSI device driver has access to the physical disk (if SCSI disk). The driver support data transfers via read and write and control commands via ioctl calls. The diskDD use the Adapter device driver to access and control the physical storage device.

Supports the SCSI adapter.

Syntax

<#include /usr/include/sys/scsi.h>

<#include /usr/include/sys/devinfo.h>

Description

The /dev/scsin and /dev/vscsin special files provide interfaces to allow SCSI device drivers to access SCSI devices. These files manage the adapter resources so that multiple SCSI device drivers can access devices on the same SCSI adapter simultaneously. The /dev/vscsin special file provides the interface for the SCSI-2 Fast/Wide Adapter/A and SCSI-2 Differential Fast/Wide Adapter/A, while the /dev/scsin special file provides the interface for the other SCSI adapters. SCSI adapters are accessed through the special files /dev/scsi0, /dev/scsi1, .... and /dev/vscsi0, /dev/vscsi1, ....

The /dev/scsin and /dev/vscsin special files provide interfaces for access for both initiator and target mode device instances. The host adapter is an initiator for access to devices such as disks, tapes, and CD-ROMs. The adapter is a target when accessed from devices such as computer systems, or other devices that can act as SCSI initiators.

For further information look in

Kernel and Subsystems Technical Reference, Volume 2

and Files Reference manual.

Page 435: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -61 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

Exercises

Examine the physical disk layout of a logical volume and a physical volume.

Use a tool such as edhx, hexit, dd or other to Look at a physical volume,

Idintify the PVID, the VGID, and the LVM structure.

Hint: which device should you use to access these data. It may be esier to copy data from the drive to a file with the dd command.

dd if=/dev/xxx of=/tmp/Myfile bs=1020k count=<number of MB>

Use another device to look at the logical volume, and does the data match those from the physical device.

Examinine the impact of LVM Passive Mirror Write Consistency

This exercise will look at the perfromace impact enabling and disabling MWC, to do do this we need a reproduceable write load. one way to get this is to write a C program to create the load remember the file has to be realy big to exceed the cache size or, force a sync to occur before terminating.

Sample C code to write a big file:

void writetstfile()

{

char buffer[512];

char *filename = "/test/a_large_file";

register int i;

int fildes;

/* for (i=0;i<38;i++) buffer[i] = buf[i]; */

if ((fildes = creat(filename,0640)) < 0) {

printf("cannot create file \n");

exit(1);

}

else {

close(fildes);

if ((fildes = open(filename,1)) < 0) {

printf("cannot open file for write \n");

exit(1);

}

Page 436: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-62 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

}

for (i=0; i< BLOCKS;i++)

if(write(fildes,buffer,512) < 0) {

printf("error writeng block %d\n",i);

exit(1);

}

Continued on next page

Page 437: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -63 of 64Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide

Exercises -- continued

Examinine the function of LVM LTG

The LTG is the LVM Logical Track Group, the amount of data read or written to the disk in each operation. Try to monitor the the data and the number of disk transactions per. second during IO. The IO and the disk transactions per second can be monitored with the iostat command.

Test the split mirror facility

Test the “Splitting and reintegrating” facility of a mirror. First create a mirrored LV, and write data to it. Then split the mirror and access data from both sides. Change data at the “primary side”, and then reintrgrate the mirror, what happens?

How fast are the mirrors reintegrated?

are they realy synchronized?

Exercise Trace LVM system activity.

In this exercise we will use the trace command to monitor LVM activity

start, stop, and list the results from a LVM trace with the commands

trace -a -j105 -j10b

trcstop

trcrpt > <filename>

Try to Unmount a filesystem, mount the filesystem again, create a file, and write data into the file to create some activity in the LVM trace file.

Page 438: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-64 of 64 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm

Page 439: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -1 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, j2-1.fm Guide

Unit 12. Enhanced Journaled File System

ObjectivesAfter completing this unit, you should be able to

• List the difference between the terms aggregate and fileset.

• Identify the various data structures that make up the JFS-2 filesystem.

• Use the fsdb command to trace the various data structures that make up the logical and virtual file system.

ReferencesSCnn-nnnn Title of Reference

http://www.yoururl.comWEB Page Name

Page 440: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-2 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, j2-1.fm

J2 - Enhanced Journaled File System

Introduction The Enhanced Journaled File System (JFS2), is an extent based Journaled File System. It is the default filesystem on IA-64 systems and is available on the Power based systems. Currently the default on Power systems is the Journaled File System (JFS).

Numbers The following table list some general information about JFS2

Function Value

Block Size 512 - 4096 Configurable block size

Architectural max. files size 4 Petabytes

Max. file size tested 1 Tetabytes

Max. file system size 1 Tetabytes

Number of Inodes Dynamic, limited by disk space.

Directory Organization B-tree

Page 441: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -3 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, j2-1.fm Guide

Aggregate

Introduction The term aggregate is defined in this section. The layout of a JFS2 aggregate is described.

Definitions JFS2 separates the notion of a disk space allocation pool, called an aggregate, from the notion of a mountable file system sub-tree, called a fileset. The rules that define aggregates and filesets in JFS2 are:

• There is exactly one aggregate per logical volume.

• There may be multiple filesets per aggregate.

• In The first release of AIX 5L, only one fileset per aggregate is supported,.

• The meta-data has been designed to support multiple filesets, and this feature may be introduced in a future release of AIX 5.

The terms aggregate and fileset in this document correspond to their DCE/DFS (Distributed Computing Environment Distributed File System) usage.

Aggregate block size

An aggregate has a fixed block size (number of bytes per block) that is defined at configuration time. The aggregate block size defines the smallest unit of space allocation supported on the aggregate. The block size cannot be altered, and must be no smaller than the physical block size (currently 512 bytes). Legal aggregate block sizes are:

• 512 bytes

• 1024 bytes

• 2048 bytes

• 4096 bytes.

Do not confused aggregate block size with the logical volume block size, which defines the smallest unit of I/O.

Continued on next page

Page 442: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-4 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Aggregate -- continued

Aggregate layout

The following diagram and table details the layout of the aggregate.

Continued on next page

2

3 19

1816

Aggregate Inode Table; inode numbers shown

Note: Aggregate Block Size is 1K in this example.

0

RESERVED

32

31

1KB(One Aggregate Block)

AggregateSuperblock

32 Inodes (16KB)

0 64 8 10 12 14

1 75 9 11 13 15

2220 24 26 28 30

17 2321 25 27 29 31

aggr inode #1: “self”

offset: 0addr: 36

length: 8

aggr inode #2: block map

offset: 0addr: 64

length: 16

owner: rootperm: -rwx------etc: blah blah

size: 16384

IAG

AggregateBlock #

xad

en

trie

s(8

tot

al)

40 44 60

PrimaryAggregateSuperblock

Secondary

1st extent of Aggregate Inode Allocation Map

Control Section

iagnum: 0

Working Map

0xf8008000

0x00000000

...

Persistent Map

0xf8008000

0x00000000

...

ixd Section

length[0]: 16

addr[0]: 44

length[1]: 0

addr[1]: 0

...

Control Page

36

owner: rootperm: -rwx------etc: blah blah

size: 8192

aggr inode #16: fileset 1

offset: 0addr: 8

length: 5992

owner: rootperm: -rwx------etc: blah blah

size: 8192

aggr inode #16: fileset 0

offset: 0addr: 8

length: 240

owner: rootperm: -rwx------etc: blah blah

size: 12288

offset: 8192addr: 4

length: 10284

17

Part Function

Reserved area A 32K area at the front not used by JFS2. The first block is used by the LVM.

Primary aggregate superblock

The primary aggregate superblock (defined as a struct superblock) contains aggregate-wide information such as the:• size of the aggregate• size of allocation groups• aggregate block sizeThe superblock is at fixed locations, which allows us to always be able to find these without depending on any other information.

Secondary aggregate superblock

The secondary aggregate superblock is a direct copy of the primary aggregate superblock. The secondary aggregate superblock is used if the primary aggregate superblock is corrupted.

Page 443: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -5 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, j2-1.fm Guide

Aggregate -- continued

Aggregate layout (continued)

Continued on next page

Part Function

Aggregate inode table

Contains inodes that describe the aggregate-wide control structures these inodes are described below.

Secondary aggregate inode table

Contains replicated inodes from the Aggregate Inode Table. Since the inodes in the Aggregate Inode Table are critical for finding any file system information they will each be replicated in the Secondary Aggregate Inode Table. The actual data for the inodes will not be repeated, just the addressing structures used to find the data and the inode itself.

Aggregate inode allocation map

Describes the Aggregate Inode Table. It contains allocation state information on the aggregate inodes as well as their on-disk location.

Secondary aggregate inode allocation map

Describes the Secondary Aggregate Inode Table.

Block allocation map

Describes the control structures for allocating and freeing aggregate disk blocks within the aggregate. The Block Allocation Map maps one-to-one with the aggregate disk blocks.

fsck working space

Provides space for fsck to be able to track the aggregate block allocations. This space is necessary - for a very large aggregate there might not be enough memory to track this information in memory when fsck is run. The space is described by the superblock. One bit is needed for every aggregate block. The fsck working space always exists at the end of the aggregate.

In-line Log, Provides space for logging of the meta-data changes of the aggregate. The space is described by the superblock. The in-line log always exist following the fsck working space.

Page 444: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-6 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Aggregate -- continued

Aggregate Inodes

When the aggregate is initially created, the first inode extent is allocated, additional inode extents are allocated and de-allocated dynamically as needed. These aggregate Inodes each describe certain aspects of the aggregate itself, as follows:

Inode # Description

0 Reserved

1 Called the “self” inode, this inode describes the aggregate disk blocks comprising the aggregate inode map. This is a circular representation, in that aggregate inode one is itself in the file that it describes. The obvious circular representation problem is handled by forcing at least the first aggregate inode extent to appear at a well-known location, namely, 4K after the Primary Aggregate Superblock. Therefore, JFS2 can easily find Aggregate Inode one, and from there it can find the rest of the Aggregate Inode table by following the B+–tree in inode one

2 Describes the Block Allocation Map.

3 Describes the In-line Log when mounted. This inode is allocated but no data is saved to disk.

4 - 15 Reserved for future extensions.

16 - Starting at aggregate inode 16 there is one inode per fileset, the Fileset Allocation Map Inode. These inodes describe the control structures that represent each fileset. As additional filesets are added to the aggregate, the aggregate inode table itself may have to grow to accommodate additional fileset inodes

Page 445: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -7 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, j2-1.fm Guide

Allocation Groups

Introduction Allocation Groups (AG) divide the space on an aggregate into chunks, and allow JFS2 resource allocation policies to use well known methods for achieving good JFS2 I/O performance.

Allocations policies

When locating data on the disk JFS2 will attempt to:

• Group disk blocks for related data and inodes close together.

• Distribute unrelated data throughout the aggregate.

Allocation Group Sizes

Allocation group sizes must be selected which yield Allocation Groups that are sufficiently large to provide for contiguous resource allocation over time. The allocation group size is stored in the aggregate superblock. The rules for setting the allocation group size is:

• maximum number of allocation groups per aggregate is 128

• minimum size of an allocation group is 8192 aggregate blocks

• The allocation group size must always be a power of 2 multiple of the number of blocks described by one dmap page. (i.e. 1, 2, 4, 8,... dmap pages)

Partial Allocation Group

An aggregate whose size is not a multiple of the allocation group size contains a partial allocation group - it is not fully covered by disk blocks. This partial allocation group will be treated as a complete allocation group, except we mark the non-existent disk blocks allocated in the Block Allocation Map.

Page 446: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-8 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Filesets

Introduction A fileset is a set of files and directories that form an independently mountable sub-tree, equivalent to a Unix file system file hierarchy. A fileset is completely contained within a single aggregate.

Layout The following illustration and table details the layout of a fileset.

Continued on next page

0 64 8 10 12 142 16 2220 24 26 28 3018

17

244 264

1575 9 11 133 2321 25 27 29 3119

248

1IAG

fileset inode #2:

owner: rootperm: -rwx------etc: blah blahsize: 4096

root directory

fileset #0:AG Free Inode List

AG 0

1

2

128

idotdot:2

2nd Half ofFileset SuperblockInformation

Fileset Inode Allocation Map: 1st extent

Control Sectioniagnum: 0

Working Map0xf00000000xffffffff...

Persistent Map0xf00000000xffffffff...

ixd Sectionlength[0]: 16addr[0]: 248length[1]: 0addr[1]: 0...

10284

IAG

IAG Free List: 1st entry

Control Sectioniagnum: 1

Working Map0xffffffff0xffffffff...

Persistent Map0xffffffff0xffffffff...

ixd Sectionlength[0]: 0addr[0]: 0length[1]: 0addr[1]: 0...

iagfree: -1

Fileset Inode Allocation Map: 2nd extentinofree: 1

Control Page

240

extfree: 1numinos: 32numfree: 28

inofree: -1extfree: -1numinos: 0numfree: 0

inofree: -1extfree: -1numinos: 0numfree: 0

inofree: -1extfree: -1numinos: 0numfree: 0

Filese Inode Table

Part Function

Fileset Inode table

Contains inodes describing the fileset-wide control structures. The Fileset Inode Table logically contains an array of inodes.

Fileset Inode allocation map

A Fileset Inode Allocation Map which describes the Fileset Inode Table. The Fileset Inode Allocation Map contains allocation state information on the fileset inodes as well as their on-disk location.

Inodes Objects. Every JFS2 object is represented by an inode, which contains the expected object-specific information such as time stamps, file type (regular vs. directory, etc.). They also “contain” a B+–tree to record the allocation of extents. Note specifically that all JFS2 meta data structures (except for the superblock) are represented as “files.” By reusing the inode structure for this data, the data format (on-disk layout) becomes inherently extensible.

Page 447: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -9 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, j2-1.fm Guide

Filesets -- continued

Super Inode Super Inodes found in the aggregate inode table (#16 and greater) describe the Fileset Inode Allocation Map and other fileset information resides in the Aggregate Inode Table. Since the Aggregate Inode Table is replicated there is also a secondary version of this inode which points to the same data.

Inodes When the fileset is initially created, the first inode extent is allocated, additional inode extents are allocated and de-allocated dynamically as needed. The inodes in a fileset are allocated as follows:

Fileset Inode #

Description

0 reserved

1 additional fileset information that would not fit in the Fileset Allocation Map Inode in the Aggregate Inode Table.

2 The root directory inode for the fileset.

3 The ACL file for the fileset.

4 - Fileset inodes from four onwards are used by ordinary fileset objects, user files, directories, and symbolic links.

Page 448: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-10 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Extents

Introduction Disk space in a JFS2 filesystem is allocated in a sequence of contiguous aggregate blocks called an extent.

Extent rules An extent is:

• made up of a series contiguous aggregate blocks.

• variable in size and can range from 1 to 223 aggregate blocks.

• wholly contained within a single aggregate

• large extents may span multiple allocation groups.

• indexed in a B+-tree.

Extent Allocation Descriptor

Extents are described in an xad structure. The two main values describing an extent, its length, and its address. In an xad both the length and address are expressed in units of the aggregate block size. Details of the xad data structure are shown below.

struct xad { uint8 xad_flag; uint16 xad_reserved; uint40 xad_offset; uint24 xad_length; uint40 xad_address;};

Continued on next page

Member Descriptionxad_flag Flags set on this extent. See /usr/include/j2/j2_xtree.h for

a list of flags.

xad_reserved Reserved for future use.

xad_offset Extents are generally grouped together to from a larger group of disk blocks. The xad_offset, describes the logical byte address this extent represents in the larger group.

xad_length A 24-bit field, containing the length of the extent in aggregate blocks. An extent can range in size from 1 to 224-1 aggregate blocks.

xad_address A 40-bit field containing the address of the first block of the extent. The address is in units of aggregate blocks and is the block offset from the beginning of the aggregate.

Page 449: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -11 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, j2-1.fm Guide

Extents -- continued

Allocation Policy

In general, the allocation policy for JFS2 tries to maximize contiguous allocation by allocating a minimum number of extents, with each extent as large and contiguous as possible. This allows for larger I/O transfer resulting in improved performance. However in special cases this is not always possible. For example copy-on-write clone of a segment will cause a contiguous extent to be partitioned into a sequence of smaller contiguous extents. Another case is restriction of the extent size. For example the extent size is restricted for compressed files since we must read the entire extent into memory and decompress it. We have a limited amount of memory available so we must ensure we will have enough room for the decompressed extent.

Fragmentation An extent based file system combined with a user-specified aggregate block size allows JFS2 to not have separate support for internal fragmentation. The user can configure the aggregate with a small aggregate block size (e.g., 512 bytes) to minimize internal fragmentation for aggregates with large numbers of small size files.

A defragmentation utility will be provided to reduce external fragmentation which occurs from dynamic allocation/de-allocation of variable size extents. This allocation and de-allocation can result in disconnected variable size free extents all over the aggregate. The defragmentation utility will coalesce multiple small free extents into single larger extents.

Page 450: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-12 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Binary Trees of Extents

Introduction Objects in JFS2 are stored in groups of extents arranged in binary trees. The concepts on binary trees are introduced in this section.

Trees Binary trees consists of nodes arranged in a tree structure. Each node contains an header describing the node. A flag in the node header identifies the role of the node in the tree.

Header flags This table describe the binary tree header flags.

Continued on next page

Root nodeHeader

flags=BT_ROOT

Internalnode

Headerflags=

BT_INTERNAL

Array of extentdescriptors

xad

xad

xad

Array of extentdescriptors

xad

xad

xad

Array of extentdescriptors

xad

xad

xad

Leaf nodeHeader

flags=BT_LEAF

Leaf nodeHeader

flags=BT_LEAF

Leaf nodeHeader

flags=BT_LEAF

Flag Description

BT_ROOT The root or top of the tree.

BT_LEAF The bottom of a branch of a tree. Leaf nodes point to the extents containing the objects data.

BT_INTERNAL An internal node points to two or more leaf nodes or other internal nodes.

Page 451: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -13 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, j2-1.fm Guide

Binary Trees of Extents -- continued

Why B+-tree B+–trees are used in JFS2 to help performance by:

• providing fast reading and writing of extents - the most common operations.

• fast search for reading a particular extent of a file.

• efficient append or insert of an extent in a file.

• efficient for traversal of an entire B+–tree

B+-tree index There is one generic B+–tree index structure for all index objects in JFS2 except for directories. The data being indexed depends upon the object. The B+–tree is keyed by offset of the xad structure of the data being described by the tree. The entries are sorted by the offsets of the xad structures, each of which is an entry in a node of a B+–tree.

Root node header

The file j2_xtree.h describes the header for the root of the B+–tree in struct xtpage_t.

#define XTPAGEMAXSLOT 256

typedef union {

struct xtheader {

int64 next; /* 8: */

int64 prev; /* 8: */

uint8 flag; /* 1: */

uint8 rsrvd1; /* 1: */

int16 nextindex; /* 2: next index = # of entries */

int16 maxentry; /* 2: max number of entries */

int16 rsrvd2; /* 2: */

pxd_t self; /* 8: self */

} header; /* (32) */

xad_t xad[XTPAGEMAXSLOT]; /* 16 * maxentry: xad array */

} xtpage_t;

Continued on next page

Page 452: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-14 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Binary Trees of Extents -- continued

Leaf node header

The file j2_btree.h describes the header for an internal node or a leaf node in struct btpage_t.

typedef struct { int64 next; /* 8: right sibling bn */ int64 prev; /* 8: left sibling bn */ uint8 flag; /* 1: */ uint8 rsrvd[7]; /* 7: type specific */ int64 self; /* 8: self address */ uint8 entry[4064]; /* 4064: */} btpage_t;

Page 453: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -15 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, j2-1.fm Guide

inodes

Overview Every file on a JFS2 filesystem is describe by an on-disk inode. The inode holds the root header for the extent binary tree. File attribute data and block allocation maps are also kept in the inode.

Inode Layout The inode is a 512 byte structure, split into four 128 byte sections described here.

Continued on next page

POSIX Attributes

é extended attributesé block allocation mapsé Inode allocation mapsé headers describing the inode data

In-line dataor

xad’s

extended attributesor

more in-line dataor

additional xad’s

Section 1

Section 2

Section 3

Section 4

Inode Layout

Section Description

1 This section describes the POSIX attributes of the JFS2 object including the inode and fileset number, object type, object size, user id, group id, created, access time, modified time, created time and more.

2 This section contains several parts:

• descriptors for extended attributes

• block allocation maps

• inode allocation maps

• Header pointing to the data (b+-tree root, directory, in-line data)

3 This section can contain one of the following:

• In-line File data - for very small files (up to 128 bytes)

• The first 8 xad structures describing the extents for this file.

4 This section extends section 3 by providing additional storage for more attributes, xad structures or in-line data.

Page 454: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-16 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Inodes -- continued

Structure The current definition of the on-disk inode structure isstruct dinode{

/* I. base area (128 bytes) * define generic/POSIX attributes */ ino64_t di_number; /* 8: inode number, aka file serial number */ uint32 di_gen; /* 4: inode generation number */ uint32 di_fileset; /* 4: fileset #, inode # of inode map file */ uint32 di_inostamp; /* 4: stamp to show inode belongs to fileset */ uint32 di_rsv1; /* 4: */ pxd_t di_ixpxd; /* 8: inode extent descriptor */ int64 di_size; /* 8: size */ int64 di_nblocks; /* 8: number of blocks allocated */ uint32 di_uid; /* 4: uid_t user id of owner */ uint32 di_gid; /* 4: gid_t group id of owner */ int32 di_nlink; /* 4: number of links to the object */ uint32 di_mode; /* 4: mode_t attribute, format and permission */ j2time_t di_atime; /* 16: time last data accessed */ j2time_t di_ctime; /* 16: time last status changed */ j2time_t di_mtime; /* 16: time last data modified */ j2time_t di_otime; /* 16: time created */

/* II. extension area (128 bytes) * extended attributes for file system (96); */ ead_t di_ea; /* 16: ea descriptor */

union { uint8 _data[80];

/* block allocation map */ struct { struct bmap *__bmap; /* incore bmap descriptor */ } _bmap;

/* inode allocation map (fileset inode 1st half) */ struct { uint32 _gengen; /* di_gen generator */ struct inode *__ipimap2; /* replica */ struct inomap *__imap; /* incore imap control */ } _imap; } _data2;

/* B+-tree root header (32) * B+-tree root node header, or dtroot_t for directory, * or data extent descriptor for inline data; */ union { struct { int32 _di_rsrvd[4]; /* 16: */ dxd_t _di_dxd; /* 16: data extent descriptor */ } _xd; int32 _di_btroot[8]; /* 32: xtpage_t or dtroot_t */ ino64_t _di_parent; /* 8: idotdot in dtroot_t */ } _data2r;

/* III. type-dependent area (128 bytes) * B+-tree root node xad array or inline data */ union { uint8 _data[128];/* +-tree root node/inline data area */ struct { uint8 _xad[128]; } _file;

/* device special file */ struct { dev64_t _rdev; /* 8: dev_t device major and minor */ } _specfile;

/* symbolic link. * link is stored in inode if its length is less than * IDATASIZE. Otherwise stored like a regular file. */ struct { uint8 _fastsymlink[128]; } _symlink; } _data3; /* IV. type-dependent extension area (128 bytes) * user-defined attribute, or * inline data continuation, or * B+-tree root node continuation */ union { uint8 _data[128]; } _data4;}�

Continued on next page

Page 455: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -17 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, j2-1.fm Guide

Inodes -- continued

Allocation Policy

JFS2 allocates inodes dynamically, which provides the following advantages:

• Allows placement of inode disk blocks at any disk address, which decouples the inode number from the location. This decoupling simplifies supporting aggregate and fileset reorganization to enable shrinking the aggregate. The inodes can be moved and still retain the same number, which allows us to not need to search the directory structure to update the inode numbers.

• There is no need to allocate “ten times as many inodes as you will ever need”, as with filesystems that contain a fixed number of inodes, and thus filesystem space utilization is optimized. This is especially important with the larger inode size of 512 bytes in JFS2.

• File allocation for large files can consume multiple allocation groups and still be contiguous. Static allocation forces a gap containing the initially allocated inodes in each allocation group, with dynamic allocation, all the blocks contained in an allocation group can be used for data.

Dynamic inode allocation causes a number of problems, including:

• With static allocation the geometry of the file system implicitly describes the layout of inodes on disk. With dynamic allocation separate mapping structures are required.

• The inode mapping structures are critical to JFS2 integrity. Due to the overhead involved in replicating these structures we accept the risk of losing these maps. However, replicating the B+–tree structures allows us to find the maps.

Inode extents Inodes are allocated dynamically by allocating inode extents that are simply a contiguous chunk of inodes on the disk. By definition, a JFS2 inode extent contains 32 inodes. With a 512 byte inode size, an inode extent is therefore occupies 16KB on the disk.

Continued on next page

Page 456: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-18 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Inodes -- continued

Inode initialization

When a new inode extent is allocated the extent is not initialized, but in order for fsck to be able to check if an inode is in-use, JFS2 will need some information in it. Once an inode in an extent is marked in-use its fileset number, inode number, inode stamp, and the inode allocation group block address are initialized. Thereafter, the link field will be sufficient to determine if the inode is currently in-use.

Inode Allocation Map

Dynamic inode allocation implies that there is no direct relationship between an inode number and the disk address of the inode. Therefore we must have a means of finding the inodes on disk. The Inode Allocation Map provides this function.

Inode Generation Numbers

Inode generation numbers are simply counters that will increment each time an inode is reused. Network file system protocols such as NFS (implicitly) require them; they form part of the file identifier manipulated by VNOP_FID() and VFS_VGET().

The static-inode-allocation practice of storing a per-inode generation counter doesn’t work with dynamic inode allocation, because when an inode becomes free its disk space may literally be reused for something other than an inode (e.g., the space may be reclaimed for ordinary file data storage). Therefore, in JFS2 there is simply one inode generation counter that is incremented on every inode allocation (rather than one counter per inode that would be incremented when that inode is reused).

Although a fileset-wide generation counter will recycle faster than a per-inode generation counter, a simple calculation shows that the 32-bit value is still sufficient to meet NFS or DFS requirements.

Page 457: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -19 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, j2-1.fm Guide

File Data Storage

Overview This section introduces the data structures used to describe where a file’s data is stored.

In-line data If a file contains small amounts of data the data may be stored in the inode its self. This is called in-line storage. The header found in the second section of the inode points to the data that is stored in the third and fourth section of the inode.

Binary trees When more storage is needed than can be provided in-line the data must be placed in extents. The header in the inode now becomes the binary tree root header. If there are 8 or fewer extents for the file, then the xad structures describing the extents are contained in the inode. An inode containing less than 8 xad structures would look like:

Continued on next page

inode

In-l

ine

data

Inode Info

Header for in-line data

inode

offset: 0addr: 68

length: 4

B+-tree header

xad

entr

ies

(8 t

otal

)

Inode Info

offset: 4096addr: 84

length: 12

offset: 26624addr: 256

length: 2

48KB

4096

68

16KBData

Data

26624

8KB Data

Page 458: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-20 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, j2-1.fm

File Data Storage -- continued

INLINEEA bit Once the 8 xad structures in the inode are filled, an attempt is made to use the last quadrant of the inode for more xad structures. If the INLINEEA bit is set in the di_mode field of the inode, then the last quadrant of the inode is available for 8 more xad structures.

More extents Once all of the available xad structures in the inode are used, the B+–tree must be split. 4K of disk space is allocated for a leaf node of the B+–tree, which is logically an array of xad entries with a header. The 8 xad entries are moved from the inode to the leaf node, and the header is initialized to point to the 9th entry as the first free entry. The first xad structure in the inode is updated to point to the newly allocated leaf node, and the inode header is updated to indicate that only one xad structure is now being used, and that it contains the pure root of a B+-tree. The offset for this new xad structure contains the offset of the first entry in the leaf node.

The organization of the inode now look like:

Continued on next page

48KB

4096

68

26624

16KBData

Data

8KB Data

inode

offset: 0addr: 412

length: 4

B+-tree header

xad

entr

ies

(8 t

otal

)

Inode Info

offset: 0addr: 0

length: 0

offset: 0addr: 0

length: 0

254 xad leaf node entries

header

412

Page 459: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -21 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, j2-1.fm Guide

File Data Storage -- continued

Continuing to add extents

As new extents are added to the file, they continue to be added to the leaf node in the necessary order, until the node fills. Once the node fills a new 4K of disk space is allocated for another leaf node of the B+–tree, and the second xad structure from the inode is set to point to this newly allocated node. The node now looks like:

Continued on next page

48KB

4096

68

26624

16KBData

Data

8KB Data

254 xad leaf node entries

header

412

254 xad leaf node entries

header

inode

offset: 0addr: 412

length: 4

B+-tree header

xad

entr

ies

(8 t

otal

)Inode Info

offset: 750addr: 560

length: 4

offset: 0addr: 0

length: 0

560

Page 460: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-22 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, j2-1.fm

File Data Storage -- continued

Another split As extents are added to the inode, this behavior continues until all 8 xad structures in the inode contain leaf node xad structures, at which time another split of the B+–tree will occur. This split creates an internal node of the B+–tree which is used purely to route the searches of the tree. An internal node looks exactly like a leaf node. 4K of disk space is allocated for the internal node of the B+–tree., the 8 xad entries of the leaf nodes are moved from the inode to the newly created internal node, and the internal node header is initialized to point to the 9th entry as the first free entry. The root of the B+–tree is then updated by making the inode’s first xad structure point to the newly allocated internal node, and the header in the inode is updated to indicate that now only 1 xad structure is being used for the B+–tree.

As extents continue to be added, additional leaf nodes are created to contain the xad structures for the extents, and these leaf nodes are added to the internal node.

Once the first internal node is filled, a second internal node is allocated, the inode’s second xad structure is updated to point to the new internal node.

This behavior continues until all 8 of the inode’s xad structures contain internal nodes.

48KB

4096

68

26624

16KBData

Data

8KB Data

254 xad leaf node entries

header

412

254 xad leaf node entries

header560

254 xad internal node entries

header inode

offset: 0addr: 380

length: 4

B+-tree header

xad

entr

ies

(8 t

otal

)

Inode Info

offset: 8340addr: 212

length: 4

offset: 0addr: 0

length: 0

380

254 xad internal node entries

header

212

Page 461: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -23 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, j2-1.fm Guide

fsdb Utility

Introduction The fsdb command enables you to examine, alter, and debug a file system.

Starting fsdb It is best to run fsdb against an unmounted filesystem. Use the following syntax to start fsdb:

fddb <path to logical volume>

For example:

# fsdb /dev/lv00

Aggregate Block Size: 512

>

Supported filesystems

fsdb supports both the JFS and JFS2 file systems. The commands available in fsdb are different depending on what filesystem type it is running against. The following explains how to use fsdb with a JFS2 file system.

Commands The commands available in fsdb can be viewed with the help command as shown here.

> help Xpeek Commands

a[lter] <block> <offset> <hex string>b[tree] <block> [<offset>]dir[ectory] <inode number> [<fileset>]d[isplay] [<block> [<offset> [<format> [<count>]]]]dm[ap] [<block number>]dt[ree] <inode number> [<fileset>]h[elp] [<command>]ia[g] [<IAG number>] [a | <fileset>]i[node] [<inode number>] [a | <fileset>]q[uit]su[perblock] [p | s]

Page 462: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-24 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Exercise 1 - fsdb

Introduction In this lab you will run the fsdb utility against a JFS2 filesystem that was created for you. The filesystem should not be mounted when running fsdb. The filesystem may be mounted to examine the files, just be sure to un-mount it before running fsdb.

Lab steps Follow the steps in this table:

Continued on next pagef

Step Action

1 Start fsdb on the logical volume /dev/lv00

# fsdb /dev/lv00

What is the aggregate block size used in this filesystem.

1 Type help to view the fsdb sub-commands. The commands you will be using in this lab are: inode, directory and display

2 What inode number represents the fileset root directory inode?

Display the root inode for the file set. What command did you use?

Note: If you want to display the aggregate inodes instead of the fileset inode append an “a” to the command i.e.: inode 2 a.

3 Find the inode number of each file in the fileset using the directory command followed by the inode number of the root directory inode of the fileset. For example:> dir 2idotdot = 24 fileA5 fileB6 fileC3 lost+found

Page 463: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -25 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, j2-1.fm Guide

Exercise 1 - fsdb -- continued

Using fsdb - continued

In the next few steps you will locate and display the fileA’s data.

Continued on next page

Step Action

4 Display the inode of fileA, what command did you use?

Use the inode you displayed to answer the following questions:

What is the file size of fileA?

How many disk blocks is fileA’s data using?

5 After the inode is displayed a sub-menu of commands is shown. Type a t to display the root binary tree header. Examine the flags in the header, what flags are set?

6 Type <enter> to walk down the xad structures in this node. How many xad structures are used for this file?

7 The address field in the xad shows the aggregate block number of the first data block of fileA. Use the display command to display this block. > d 12345

Did you find fileA’s data?

Page 464: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-26 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Exercise 1 - fsdb -- continued

FileB and fileC Use the commands and techniques you learned in the last section to examine fileB, fileC and fileD. Answer the following questions about these files:

1. What number inodes are used for fileB, fileC and fileD?

2. How many xad structures are used to describe fileB’s data blocks?

3. How many xad structures are used to describe fileC’s data blocks?

4. Examine the inode for fileD. How big is this file (as shown in di_size)?

How many aggregate blocks are being used by fileD?

Are enough aggregate blocks allocated to store the entire file? Explain your answer.

Page 465: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -27 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, j2-1.fm Guide

Directory

Introduction In addition to files an inode can represent a directory. A Directory is a journaled meta-data file in JFS2, and is composed of directory entries which indicate the files and sub-directories contained in the directory.

Directory entry Stored in an array the directory entries links the names of the objects in the directory to an inode number. The directory entry has the following members.

Continued on next page

Member Description

inumber Inode number

namelen Length of the name.

name[30] File name, up to 30 characters.

next If more that 30 characters are needed additional entries are link using the next pointer

Page 466: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-28 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Directory -- continued

Root Header In order to improve performance of locating a specific directory entry a binary tree sorted by name is used. As with files, the header section of a directory inode contains the binary tree root header. Each header describes an 8 element array of directory entries. The root header is defined by a dtroot_t structure contained in /usr/include/j2/j2_dtree.h:

typedef union { struct { ino64_t idotdot; /* 8: parent inode number */ int64 rsrvd1; /* 8: */ uint8 flag; /* 1: */ int8 nextindex; /* 1: next free entry in stbl */ int8 freecnt; /* 1: free count */ int8 freelist; /* 1: freelist header */ int32 rsrvd2; /* 4: */ int8 stbl[8]; /* 8: sorted entry index table */ } header; /* (32) */ dtslot_t slot[9];} dtroot_t;

Leaf and internal node header

When more than 8 directory entries are needed a leaf or internal node is added. The directory internal and leaf node headers are similar to root node header except that up to 128 directory entries. The page header is defined by a dpage_t structure contained in /usr/include/j2/j2_dtree.h.

Continued on next page

Member Description

idotdot Inode number of parent directory.

flag indicating if the node is an internal or leaf node, and whether it is the root of the binary tree.

nextindex last used slot in the directory entry slot array.

freecnt number of free slots in the directory entry array.

freelist slot number of the head of the free list

stbl[8] indices to the directory entry slots that are currently in use. The entries are sorted alphabetically by name.

slot[9] Array of directory entries. 8 entries, The header is stored in the first slot.

Page 467: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -29 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, j2-1.fm Guide

Directory -- continued

Directory slot array

The Directory Slot Array (stbl[]) is a sorted array, of indices to the directory slots that are currently in use. The entries are sorted alphabetically by name. This limits the amount of shifting necessary when directory entries are added or deleted, since the array is much smaller than the entries themselves. A binary search can be used on this array to search for particular directory entries.

In this example the directory entry table contains four files. The stbl table contains the slot numbers of the entries ordering the entries alphabetically.

. and .. A directory does not contain specific entries for self (“.”) and parent (“..”). Instead these will be represented in the inode itself. Self is the directory’s own inode number, and the parent inode number is held in the “idotdot” field in the header.

Continued on next page

00003412

hij

xyz

abc

def12

34

5

7

6

8

Directory Entrytable

STBL[8]

Page 468: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-30 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Directory -- continued

Growing directory size

As the number of files in the directory grow the directory tables must be increase in size. This table describes the steps used.

Step Action

1 Initial directory entries are stored in directory inode in-line data area.

2 When the in-line data area of the directory inode becomes full JFS2 allocates a leaf node the same size as the aggregate block size.

3 When that initial leaf node becomes full and the leaf node is not yet 4K, double the current size. First attempt to double the extent in place, if there is not room to do this a new extent must be allocated and the data from the old extent must be copied to the new extent. The directory slot array will only have been big enough to reference enough slots for the smaller page so a new slot array will have to be created. Use the slots from the beginning of the newly allocated space for the larger array and copy the old array data to the new location. Update the header to point to this array and add the slots for the old array to the free list.

4 If the leaf node again becomes full and is still not 4K repeat step 3. Once the leaf node reaches 4K allocate a new leaf node. Every leaf node after the initial one will be allocated as 4K to start.

5 When all entries are free in a leaf page, the page will be removed from the B+–tree. When all the entries in the last leaf page are deleted, the directory will shrink back into the directory inode in-line data area.

Page 469: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -31 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, j2-1.fm Guide

Directory Examples

Introduction This sections demonstrates how the directory structures change over time.

Small Directories

Initial directory entries are stored in directory inode in-line data area. Examine this example of a small directory. In this example all the inode information fits into the in-line data area:

# ls -ai69651 . 2 ..69652 foobar169653 foobar1269654 foobar369655 longnamedfilewithover22charsinitsname

Note: the file with a long name has its name split across two slots.

Continued on next page

inumber: 69652next: -1namelen: 7name: foobar1

inumber: 69653next: -1namelen: 8name: foobar12

inumber: 69654next: -1namelen: 7name: foobar2

inumber: 69655next: 5namelen: 37name:longnamedfilewithover2

next: -1cnt: 0name: 2charsinitsname

flag: BT_ROOT BT_LEAFnextindex: 4freecnt: 3freelist: 6idotdot: 2stbl: {1,2,3,4,0,0,0}

1

2

3

4

5

Page 470: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-32 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Directory Examples -- continued

Adding a file An additional file called “afile” is created. The details for this file are added at the next free slot (slot 6). As this is now, alphabetically, the first file in the directory, the search table array (stbl[]) is re-organized, such that the entry in slot 6 is now in the first entry.

# ls -ai69651 . 2 ..69656 afile69652 foobar169653 foobar269654 foobar369655 longnamedfilewithover22charsinitsname

Continued on next page

inumber: 69656next: -1namelen: 5name: afile

6

inumber: 69652next: -1namelen: 7name: foobar1

inumber: 69653next: -1namelen: 8name: foobar12

inumber: 69654next: -1namelen: 7name: foobar2

inumber: 69655next: 5namelen: 37name:longnamedfilewithover2

next: -1cnt: 0name: 2charsinitsname

flag: BT_ROOT BT_LEAFnextindex: 5freecnt: 2freelist: 7idotdot: 2stbl: {6,1,2,3,4,0,0,0}

1

2

3

4

5

Page 471: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -33 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, j2-1.fm Guide

Directory Examples -- continued

Adding a leaf node

When the directory grows to where there are more entries than can be stored in the in-line data area of the inode then JFS2 allocates a leaf node the same size as the aggregate block size. The in-line entries are moved to a leaf node as illustrated.

Once the leaf is full, an internal node is added at the next free in-line data slot in the inode, which will contain the address of the next leaf node.

Note: the internal node entry, contains the name of the first file (in alphabetical order) for that leaf node.

Continued on next page

xd.len: 1xd.addr1: 0xd.addr2: 52next: -1namelen: 0name: file0

flag: BT_ROOT BT_INTERNALnextindex: 1freecnt: 7freelist: 2idotdot: 2stbl: {1,2,3,4,5,6,7,8}

1 inumber: 5next: -1namelen: 5name: file0

inumber: 6next: -1namelen: 5name: file1

inumber: 15next: -1namelen: 6name: file10

flag: BT_LEAFnextindex: 20freecnt: 103freelist: 25maxslot: 128stbl: {1,2,15, ... 8,13,14}

1

2

3

inumber: 23next: -1namelen: 6name: file18

19

inumber: 24next: -1namelen: 6name: file19

20

Block 52

Page 472: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-34 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Directory Examples -- continued

Adding a internal node

Once all the in-line slots have been filled by internal nodes, a separate node block is allocated, the entries from the in-line data slots are moved to this new node, and the first in-line data slot updated with the address of the new internal node.

After many extra files have been added to the directory, two layers of internal nodes are required to reference all the files.

Note: now, that the internal node entries in the inode contain the name of the alphabetically first entry referenced by each of the second level internal nodes, and each entry in these references the name of the alphabetically first entry in each leaf node.

xd.len: 1xd.addr1: 0xd.addr2: 118next: -1namelen: 0name: file0

flag: BT_ROOT BT_INTERNALnextindex: 4freecnt: 4freelist: 5idotdot: 2stbl: {1,3,4,2,6,7,2,8}

1 inumber: 5next: -1namelen: 5name: file0

inumber: 6next: -1namelen: 5name: file1

inumber: 15next: -1namelen: 6name: file10

flag: BT_LEAFnextindex: 64freecnt: 59freelist: 21maxslot: 128stbl: {1,2,15 ... 113,112}

1

2

3

inumber: 10057next: -1namelen: 9name: file10052

126

inumber: 10041next: -1namelen: 9name: file10036

127

Block 52

xd.len: 1xd.addr1: 0xd.addr2: 52next: -1namelen: 0name: file0

flag: BT_INTERNALnextindex: 64freecnt: 59freelist: 76maxslot: 128stbl: {1,19,18, ... 7,8}

1

xd.len: 1xd.addr1: 0xd.addr2: 1204next: -1namelen: 8name: file4845

2

xd.len: 1xd.addr1: 0xd.addr2: 1991next: -1namelen: 9name: file13833

3

xd.len: 1xd.addr1: 0xd.addr2: 2609next: -1namelen: 8name: file17723

4

Block 118

xd.len: xd.addr1: xd.addr2: next: namelen: name:

2

xd.len: 0xd.addr1: -1xd.addr2: 1473next: -1namelen: 8name: file1472

126

xd.len: 1xd.addr1: 0xd.addr2: 1472next: -1namelen: 8name: file1017

127

Page 473: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -35 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, j2-1.fm Guide

Exercise 2 - Directories

Introduction In this exercise you will use the fsdb utility to examine directory inodes in a jfs2 filesystem.

Small directories

Run fsdb on the sample filesystem. Use the following steps to examine the directory node for /mnt/small.

Continued on next page

Step Action

1 Find the inode for directory small:> dir 2

2 Display the inode found in the last step.> i <inode number>

3 Using the t sub-command display the directory node root header.

Is this header a root, internal or leaf header?

4 Type <enter> to display the directory entries. Repeat <enter> until all the entries are displayed.

How many files are in the directory?

5 Examine the directory slot array stbl[] (displayed in the header).

What file name is associated with the first slot entry?

6 Exit fsdb and mount the filesystem.# mount /mnt

7 Create the file /mnt/small/a

#touch /mnt/small/a

Predict what the stbl[] table for directory small will look like now?

8 Un-mount the filesystem run fsdb and check your prediction.

Page 474: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-36 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Exercise 2 - Directories -- continued

Larger directories

In this section you will examine the directory node structures for some larger directories.

Step Action

1 What is the inode for the directory called medium?

2 Display the inode and look at the root tree header. The flags should indicate that this is an internal header. One entry should be found for each leaf node. Display the entries with the <enter> key. How many leaf nodes are their?.

3 Use the down sub command to display the first leaf node header. How many entries is this header currently describing?

What is the maximum number of entries (files) that be described by a single leaf node?

4 Examine the big directory and answer the following questions.

How many internal leaf nodes in big?

How many files in big?

Page 475: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -1 of 26Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, fs1.fm Guide

Unit 13. Logical and Virtual File Systems

ObjectivesAfter completing this unit, you should be able to

• Identify the various compoints that make up the logical and virtual

• To use the debugger (kdb/iadb) to display these components.

References

Page 476: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-2 of 26 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, fs1.fm

General File System Interface

Introduction This lesson covers the interface and services that AIX 5L provides to physical filesystem. The Logical File System (LFS), Virtual File System (VFS) and the interface between these compoints and physical file systems are discussed in this lesson.

Supported file systems

Using the structure of the logical file system and the virtual filesystem AIX 5L can support a number of different file system types transparently to application programs. These file systems reside below the LFS/VFS and operate relatively independently of each other. Currently AIX 5L supports the following physical filesystem implementations:

• Enhanced Journaled Filesystem (JFS2)

• Journaled filesystem (JFS)

• Network File System (NFS)

• A CD-ROM File system which supports ISO-9660, High Sierra and Rock Ridge formats.

Extensible The LFS/VFS interface also provides a relatively easy means by which third party filesystem types can be added without any changes to the LFS.

Hierarchy Access to files and directories by a process is controlled by the various layers in the AIX 5L kernel as illustrated here.

Continued on next page

System Call

Logical File System

Virtural File System

File SystemImplementation

Fault Handler

Device Driver

Device

é System callsé Logical File System (LFS)é Virtual File System (VFS)é File System Implementation -

Support of individual file systemlayout.

é Fault Handler - Device page faulthandler support in the VMM.

é Device Driver - Actual devicedriver code to interface with thedevice. It is invoked by the pagefault handler when the file systemimplementation code maps theopened file to kernel memory andreads the mapped memory. LVMis the device driver for J2 andJournalled filesystems.

Page 477: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -3 of 26Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, fs1.fm Guide

General File System Interface -- continued

Internal data structures

This illustration shows the major data structures that will be discussed in this lesson. This illustration is repeated throughout the lesson highlighting the areas being discussed.

Logical File SystemVirtural File System

(Vnode-VFS Interface)File System

System FileTable

gfs

vnodeops

vfsops

vnode

vmount

vfs

inode

gnode

u-block

User FileDescriptor

Table

Page 478: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-4 of 26 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, fs1.fm

Logical File System

Overview The Logical File System (LFS) provides a consistent programming interface to applications via the system call interface, with calls such as open(), close(), read() and write(). The LFS breaks down each system call into requests for the underlying file system implementations.

LFS Data Structures

The data structures discussed in this section are the System Open File Table and the User File Descriptor Table. The system open file table has one entry for each open file on the system. The user file descriptor table (one per process) contains entries for each of the process open file...

Operations The LFS provides a standard set of operations to support the system call interface, its routines manage the open file table entries and the per-process file descriptors. It provides:

• the User File Descriptor Table.

• the System File table. An open file table entry records the authorization of a process’s access to a file system object.

The LFS abstraction specifies the set of file system operations that an implementation must include in order to carry out logical file system requests. Physical file systems can differ in how they implement these predefined operations, but they must present a uniform interface to the LFS. It supports UNIX-like file system access semantics, but other non-UNIX file systems can also be supported.

Continued on next page

Logical File SystemVirtural File System

(Vnode-VFS Interface)File System

System FileTable

gfs

vnodeops

vfsops

vnode

vmount

vfs

inode

gnode

u-block

User FileDescriptor

Table

Page 479: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -5 of 26Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, fs1.fm Guide

Logical File System -- continued

User interface A user can refer to an open file table entry through a file descriptor held in the thread’s ublock, or by accessing the virtual memory to which the file was mapped. The file descriptor table entry is created when the file is initially opened, via the open() system call and will remain until either the user closes the file via the close() system call, or the process terminates. The LFS is the level of the file system at which users can request file operations by using system calls, such as open(), close(), read(), write() etc. For all these calls (except open()), the file descriptor number is passed as an argument to the call. The system calls implement services that are exported to users, and provide a consistent user mode programming interface to the LFS that is independent of the underlying file system type.System calls that carry out file system requests:

• Map the user’s parameters to a file system object. This requires that the system call component use the vnode (virtual node) component to follow the object’s path name. In addition, the system call must resolve a file descriptor or establish implicit (mapped) references using the open file component.

• Verify that a requested operation is applicable to the type of the specified object.

• Dispatch a request to the file system implementation to perform operations.

Page 480: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-6 of 26 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, fs1.fm

User File Descriptor Table

Description The user file descriptor table, is contained in the user area, and is a per process resource. Each entry references an open file, device, or socket from the process’ perspective. The index into the table for a specific file, is the value returned by the open() system call when the file is opened - the file descriptor.

Table Management

One or more slots of the file descriptor tables are used for each open file. The file descriptor table can extend beyond first page of the ublock, and is page-able. There is a fixed upper limit of 32768 open file descriptors per process (defined as OPEN_MAX in /usr/include/sys/limits.h). This value is fixed, and may not changed.

User File Descriptor Table structure

The user file descriptor table consists of an array of user file descriptor table structures defined in /usr/include/sys/user.h in the structure ufd:

struct ufd { struct file * fp; unsigned short flags; unsigned short count;#ifdef __64BIT_KERNEL unsigned int reserved;#endif /* __64BIT_KERNEL */};

Page 481: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -7 of 26Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, fs1.fm Guide

System File Table

Description The system file table is a global resource, and is shared by all processes on the system. One unique entry is allocated for each unique open of a file, device, or socket in the system.

Table Management

The table is a large array, and is partly initialized. It grows on demand, and is never shrunk. Once entries are freed, they are added back onto the free list (ffreelist). The table can contain a maximum of 1,000,000 entries, and is not configurable.

Table entries The file table array consists of struct file data elements. Several of the key members of this data structure are described in this table.

Continued on next page

Member Description

f_count A reference count field detailing the current number of opens on the file. This value is increased each time the file is opened, and decremented on each close(). Once the reference count is zero, the slot is considered free, and may be re-used.

f_flag various flags described in fcntl.h

f_type a type field describing the type of file:/* f_type values */#define DTYPE_VNODE 1 /* file */#define DTYPE_SOCKET 2 /* communications endpoint */#define DTYPE_GNODE 3 /* device */#define DTYPE_OTHER -1 /* unknown */

f_offset a read/write pointer.

f_data Defined as f_up.f_uvnode it is a pointer to another data structure representing the object typically the vnode structure.

f_ops a structure containing pointers to functions for the following file operations: rw (read/write), ioctl, select, close, fstat.

Page 482: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-8 of 26 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, fs1.fm

System File Table -- continued

file structure The file table structure is described in /usr/include/sys/file.h

struct file { long f_flag; /* see fcntl.h */ int f_count; /* reference count */ short f_options; /* file flags not passed through vnode layer */ short f_type; /* descriptor type */ union { struct vnode *f_uvnode; /* pointer to vnode structure */ struct file *f_unext; /* next entry in freelist */ } f_up; offset_t f_offset; /* read/write character pointer */ off_t f_dir_off; /* BSD style directory offsets */ union { struct ucred *f_cpcred; /* process credentials at open() */ struct file *f_cpqmnext; /* next quick move chunk on free list*/ } f_cp; Simple_lock f_lock; /* file structure fields lock */ Simple_lock f_offset_lock; /* file structure offset field lock */ caddr_t f_vinfo; /* any info vfs needs */ struct fileops { int (*fo_rw)(); int (*fo_ioctl)(); int (*fo_select)(); int (*fo_close)(); int (*fo_fstat)(); } *f_ops;};

Page 483: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -9 of 26Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, fs1.fm Guide

Virtual File System

Overview The Virtual FIle System (VFS) defines a standard set of operations on an entire file system. Operations preformed by a process on a file or file system are mapped through the VFS to the file system below. In this way, the process need not know the specifics of different file systems (such as JFS, J2, NFS or CDROM).

Data Structures

The data structures within a virtual file system are:

• vnode - one per file

• gfs - one per filesystem type kernel extension.

• vnodeops - one per filesystem type kernel extension.

• vfsops - one per filesystem type kernel extension.

• vfs - one per mounted filesystem.

• vmount - one per mounted filesystem.

Functional sections

For the purpose of this lesson the VFS will be broken into three sections and described separately. These sections are:

• Vnode-VFS interface

• File and File System Operations

• The gnode

Logical File SystemVirtural File System

(Vnode-VFS Interface)File System

System FileTable

gfs

vnodeops

vfsops

vnode

vmount

vfs

inode

gnode

u-block

User FileDescriptor

Table

Page 484: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-10 of 26 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, fs1.fm

Vnode/vfs interface

Overview The interface between the logical file system and the underlying file system implementations is referred to as the vnode/vfs interface. This interface provides a logical boundary between generic objects understood at the LFS layer and the file system specific objects that the underlying file system implementation must manage such as inodes and super blocks. The LFS is relatively unaware of the underlying file system data structures since they can be radically different for the various file system types.

Data Structures

Vnodes and vfs structures are the primary data structures used to communicate through the interface (with help from the vmount).

• vnodes - represents a files

• vfs - represents a mounted file system

• vmount - contains specifics of the mount request.

History The vnode and vfs structures of the LFS was created by Sun Micro Systems and has evolved into a de-facto industry standard, thanks in part to NFS.

Logical File SystemVirtural File System

(Vnode-VFS Interface)File System

System FileTable

gfs

vnodeops

vfsops

vnode

vmount

vfs

inode

gnode

u-block

User FileDescriptor

Table

Page 485: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -11 of 26Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, fs1.fm Guide

Vnodes

Overview The vnode provides a standard set of operations within the file system, and provides system calls with a mechanism for local name resolution. This allows the logical file system to access multiple file system implementations through a uniform name space.

Detail Vnodes are the primary handles by which the operating system references files, and represent access to an object within a virtual file system. Each time an object (file) within a file system is located (even if it is not opened), a vnode for that object is located (if already in existence), or created, as are the vnodes for any directory that has to be searched to resolve the path to the object.

As a file is created, a vnode is also created, and will be re-used for every subsequent reference made to the file by a path name. Every path name known to the logical file system can be associated with, at most, one file system object, and each file system object can have several names because it can be mounted in different locations. Symbolic links and hard links to an object always get the same vnode if accessed through the same mount point.

vnode Management

Vnodes are created by the vfs-specific code when needed, using the vn_get kernel service. Vnodes are deleted with the vn_free kernel service. Vnodes are created as the result of a path resolution.

structure The vnode is structure is defined in /usr/include/sys/vnode.h

struct vnode { ushort v_flag; ulong32int64 v_count; /* the use count of this vnode */ int v_vfsgen; /* generation number for the vfs */ Simple_lock v_lock; /* lock on the structure */ struct vfs *v_vfsp; /* pointer to the vfs of this vnode */ struct vfs *v_mvfsp; /* pointer to vfs which was mounted over * / /* this vnode; NULL if not mounted */ struct gnode *v_gnode; /* ptr to implementation gnode */ struct vnode *v_next; /* ptr to other vnodes that share same gnode */ struct vnode *v_vfsnext; /* ptr to next vnode on list off of vfs */ struct vnode *v_vfsprev; /* ptr to prev vnode on list off of vfs */ union v_data { void * _v_socket; /* vnode associated data */ struct vnode * _v_pfsvnode; /* vnode in pfs for spec */ } _v_data; char * v_audit; /* ptr to audit object */};

Page 486: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-12 of 26 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, fs1.fm

vfs and vmount

Description When new file systems are mounted, a vfs and vmount structures are created. The vmount structure contains specifics of the mount request, such as the object being mounted, and the stub over which it is being mounted. The vfs structure is the connecting structure which links the vnodes (representing files) with the vmount information, and the gfs structure that help define the operations that can be performed on the filesystem and its files.

vfs The vfs structure is the connecting structure which links the vnodes (representing files) with the vmount information, and the gfs structure witch provides a path to the operations that can be performed on the filesystem and its files.

Continued on next page

Element Description

*vfs_next vfs’s are a linked list with the first vfs entry addressed by the rootvfs variable which is private to the kernel.

*vfs_gfs path back to the gfs structure and its file system specific subroutines through the vfs_gfs pointer.

vfs_mntd The vfs_mntd pointer points to the vnode within the file system which generally represents the root directory of the file system.

vfs_mntdover The vfs_mntdover pointer points to a vnode within another file system, also usually representing a directory, which indicates where the file system is mounted. In this sense, the vfs_mntd pointer corresponds to the object within the vmount structure referenced by the vfs_mdata pointer, and the vfs_mntdover pointer corresponds to the stub within the vmount structure referenced by the vfs_mdata pointer.

vfs_nodes Pointer to all vnodes for this file system.

vfs_mdata Pointer to the vmount providing mount information for this filesystem

Page 487: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -13 of 26Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, fs1.fm Guide

vfs and vmount -- continued

vfs structure The vfs structure is defined in /usr/include/sys/vfs.h:

struct vfs { struct vfs *vfs_next; /* vfs’s are a linked list */ struct gfs *vfs_gfs; /* ptr to gfs of vfs */ struct vnode *vfs_mntd; /* pointer to mounted vnode */ struct vnode *vfs_mntdover; /* pointer to mounted-over vnode */ struct vnode *vfs_vnodes; /* all vnodes in this vfs */ int vfs_count; /* number of users of this vfs */ caddr_t vfs_data; /* private data area pointer */ unsigned int vfs_number; /* serial number to help distinguish between */ /* different mounts of the same object */ int vfs_bsize; /* native block size */ short vfs_rsvd1; /* Reserved */ unsigned short vfs_rsvd2; /* Reserved */ struct vmount *vfs_mdata; /* record of mount arguments */ Simple_lock vfs_lock; /* lock to serialize vnode list */};

vfs Management

The mount helper creates the vmount structure, and calls the vmount subroutine. The vmount subroutine then creates the vfs structure, partially populates it, and invokes the file system dependent vfs_mount subroutine which completes the vfs structure, and performs any operations required internally by the particular file system implementation.

There is one vfs structure for each file system currently mounted. New vfs structures are created with the vmount subroutine. This subroutine calls the vfs_mount subroutine found within the vfsops structure for the particular virtual file system type. The vfs entries are removed with the uvmount subroutine. This subroutine calls the vfs_umount subroutine from the vfsops structure for the virtual file system type.

vmount The vmount structure contains specifics of the mount request. The vmount structure is defined in /usr/include/sys/vmount.h

struct vmount { uint vmt_revision; /* I revision level, currently 1 */ uint vmt_length; /* I total length of structure & data */ fsid_t vmt_fsid; /* O id of file system */ int vmt_vfsnumber; /* O unique mount id of file system */ uint vmt_time; /* O time of mount */ uint vmt_timepad; /* O (in future, time is 2 longs) */ int vmt_flags; /* I general mount flags */ /* O MNT_REMOTE is output only */ int vmt_gfstype; /* I type of gfs, see MNT_XXX above */ struct vmt_data { short vmt_off; /* I offset of data, word aligned */ short vmt_size; /* I actual size of data in bytes */ } vmt_data[VMT_LASTINDEX + 1]; };

Page 488: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-14 of 26 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, fs1.fm

File and Filesystem Operations

Overview Each file system type extension provides functions to perform operations on the filesystem and its files. Pointers to these functions are stored in the vfsops (filesystem operations) and vnodeops (file operations) structures.

Data Structures

The data structures discussed in this section are:

• gfs - Holds pointers to the vnodeops and the vfsops structures

• vnodeops - contains pointers to filesystem dependent operations on files (open, close, read, write...).

• vfsops - contains pointers to filesystem dependent operations on the filesystem (mount, umount...)

Logical File SystemVirtural File System

(Vnode-VFS Interface)File System

System FileTable

gfs

vnodeops

vfsops

vnode

vmount

vfs

inode

gnode

u-block

User FileDescriptor

Table

Page 489: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -15 of 26Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, fs1.fm Guide

gfs

Description There is one gfs structure for each type of virtual file system currently installed on the machine. For each gfs entry, there may be any number of vfs entries.

Purpose The operating system uses the gfs entries as an access point to the virtual file system functions on a type-by-type basis. There is no direct link from a gfs entry to all of the vfs entries of a particular gfs type. The file system code generally uses the gfs structure as a pointer to the vnodeops structure and the vfsops structure for a particular type of file system.

gfs management

The gfs structures are stored within a global array accessible only by the kernel. The gfs entries are inserted with the gfsadd() kernel service, and only one gfs entry of a given gfs_type can be inserted into the array. Generally, gfs entries are added by the CFG_INIT section of the configuration code of the file system kernel extension. The gfs entries are removed with the gfsdel()kernel service. This is usually done within the CFG_TERM section of the configuration code of the file system kernel extension.

gfs structure The gfs structure is defined in /usr/include/sys/gfs.h

struct gfs { struct vfsops *gfs_ops; struct vnodeops *gn_ops; int gfs_type; /* type of gfs (from vmount.h) */ char gfs_name[16]; /* name of vfs (eg. "jfs","nfs")*/ int (*gfs_init)(); /* ( gfsp ) - if ! NULL, */ /* called once to init gfs */ int gfs_flags; /* flags for gfs capabilities */ caddr_t gfs_data; /* gfs private config data*/ int (*gfs_rinit)(); int gfs_hold /* count of mounts */}

Page 490: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-16 of 26 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, fs1.fm

vnodeops

Description The vnodeops structure contains pointers to the filesystem dependant operations that can be performed on the vnode, such as link, mkdir, mknod, open, close, remove.

vnodeops management

There is one vnodeops structure per filesystem kernel extension loaded (i.e. one per unique filesystem type), and is initialized when the extension is loaded.

vnodeops structure

This structure is defined in /usr/include/sys/vnode.h. Due to the size of this structure, only a few lines are detailed below:

struct vnodeops {

/* creation/naming/deletion */

int (*vn_link)(struct vnode *, struct vnode *, char *,

struct ucred *);

int (*vn_mkdir)(struct vnode *, char *, int32long64_t,

struct ucred *);

int (*vn_mknod)(struct vnode *, caddr_t, int32long64_t,

dev_t, struct ucred *);

int (*vn_remove)(struct vnode *, struct vnode *, char *,

struct ucred *);

int (*vn_rename)(struct vnode *, struct vnode *, caddr_t,

struct vnode *,struct vnode *,caddr_t,struct ucred *);

int (*vn_rmdir)(struct vnode *, struct vnode *, char *,

struct ucred *);

/* lookup, file handle stuff */

int (*vn_lookup)(struct vnode *, struct vnode **, char *,

int32long64_t, struct vattr *, struct ucred *);

int (*vn_fid)(struct vnode *, struct fileid *, struct ucred *);

/* access to files */

int (*vn_open)(struct vnode *, int32long64_t, ext_t, caddr_t *,

struct ucred *);

int (*vn_create)(struct vnode *, struct vnode **, int32long64_t,

caddr_t, int32long64_t, caddr_t *, struct ucred *);

Page 491: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -17 of 26Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, fs1.fm Guide

vfsops

Description The vfsops structure, contains pointers to the filesystem dependant operations that can be performed on the vfs, such as mount, unmount or sync.

vfsops management

There is one vfsops structure per filesystem kernel extension loaded (i.e. one per unique filesystem type), and is initialized when the extension is loaded.

vfsops structure

This structure is defined in /usr/include/sys/vfs.h.

struct vfsops { /* mount a file system */ int (*vfs_mount)(struct vfs *, struct ucred *); /* unmount a file system */ int (*vfs_unmount)(struct vfs *, int, struct ucred *); /* get the root vnode of a file system */ int (*vfs_root)(struct vfs *, struct vnode **, struct ucred *); /* get file system information */ int (*vfs_statfs)(struct vfs *, struct statfs *, struct ucred *); /* sync all file systems of this type */ int (*vfs_sync)(); /* get a vnode matching a file id */ int (*vfs_vget)(struct vfs *, struct vnode **, struct fileid *, struct ucred *); /* do specified command to file system */ int (*vfs_cntl)(struct vfs *, int, caddr_t, size_t, struct ucred *); /* manage file system quotas */ int (*vfs_quotactl)(struct vfs *, int, uid_t, caddr_t, struct ucred *);};

Page 492: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-18 of 26 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, fs1.fm

The Gnode

Introduction Gnode represent an object in a file system implementation, and serves as the interface between the logical file system and the file system implementation. There is a one-to-one correspondence between a gnode and an object in a file system implementation.

Overview Each filesystem implementation is responsible for allocating and destroying gnodes. Calls to the file system implementation serve as requests to perform an operation on a specific gnode. A gnode is needed, in addition to the file system inode, because some file system implementations may not include the concept of an inode. Thus the gnode structure substitutes for whatever structure the file system implementation may have used to uniquely identify a file system object. The logical file system relies on the file system implementation to provide valid data for the following fields in the gnode:

• gn_type Identifies the type of object represented by the gnode.

• gn_ops Identifies the set of operations that can be performed on the object.

Creation A gnode refers directly to a file (regular, directory, special, and so on), and is usually embedded within a file system implementation specific structure (such as an inode). Gnodes are created as needed by file system specific code at the same time as creating implementation specific structures. This is normally immediately followed by a call to the vn_get kernel service to create a matching vnode. The gnode structure is usually deleted either when the file it refers to is deleted, or when the implementation specific structure is being reused for another file.

Continued on next page

Page 493: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -19 of 26Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, fs1.fm Guide

The Gnode -- continued

gnode and inode

The gnode is typical embedded in an in-core inode. The member gnode->gn_data points to the start of the inode.

Structure The gnode structure is defined in /usr/include/sys/vnode.h:

struct gnode { enum vtype gn_type; /* type of object: VDIR,VREG etc */ short gn_flags; /* attributes of object */ ulong gn_seg; /* segment into which file is mapped */ long32int64 gn_mwrcnt; /* count of map for write */ long32int64 gn_mrdcnt; /* count of map for read */ long32int64 gn_rdcnt; /* total opens for read */ long32int64 gn_wrcnt; /* total opens for write */ long32int64 gn_excnt; /* total opens for exec */ long32int64 gn_rshcnt; /* total opens for read share */ struct vnodeops *gn_ops; struct vnode *gn_vnode; /* ptr to list of vnodes per this gnode */ dev_t gn_rdev; /* for devices, their "dev_t" */ chan_t gn_chan; /* for devices, their "chan", minor’s minor */ Simple_lock gn_reclk_lock; /* lock for filocks list */ int gn_reclk_event;/* event list for file locking */ struct filock *gn_filocks; /* locked region list */ caddr_t gn_data; /* ptr to private data (usually contiguous) */}

Incore inode

gnode

gnode->gn_data

Page 494: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-20 of 26 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, fs1.fm

Exercise 1

Overview This exercise will test you knowledge of the data structures of the LFS and VFS and the relationships between them.

lab Use the following list of terms to best complete the statements below.

File vfs

File system vnodeops

System File Table vmount

1. A vnode represents a ______________.

2. A vfs represents a _____________.

3. The gfs contains pointers to the ufsops and the _____________.

4. The ___________ structure contains specifics about a mount request.

5. The ____________ has one entry for each open file on the system.

Answer the following two questions by completing this diagram as directed.

6. Label the blocks representing the vnode, vmount and gfs structures

7. Draw a line representing the file pointer in the ufd to an entry in the system file table.

Logical File SystemVirtural File System

(Vnode-VFS Interface)File System

System FileTable

vnodeops

vfsops

vfs

inode

gnode

u-block

User FileDescriptor

Table

Page 495: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -21 of 26Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, fs1.fm Guide

Lab Exercise 1

Overview In the following exercise you will run a small C program that opens a file, initializes it by writing a few bytes to it, then pauses. The pause allows us to investigate the various LFS structures that are created by opening the file, using the appropriate system debugger.

The program The C code for the example is:

#include <fcntl.h>main(){ int fd; fd=open("foo", O_RDWR | O_CREAT); write(fd, "abcd", 4); close(fd); fd=open("foo", O_RDONLY); printf("fd = %d\n", fd); pause();}

The close() then open() is required, to ensure that the write is committed to disk & hence that the inode is updated.save this code to a file called t.c, and compile it using “make t”.

Lab Follow the steps in the table below.

Continued on next page

Stage Description

1 Enter the C program from above, save it to a file called t.c and compile with the command:$ make t

2 Execute the program created in the last step. It will print the file descriptor number of the file it creates, then pauses. $ ./tfd = 3

3 From another shell on the same system, enter the system debugger (kdb or iadb).

Page 496: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-22 of 26 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, fs1.fm

Lab Exercise 1 -- continued

Lab

Continued on next page

Stage Description

4 Initially, we need to find the address of the file structure for the open file. We know that the file descriptor for our program is number 3, so we have to find the mapping between the file descriptor number and the file structure. This mapping is done from the file descriptor table in the uarea structure for the process. To find the uarea, find the slot number in the thread table that our “t” process occupies, the uarea slot number will be the same.

For kdb use the “th *” command to display all the threads. Page down through the entries until you find the correct entry:(0)> th *

SLOT NAME STATE TID PRI RQ CPUID CL WCHAN

pvthread+000000 0 swapper SLEEP 000003 010 1 0 ...pvthread+001D00 55 t SLEEP 003A39 03C 1 0 ...

5 Now use the command “uarea” on this thread slot number, to view the user area (which contains the file descriptor table), and page down through the output until you find the “File descriptor table”:(0)> u 55File descriptor table at..F00000002FF3CEC0: fd 0 fp..F100009600007430 count..00000000 flags. ALLOCATED fd 1 fp..F100009600007430 count..00000000 flags. ALLOCATED fd 2 fp..F100009600007430 count..00000000 flags. ALLOCATED fd 3 fp..F100009600007700 count..00000000 flags. ALLOCATED Rest of File Descriptor Table empty or paged out....

Page 497: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -23 of 26Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, fs1.fm Guide

Lab Exercise 1 -- continued

lab

Continued on next page

Stage Description

6 The file structure for file descriptor 3 is at address F100009600007700. Use the “file” command along with this address to display the contents of the structure:(0)!�ILOH�)���������������$''5�������������&2817�����������2))6(7�������������'$7$�7<3(���)/$*6

)���������������������������������������)�����������$����912'(��5($'

QRGH���������������������VORW��������������������IBIODJ�������������������IBFRXQW������������������IBRSWLRQV����������������IBW\SH�������������������IBGDWD����������)�����������$����IBRIIVHW�������������������������IBGLUBRII������������������������IBFUHG����������)���������&��&���IBORFN#���������)����������������IBORFN���������������������������IBRIIVHWBORFN#��)����������������IBRIIVHWBORFN��������������������IBYLQIR��������������������������IBRSV��������������$&&(��YQRGHIRSV�������912'(�����������)�����������$���YBIODJ��������������YBFRXQW�������������YBYIVJHQ������������YBYIVS�����)���������)'�����YBORFN#����)�����������$����YBORFN����������������������YBPYIVS���������������������YBJQRGH����)�����������$�)��YBQH[W����������������������YBYIVQH[W��)���������%�)����YBYIVSUHY�������������������YBSIVYQRGH������������������YBDXGLW���������������������

Note that half way down the output, the address of the vnode structure that corresponds to this file is printed, followed by the contents of this vnode structure. (We could also display the vnode structure separately by running the kdb command “vnode” with the address F10000971528A380.)

8 There are two items that we are interested in from the vnode structure displayed in the last step, the v_vfsp address, which points to the filesystem that contains the vnode, and the v_gnode address, which points to the gnode structure for the file. From the gnode we can display the inode structure for the file.

Initially, display the gnode address, using the kdb command “gnode” with the address F10000971528A3F8.(0)> gnode F10000971528A3F8GNODE............ F10000971528A3F8 KERN_heap+528A3F8gn_type....... 00000001 gn_flags...... 00000000 gn_seg........ 00000000000078AD gn_mwrcnt..... 00000000 gn_mrdcnt..... 00000000 gn_rdcnt...... 00000001 gn_wrcnt...... 00000000 gn_excnt...... 00000000 gn_rshcnt..... 00000000 gn_ops........ 00000000003D7DC8 jfs_vopsgn_vnode...... F10000971528A380 gn_rdev....... 8000000A00000008 gn_chan....... 00000000 gn_reclk_event 00000000FFFFFFFF gn_reclk_lock@ F10000971528A440 gn_reclk_lock. 0000000000000000 gn_filocks.... 0000000000000000 gn_data....... F10000971528A3D8 gn_type....... REG

Page 498: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-24 of 26 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, fs1.fm

Lab Exercise 1 -- continued

Lab

Continued on next page

Step Action

9 The inode address is contained in the gn_data field, in this case F10000971528A3D8. Use the kdb command “inode” to display this structure:

(0)�!�LQRGH�)�����������$�'��������������������������������������'(9�����180%(5�&17�7<3(�)/$*6

������.(51BKHDS����$�'���������$������������������������5(*�

IRUZ������)�����������)(���EDFN������)�����������)(��QH[W������)�����������$�'��SUHY������)�����������$�'�JQRGH#����)�����������$�)��QXPEHU������������GHY��������������$���������LSPQW�����)���������(�)(��IODJ���������������ORFNV��������������ELJH[S�������������FRPSUHVV����������FIODJ��������������FRXQW��������������V\QFVQ����������'$�LG���������������&PRYHGIUDJ������������������RSHQHYHQW�))))))))))))))))KLS�������)�����������)(���QRGHORFN������������������QRGHORFN#�)�����������$�$��GTXRW>865@����������������GTXRW>*53@�����������������GLQRGH#���)�����������$�&�FOXVWHU������������UFOXVWHU�����������GLRFQW�������������QRQGLR������������VL]H�����������������������JHWV��������������

*12'(�������������)�����������$�)�JQBW\SH�����������������JQBIODJV����������������JQBVHJ�����������������������$'�JQBPZUFQW���������������JQBPUGFQW���������������JQBUGFQW����������������JQBZUFQW����������������JQBH[FQW����������������JQBUVKFQW���������������JQBRSV��������������������'�'&��MIVBYRSVJQBYQRGH�������)�����������$����JQBUGHY���������������$���������JQBFKDQ�����������������JQBUHFONBHYHQW���������))))))))�JQBUHFONBORFN#�)�����������$����JQBUHFONBORFN�������������������

JQBILORFNV����������������������JQBGDWD��������)�����������$�'��

JQBW\SH��������5(*�

GLBJHQ���������)���)�&�GLBPRGH������������&���GLBQOLQN��������������GLBDFFW����������������GLBXLG�����������������GLBJLG����������������GLBQEORFNV�������������GLBDFO����������������GLBPWLPH��������&��)�'�GLBDWLPH��������&��)�'�GLBFWLPH��������&��)�'GLBVL]HBKL�������������GLBVL]HBOR�������������GLBVHF����������������GLBUGDGGU�������������GLBYLQGLUHFW�����������GLBULQGLUHFW����������GLBSULYRIIVHW����������GLBSULYIODJV�����������GLBSULY���������������

912'(������������)�����������$���YBIODJ��������������YBFRXQW�������������YBYIVJHQ������������YBYIVS�����)���������)'�����YBORFN#����)�����������$����YBORFN����������������������YBPYIVS���������������������YBJQRGH����)�����������$�)��YBQH[W����������������������YBYIVQH[W��)���������%�)����YBYIVSUHY�������������������YBSIVYQRGH������������������YBDXGLW���������������������

Page 499: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -25 of 26Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, fs1.fm Guide

Lab Exercise 1 -- continued

labStep Action

10 The inode command displays the inode, gnode and vnode structures.

The member number in the inode structure should contain the inode number in hex of the file foo. Verify this inode number matches the inode number displayed by the command :

$ ls -lia foo

Don’t forget to convert the inode number from hex to decimal.

11 The dev field displays the major and minor number of the logical volume for the filesystem.For example:64 bit systems: 8000000A00000007 -> major=10 minor=732 bit systems: 000A0007 -> major=10 minor=7Verify this number with the command:

$ ls -lia /dev/<logical volume>

Page 500: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-26 of 26 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, fs1.fm

Lab Exercise 2

Overview The instructor will create a simple shell script that simply prints its process id, then pauses.

Both the “ps” command, and the process and thread tables entries for this script will simply list the name of the program as the name of the shell that it is being executed by (E.g. “ksh”).

Objective To determine the name of the script that the instructor is running.

Tips • Remember that the shell will have to open() the script prior to executing it.

• The command find . -inum xxx can be used to find the name of a file given the filesystem name and an inode number.

Page 501: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -1 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

Unit 14. AIX 5L boot

ObjectivesAfter completing this unit, you should be able to :

• List and locate boot components and their usage

• Understand the 3 Phases of rc.boot

• Understand the contents and usage of a RAMFS

• Understand the ODM structure and the usage of ODM classes

• Create new boot images

• Debug boot problems

Page 502: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-2 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

What is boot

Definition It is the process that begins when the computer is powered up and continues until the entries in the init table have been processed.

ROS process System ROS (Read Only Storage), contains firmware that is independent of the operating system which initializes the hardware and loads AIX.

All platforms except RS6K will use an intermediate boot process called :

• Softros : (/usr/lib/boot/aixmon_chrp) for CHRP systems

• Softros : (/usr/lib/boot/aixmon_rspc) for RSPC systems

• Boot loader : (/usr/lib/boot/boot_elf) for IA-64 systems

AIX process AIX begins execution after system ROS firmware or the intermediate boot process finishes its execution :

• sets up firmware information

• kernel initialization

• RAM filesystem based configuration

• control is passed to files based in the permanent filesystem (this may be a disk or network filesystem)

• /etc/inittab entries are processed. This usually includes enabling the user login process.

Page 503: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -3 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

Various Types of boot

Devices AIX can boot from the following types of devices :

• hard disk boot

• CD-ROM boot

• tape boot (Not supported on IA-64 platform)

• network boot

Configuration The boot process can use one of the following boot configurations :

• standalone

• diskless/dataless (Not supported on IA64 platform)

• operating system installation/software maintenance

• diagnostics

Hard disk boot The hard disk boot has the following characteristics :

• the boot image resides on the hard disk

• the RAM filesystem contains the files necessary for configuring the hard disk(s), and then accessing the filesystems that reside in the root volume group (rootvg)

• this is the most common system configuration

• these types of systems are also known as “standalone” systems

• these types of systems may also be booted into the diagnostics functions

CDROM boot The CDROM boot maybe used in the following situations :

• operating system installation

• diagnostics

• hard disk boot failure recovery/maintenance

Continued on next page

Page 504: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-4 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

Various Types of boot -- continued

Tape boot The Tape boot device can be used for :

• operating system installation

• hard disk boot failure recovery/maintenance

The tape boot device is usually used for creating bootable system backups

The tape boot device is not supported on IA-64 platform.

Network boot The network boot can be used for the following purposes :

• boot and install the operating system

• the operating system is installed on a hard disk with NIM

• subsequent boots are from the hard disk

• supported diskless/dataless configurations

• diagnostics

• hard disk boot failure recovery/maintenance

The centralized boot/filesystem servers offer convenient administration

Page 505: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -5 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

Systems types and Kernel images

System Types There are four basic hardware architecture types:

• RS6K - the “classic” IBM workstation

• RSPC - the PowerPC Reference Platform workstation

• CHRP - Common Hardware Reference Platform

• IA-64 - Intel IA-64 Platform

boot images types

There are three corresponding types of boot images:

• The RS6K uses an hardware ROS to build the IPL Control Block

• The RSPC and CHRP uses a SOFTROS to build the IPL Control Block

• The IA-64 use an EFI boot loader to build the IPL Control Block

kernel types There are four types of Kernels loaded:

• 32 bits Power UP (/unix->/usr/lib/boot/unix_up)

• 32 bits Power MP (/unix->/usr/lib/boot/unix_mp)

• 64 bits Power (/unix->/usr/lib/boot/unix_64)

• 64 bits IA-64 (/unix->/usr/lib/boot/unix_ia64)

Page 506: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-6 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

RAMFS and prototype files

Introduction In order to successfully boot a system, the AIX kernel will need basic commands, configuration files, kernel extensions and device drivers to be able to configure a minimum environment.

All the files needed are included in the RAMFS using the following command

mkfs -V jfs -p <proto> <temp_filesystem_file>

prototypes files description

A prototype file is a list of file and file descriptions that are needed to create a RAMFS.

A prototype file entry format is as follow :

<dest_file_name> <type> <mode> 0 0 <full_path_name>

Where :

• <dest_file_name> : is the name of the file, directory, link or device as it will be written to the RAMFS

• <type> : defines the type of the entry and can be :

• d--- : a directory entry (this will change the relative path of the following entries).

• l--- : a link (the target will be listed in the <full_path_name> parameter)

• b--- : a block device (the <full_path_name> parameter will represent the major and minor numbers)

• c--- : a character device (the <full_path_name> parameter will represent the major and minor numbers)

• ---- : a file

• <mode> : represent the file permissions in octal format

• <full_path_name> : value will depend on the <type> as described before.

Continued on next page

Page 507: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -7 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

RAMFS and prototype files -- continued

prototypes files types

Prototype files are divided in several parts according to their specific use :

• Prototypes files located in /usr/lib/boot are the base prototypes used for a platform according to the boot device type and comes with the platform base system device fileset

• Prototypes files located in /usr/lib/boot/network are specific to any general kind of network boot device and comes with the platform base system device fileset

• Prototypes files located in /usr/lib/boot/protoext are used for any specific type of boot device and comes with the device specific fileset

Page 508: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-8 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

Boot Image Creation

Introduction In order to successfully boot from a device, the administrator will need to run commands that will create the boot structure.

bosboot command

The bosboot command is the most commonly used on AIX because it will manage all verification tasks and environment setup for the administrator. The administrator can also use the mkboot command but he then should take care himself of all these preliminary checks.

The bosboot command will also be used by over commands like mksysb or installp post installation process when installing packages that needs to build a new boot image.

bosboot process overview

The bosboot command will do the following :

• set execution environment

• parse command line arguments

• verify syntax and arguments

• point to platform specific files (like mkboot_chrp or aixmon_rspc)

• check for space needed in /tmp and destination filesystem if needed

• create a RAMFS if requested using mkfs and proto files

• create a bootimage and a boot record if requested using the appropriate mkboot command

• copy the boot image and savebase to the boot device if requested.

• cleanup execution environment

Continued on next page

Page 509: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -9 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

Boot Image Creation -- continued

bosboot parameters

The most commonly used bosboot command is :

# bosboot -a -d /dev/hdisk0

For example if you need to load and invoke the kernel debugger you can use :

bosboot -a -I -d /dev/hdisk0

The following table list the bosboot parameters that can be used :

argument description

-a Create complete boot image and device.

-w file Copy given boot image file to device.

-r file Create ROS Emulation boot image.

-d device Device for which to create the boot image.

-U Create uncompressed boot image.

-p proto Use given proto file for RAM disk file system.

-k kernel Use given kernel file for boot image.

-l lvdev Target boot logical volume for boot image.

-b file Use given file name for boot image name.

-D Load Low Level Debugger.

-I Load and Invoke Low Level Debugger.

-L Enable MP locks instrumentation (MP kernels)

-M norm|serv|both Boot mode - normal or service

-O offset boot image offset for CDROM file system.

-q Query size required for boot image.

Page 510: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-10 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

AIX 5L Distributions

Introduction AIX 5L will be delivered in two separate distributions :

• One for Power systems

• One for Intel IA-64 systems

Power CDROM Distributions

The distribution CDROM that IBM provides to our customers has three boot images. There is a boot image for the RS6K computers, a second for the RSPC computers, and a third for CHRP (/ppc/chrp/bootfile.exe). The RS6K, RSPC, and CHRP UP computers can use the MP Kernel, which is the method implemented for distribution media that goes to our customers. In other words, when a customer receives boot/install media from IBM, there is no need to determine whether the system is UP or MP. This boot image is created using the MP kernel. The UP kernel is more efficient for uniprocessor systems, but the strategy of a single boot image for both hardware platform types lowers distribution cost, and is more convenient for our customers.

IA-64 CDROM Distributions

Page 511: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -11 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

Checkpoint

Introduction Take a few minutes to answer the following questions. We will review the questions as a group when everyone has finished.

Quiz 1. What is the name of the file used as a SOFTROS on CHRP systems

2. Does an IA-64 support 32 bit kernel

3. What are the common functions of the ROS, SOFTROS and EFI boot loader.

4. List the 4 platforms supported by AIX 5L

5. What is the purpose of the RAMFS

6. How to create a RAMFS

Page 512: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-12 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

Instructor Notes

Purpose Notes on Quiz and transition to the next section

Quiz responses

The responses for the Quiz are :

1. What is the name of the file used as a SOFTROS on CHRP systems

• /usr/lib/boot/aixmon_chrp

2. Does an IA-64 support 32 bit kernel : NO

3. What are the common functions of the ROS, SOFTROS and EFI boot loader.

• create the IPLCB

• load the kernel

4. list the 4 platforms supported by AIX 5L

• RS6K

• RSPC

• CHRP

• IA-64

5. What is the purpose of a RAMFS :

• Get basic commands, configuration files, kernel extensions and device drivers in order to be able to bring a minimum environment.

6. How to create a RAMFS :

• Using mkfs and prototype files.

Transition Statement

Now we will describe:

• the Power specific boot process if this is a Power course

• the IA-64 specific boot process if this is a IA-64 course

Page 513: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -13 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

The Power Boot Mechanism

Introduction The section will explain the boot mechanism used by Power family systems.

Boot overview When the system is powered on, the ROS or the firmware will look for the bootrecord on the device pointed by the bootlist to find the boot entry point.

The Softros on RSPC and CHRP will execute and uncompress the boot image if needed using the bootexpand process.

Then it will load the kernel that will initialize.

The kernel will then call init (In fact /usr/lib/boot/ssh at this stage)

The ssh will then call rc.boot for PHASE I and PHASE II specific to each boot device types.

Then init will execute rc.boot phase 3 and the remaining common code in rc.boot for disk and network boot devices

Boot diagram The following diagram represent the high level boot process overview.

rspc

Kernel initialization

init ssh call rc.boot PHASE I&II

execution of the system ROSor firmware.

compressed

bootimg

or chrpboot

execution of softros

execution bootexpand

boot record read from boot device

Kernel call init (/usr/lib/boot/ssh)

init exit to newrootinit calls rc.boot PHASE III frominittab and process the rest of inittab entries.

yes

yes

no

no

Page 514: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-14 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

Power boot disk layout

Boot image overview

The following chart describes a Power boot disk :

bootrecord 512 byte block containing size and location of the boot image. The boot record is the first block on a disk or cdrom and is therefore separated from the boot image. The boot image on a disk is placed in the boot logical volume which is a reserved contiguous area.

softros RSPC and CHRP platform uses a SOFTROS program (/usr/lib/boot/aixmon_rspc or /usr/lib/boot/aixmon_chrp) that performs system initialization for AIX that the hardware firmware in ROS does not provide, such as appending device information to the IPL control block.

Continued on next page

bootrecord

softros (chrp and rspc)

bootexpand

compressed compressed

basecustomized

data

VGDA

RAM Filesystem

boot diskhd5

kernel

rest ofthe boot

disk

Page 515: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -15 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

Power boot disk layout -- continued

bootexpand Program to expand compressed boot image which is executed before control is passed to kernel. The compression of a boot image is optional but it is the default since the image size is less than half of an uncompressed image and requires less time to load from the media.

kernel AIX 32 bits UP, 32 bits MP or 64 bits MP kernels that which control passes to after expansion by bootexpand. The kernel initializes itself and then passes control to the simple shell init (ssh) in the RAM filesystem.

RAM filesystem

Filesystem used during boot process, that contains programs and data for initializing devices and subsystems in order to install AIX, execute diagnostics, or to access and bring up the rest of AIX.

base customized data

Area of the hard disk boot logical volume containing user configured ODM device configuration information that is used by the system configuration process.

Page 516: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-16 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

AIX 5L Power boot record

Introduction On Power systems, the boot record is located at the beginning of the boot device and contains the following informations :

• The IPL record

• The boot partition table used by chrp and rspc systems.

IPL record description

The following table describe the content of the boot record.

Continued on next page

size offset

name description

4 0 IPL_record_id This physical volume contains a valid IPL record if and only if this field contains IPLRECID in EBCDIC ’IBMA’

20 4 reserved1

4 24 formatted_cap Formatted capacity. The number of sectors available after formatting.

1 28 last_head THIS IS DISKETTE INFORMATION The number of heads minus 1.

1 29 last_sector THIS IS DISKETTE INFORMATION The number of sectors per track.

6 30 reserved2

4 36 boot_code_length

Boot code length in sectors. A 0 value implies no boot code present

4 40 boot_code_offset

Boot code offset. Must be 0 if no boot code present, else contains byte offset from start of boot code to first instruction.

4 44 boot_lv_start Contains the PSN of the start of the BLV.

4 48 boot_prg_start Boot code start. Must be 0 if no boot code present, else contains the PSN of the start of boot code.

4 52 boot_lv_length BLV length in sectors.

4 56 boot_load_add 512 byte boundary load address for boot code.

1 60 boot_frag 0x1 => fragmentation allowed

Page 517: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -17 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

AIX 5L Power boot record -- continued

IPL record description continued

Continued on next page

size offset

name description

1 61 boot_emulation 0x1 => ROS network emulation code

2 62 reserved3

2 64 basecn_length Number of sectors for base customization. Normal mode.

2 66 basecs_length Number of sectors for base customization. Service mode.

4 68 basecn_start Starting PSN value for base customization. Normal mode.

4 72 basecs_start Starting PSN value for base customization. Service mode.

24 76 reserved4

4 100 ser_code_length Service code length in sectors. A 0 value implies no service code present.

4 104 ser_code_offset Service code offset. 0 if no service code is present, else contains byte offset from start of service code to first instruction.

4 108 ser_lv_start Contains the PSN of the start of the SLV.

4 112 ser_prg_start Service code start. Must be 0 if service code is not present, else contains the PSN of the start of service code.

4 116 ser_lv_length SLV length in sectors.

4 120 ser_load_add 512 byte boundary load address for service code.

1 124 ser_frag Service code fragmentation flag. Must be 0 if no fragmentation allowed, else must be 0x01.

1 125 ser_emulation ROS network emulation flag

2 126 reserved5

8 128 pv_id The unique identifier for this PV.

376 136 dummy Include the partition table.

Page 518: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-18 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

AIX 5L Power boot record -- continued

boot partition table

The boot record contains 4 partition tables entries starting at offset 0x1be. Each entry contains the following information :

boot partition tables entries

RS6K platform doesn’t use a boot partition table. The four boot partition table entries are used for :

• CHRP boot images

• CHRP and First RSPC boot image

• CHRP and Second RSPC boot image

• CHRP Third RSPC boot image

Continued on next page

size in byte name description

1 boot_ind Boot indicator

1 begin_h Begin head

1 begin_s Begin sector

1 begin_c Begin cylinder

1 syst_ind System indicator

1 end_h End head

1 end_s End sector

1 end_c End cylinder

4 RBA Relative block address in little endian format

4 sectors Number of sectors in little endian format

Page 519: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -19 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

AIX 5L Power boot record -- continued

Example The following chart represent an AIX 5L boot record from a chrp system. It was obtained using :

od -Ax -x /dev/hdisk0|pg

_length

boot_code

IBMA

_len

base_cn_length

base_cs

0000020 0000 0000 0000 2cc1 0000 0000 0000 1100

0000000 c9c2 d4c1 0000 0000 0000 0000 0000 00000000010 0000 0000 0000 0000 0000 0000 0000 0000

0000030 0000 0000 0000 0000 0000 0000 0000 00000000040 0100 0100 0000 3cdc 0000 3cdc 0000 00000000050 0000 0000 0000 0000 0000 0000 0000 00000000060 0000 0000 0000 2cc1 0000 0000 0000 11000000070 0000 0000 0000 0000 0000 0000 0000 00000000080 0007 1483 229d 0662 0000 0000 0000 00000000090 0000 0000 0000 0000 0000 0000 0000 0000

00001c0 0000 0000 0000 0000 0000 0000 0000 80ff00001d0 ffff 41ff ffff 1b11 0000 c12c 0000 00ff00001e0 ffff 41ff ffff 0211 0000 1900 0000 80ff00001f0 ffff 41ff ffff 1b11 0000 c12c 0000 55aa0000200 4182 000c 3880 0000 4800 000c 7c83 23780000210 7ca4 2b78 83c3 0098 7fde 1814 83de 00340000220 57de 063e 2c1e 0057 4182 0024 2c1e 00580000230 4182 001c 2c1e 0059 4182 0014 2c1e 00720000240 4182 000c 2c1e 0082 4082 0030 83c3 02880000250 7fde 1814 83de 006c 2c1e 0000 4182 001c0000260 3fc0 8000 7fcf 01a4 3fc0 f000 83fe 10c00000270 67ff 0080 93fe 10c0 31ad ffd8 30c3 0080

_startbase_cn

_startboot_lv

PVID _lengthserv_code

_startbase_cs

_startser_lv

00001b0 0000 0000 0000 0000 0000 0000 0000 0000

boot_partition_table

BOOT_SIGNATURE

RBA sectors

Page 520: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-20 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

Instructor Notes

Purpose Notes on Power boot record

Little endian format

The RBA and sectors informations from the boot partition table are little endian format.

So to obtain the actual address, you will need to swap the 2 bytes as they are display using the od command

Page 521: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -21 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

Power boot images structures

Introduction Depending on the architecture, the boot image will not always contains the same elements due to the needs of ROS and Firmware specifications.

RS6K boot image

The rs6k platform doesn’t need a an softros emulation, so the boot image start with the bootexpand program. The bootexpand will be loaded first to uncompress the kernel and the RAMFS.

RSPC boot image

On rspc, the aixmon_rspc softros is located at the begening of the boot image, but the xcoff format is replaced by an hints structure has defined in /usr/include/sys/boot.h. So an RSPC boot image will contain the following sections :

• The hints structure

• The aixmon_rspc file reduced by it’s xcoff header and in fact starting at its entry point

• The bootexpand program

• The compressed kernel

• The compressed RAMFS

• The saved base customization.

CHRP boot image

On chrp, the aixmon_chrp softros is located at the begening of the boot image, but the xcoff format is replaced by an ELF format. So a CHRP boot image will contain :

• The ELF structure

• The aixmon_chrp file reduced by it’s xcoff header and in fact starting at its entry point.

• The bootexpand program

• The compressed kernel

• The compressed RAMFS

• The saved base customization.

Page 522: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-22 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

RSPC boot image hints header

introduction On rspc systems, the aixmon xcoff header is replaced by an hints structure. The aixmon_rspc file is copied to the boot image after the hints structure starting at it’s entry point.

hints boot structure description

The following table represents the hints structure :

Continued on next page

size name description

4 signature Signature for boot program ‘0x4149584d’

4 resid_data_address address of residual data as determined by firmware

4 bss_offset Address of bss section

4 bss_length Length of bss section

4 jump_offset Jump offset in boot image

4 load_exec_address address of boot loader as determined by firmware

4 header_size Size of header

4 header_block_size Offset to AIX boot image

4 image_length Size of boot program

4 Spare

4 res_mem_size reserved memory size

4 mode_control Boot mode control ‘0xDEAD0000 | mode_control’

Page 523: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -23 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

RSPC boot image hints header -- continued

RSPC boot image example

The following output represents the hints header output from the following command :

# dd if=<boot_disk> bs=512 skip=<RBA> count=1 |od -Xa -x

aixmon entry point

0000000 0000 0000 0000 0000 0000 0000 0000 0000*0000200 3004 0000 00 fe 3200 0002 4149 5820 20340000210 2033 2030 3130 3130 3035 3437 3000 00000000220 0000 0000 0000 0000 0000 0000 0000 0000*0000400 4149 584d 0000 0000 0000 ff d4 0000 022c0000410 0000 038c 0000 0000 0000 0400 0000 00970000420 0001 2810 0000 0000 0000 0000 dead 00c00000430 4800 0005 7e80 00a6 7e94 a278 3a94 1000

Page 524: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-24 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

CHRP Boot image ELF structure

introduction On chrp systems, the aixmon xcoff header is replaced by an ELF header. The aixmon_chrp file is copied to the boot image after the ELF header starting at it’s entry point.

ELF boot header description

The ELF boot header is made of :

• ELF header structure

• Note section description

• loader section 1 description

• loader section 2 description

• Note data description

• The boot loader parameters data

ELF header structure description

The Following table describes the ELF header structure :

Continued on next page

size name description

16 e_ident ELF identification

2 e_type object file type

2 e_machine architecture

4 e_version object file version

4 e_entry entry point

4 e_phoff prog hdr byte offset

4 e_shoff section hdr byte offset

4 e_flags processor specific flags

2 e_ehsize ELF header size

2 e_phentsize prog hdr table entry size

2 e_phnum prog hdr table entry count

2 e_shentsize section header size

2 e_shnum section header entry count

2 e_shstrndx sect name string tbl idx

Page 525: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -25 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

CHRP boot image ELF structure - Continued

Note, load 1 and load2 segments descriptions

The following table describes the structure used to format note, loader 1 and loader 2 segments :

Note data description

The following table represent the note data description structure :

Continued on next page

size name description

4 p_type segment type

4 p_offset offset to this segment

4 p_vaddr virt addr of seg in memory

4 p_paddr phy addr of seg in memory

4 p_filesz file image segment size

4 p_memsz mem image segment size

4 p_flags segment flags

4 p_align segment alignment

size name description

4 namesz size of name

4 descsz size of descriptor

4 type descriptor interpretation

8 name the owner of this entry

4 real_mode ISA env variable

4 real_base ISA env variable

4 real_size ISA env variable

4 virt_base ISA env variable

4 virt_size ISA env variable

4 load_base ISA env variable

Page 526: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-26 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

CHRP boot image ELF structure - Continued

Boot loader parameters description

The following table describes the boot loader structure :

example Use the following command to display the ELF structure:# dd if=<boot_disk> bs=512 skip=<RBA> count=1 |od -Xa -x

size name description

4 timestamp date when the boot image was created

4 bootimage_size equivalent to the number of sectors for the blv found in the bootrecord

4 boot_loader_size size of the aixmon in bytes

4 inst_offset jump offset in boot image

4 rmalloc_size Percent of memory for kernel heap

4 reserved1

4 reserved2

4 reserved3

����������I����F��������������������������������������������������������������������H��������������������������������������������������������������������������������������������������IIII�IIII���������IIII�IIII���������F��������������������������������������������������������F���������F�����������������F�������������������������������������������������������������������H���������H�����������������H��������H���������������������������������������������������������������������������D������I����������������IIII�IIII���F�����������E���IIII�IIII�IIII�IIII�IIII�IIII���������������F�������������������H���������H�����������F�����G�����������������������������������������������H��������������IH����D�����I����F���������G�����I���������EE���������E��������EE���������������������DF��������D������������������������������������������������������F��������������I�����

BL_parms_data

load_phdr1

load_phdr2

note_phdr elf_hdr

note_data

aixmon entry point

Page 527: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -27 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

Exercise

Introduction This exercise will show you the way to locate the different parts of the boot image using the boot record

Procedure Follow the following procedure to locate main parts of the boot image.

Step Action

1 Locate the boot disk using :# bootinfo -b

2 Determine the architecture of your system using :# bootinfo -p

3 Find the boot record located at the beginning of the disk found in step 1 using :# dd if=<boot_disk> bs=512 count=1 |od -Ax -x

4 • On RSPC or CHRP, locate in the boot partition table the RBA and sectors from output of step 3.

• On RS6K, locate in the record, the boot_prg_start and boot_code_length

5 Create a file using the offset and sectors length found in step 5 using :# dd if=<boot_disk> bs=512 skip=<offset> count=<sectors> of=/tmp/myfile

6 Using the what command try to find what is included in this fileWhat is missing from the what output ?Why ?

7 Create a file using the offset and sectors length found in step 5 plus the size of the boot_loader# dd if=<boot_disk> bs=512 skip=<(offset*512)+boot_loader_size)> count=512 of=/tmp/myfile2

8 What is myfile2

9 Using the results from step 3, locate the base customization sector start and length : use these values to create a new file# dd if=<boot_disk> bs=512 skip=<base_cn_start> count=<base_cn_length> of=/tmp/myfile3

10 Create a directory <dir1> and copy /etc/objrepos/* to dir1 # /usr/lib/boot/restbase -o myfile3 -d dir1 -v

Page 528: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-28 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

Instructor Notes

Purpose Notes on boot record and image exercise

Details Step 6 should output something like :

07 1.3 src/rspc/usr/lib/boot/aixmon_chrp/cl_in_services.c, chrp_softros, rspc500, 0025A_500 10/22/98 14:25:3904 1.32 src/rspc/usr/lib/boot/aixmon_chrp/aixmon_chrp.c, chrp_softros, rspc500, 0026A_500 6/16/00 12:43:2509 1.2 src/rspc/usr/lib/boot/aixmon_chrp/printf.c,chrp_softros, rspc500, 0025A_500 1/13/99 10:38:0208 1.40 src/rspc/usr/lib/boot/aixmon_chrp/iplcb_init.c, chrp_softros, rspc500, 0029A_500 7/17/00 14:07:1139 1.5 src/rspc/usr/lib/boot/aixmon_chrp/numa_topo.c, chrp_softros, rspc500, 0028A_500 6/7/00 08:11:2148 1.1 src/rspc/usr/lib/boot/aixmon_chrp/rtas_func.c, chrp_softros, rspc500, 0026A_500 6/16/00 13:04:3265 1.21 src/bos/usr/sbin/bootexpand/expndkro.c, bosboot, bos500, 0025A_500 4/14/00 14:26:38

So it reflect the presence of the softros (aixmon_chrp) and the bootexpand codes.

Here we are missing the kernel and ramfs because the are stripped and then unreadable for the what command.

Step 8 should output something like :

# what /tmp/myfile2

/tmp/myfile2:

65 1.21 src/bos/usr/sbin/bootexpand/expndkro.c, bosboot, bos500, 0

After completing the step 10, students should observe that the following files were updated by the restbase command. That confirms that myfile3 is actually the base customization area.

-rw-r--r-- 1 root system 32768 Aug 23 15:51 CuDvDr

-rw-r--r-- 1 root system 4096 Aug 23 15:51 CuPath

-rw-r--r-- 1 root system 4096 Aug 23 15:51 CuPath.vc

-rw-r--r-- 1 root system 4096 Aug 23 15:51 CuPathAt

-rw-r--r-- 1 root system 4096 Aug 23 15:51 CuPathAt.vc

-rw-r--r-- 1 root system 16384 Aug 23 15:51 CuAt

-rw-r--r-- 1 root system 8192 Aug 23 15:51 CuAt.vc

Page 529: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -29 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

-rw-r--r-- 1 root system 4096 Aug 23 15:51 CuDep

-rw-r--r-- 1 root system 16384 Aug 23 15:51 CuVPD

-rw-r--r-- 1 root system 12288 Aug 23 15:51 CuDv

Page 530: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-30 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

Power ROS and Softros

ROS On RS6K platforms, the Hardware ROS performs some basic hardware configuration and tests, and create the IPL Control Block before transferring control to kernel’s entry point.

Softros The RSPC and CHRP family of computers requires a boot image with special software known as SOFTROS, which is used to provide function that AIX requires, and is not provided by the hardware firmware. The SOFTROS performs some basic hardware configuration and tests, and also sets up some data structures to provide an environment for AIX that more closely resembles the environment provided by RS6K system ROS. On CHRP systems the firmware device tree is also appended to the IPL Control Block. The the Softros transfer control to kernel’s entry point.

Page 531: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -31 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

IPLCB on Power

Definition The IPLCB (Initial Program Load Control Block) defines the RAM resident interface between the IPL Boot Process and the Operating System

The ROS or Softros will initialize the IPLCB structure using interfaces to the firmware or ROS (on RS6K platform).

The kernel when loaded will use the IPLCB structure to initialize it’s runtime structures.

IPLCB Description

The IPLCB contains the following structures (described in : /usr/include/sys/iplcb.h) :

• IPLCB Directory : contains the IPLCB ID and pointers (offset and size to IPLCB Data)

• IPLCB Data such as :

• processor information ('ipl -proc [cpu]')

• memory region ('ipl -mem')

• system information ('ipl -sys')

• user information ('ipl -user')

• NUMA information ('ipl -numa')

Continued on next page

Page 532: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-32 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

IPLCB on Power -- continued

IPLCB directory example on a CHRP system

The following screen output shows the IPLCB on a CHRP system captured using the kdb iplcb -dir sub command :

IPL directory [10000080]ipl_control_block_id.........ROSIPLipl_cb_and_bit_map_offset...00000000 ipl_cb_and_bit_map_size....00008898bit_map_offset..............000087A8 bit_map_size...............00000007ipl_info_offset.............000002E8 ipl_info_size..............00000598iocc_post_results_offset....00000000 iocc_post_results_size.....00000000nio_dskt_post_results_offset00000000 nio_dskt_post_results_size.00000000sjl_disk_post_results_offset00000000 sjl_disk_post_results_size.00000000scsi_post_results_offset....00000000 scsi_post_results_size.....00000000eth_post_results_offset.....00000000 eth_post_results_size......00000000tok_post_results_offset.....00000000 tok_post_results_size......00000000ser_post_results_offset.....00000000 ser_post_results_size......00000000par_post_results_offset.....00000000 par_post_results_size......00000000rsc_post_results_offset.....00000000 rsc_post_results_size......00000000lega_post_results_offset....00000000 lega_post_results_size.....00000000keybd_post_results_offset...00000000 keybd_post_results_size....00000000ram_post_results_offset.....00000000 ram_post_results_size......00000000sga_post_results_offset.....00000000 sga_post_results_size......00000000fm2_post_results_offset.....00000000 fm2_post_results_size......00000000net_boot_results_offset.....00000000 net_boot_results_size......00000000csc_results_offset..........00000000 csc_results_size...........00000000menu_results_offset.........00000000 menu_results_size..........00000000console_results_offset......00000000 console_results_size.......00000000diag_results_offset.........00000000 diag_results_size..........00000000rom_scan_offset.............00000000 rom_scan_size..............00000000sky_post_results_offset.....00000000 sky_post_results_size......00000000global_offset...............00000000 global_size................00000000mouse_offset................00000000 mouse_size.................00000000vrs_offset..................00000000 vrs_size...................00000000taur_post_results_offset....00000000 taur_post_results_size.....00000000ent_post_results_offset.....00000000 ent_post_results_size......00000000vrs40_offset................00000000 vrs40_size.................00000000gpr_save_area1............@ 10000178system_info_offset..........00000880 system_info_size...........0000009Cbuc_info_offset.............0000091C buc_info_size..............00000150processor_info_offset.......00000A6C processor_info_size........00000310fm2_io_info_offset..........00000000 fm2_io_info_size...........00000000processor_post_results_off..00000000 processor_post_results_size00000000system_vpd_offset...........00000000 system_vpd_size............00000000mem_data_offset.............00000000 mem_data_size..............00000000l2_data_offset..............00000D7C l2_data_size...............000000C0fddi_post_results_offset....00000000 fddi_post_results_size.....00000000golden_vpd_offset...........00000000 golden_vpd_size............00000000nvram_cache_offset..........00000000 nvram_cache_size...........00000000user_struct_offset..........00000000 user_struct_size...........00000000residual_offset.............00000E3C residual_size..............0000776Cnumatopo_offset.............00000E3C numatopo_size..............00000000

Continued on next page

Page 533: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -33 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

Checkpoint

Introduction Take a few minutes to answer the following questions. We will review the questions as a group when everyone has finished.

Quiz 1. Where is the softros located ?

2. what are the four common parts of the boot image across Power platforms ?

3. What is the difference between the RSPC and the CHRP at the very platforms of the boot image ?

4. In which logical volume is located the boot record ?

5. Who builds the IPLCB on the 3 Power platforms ?

6. What is the difference between the RS6K and the other Power architectures in the boot record

Page 534: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-34 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

Instructor Notes

Purpose Notes on Quiz and transition to the next section

Quiz responses

The responses for the Quiz are :

1. Where is the located the softros :

• after the header in the boot logical volume

2. what are the four common parts of the boot image across Power platforms :

• bootexpand

• kernel

• ramfs

• saved base

3. What is the difference between the RSPC and the CHRP at the very begening of the boot image

• RSPC use an hints structure

• CHRP use an ELF header

4. In which logical volume is located the boot record

• None, the bootrecord is located at the very beginning of the disk

5. Who build the IPLCB ?

• ROS on RS6K

• Softros on CHRP and RSPC

6. What is the difference between RS6K and other Power platforms in the boot record ?

• The RS6K doesn’t use the boot partition table

Transition Statement

Now we will describe:

• the IA-64 specific boot process if this is not an only Power course

Page 535: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -35 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

The IA-64 Boot Mechanism

Introduction The section will explain the boot mechanism used by the IA-64 platform.

Definitions EFI stands for Extensible Firmware Interface. EFI provide a standard interface between the Hardware and the operating system on IA-64 platforms.

Boot overview When the system is powered on, the EFI will load first.

EFI will load BIOS for devices that needs.

EFI will then prompt to enter the setup for a timeout period.

EFI will then prompt the EFI boot menu for another timeout period after witch he will scan the bootlist in order to find a boot device.

The EFI boot loader will prompt for the boot loader menu and after the timeout or exit from the menu initialize the IPL Control Block.

Then it will locate and load the kernel that will initialize.

The kernel will then call init (In fact /usr/lib/boot/ssh at this stage)

The ssh will then call rc.boot for Phase I and Phase II specific to each boot device types.

Then init will execute rc.boot Phase III and the remaining common code in rc.boot for disk and network boot devices

If no boot device is found EFI will start the EFI Shell on IA-64 platforms that supports EFI shell.

Continued on next page

Page 536: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-36 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

The IA-64 Boot Mechanism -- continued

Boot diagram The following diagram represent the high level boot process overview.

prompt

Kernel initialization

init ssh call rc.boot PHASE I&II

execution of EFI firmwareLoad needed BIOS.

timeout oros boot

setup menu

Prompt for Setup

Kernel call init (/usr/lib/boot/ssh)

init exit to newrootinit calls rc.boot PHASE III frominittab and the rest of inittab entries

no

yes

request

keyentered AIX boot

loader menu

yes

no

duringtimeout

scan the

EFI bootmanager menu

bootmaint

manager

manager menu

yes

no

request

boot list

boot maintenance

validboot EFI

Shell

no

yes

devicefound

AIX boot loader

Page 537: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -37 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

IA-64 boot disk layout

Boot image overview

The following represent the overview of an AIX 5L on IA-64 boot image :

PMBR, EFI Partition Header and entries

On IA-64 platform, AIX 5L must be aware of EFI disk partitioning.

During installation, two partitions will be created on the target disk (hdisk0_all) :

• A Physical Volume partition (hdisk0 in the AIX environment) known as a block device in the EFI environment (blkXX).

• An IA-64 System partition (hdisk0_s0 in the AIX environment) known as an IA-64 System partition in the EFI environment (fsXX)

kernel On IA-64 platform the 64 bit kernel (unix_ia64) can be used as the kernel for either UP or MP systems. The kernel initializes itself and then passes control to the simple shell init (ssh) in the RAM filesystem.

Continued on next page

PMBR,EFI Partition

basecustomized

data EFI boot

RAM Filesystem

hdisk0_all

hd5

loaderrestof thehdisk0

hdisk0_s0hdisk0

kernel

Header and entries

VGDA

Page 538: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-38 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

IA-64 boot disk layout -- continued

RAM filesystem

Filesystem used during boot process, that contains programs and data for initializing devices and subsystems in order to install AIX, execute diagnostics, or to access and bring up the rest of AIX.

base customized data

Area of the hard disk boot logical volume containing user configured ODM device configuration information that is used by the system configuration process.

EFI boot loader

The EFI boot loader will reside in am IA-64 System Partition physically located after the Physical Volume Partition by the installation process.

Page 539: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -39 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

EFI boot manager and boot maintenance manager overview

Introduction At boot time, EFI will prompt for the EFI boot manager menu to be entered for a timeout period.

The timeout period is customizable via the boot maintenance menu.

boot manager At boot time, the boot manager will display the bootlist and prompt for a time out period.

If the timeout is reached, the boot manager will scan the bootlist in the boot order to find a valid boot device.

If a key is entered before the timeout period, the user will be able to :

• select a boot device from the list to boot for this session

• start EFI Shell on platform that support EFI Shell

• enter the boot maintenance manager

boot maintenance manager menu

The boot maintenance manager menu will allow the administrator to :

• boot from a file

• add/delete boot options

• change boot order

• manage boot next setting

• set autoboot timeout

• select active console output devices (output,input and error)

• do a cold reset.

Page 540: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-40 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

EFI Shell Overview

Introduction The EFI Shell allow you to configure the boot process used by the IA-64 platform. The main functions are to :

• Locate and identify different boot devices

• Set environment variable

• Use debugging sub commands

• boot from the selected boot device

EFI Shell startup example

The EFI shell startup will display informations about the current EFI level and device mapping as follow :

EFI version x.xx [xx.xx] Build flags : EIF64 Running on Merced EFI_DEBUGEFI IA-64 SDV/FDK (BIOS CallBacks) [Fri Mar 31 13:21:32 2000] - INTELCache Enabled. This image Main entry is at address 000000003F2BA000Stack = 000000003F2B6FF0 BSP = 000000003F293000INT Stack = 000000003F292FF0 INT BSP = 000000003F26F000EFI Shell version x.xx [xx.xx]Device mapping tablefs0 : VenHw(Unknown Device:80)/HD(Part1,Sig0CBCBA54)blk0 : VenHw(Unknown Device:01)/HDblk1 : VenHw(Unknown Device:80)/HDblk2 : VenHw(Unknown Device:81)/HDblk3 : VenHw(Unknown Device:ff)/HDblk4 : VenHw(Unknown Device:80)/HD(Part1,Sig0CBCBA54)blk5 : VenHw(Unknown Device:80)/HD(Part2,Sig0CBCBA54)

EFI Shell sub commands

In the EFI Shell you will be able to use the following sub commands :

Continued on next page

sub command Description

help [internal command] Display this help

guid [sname] Dump known guid ids

set [-d] [sname] [value] Set/get environment variable

alias [-d] [sname] [value] Set/get alias settings

dh [-p prot_id] | [handle] Dump handle info

map [-dvr] [sname[:]] [handle] Map shortname to device path

mount BlkDevice [sname[:]] Mount a filesystem on a block device

cd [path] Updates the current directory

echo [[-on | -off] \ [text] Echo text to stdout or toggle script echo

endfor Script-only: Delimiter for loop construct

pause Script-only: Prompt to quit or continue

Page 541: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -41 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

EFI Shell Overview -- continued

EFI Shell sub commands continued

Continued on next page

sub command Description

ls [dir] [dir]... Obtain directory listing

mkdir [dir][dir].... Make directory

if [not] condition then Script-only: IF THEN construct

endif Script-only: Delimiter for IF THEN construct

goto label Script-only: Jump to label location in script

for var in <set> Script-only: Loop construct

mode [row col] Set/get current text mode

cp file [file] ... dest Copy files/directories

comp file1 file2 Compare two files

rm file/dir [file/dir] Remove file/directories

memmap Dumps memory map

type [-a] [-u] file Type file

dmpstore Dumps variable store

load driver_name Loads a driver

ver Displays version info

err [level] Set or display error level

time [hh:mm:ss] Set or display time

date [mm/dd/yyyy] Set or display date

stall microseconds Delay for x microseconds

reset [/warm] [reset string] Cold or Warm reset

vol fs [Volume Label] Set or display volume label

attrib [+/- rhs] [filename] View/sets file attributes

cls [background color] Clear screen

dnlk device [Lba] [Blocks] Hex dump of BlkIo Devices

pci [bus dev] [func] Dsiplay pci device(s) info

mm Address [Width] [;Type] Memory modify: Mem, MMIO, IO, PCI

mem [Address] [size] [;MMIO]

Dump Memory or Memory Mapped IO

Page 542: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-42 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

EFI Shell Overview -- continued

EFI Shell sub commands continued

EFI Shell examples

The following is an example of the EFI Shell use :

Shell> map <== show the current device mappingfs0 : VenHw(Unknown Device:80)/HD(Part1,Sig0CBCBA54)blk0 : VenHw(Unknown Device:01)/HDblk1 : VenHw(Unknown Device:80)/HDblk2 : VenHw(Unknown Device:81)/HDblk3 : VenHw(Unknown Device:ff)/HDblk4 : VenHw(Unknown Device:80)/HD(Part1,Sig0CBCBA54)blk5 : VenHw(Unknown Device:80)/HD(Part2,Sig0CBCBA54)Shell> pci <== list the pci devicesBus Dev Func Description 0 0 0 ==> Generic System Peripheral - Interrupt Controller Vendor 0x8086 Device 0x123D Program Interface 20 0 2 0 ==> Mass Storage Controller - SCSI Bus Vendor 0x1077 Device 0x1280 Program Interface 0 0 3 0 ==> PCI Bridge Device - ISA Vendor 0x8086 Device 0x7600 Program Interface 0 0 3 1 ==> Mass Storage Controller - IDE Vendor 0x8086 Device 0x7601 Program Interface 80 0 3 2 ==> Serial Bus Controller - USB Vendor 0x8086 Device 0x7602 Program Interface 0 0 3 3 ==> Serial BUS Controller - SMBUS Vendor 0x8086 Device 0x7603 Program Interface 0...Shell> fs0: <== change to fs0fs0:>dir <== list the content of fs0XX/XX/XX 01:05p <DIR> 512 aixXX/XX/XX 01:10p 279,792 a.outXX/XX/XX 01:11p 23,636 boot.efifs0:>boot <== boot from fs0

sub command Description

bcfg -? Configures boot driver & load options

edit [file name]

Edd30 [On|Off] Enable or Disable EDD 3.0 Device paths

unload [-nv]

EddDebug [blockdevicename] Debug of EDD info from adapter card

Page 543: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -43 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

IA-64 Boot Loader

Introduction The AIX 5L EFI boot loader makes the interface between EFI and the kernel.

On disk drives, the AIX boot loader is located in the system partition.

Before loading the kernel, the boot loader will prompt the user to enter the boot loader menu.

Then the boot loader will make use of EFI interface to initialize the IPL Control Block.

The boot loader will then locate the kernel that reside in the hd5 that actually is contained in the AIX PV partition.

Finally the boot loader will pass control to the kernel entry point.

boot loader and EFI interactions

The boot loader will make use of all the EFI boot services to load file images such as kernel, RAM filesystem file and base customized data and to locate various system tables such as System Abstraction Layer (SAL) System Table (SST) and Advanced Configuration and Power Interface (ACPI) Specification Tables. The boot loader will then create Initial Program Load Control Block (IPLCB) and setup Translation Registers (TR) before transferring control to kernel’s entry point.

EFI boot loader menu

The boot loader menu can be used to set parameters that may affect the kernel loading and operating environment like :

• enable the kernel debugger

• invoke the kernel debugger

• override RMALLOC memory reservation

• set boot loader debug flag

• set service/diagnostics flag

• select the amount of memory to enable

• Set the number of cpu to use

• select the number of CPU to use

• Toggle Single/Multi dispersal mode

Page 544: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-44 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

IA-64 Initial Program Load Control Block

Introduction The IPLCB (Initial Program Load Control Block) defines the RAM resident interface between the IPL Boot Process and the Operating System. The boot loader will initialize the IPLCB structure using interfaces to EFI.

The kernel when loaded will use the IPLCB structure to initialize it’s runtime structures.

IPLCB Description

The IPLCB contains the following structures (described in : /usr/include/sys/iplcb.h) :

• IPLCB Directory : contains the IPLCB ID and pointers (offset and size to IPLCB Data)

• IPLCB Data such :

• IPLCB Hand off information

• IPLCB IPL information

• IPLCB system information

• IPLCB processor information

• I/O XAPIC Information

• Memory Information and Memory regions.

IPLCB directory example on a IA64 system

The following screen shows the IPLCB Directory on a IA-64 system captured using the IADB iplcb -dir sub command :

> iplcb -dirDirectory Informationipl_control_block_id......................= IA64_IPLipl_cb_and_bit_map_offset.................= 0x0ipl_cb_and_bit_map_size...................= 0x7F0bit_map_offset............................= 0x448bit_map_size..............................= 0x27ipl_info_offset...........................= 0xD8ipl_info_size.............................= 0x7Csystem_info_offset........................= 0x3D8system_info_size..........................= 0x50processor_info_offset.....................= 0x250processor_info_size.......................= 0x188io_xapic_info_offset......................= 0x428io_xapic_info_size........................= 0x18handoff_info_offset.......................= 0x158handoff_info_size.........................= 0xF0platform_int_info_offset..................= 0x440platform_int_size.........................= 0x8residual_offset...........................= 0x0residual_size.............................= 0x0

Page 545: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -45 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

Checkpoint

Introduction Take a few minutes to answer the following questions. We will review the questions as a group when everyone has finished.

Quiz 7. In which partition is located the aix boot loader ?

8. What is the equivalent of fs0 partition in the AIX environment ?

9. In which partition is located the IA-64 boot record ?

10. In which partition is located the IA-64 boot image ?

11. Where is the bootexpand located on IA-64 ?

Page 546: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-46 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

Instructor Notes

Purpose Checkpoint for rc.boot results

Answers 1. In which partition is located the aix boot loader ?

• The boot loader is located in fs0:

2. What is the equivalent of fs0 partition in the AIX environment ?

• the equivalent is hdiskxx_s0

3. In which partition is located the IA-64 boot record ?

• no boot record on IA-64

4. In which partition is located the IA-64 boot image ?

• the boot image is located in hd5that in fact resides in the rootvg PV partition of the disk (blk5 in our example)

5. Where is the bootexpand located on IA-64 ?

• no bootexpand on IA-64

Page 547: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -47 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

Hard Disk Boot process (rc.boot Phase I)

Introduction The main goal here is to get the devices configured and odm initialized

Hard disk Phase I diagram

The following chart represent the hard disk boot phase I process

configuration manager Phase I

link boot device to /dev/ipldevice

restore base configurationfrom boot disk

restbasereturn

run bootinfo -b to get boot device

<>0

0

codeled 548

led 510

led 511

exit 0

Page 548: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-48 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

Hard Disk Boot process (rc.boot Phase II)

Introduction The main objective in hard disk boot phase II is to varyon rootvg and mount standard filesystems.

Hard disk Phase II diagram

The following chart represent the hard disk boot phase II process

fsck and mount aix filesystems

varyonreturn

on /mnt.

<>0

0

codeled 552,554 or 556

led 517

exit 0

led 511

ipl_varyon -v

ipl

check for dump in hd6swapon hd6 if no dump presentrun savebase recovery procedure

serviceor dump

yes

no

in hd6

keyexecute the

service procedure

copy /etc/vg andobjrepos to diskmerge devices

unmount filesystemsremount filesystems

led 553

Page 549: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -49 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

Hard Disk Boot process (rc.boot Phase III)

Introduction The main objective in hard disk boot phase III is to mount runtime /tmp, sync rootvg and then fall down the phase III common process.

Hard disk Phase III diagram

The following chart represent the hard disk boot phase III process

fsck and mount /tmpsyncvg rootvg

continue phase IIIcommon code

Page 550: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-50 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

CDROM Boot process (rc.boot Phases I, II and III)

Introduction The main objective of the CDROM boot process is to configure devices needed for installation and maintenance procedures and start the bi_main process.

CDROM boot phases I,II and III diagram

The following chart shows the CDROM boot phases I,II and III

Phasenumber

3

2

exit 0

1exit 0

exec bi_main

exit 0

configuration manager Phase I

led 517

Mount the cdrom spot

led 512

recreate the ramfs

led 510

from the SPOT

configure remainingdevices needed for install

led 511

Page 551: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -51 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

Tape Boot process (rc.boot Phases I, II and III)

Introduction The main objective of the Tape boot process is to configure devices needed for installation and maintenance procedures and start the bi_main process.

Tape boot phases I,II and III diagram

The following chart shows the Tape boot phases I,II and III

Phasenumber

3

2

exit 0

1exit 0

exec bi_main

exit 0

configuration manager Phase I

led 510

configuration manager

led 512

Change all tape devices block_sizes to 512

Phase II

Cleanup linksCleanup ODM and rebuild

Page 552: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-52 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

Network Boot process (rc.boot Phases I, II and III)

Introduction The main objective of the Network boot process is to configure devices, configure additional network options (network address, mask and default route) and run the $RC_CONFIG script.

Network boot phases I,II and III diagram

The following chart shows the Network boot phases I,II and III

Phasenumber

3

2

exit 0

1

exit 0set nim debug if needed

exit 0

set nim debug if

restbasesave ATM datasClear ODM

continue phase IIIcommon code

set nim environmentrun $RC_CONFIG

needed

led 600

bootfrom

yes

no

configuration manager phase I

configure ATMpvc, svc andmuxatmd

bootfrom

yes

no

configure thenative networkbootdevice (ifconfig)

= 0no

yesled 607

rc= 0

no

yesled 607

rc

atm0

atm0

tftp minirootset nim environmentcreate /etc/hosts and routesnfs mount the SPOTrun $RC_CONFIG from SPOT

Page 553: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -53 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

Common Boot process (rc.boot Phase III)

Introduction The common Phase III boot code is run for disk and network boot only.

Common boot Phase III diagram

The following chart shows the common boot phases III process

ensure 1024K free space in /tmpload streams modules

exit 0

fix the secondary dump deviceswapon hd6 if no dump presentrun savebase recovery procedure

service

yes

no

keyconfig manager phase IIIdisable controlling tty

clean odm for alt disk installconfig manager phase II

setup System Hang Detectionrun graphical boot if needed

is in

run savebaseclean unavailable tty from inittabsync the files to hard diskrun /etc/rc.B1 if existsstart the syncd daemonstart the errdaemon daemonclean /etc/locks and /etc/nologinstart mirrord daemonstart cfgchk daemonrun diagsrv if supported by platformSystem initialization completed

Page 554: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-54 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

Network boot $RC_CONFIG files

Introduction As seen in the Network Boot Process (Phases I, II and III) these scripts are ran by rc.boot when booting from a network device in phases I and II.

These script are located in the /usr/lib/boot/network directory.

They are loaded from the SPOT on the NIM server during the network boot process.

rc.config types There are 3 types of rc.config files :

• rc.bos_inst : Used to configure a system for AIX installation

• rc.dd_boot : Used for network boot of diskless or dataless systems

• rc.diag : Used for booting to diagnostics

rc.bos_inst This script will :

• Phase I :

• Mount resources listed in niminfo as ${NIM_MOUNTS}

• Enable NIM debug if needed

• link necessary methods from the SPOT

• run configuration manager

• Phase II :

• Set some tcpip parameters

• enable diagnostics for pre-install diagnostics on disks

• execute bi_main

Continued on next page

Page 555: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -55 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

Network boot $RC_CONFIG files -- continued

rc.dd_boot This script will :

• Phase I :

• remove link from /lib to /usr/lib and populate /lib with hard links to /usr to ensure the use of RAM libraries

• Mount the root directory

• get niminfo file

• unconfigure network services (ifconfig and routes)

• run configuration manager phase I

• reconfigure the network using nim informations

• mount /usr

• activate the local or remote paging spaces

• issue mergedev

• unmount all remote filesystems

• Phase II :

• mount types dd_boot filesystems

• clean up unused shared libraries

• set the hostname

rc.diag This script will :

• Phase I:

• Mount resources list in niminfo as ${NIM_MOUNTS}

• Enable NIM debug if needed

• link necessary methods from the SPOT

• run configuration manager

• Phase II :

• configure the console

• if graphic console configure gxme0 and rcm0

• For RSPC and CHRP start, sleep 2 and stop the errdaemon to get errors since last boot

• Execute diag pretest before running diag

Page 556: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-56 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

The init process

Introduction The init initializes and controls AIX processes.

The boot process, when running from the RAM filesystem (Phases I and II), doesn’t use the real init command but /usr/lib/boot/ssh.

This strategy allows for more efficient use of the system resources during boot

The real init is found in /usr/sbin/init. The real init begins during the kernel newroot, which occurs at the end of Phase II of rc.boot.

The real init will use the /etc/inittab file to start AIX processes and run system environment initialization scripts

/etc/inittab Here is a example of the inittab file :

init:2:initdefault:brc::sysinit:/sbin/rc.boot 3 0</dev/console >/dev/console 2>&1powerfail::powerfail:/etc/rc.powerfail 0</dev/console >/dev/console 2>&1 # Power Failure Detectionrc:2:wait:/etc/rc 0</dev/console >/dev/console 2>&1fbcheck:2:wait:/usr/sbin/fbcheck 0</dev/console >/dev/console 2>&1 # Run /etc/firstbootsrcmstr:2:respawn:/usr/sbin/srcmstr # System Resource Controllerrctcpip:2:wait:/etc/rc.tcpip > /dev/console 2>&1 # Start TCP/IP daemonsrcnfs:2:wait:/etc/rc.nfs > /dev/console 2>&1 # Start NFS Daemonscron:2:respawn:/usr/sbin/croncons:0123456789:respawn:/usr/sbin/getty /dev/consolewritesrv:2:wait:/usr/bin/startsrc -swritesrvuprintfd:2:respawn:/usr/sbin/uprintfdshdaemon:2:off:/usr/sbin/shdaemon >/dev/console 2>&1 # High availability daemonlogsymp:2:once:/usr/lib/ras/logsymptom # for system dumpslft:2:respawn:/usr/sbin/getty /dev/lft0

Page 557: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -57 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

ODM Structure and usage

Introduction The Object Data Manager is widely used in AIX to store and retrieve various system informations.

For this purpose, AIX defines number of standard ODM classes.

Any application can create an use it’s own ODM classes to manage it’s own informations.

AIX Informations managed by ODM

AIX System data managed by ODM includes:

• Device configuration information

• Display information for SMIT (menus, selectors, and dialogs)

• Vital product data for installation and update procedures

• Diagnostics informations

• System resource information.

• RAS informations

Devices ODM Classes

The Devices classes are used by the configuration manager, device drivers and AIX device related commands (lsdev, lsattr ,lspv ,lsvg ...).

The following table list the Devices ODM classes and their definitions :

Continued on next page

Class Definition

PdDv Predefined Devices

PdCn Predefined Connection

PdAt Predefined Attribute

PdAtXtd Extended Predefined Attribute

Config_Rules Configuration Rules

CuDv Customized Devices

CuDep Customized Dependency

CuAt Customized Attribute

CuDvDr Customized Device Driver

CuVPD Customized Vital Product Data

CuPart EFI partitions

CuPath

CuPathAt

Page 558: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-58 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

ODM Structure and usage -- continued

SWVPD ODM Classes

The SWVPD classes are used by fileset related commands like installp, instfix, lslpp, oslevel.

SWVPD is divided in 3 parts :

• root : classes are in /etc/objrepos

• usr : classes are in /usr/lib/objrepos

• share : classes are located in /usr/share/lib/objrepos

The following table list the Software Vital Product Data ODM classes and their definitions :

SRC ODM Classes

SRC Classes are used by the srcmstr and related commands : lssrc, startsrc, stopsrc and chssys.

The following table list the System Resource Controller ODM classes and their definitions

Continued on next page

Class Definition

lpp The lpp object class contains information about the installed software products, including the current software product state.

inventory The inventory object class contains information about the files associated with a software product.

history The history object class contains historical information about the installation and updates of software products.

product The product object class contains product information about the installation and updates of software products and their prerequisites.

Class Definition

SRCsubsys The subsystem object class contains the descriptors for all SRC subsystems. A subsystem must be configured in this class before it can be recognized by the SRC.

SRCsubsvr An object must be configured in this class if a subsystem has subservers and the subsystem expects to receive subserver-related commands from the srcmstr daemon.

SRCnotify This class provides a mechanism for the srcmstr daemon to invoke subsystem-provided routines when the failure of a subsystem is detected.

SRCextmeth

Page 559: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -59 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

ODM Structure and usage -- continued

SMIT ODM Classes

The SMIT odm classes are used by smit and smitty commands.

The following table list the SMIT ODM classes and their definitions

RAS ODM Classes

The RAS classes are used by the errdaemon, shdaemon, shconf and alog commands.

The following table list the RAS ODM classes and their definitions

Continued on next page

Use Class Definition

smit menu sm_menu_opt 1 for title of screen 1 for first item 1 for second item 1 for last item

smit selector sm_name_hdr 1 for title of screen and other attributes 1 for entry field or pop-up list

smit selector sm_cmd_opt 1 for entry field or pop-up list

smit dialog sm_cmd_hdr 1 for title of screen and command string

smit dialog sm_cmd_opt 1 for first entry field 1 for second entry field... 1 for last entry field

Class Definition

errnotify Used by errlog notification process

SWservAt Used by errorlog, system dumps, System Hang Detection and alog

Page 560: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-60 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

ODM Structure and usage -- continued

Diagnostics ODM Classes

The diagnostics classes are used by the diag command.

The following table list the Diagnostics ODM classes and their definitions

ODM commands

The following table list the ODM commands and their usage:

Continued on next page

Class Definition

PDiagRes Predefined Diagnostic Resource Object Class

PDiagAtt Predefined Diagnostic Attribute Device Object Class

PDiagTask Predefined Diagnostic Task Object Class

CDiagAtt Customized Diagnostic Attribute Object Class

TMInput Test Mode Input Object Class

MenuGoal Menu Goal Object Class

FRUB Fru Bucket Object Class

FRUs Fru Reporting Object Class

DAVars Diagnostic Application Variables Object Class

PDiagDev Predefined Diagnostic Devices Object Class

DSMOptions Diagnostic Supervisor Menu Options Object Class

Command Definition

odmadd Adds objects to an object class. The odmadd command takes an ASCII stanza file as input and populates object classes with objects found in the stanza file.

odmchange Changes specific objects in a specified object class.

odmcreate Creates empty object classes. The odmcreate command takes an ASCII file describing object classes as input and produces C language .h and .c files to be used by the application accessing objects in those object classes.

odmdelete Removes objects from an object class.

odmdrop Removes an entire object class.

odmget Retrieves objects from object classes and puts the object information into odmadd command format.

odmshow Displays the description of an object class. The odmshow command takes an object class name as input and puts the object class information into odmcreate command format.

Page 561: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -61 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

ODM Structure and Usage -- continued

ODM subroutines

The following table list the odm subroutines and their use :

Continued on next page

subroutine definition

odm_add_obj Adds a new object to the object class.

odm_change_obj Changes the contents of an object.

odm_close_class Closes an object class.

odm_create_class Creates an empty object class.

odm_err_msg Retrieves a message string.

odm_free_list Frees memory allocated for the odm_get_list subroutine.

odm_get_by_id Retrieves an object by specifying its ID.

odm_get_first Retrieves the first object that matches the specified criteria in an object class.

odm_get_list Retrieves a list of objects that match the specified criteria in an object class.

odm_get_next Retrieves the next object that matches the specified criteria in an object class.

odm_get_obj Retrieves an object that matches the specified criteria from an object class.

odm_initialize Initializes an ODM session.

odm_lock Locks an object class or group of classes.

odm_mount_class Retrieves the class symbol structure for the specified object class.

odm_open_class Opens an object class.

odm_rm_by_id Removes an object by specifying its ID.

odm_rm_obj Removes all objects that match the specified criteria from the object class.

odm_run_method Invokes a method for the specified object.

odm_rm_class Removes an object class.

odm_set_path Sets the default path for locating object classes.

odm_unlock Unlocks an object class or group of classes.

odm_terminate Ends an ODM session.

Page 562: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-62 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

ODM Structure and Usage -- continued

ODM paths As the ODM classes can be found in 3 paths (root, usr and share), the user must decide which path he will use before running ODM commands or ODM subroutines.

For ODM commands, the user can set the path using :

# export ODMDIR=/usr/share/lib/objrepos

In a C program, the user should use :

odm_set_path("/usr/lib/objrepos");

Page 563: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -63 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

boot and installation logging facilities

Introduction It can be useful to retrieve rapidly the logging files used for boot or installation to help solve problems.

The alog command can be used to recover system logs

log types The alog command is used by installation and boot processes to log informations or errors for the following topics :

• boot : log for the boot process

• bosinst : log used for the AIX installation process

• console : log used to store console messages

• nim : log used to store NIM messages

• dumpsymp : used to store dump symptom messages

alog command usage

The following alog commands may be used :

• alog -L : will list alog log types defined in the ODM

• alog -t <log_type> -o : will display the log file related to the log_type

• echo “Message xxx” | alog -t <boot_type> : will log the message to the log file

• alog -L -t <log_type> : will display detailed information related to the log_type definition (log file path, size and verbosity)

• alog -Cw <new_verbosity> -t <log_type> : will change the verbosity (0-9) for the log_type

• alog -C -t <log_type -s <new_size> -f <new_file> : will change the file and file size the log_type.

• alog -V -t <log_type> : will display the current verbosity

example The following example will output the 15 last lines of the boot log :

# alog -t boot -o|tail -15

Saving Base Customize Data to boot disk

Starting the sync daemon

Starting the error daemon

A device that was previously detected could not be found.

Run "diag -a" to update the system configuration.

System initialization completed.

Starting Multi-user Initialization

Page 564: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-64 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

Performing auto-varyon of Volume Groups

Activating all paging spaces

0517-075 swapon: Paging device /dev/hd6 is already active.

/dev/rhd1 (/home): ** Unmounted cleanly - Check suppressed

Performing all automatic mounts

Replaying log for /dev/lv01.

Multi-user initialization completed

Page 565: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -65 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

Debugging boot problems using KDB

introduction For boot problems debugging purposes, it can be useful to get a detailed output of the boot process, including rc.boot outputs.

entering boot debug

To enter the boot debugging, the administrator should first make sure the KDB kernel debugger will be loaded invoked at boot time using :

# bosboot -I -ad/dev/ipldevice

The next reboot will launch the KDB on the native serial connection.

At the KDB prompt you will need to toggle the rc.boot debug flag and optionally the exec debug flag in order to have rc.boot outputs at the native serial connection.

Note that the exec tracing will continue after the end of the rc.boot.

example The following is an example of a boot debug session :

.......... kdb_tty_init done

.......... kdb_init_flihs done region address region length nodeid att label0000000000000000 0000000000FF1000 0000 01 010000000000FF1000 000000000000F000 0000 01 030000000001000000 0000000006FCC000 0000 01 010000000007FCC000 0000000000029000 0000 00 050000000007FF5000 000000000000B000 0000 01 020000000008000000 0000000018000000 0000 01 010000000020000000 FFFFFFFFE0000000 0000 00 07Real memory size = 512 M BytesModel = 0800004CData cache size = 64 K BytesInst cache size = 32 K Bytes.......... kdb_mem_size done.......... kdb_code_init donePreserving 911247 bytes of symbol tableFirst symbol __mulh START END <name>0000000000003500 0000000000DB55A8 _system_configuration+000020F00000002FF3B400 F00000002FFC0818 __ublock+000000000000002FF22FF4 000000002FF22FF8 environ+000000000000002FF22FF8 000000002FF22FFC _errno+000000F100008080000000 F10000808A000000 pvproc+000000F100008090000000 F100008094000000 pvthread+000000F100000040000000 F100000040266C80 vmmdseg+000000F1000013B0000000 F1000073B4800000 vmmswpft+000000F100000BB0000000 F1000013B0000000 vmmswhat+000000F100000050000000 F100000060000000 ptaseg+000000F100000070000000 F1000000B0000000 ameseg+000000F100009710000000 F100009720000000 KERN_heap+000000F100009500000000 F100009510000000 lkwseg+000000

Continued on next page

Page 566: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-66 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

Debugging boot problems using KDB -- continued

example continued

************* Welcome to KDB *************Call gimmeabreak...Static breakpoint:.gimmeabreak+000000 tweq r8,r8 r8=00000000F80003F8.gimmeabreak+000004 blr <.kdb_init+00021C> r3=0KDB(0)> dbgopt <== Enter debug options Debug options:--------------1. Toggle rc.boot tracing - currently DISABLED2. Toggle tracing of exec calls - currently DISABLEDq. ExitEnter option: 1 <== Enable rc.boot tracingDebug options:--------------1. Toggle rc.boot tracing - currently ENABLED2. Toggle tracing of exec calls - currently DISABLEDq. ExitEnter option: 2 <== Enable exec calls tracingDebug options:--------------1. Toggle rc.boot tracing - currently ENABLED2. Toggle tracing of exec calls - currently ENABLEDq. ExitEnter option: q <== here, we quit the debug option menuKDB(0)> q <== here, we quit the KDB so the boot process can pursue.PFT:id....................0007raddr.....0000000001000000 eaddr.....0000000000000000size..............00800000 align.............00800000valid..1 ros....0 holes..0 io.....0 seg....1 wimg...2PVT:id....................0008raddr.....0000000000692000 eaddr.....0000000000000000size..............00100000 align.............00001000valid..1 ros....0 holes..0 io.....0 seg....1 wimg...2Exiting vmsi()LED{814}AIX Version 5.0Starting NODE#000 physical CPU#002 as logical CPU#001... done.exec(/etc/init)exec(/usr/bin/sh,-c,/sbin/rc.boot 1)exec(/sbin/rc.boot,/sbin/rc.boot,1)+ [ 1 -ne 1 ]+ PHASE=1+ + bootinfo -pexec(/usr/sbin/bootinfo,-p)PLATFORM=chrp

Page 567: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -67 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

Debugging boot problems using IADB

Introduction For boot problems debugging purposes, it can be useful to get a detailed output of the boot process, including rc.boot outputs.

Prerequisites In order to get boot debug output you will need to have a device (TTY, Thinkpad or an other system serial port) connected to the native serial port and configured at 115200-8-N-1

Process The following process will be used to debug boot problems :

Continued on next page

Step Action

1 If you want the IADB to be invoked at boot time, use :# bosboot -I -ad /dev/ipldevice

You can also chose not to do this and set manually the debugger flags on the boot loader menu

2 Boot or reboot the system

3 If you are using another system as the TTY, you may want to set some tracing/capture options to capture the debugging output.

4 If the autoboot flag is not set in EFI set the file system and boot using :Shell> fs0:fs0> boot

5 The boot loader menu should come up with the debugger flags set “ON” if you ran bosboot in step 1.Otherwise, hit some key to enter the boot loader menu and set the debugger flags. Then exit the boot loader menu

6 The boot loader will load the IADB that will prompt on the native serial port.At the IADB prompt type :CPU0> set dbgmsg=onCPU0> set exectrace=onCPU0> go

Page 568: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-68 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

Debugging boot problems using IADB -- continued

boot debugging output example

The following example show the beginning of what you can see on the native serial port when debugging the boot process :

MEDIEVAL DEBUGGER ENTERE interrupt.IP->E00000000001D2F2 brkpoint()+2: { .mfi 0: nop.m 0x100001;; }>CPU0> set dbgmsg=on>CPU0> set exectrace=on <== here we ask for debugging >CPU0> go <== here we goSee Ya! Performing Hostile Takeover of the System Console...AIX Version 5.0Starting CPU#001... done.+ ODMSTRNG=attribute=keylock and value=service+ HOME=/+ LIBPATH=/usr/lib:/lib:/usr/sbin:/etc:/usr/bin+ SHOWLED=showled+ SYSCFG_PHASE=BOOT+ export HOME LIBPATH ODMDIR PATH SHOWLED SYSCFG_PHASE+ umask 077+ set -x+ [ 1 -ne 1 ]+ PHASE=1+ + bootinfo -pPLATFORM=ia64+ [ ! -x /usr/lib/boot/bin/bootinfo_ia64 ]+ [ 1 -eq 1 ]+ 1> /)+ + bootinfo -tBOOTYPE=3+ [ 0 -ne 0 ]+ [ -z 3 ]+ unset pdev_to_ldev undolt

Page 569: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -69 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

Packaging Changes

Introduction The lpp packaging has been reviewed to reflect the need for platform dependant packages.

Package names

The Packages names have the following structure : <pkg_name>.V.R.M.F.<plateform_type>.<install_type>.bff where :

• <pkg_name> is the name of the package to be installed

• V.R.M.F are the Version, Release, Modification and Fix levels of the package

• <platform_type> is the platform type for which that package was designed. The platform type can be one of :

• I : For Intel IA-64 platform

• N : For Neutral packages that can be installed on all platforms

• Nothing : For Power specific packages

Packaging commands

installp, bffcreate, inutoc and instfix commands are updated to reflect these changes.

By default packaging commands will process only packages related to the platform where the command is ran.

A “-M” flag has been added to these command that accept the following sub options :

• I : To process Intel related packages

• R : To process Power related packages

• N : To process Neutral related packages

• A : To process all kind of packages

installp options

The installp command will only accept the -M flag with -l or -L options.

installp option -L output will include platform informations

bffcreate options

The bffcreate command will accept all -M sub options to allow transit of packages regardless of the current platform. This is needed for nim operations

instfix options The instfix command like the installp command will only accept the -M flag when used in conjunction with the -T (list flag).

Page 570: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-70 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

inutoc command

The inutoc command will accept the -M flag.

Page 571: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -71 of 72Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide

Checkpoint

Introduction Take a few minutes to answer the following questions. We will review the questions as a group when everyone has finished.

Quiz 1. Who call rc.boot ?

2. What is common in phase II of tape, cdrom and network phase II ?

3. What is specific to the rc.boot phase III ?

4. What will you need to do if you want to modify something in rc.boot phase I or II ?

5. What is the phase and/or device in rc.boot not supported on IA-64 ?

6. What is the usage of the ODM ?

7. What is init in the first two phases of the boot ?

Page 572: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-72 of 72 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm

Instructor Notes

Purpose <which objective does this map address >

Answers 1. Who call rc.boot ?

• init

2. What is common in phase II of tape,cdrom and network phase II ?

• They exec bi_main (rc.bos_inst for network) to run installation tasks

3. What is specific to the rc.boot phase III ?

• rc.boot phase III is called by the actual init process after newroot.

4. What will you need to do if you want to modify something in rc.boot phase I or II ?

• You will need to run bosboot in order to copy your changed rc.boot to the RAMFS

5. What is the phase and/or device in rc.boot not supported on IA-64 ?

• The tape boot device (this was said in the map various types of boot)

6. What is the usage of the ODM ?

• Store and retrieve system informations

7. What is init in the first two phases of the boot ?

• ssh

Page 573: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -1 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, proc.fm Guide

Unit 15. /proc Filesystem Support

This unit describes: The /proc filesystem in the AIX 5L kernel.

What You Should Be Able to DoAfter completing this unit, you should be able to

• List the directories and files that are found in the /proc filesystem

• Describe the basic functionality of each file in the sub-directory tree for a specific process

• Create a simple C program to access the files belonging to another process

Page 574: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-2 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, proc.fm

/proc Filesystem Support

Introduction /proc is a file system that provides access to the state of each active process and Light Weight Process (LWP) in the system.

Platform This lesson is platform independent

/proc filesystem

The contents of the /proc filesystem have the same appearance as any other file and directory in a Unix filesystem. The name of each top-level entry in the /proc directory is a sub-directory, named by the decimal number corresponding to the process ID, and the owner of each is determined by the user-ID of the process.

Access to process state is provided by additional files contained within each sub-directory; this hierarchy is described more completely below. Except where otherwise specified, ‘‘/proc file’’ is meant to refer to a non-directory file within the hierarchy rooted at /proc.

Filesystem heirarchy

The directory structure for the proc directory is described below. The pid represents the process ID number and the lwp# represents the light-weight process number.

Continued on next page

File/Directory Name Description

/proc directory - list of processes

/proc/pid directory for process pid

/proc/pid/status status of process pid

/proc/pid/ctl control file for process pid

/proc/pid/psinfo ps info for process pid

/proc/pid/as address space of process pid

/proc/pid/map as map info for process pid

/proc/pid/object directory for objects for process pid

/proc/pid/sigact signal actions for process pid

/proc/pid/lwp/lwp# directory for LWP lwp#

/proc/pid/lwp/lwp#/lwpstatus status of LWP lwp#

/proc/pid/lwp/lwp#/lwpctl control file for LWP lwp#

/proc/pid/lwp/lwp#/lwpsinfo ps info for LWP lwp#

Page 575: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -3 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, proc.fm Guide

/proc Filesystem Support -- continued

Accessing /proc files

Standard system call interfaces are used to access /proc files: open(2), close(2), read(2), and write(2). Most files describe process state and can only be open for reading. An open for writing allows process control; a read-only open allows inspection, but not control.

Page 576: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-4 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, proc.fm

Types of Files

Introduction Listed below are descriptions of the files that are contained in the /proc filesystem heirarchy. These files are described in more detail on the following pages.

Filename Mode Function

as rd/wr Contains the address-space image of the process

ctl wr Allows change to the process state or behaviour

status rd Contains state information about the process

psinfo rd Information about the process needed by the ps(1) command

map rd Information about the virtual address map of the process

cred rd Describes the credentials associated with the process

sigact rd Describes the disposition of all signals associated with the process

object N/A A directory containing read-only files with names as they appear in the map file

lwp N/A A directory for LWP

lwp#/lwpstatus rd State information for LWP lwp#

lwp#/lwpctl wr Allows change to the LWP process state or behaviour of LWP lwp#

lwp#/lwpsinfo ?? Process info for LWP lpw#

Page 577: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -5 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, proc.fm Guide

The as File

Introduction The as file contains the address-space image of the process and can be opened for both reading and writing.

Accessing the file

lseek is used to position the file at the virtual address of interest and then the address space can be examined or changed through a read or write.

Page 578: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-6 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, proc.fm

The ctl File

Introduction The ctl file is a write-only file to which structured messages are written directing the system to change some aspect of the process’s state or control its behavior in some way. The seek offset is not relevant when writing to this file.

Control messages

Individual LWPs also have associated lwpctl files. Process state changes are effected through control messages written to either to the ctl file of the process or to a specific lwpctl file. All control messages consist of an int naming the specific operation followed by additional data containing operands (if any). The effect of a control message is immediately reflected in the state of the process visible through appropriate status and information files.

Multiple control messages can be combined in a single write(2) to a control file, but no partial writes are permitted; that is, each control message (operation code plus operands) must be presented in its entirety to the write and not in pieces over several system calls.

Descriptions of control messages

Descriptions of allowable control messages are included on page 20.

Page 579: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -7 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, proc.fm Guide

The status File

Introduction The status file contains state information about the process and one of its LWPs (chosen according to the rules described below).

File format The file is formatted as a struct pstatus containing the following members:

long pr_flags; /* Flags */ ushort_t pr_nlwp; /* Total number of lwps in the process */sigset_t pr_sigpend; /* Set of process pending signals */vaddr_t pr_brkbase; /* Address of the process heap */ulong_t pr_brksize; /* Size of the process heap, in bytes */ vaddr_t pr_stkbase; /* Address of the process stack */ ulong_t pr_stksize; /* Size of the process stack, in bytes */ pid_t pr_pid; /* Process id */pid_t pr_ppid; /* Parent process id */pid_t pr_pgid; /* Process group id */pid_t pr_sid; /* Session id */timestruc_t pr_utime; /* Process user cpu time */timestruc_t pr_stime; /* Process system cpu time */timestruc_t pr_cutime; /* Sum of children’s user times */ timestruc_t pr_cstime; /* Sum of children’s system times */ sigset_t pr_sigtrace; /* Mask of traced signals */fltset_t pr_flttrace; /* Mask of traced faults */sysset_t pr_sysentry; /* Mask of system calls traced on entry */ sysset_t pr_sysexit; /* Mask of system calls traced on exit */lwpstatus_t pr_lwp; /* "representative" LWP */

Continued on next page

Page 580: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-8 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, proc.fm

The status File -- continued

Member description

Here is a description of members of the status file:

Continued on next page

Member Description

pr_flags A bit mask holding flags (flags are described below)

pr_nwlp Total number of LWPs in the process

pr_brkbase Virtual address of the process heap

pr_brksize Size of process heap in bytes. The address formed by the sum of these values is the process break (see brk(2)).

pr_stkbase Virtual address of the process stack

pr_stksize Size of the process stack in bytes. Each LWP runs on a separate stack; the process stack is distinguished in that the operating system will grow as necessary.

pr_pid Process ID

pr_ppid Parent process ID

pr_pgid Process group ID

pr_sid Session ID of the process

pr_utime User CPU time consumed by the process in seconds and nanoseconds

pr_stime System CPU time consumed by the process in seconds and nanoseconds

pr_cutime Cumulative user CPU time consumed by the process in seconds and nanoseconds

pr_cstime Cumulative system CPU time consumed by the process in seconds and nanoseconds

pr_sigtrace Set of signals that are being traced (see PCSTRACE)

pr_flttrace Set of hardware faults that are being traced (see PCSFAULT)

pr_sysentry Set of system calls being traced on entry (see PCSENTRY)

pr_sysexit Set of system calls being traced on exit (see PCSEXIT)

pr_lwp If the process is not a zombie, pr_lwp contains an lwpstatus_t structure describing a representative LWP. The contents of this structure ave the same meanin as if it were read from an lwpstatus file.

Page 581: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -9 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, proc.fm Guide

The status File -- continued

pr_flags pr_flags is a bit-mask holding these flags:

Multi-threaded applications

When the process has more than one LWP, its representative LWP is chosen by the /proc implementation. The chosen LWP is a stopped LWP only if all the process’s LWPs are stopped, is stopped on an event of interest only if all the LWPs are so stopped, or is in a PR_REQUESTED stop only if there are no other events of interest to be found. The chosen LWP remains fixed as long as all the LWPs are stopped on events of interest and PCRUN is not applied to any of them.

When applied to the process control file, every /proc control operation that must act on an LWP uses the same algorithm to choose which LWP to act on. Together with synchronous stopping (see PCSET), this enables an application to control a multiple-LWP process using only the process-level status and control files if it so chooses. More fine-grained control can be achieved using the LWP-specific files.

Flag Description

PR_ISSYS System process (see PCSTOP)

PR_FORK Has its inherit-on-fork flag set (see PCSET)

PR_RLC Has its run-on-last-close flag set (see PCSET)

PR_KLC Has its kill-on-last-close flag set (see PCSET)

PR-ASYNC Has its asynchronous-stop flag set (see PCSET)

Page 582: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-10 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, proc.fm

The psinfo file

Introduction The psinfo file contains information about the process needed by the ps(1) command. If the process contains more than one LWP, a representative LWP (chosen according to the rules described for the status file) is used to derive the status information.

File format The file is formatted as a struct psinfo containing the following members:

ulong_t pr_flag; /* process flags */ulong_t pr_nlwp; /* number of LWPs in process */uid_t pr_uid; /* real user id */ gid_t pr_gid; /* real group id */ pid_t pr_pid; /* unique process id */ pid_t pr_ppid; /* process id of parent */pid_t pr_pgid; /* pid of process group leader */pid_t pr_sid; /* session id */ caddr_t pr_addr; /* internal address of process */long pr_size; /* size of process image in pages */long pr_rssize; /* resident set size in pages */timestruc_t pr_start; /* process start time, time since epoch */timestruc_t pr_time; /* usr+sys cpu time for this process */dev_t pr_ttydev; /* controlling tty device (or PRNODEV)*/char pr_fname[PRFNSZ]; /* last component of exec()ed pathname*/char pr_psargs[PRARGSZ]; /* initial characters of arg list */struct lwpsinfo pr_lwp; /* "representative" LWP */

Platform specific data

Some of the entries in psinfo, such as pr_flag and pr_addr, refer to internal kernel data structures and should not be expected to retain their meanings across different versions of the operating system. They have no meaning to a program and are only useful for manual interpretation by a user aware of the implementation details.

Zombies psinfo is still accessible even after a process becomes a zombie.

Representative LWP

pr_lwp describes the representative LWP chosen as described under thepstatus file above. If the process is a zombie, pr_nlwp and pr_lwp.pr_lwpid are zero and the other fields of pr_lwp are undefined.

Page 583: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -11 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, proc.fm Guide

The map File

Introduction The map file contains information about the virtual address map of the process. The file contains an array of prmap structures, each of which describes a contiguous virtual address region in the address space of the traced process.

File format The prmap structure contains the following members:

caddr_t pr_vaddr; /* Virtual address */ulong_t pr_size; /* Size of mapping in bytes */char pr_mapname[32]; /* Name in /proc/pid/object */off_t pr_off; /* Offset into mapped object, if any */ long pr_mflags; /* Protection and attribute flags */long pr_filler[9]; /* For future use */

Member description

Members of the map file are described below:

pr_mflags pr_mflags is a bit-mask of protection and attribute flags:

Continued on next page

Member Descriptionpr_vaddr Virtual address of the mapping within the traced process

pr_size Size of mapping in bytes

pr_mapname If not empty string, contains name of a file in the object directory that can be opened for reading to yield a file descriptor for the object to which vitrual address is mapped.

pr_off Offset within the mapped object (if any) to which the virtual address is mapped

pr_mflags Protection and attribute flags (see below)

pr_filler For future use

Flag Description

MA_READ Mapping is readable by the traced process

MA_WRITE Mapping is writable by the traced process

MA_EXEC Mapping is executable by the traced process

MA_SHARED Mapping changes are shared by mapped object

Page 584: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-12 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, proc.fm

The map File -- continued

Contiguous address space

A contiguous area of the address space having the same underlying mapped object may appear as multiple mappings because of varying read, write, execute, and shared attributes. The underlying mapped object does not change over the range of a single mapping. An I/O operation to a mapping marked MA_SHARED fails if applied at a virtual address not corresponding to a valid page in the underlying mapped object. Reads and writes to private mappings always succeed. Reads and writes to unmapped addresses always fail.

Page 585: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -13 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, proc.fm Guide

The cred File

Introduction The cred file contains a description of the credentials associated with the process.

File format The file is formatted as a struct prcred containing the following members:

uid_t pr_euid; /* Effective user id */ uid_t pr_ruid; /* Real user id */ uid_t pr_suid; /* Saved user id (from exec) */gid_t pr_egid; /* Effective group id */gid_t pr_rgid; /* Real group id */gid_t pr_sgid; /* Saved group id (from exec) */uint_t pr_ngroups; /* Number of supplementary groups */gid_t pr_groups[1]; /* Array of supplementary groups */

The list of associated supplementary groups in pr_groups is of variable length; pr_ngroups specifies the number of groups.

Page 586: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-14 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, proc.fm

The sigact File

Introduction The sigact file contains an array of sigaction structures describing the current dispositions of all signals associated with the traced process. Signal numbers are displaced by 1 from array indexes, so that the action for signal number n appears in position n-1 of the array.

Page 587: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -15 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, proc.fm Guide

lwp/lwpctl file

Introduction The lwpctl file is a write-only control file. The messages written to this file affect only the associated LWP rather than the process as a whole (where appropriate).

Page 588: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-16 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, proc.fm

The lwp/lwpstatus File

Introduction The lwp/lwpstatus file contains LWP-specific state information. This information is present in the status file of the process for its representative LWP, also.

File format The file is formatted as a struct lwpstatus containing the following member

long pr_flags; /* Flags */short pr_why; /* Reason for stop (if stopped) */short pr_what; /* More detailed reason */lwpid_t pr_lwpid; /* Specific LWP identifier */short pr_cursig; /* Current signal */siginfo_t pr_info; /* Info associated with signal or fault */struct sigaction pr_action; /* Signal action for current signal */sigset_t pr_lwppend; /* Set of LWP pending signals */stack_t pr_altstack; /* Alternate signal stack info */short pr_syscall; /* System call number (if in syscall) */short pr_nsysarg; /* Number of arguments to this syscall */long pr_sysarg[PRSYSARGS];/* Arguments to this syscall */char pr_clname[PRCLSZ]; /* Scheduling class name */ucontext_t pr_context; /* LWP context */pfamily_t pr_family; /* Processor family-specific information */

Continued on next page

Page 589: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -17 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, proc.fm Guide

The lwp/lwpstatus File -- continued

Member description

Here is a description of members of the lwpstatus file:

Continued on next page

Member Description

pr_flags A bit mask holding flags (described below)

pr_why Reason for LWP stop (if stopped). Possible values listed below.r

pr_what More detailed reason for LWP stop. pr_why and pr_what together, describe the reason for a stopped LWP.

pr_lwpid Specific LWP identifier.

pr_cursig Names the current signal; that is, the next signal to be delivered to the LWP.

pr_info When the LWP is in a PR_SIGNALLED or PR_FAULTED stop, pr_info contains additional information pertinent to the particular signal or fault. (See sys/siginfo.h)

pr_action Contains signal action information about the current signal (see sigaction(2)). It is undefined if pr_cursig is zero.

pr_lwppend Identifies any synchronously-generated or LWP-directed signals pending for the LWP. Does not include signals pending at the process leel.

pr_altstack Contains the alternate signal stack information for the LWP. (see sigaltstack(2)).

pr_syscall Number of the system call, if any, being executed by the LWP. It is nonzero if and only if the LWP is stopped on PS_SYSENTRY or PR_SYSEXIT or is asleep with a system call (PR_ASLEEP is set)

pr_nsysarg If pr_syscall is non-zero, pr_nsysarg is the number of arguments to the system call

pr_sysarg Array of arguments to the system call.

pr_clname Contains the name of the scheduling class of the LWP.

pr_context Contains the user context of the LWP, as if it had called getcontext(2). If the LWP is not stopped, all context values are undefined.

pr_family Contains the CPU-family specific information about the LWP. Use of this field is not portable across different architectures.

Page 590: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-18 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, proc.fm

The lwp/lwpstatus File -- continued

pr_flags pr_flags is a bit-mask holding these flags:

pr_why Possible values of pr_why are:

Flag Description

PR_STOPPED LWP is stopped

PR_ISTOP LWP is stopped on an event of interest (see PCSTOP)

PR_DSTOP LWP has a stop directive in effect (see PCSTOP)

PR_STEP LWP has a single-step directive in effect

PR_ASLEEP LWP is in an interruptible sleep within a system call

PR_PCINVAL LWP program counter register does not point to a valid address

Value Description

PR_REQUESTED Shows that the stop occurred in response to a stop directive, normally because PCSTOP was applied or because another LWP stopped on an event of interest and the asynchronous-stop flag (see PCSET) was not set for the process. pr_what is unused in this case.

PR_SIGNALLED Shows that the LWP stopped on receipt of a signal (see PCSTRACE); pr_what holds the signal number that caused the stop (for a newly-stopped LWP, the same value is in pr_cursig)

PR_FAULTED shows that the LWP stopped on incurring a hardware fault (see PCSFAULT); pr_what holds the fault number that caused the stop

PR_SYSENTRYPR_SYSEXIT

Show a stop on entry to or exit from a system call (see PCSENTRY and PCSEXIT); pr_what holds the system call number.

PR_JOBCONTROL Sows that the LWP stopped because of the default action of a job control stop signal (see sigaction(2)); pr_what holds the stopping signal number.

Page 591: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -19 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, proc.fm Guide

The lwp/lwpsinfo File

Introduction The lwp/lwpsinfo file contains information about the LWP needed by ps(1). This information also is present in the psinfo file of the process for its representative LWP if it has one.

File format The file is formatted as a struct psinfo containing the following members:

ulong_t pr_flag; /* LWP flags */lwpid_t pr_lwpid; /* LWP id */caddr_t pr_addr; /* internal address of LWP */ caddr_t pr_wchan; /* wait addr for sleeping LWP */ uchar_t pr_stype; /* synchronization event type */uchar_t pr_state; /* numeric scheduling state */char pr_sname; /* printable character representing pr_state */uchar_t pr_nice; /* nice for cpu usage */int pr_pri; /* priority, high value = high priority */timestruc_t pr_time; /* usr+sys cpu time for this LWP */ char pr_clname[8]; /* Scheduling class name */char pr_name[PRFNSZ]; /* name of system LWP */ processorid_t pr_onpro; /* processor on which LWP is running */processorid_t pr_bindpro; /* processor to which LWP is bound */

processorid_t pr_exbindpro; /* processor to which LWP is exbound */

Platform-specific data

Some of the entries in lwpsinfo, such as pr_flag, pr_addr, pr_state, pr_stype, pr_wchan, and pr_name, refer to internal kernel data structures and should not be expected to retain their meanings across different versions of the operating system. They have no meaning to a program and are only useful for manual interpretation by a user aware of the implementation details.

Page 592: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-20 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, proc.fm

Control Messages

Introduction Process state changes are affected through messages written to the ctl file of the process or to the lwpctl file of an individual LWP.

Sending control messages

All control messages consist of an int naming the specific operation followed by additional data containing operands (if any). Multiple control messages can be combined in a single write(2) to a control file, but no partial writes are permitted; that is, each control message (operation code plus operands) must be presented in its entirety to the write and not in pieces over several system calls.

ENOENT Note that writing a message to a control file for a process or LWP that has exited elicits the error ENOENT.

List of messages

Here is a list of the allowable control messages:

Control Message Description

PCSTOP Stops a LWPs

PCDSTOP Stops a LWPs

PCWSTOP Stops a LWPs

PCRUN Makes a LWP runnable again after a stop.

PCSTRACE Defines a set of signals to be traced in the process

PCSSIG Contains the current signal and its associated signal information????

PCKILL End the process or LWP immediately????

PCUNKILL ????

PCSHOLD Set the held signals for the specific or chosen LWP according to the operand sigset_t structure

PCSFAULT Define a set of hardware faults to be traced in the process

Page 593: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -21 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, proc.fm Guide

PCSTOP, PCDSTOP, and PCWSTOP

Introduction There are three control messages that stop LWPs. They perform in different ways. They are:

• PCSTOP

• PCDSTOP

• PCWSTOP

PCSTOP When applied to the process control file, directs all LWPs to stop and waits for them to stop. Completes when every LWP has stopped on an event of interest.

When applied to an LWP control file, directs the specific LWP to stop and waits until it has stopped. Completes when the LWP stops on an event of interest, immediately if already so stopped.

PCDSTOP When applied to the process control file, directs all LWPs to stop without waiting for them to stop.

When applied to an LWP control file, directs the specific LWP to stop without waiting for it to stop

PCWSTOP When applied to the process control file, simply waits for all LWPs to stop. Completes when every LWP has stopped on an event of interest.

When applied to an LWP control file, simply waits for the LWP to stop. Completes when the LWP stops on an event of interest, immediately if already so stopped

Event of interest

An event of interest is either a PR_REQUESTED stop or a stop that has been specified in the process’s tracing flags (set by PCSTRACE, PCSFAULT, PCSENTRY, and PCSEXIT). A PR_JOBCONTROL stop is specifically not an event of interest. (An LWP may stop twice because of a stop signal; first showing PR_SIGNALLED if the signal is traced and again showing PR_JOBCONTROL if the LWP is set running without clearing the signal.) If PCSTOP or PCDSTOP is applied to an LWP that is stopped, but not on an event of interest, the stop directive takes effect when the LWP is restarted by the competing mechanism; at that time the LWP enters a PR_REQUESTED stop before executing any user-level code.

Continued on next page

Page 594: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-22 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, proc.fm

PCSTOP, PCDSTOP, and PCWSTOP -- continued

Blocked control messages

A write of a control message that blocks is interruptible by a signal so that, for example, an alarm(2) can be set to avoid waiting forever for a process or LWP that may never stop on an event of interest. If PCSTOP is interrupted, the LWP stop directives remain in effect even though the write returns an error.

System process

A system process (indicated by the PR_ISSYS flag) never executes at user level, has no user-level address space visible through /proc, and cannot be stopped. Applying PCSTOP, PCDSTOP, or PCWSTOP to a system process or any of its LWPs elicits the error EBUSY.

Page 595: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -23 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, proc.fm Guide

PCRUN

Introduction The control message PCRUN makes an LWP runnable again after a stop. The operand is a set of flags, contained in a ulong_t, describing optional additional actions.

Flag descriptions

Here is a description of the flags contained in the operand of PCRUN:

Using PCRUN on an LWP

When applied to an LWP control file PCRUN makes the specific LWP runnable. The operation fails (EBUSY) if the specific LWP is not stopped on an event of interest.

Continued on next page

Flag Description

PRCSIG Clears the current signal, if any (see PCSSIG)

PRCFAULT Clears the current fault, if any (see PCCFAULT)

PRSTEP Directs the LWP to execute a single machine instruction. On completion of the instruction, a trace trap occurs. If FLTTRACE is being traced, the LWP stops, otherwise it is sent SIGTRAP; if SIGTRAP is being traced and not held, the LWP stops. When the LWP stops on an event of interest the single-step directive is cancelled, even if the stop occurs before the instruction is executed. This operation requires hardware and operating system support and may not be implemented on all processors

PRSABORT Is significant only if the LWP is in a PR_SYSENTRY stop or is marked PR_ASLEEP; it instructs the LWP to abort execution of the system call (see PCSENTRY, PCSEXIT).

PRSTOP Directs the LWP to stop again as soon as possible after resuming execution (see PCSTOP). In particular if the LWP is stopped on PR_SIGNALLED or PR_FAULTED, the next stop will show PR_REQUESTED, no other stop will have intervened, and the LWP will not have executed any user-level code

Page 596: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-24 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, proc.fm

PCRUN -- continued

Using PCRUN on a process

When applied to the process control file an LWP is chosen for the operation as described for /proc/pid/status. The operation fails (EBUSY) if the chosen LWP is not stopped on an event of interest. If PRSTEP or PRSTOP were requested, the chosen LWP is made runnable; otherwise, the chosen LWP is marked PR_REQUESTED. If as a result all LWPs are in the PR_REQUESTED stop state, they are all made runnable.

Once an LWP has been made runnable by PCRUN, it is no longer stopped on an event of interest even if, because of a competing mechanism, it remains stopped.

Page 597: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -25 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, proc.fm Guide

PCSTRACE

Introduction PCSTRACE Define a set of signals to be traced in the process: the receipt of one of these signals by an LWP causes the LWP to stop. The set of signals is defined using an operand sigset_t contained in the control message.

SIGKILL Receipt of SIGKILL cannot be traced; if specified, it is silently ignored.

Held signals If a signal that is included in a held signal set of an LWP is sent to the LWP, the signal is not received and does not cause a stop until it is removed from the held signal set, either by the LWP itself or by setting the held signal set with PCSHOLD or the PRSHOLD option of PCRUN.

Page 598: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-26 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, proc.fm

PCCSIG

Introduction

PCCSIG The current signal and its associated signal information for the specific or chosen LWP are set according to the contents of the operand siginfo structure (see ). If the specified signal number is zero, the current signal is cleared. An error (EBUSY) is returned if the LWP is not stopped on an event of interest. The semantics of this operation are different from those of kill(2), _lwp_kill(2), or PCKILL in that the signal is delivered to the LWP immediately after execution is resumed (even if the signal is being held) and an additional PR_SIGNALLED stop does not intervene even if the signal is being traced. Setting the current signal to SIGKILL ends the process immediately.

Page 599: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -27 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, proc.fm Guide

PCKILL, PCUNKILL

Introduction

PCKILL If applied to the process control file, a signal is sent to the process with semantics identical to those of kill(2). If applied to an LWP control file, a signal is sent to the LWP with semantics identical to those of _lwp_kill(2). The signal is named in an operand int contained in the message. Sending SIGKILL ends the process or LWP immediately.

PCUNKILL A signal is deleted, that is, it is removed from the set of pending signals. If applied to the process control file, the signal is deleted from the process’s pending signals. If applied to an LWP control file, the signal is deleted from the LWP’s pending signals. The current signal (if any) is unaffected. The signal is named in an operand int in the control message. It is an error (EINVAL) to attempt to delete SIGKILL.

Page 600: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-28 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, proc.fm

PCSHOLD

Introduction Set the set of held signals for the specific or chosen LWP (signals whose delivery will be delayed if sent to the LWP) according to the operand sigset_t structure. SIGKILL or SIGSTOP cannot be held; if specified, they are silently ignored.

Page 601: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -29 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, proc.fm Guide

PCSFAULT

Introduction PCSFAULT defines a set of hardware faults to be traced in the process: on incurring one of these faults an LWP stops. The set is defined via the operand fltset_t structure.

Fault names Some fault names may not occur on all processors; there may be processor-specific faults in addition to these. Fault names include the following:

When not traced, a fault normally results in the posting of a signal to the LWP that incurred the fault. If an LWP stops on a fault, the signal is posted to the LWP when execution is resumed unless the fault is cleared by PCCFAULT or by the PRCFAULT option of PCRUN. FLTPAGE is an exception; no signal is posted. There may be additional processor-specific faults like this.

Continued on next page

Fault Name Description

FLTILL Illegal instruction

FLTPRIV Privileged instruction

FLTBPT Breakpoint trap

FLTTRACE Trace trap

FLTACCESS Memory access fault (bus error)

FLTBOUNDS Memory bounds violation

FLTIOVF Integer overflow

FLTIZDIV Integer zero divide

FLTFPE Floating-point exception

FLTSTACK Unrecoverable stack fault

FLTPAGE Recoverable page fault

Page 602: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-30 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, proc.fm

PCSFAULT -- continued

pr_info field The pr_info field in /proc/pid/status or in /proc/pid/lwp/lw#/lwpstatus identifies the signal to be sent and contains machine-specific information about the fault. Signals can be any of the following and are described below:

PCCFAULT The current fault (if any) is cleared; the associated signal is not sent to the specific or chosen LWP.

Continued on next page

PCCFAULT The current fault (if any) is cleared; the associated signal is not sent to the specific or chosen LWP.

PCSENTRY, PCSEXIT

These control operations instruct the process’s LWPs to stop on entry to or exit from specified system calls.

PCSET Sets one or more modes of operation for the traced process.

PCRESET Resets these modes. The modes to be set or reset are specified by flags in an operand long in the control message:

PSREG Sets the general registers for the specific or chosen LWP according to the operand gregset_t structure.

PCSFPREG Sets the floating-point registers for the specific or chosen LWP according to the operand fpregset_t structure.

PCNICE Sets the LWP’s nice(2) priority.

Page 603: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -31 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, proc.fm Guide

PCSFAULT -- continued

PCSENTRY, PCSEXIT

These control operations instruct the process’s LWPs to stop on entry to or exit from specified system calls. The set of system calls to be traced is defined via an operand sysset_t structure.

When entry to a system call is being traced, an LWP stops after having begun the call to the system but before the system call arguments have been fetched from the LWP. When exit from a system call is being traced, an LWP stops on completion of the system call just before checking for signals and returning to user level. At this point all return values have been stored into the LWP’s registers.

If an LWP is stopped on entry to a system call (PR_SYSENTRY) or when sleeping in an interruptible system call (PR_ASLEEP is set), it may be instructed to go directly to system call exit by specifying the PRSABORT flag in a PCRUN control message. Unless exit from the system call is being traced the LWP returns to user level showing error EINTR.

PCSET PCSET sets one or more modes of operation for the traced process.

Continued on next page

Page 604: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-32 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, proc.fm

PCSFAULT -- continued

PCRESET PCRESET resets these modes. The modes to be set or reset are specified by flags in an operand long in the control message. The flags are described below:

Continued on next page

Flag DescriptionPR_FORK (inherit-on-fork) When set, the tracing flags of the process are

inherited by the child of a fork(2) or vfork(2). When reset, child processes start with all tracing flags cleared.

PR_RLC (run-on-last-close) When set and the last writable /proc file descriptor referring to the traced process or any of its LWPs is closed, all the tracing flags of the process are cleared, any outstanding stop directives are canceled, and if any LWPs are stopped on events of interest, they are set running as though PCRUN had been applied to them. When reset, the process’s tracing flags are retained and LWPs are not set running on last close.

PR_KLC (kill-on-last-close) When set and the last writable /proc file descriptor referring to the traced process or any of its LWPs is closed, the process is exited with SIGKILL.

PR_ASYNC (asynchronous-stop) When set, a stop on an event of interest by one LWP does not directly affect any other LWP in the process. When reset and an LWP stops on an event of interest other than PR_REQUESTED, all other LWPs in the process are directed to stop.

It is an error (EINVAL) to specify flags other than those described above or to apply these operations to a system process. The current modes are reported in the pr_flags field of /proc/pid/status

Page 605: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -33 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, proc.fm Guide

PCSFAULT -- continued

EINVAL It is an error (EINVAL) to specify flags other than those described above or to apply these operations to a system process. The current modes are reported in the pr_flags field of /proc/pid/status.

PCSREG PCSREG sets the general registers for the specific or chosen LWP according to the operand gregset_t structure. There may be machine-specific restrictions on the allowable set of changes. PCSREG fails (EBUSY) if the LWP is not stopped on an event of interest.

PCSFPREG PCSFPREG sets the floating-point registers for the specific or chosen LWP according to the operand fpregset_t structure. An error (EINVAL) is returned if the system does not support floating-point operations (no floating-point hardware and the system does not emulate floating-point machine instructions). PCSFPREG fails (EBUSY) if the LWP is not stopped on an event of interest.

PCNICE The traced (or chosen) LWP’s nice(2) priority is incremented by the amount contained in the operand int. Only the super-user may better an LWP’s priority in this way, but any user may make the priority worse. This operation is significant only when applied to an LWP in the time-sharing scheduling class.

Page 606: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-34 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, proc.fm

Directories

Introduction

Object directory

The object directory contains read-only files with names as they appear in the entries of the map file, corresponding to objects mapped into the address space of the target process. Opening such a file yields a descriptor for the mapped file associated with a particular address-space region. The name a.out also appears in the directory as a synonym for the executable file associated with the ‘‘text’’ of the running process.

The object directory makes it possible for a controlling process to get access to the object file and any shared libraries (and consequently the symbol tables)--in general, any mapped files--without having to know the specific path names of those files.

lwp directory The lwp directory contains entries each of which names an LWP within the containing process. These entries are directories containing additional files and are described beginning on page 15.

Page 607: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 -35 of 36Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Draft Version for review, Sunday, 15. October 2000, proc.fm Guide

Code Example

Introduction The following code is an simple example of how one process can use the /proc filesystem to access the address space of another. Provided with a single argument (the id of a currently running process), it prints the name of the process from the psinfo structure.

#include <stdio.h>#include <fcntl.h>#include <sys/procfs.h>

main(int argc, char **argv){ char fname[512]; struct psinfo p; int fd;

/* check for an argument */ if (argc != 2) exit(1);

sprintf(fname, "/proc/%s/psinfo", argv[1]);

/* check that the process id is still running */ if((access(fname, F_OK)) < 0) exit(1);

fd = open(fname, O_RDONLY); read(fd, &p, sizeof(struct psinfo)); printf("process pid %s: exec path/args: %s %s\n", argv[1], p.pr_fname, p.pr_psargs);}

Page 608: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

-36 of 36 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000Course materials may not be reporduced in whole or in part

without the prior writen permission of IBM.

Guide Draft Version for review, Sunday, 15. October 2000, proc.fm

Page 609: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version YYYYMMDDCourse materials may not be reproduced in whole or in part

without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, lastpage.fm Student Guide

Page 610: Version 20001015 - Freebabelleetsteph.free.fr/doc/cheat_sheets/AIX/AIX5L_StudentGuide.pdf · AIX 5L Internals Student Guide Version 20001015 IBM Web Server Knowledge Channel