42
© Copyright Oracle 2011. All rights reserved [i] Troubleshooting E1 Kernels Including: Types of Kernel Problems Kernel Error Troubleshooting Procedure Getting and Using an OS Core File OS Tools for Obtaining a call Stack from a running code

Troubleshooting E1 Kernels-1

Embed Size (px)

DESCRIPTION

Troubleshooting E1 Kernels-1

Citation preview

  • Copyright Oracle 2011. All rights reserved [i]

    Troubleshooting E1 Kernels

    Including:

    Types of Kernel Problems

    Kernel Error Troubleshooting Procedure

    Getting and Using an OS Core File

    OS Tools for Obtaining a call Stack from a running code

  • Copyright Oracle 2011. All rights reserved [ii]

    Table of Contents

    TABLE OF CONTENTS ............................................................................................................................................................ II

    CHAPTER 1 - INTRODUCTION .............................................................................................................................................. 1

    Intended Audience 1

    Structure of this Document 1

    Related Materials 1

    CHAPTER 2 - TYPES OF KERNEL PROBLEMS ................................................................................................................. 3

    Hung Kernel with Low CPU 3

    Hung Kernel with High CPU 3

    Zombie Process / Zombie Kernel 3

    Out of Memory Kernel / Memory Leak Kernel 3

    CHAPTER3 - KERNEL ERROR TROUBLESHOOTING PROCEDURE ........................................................................... 4

    General Troubleshooting Philosophy 4

    Troubleshooting Procedure Identify Product Area of Problem 4

    Interactive Problems 4

    Enterprise Server Problem / Batch Problem 6

    Batch Problem 7

    CHAPTER 4 - ZOMBIE KERNELS ........................................................................................................................................ 8

    Call Object Kernels (COBK) 8

    Metadata Kernel 12

    CHAPTER 5 - HUNG KERNELS WITH HIGH CPU ......................................................................................................... 13

    CHAPTER 6 - HUNG KERNELS WITH LOW CPU .......................................................................................................... 14

    Is a Package Deployment Currently Underway? 14

    Troubleshooting Low-CPU Hung Kernels 14

    CHAPTER 7 - OUT OF MEMORY / MEMORY LEAK KERNELS................................................................................. 15

    Memory Leaks 15

    Overly-Aggressive Caching 15

    Troubleshooting Out-of-Memory Issues 15

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved iii

    APPENDIX A VALIDATION AND FEEDBACK ............................................................................................................... 17

    Customer Validation 17

    Field Validation 17

    APPENDIX B GLOSSARY .................................................................................................................................................... 18

    APPENDIX C GETTING AND USING AN OS CORE FILE ............................................................................................ 19

    Windows 19

    AS400 iSeries 27

    UNIX 29

    HP ............................................................................................................................................................................................ 30

    LINUX ..................................................................................................................................................................................... 31

    AIX ........................................................................................................................................................................................... 31

    SUN .......................................................................................................................................................................................... 32

    APPENDIX D OS TOOLS FOR OBTAINING A CALL STACK FROM RUNNING CODE ........................................ 33

    Unix 33

    Windows 33

    AS400 33

  • Copyright Oracle 2011. All rights reserved 1

    Troubleshooting E1 Kernels 5/18/2011

    Chapter 1 - Introduction

    JD Edwards EnterpriseOne Kernels consist of several types of processes. The process definitions can be found in JDE.INI. On

    the enterprise server, two process name are registered, JDENET_N and JDENET_K. The JDENET_N process services

    incoming and outgoing requests for the JDENET_K processes.

    The number of JDENET_N processes needed on an EnterpriseOne server can be calculated based on the number of connections

    and maximum number of net processes. For a detailed JDENET calculation, please refer to the document, JD Edwards

    EnterpriseOne Tools #### System Administration Guide, where #### refers to the tools GA release. The calculation is

    described in the section, Understanding the jde.ini File Settings, [JDENET].

    E.g. The base guides for 898 are located here: http://download.oracle.com/docs/cd/E13780_01/jded/html/docset.html

    The minimum and maximum numbers of each type of JDENET_K process are defined in JDE.INI. For each type of

    JDENET_K kernel, there is a section titled [JDENET_KERNEL_DEF#] where # stands for 1, 2, etc. As of 8.97 tool release,

    there are 32 JDENET_KERNEL_DEF definitions. (Two new definitions, JDENET_KERNEL_DEF31 and

    JDENET_KERNEL_DEF32, were introduced in 8.97, and they correspond to the XMLPublisher and Management Kernels

    respectively.) For detailed definitions of the JDENET_K processes, please refer to the document, JD Edwards

    EnterpriseOne Tools #### System Administration Guide, where #### refers to the tools GA release. The necessary

    calculations are described in the section, Understanding the jde.ini File Settings, [JDENET_KERNEL_DEF#].

    INTENDED AUDIENCE

    This document is intended for use by three different groups: Customers, Consultants, and Oracle Global Customer Support

    (GCS).

    This document is primarily concerned with debugging kernel issues for tools releases prior to 8.98.3.0. Tools release 8.98.3.0

    introduces several new utilities to aid in troubleshooting kernel issues. While the information in this document will still be

    correct when applied to releases beyond 8.98.3.0, it provides only minimal coverage of the improved troubleshooting utilities

    and methodologies that are available in newer tools releases.

    STRUCTURE OF THIS DOCUMENT

    This document provides guidance to self diagnose the Kernel Issues based on pre-KRM methodology (pre-898_2.0)

    The KRM Documentation is present here:

    OU Recording:http://oukc.oracle.com/static09/opn/login/?t=checkusercookies|r=-1|c=839298384

    Documentation: https://support.oracle.com/CSP/main/article?cmd=show&id=1090646.1&type=NOT

    Keep in mind that Oracle updates this document as needed so that it reflects the most current feedback we receive from the

    field. Therefore, the structure, headings, content, and length of this document are likely to vary with each posted version. To see

    if the document has been updated since you last downloaded it, compare the date of your version to the date of the version

    posted on My Oracle Support.

    RELATED MATERIALS

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 2

    We assume that our readers are experienced IT professionals, with a good understanding of JD Edwards EnterpriseOne. To

    take full advantage of the information covered in this document, we recommend that you have a basic understanding of system

    administration, basic Internet architecture, relational database concepts/SQL, and how to use Oracle JDEdwards applications.

    This document is not intended to replace the documentation delivered with the CRM PeopleBooks. We recommend that before

    you read this document, you read the PIA related information in the PeopleTools PeopleBooks to ensure that you have a well-

    rounded understanding of our PIA technology. Note: Much of the information in this document will eventually be

    incorporated into subsequent versions of the PeopleBooks.

    Many of the fundamental concepts related to PIA are discussed in the following PeopleSoft PeopleBooks:

    PeopleSoft Internet Architecture Administration (PeopleTools|Administration Tools|PeopleSoft Internet Architecture

    Administration)

    Application Designer (Development Tools|Application Designer)

    Application Messaging (Integration Tools|Application Messaging)

    PeopleCode (Development Tools|PeopleCode Reference)

    Customers using tools release 8.98.3.0 or newer should also read KRM documentation for information on additional

    troubleshooting techniques that are available to users of those releases as a supplement to the techniques described in this

    document.

    KRM Docs: OU Recording:http://oukc.oracle.com/static09/opn/login/?t=checkusercookies|r=-1|c=839298384

    Documentation: https://support.oracle.com/CSP/main/article?cmd=show&id=1090646.1&type=NOT

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 3

    Chapter 2 - Types of Kernel Problems

    This document refers to several specific types of kernel issues that a customer may encounter. The most important categories of

    kernel problems are explained below.

    HUNG KERNEL WITH LOW CPU

    Definition:

    A hung kernel with low CPU refers to a kernel that has stopped functioning correctly but whose process continues to run with

    very little CPU activity. Generally, this points to a root cause related to deadlock.

    HUNG KERNEL WITH HIGH CPU

    Definition:

    A hung kernel with high CPU refers to a kernel that has stopped functioning correctly but whose process continues to run with

    significant CPU activity. Generally, this points to a root cause related to an infinite loop.

    ZOMBIE PROCESS / ZOMBIE KERNEL

    Definition:

    When an E1 server process crashes due to a programming error in some piece of code that it is running, the kernel stops

    running from the perspective of the OS. The process is flagged as a zombie kernel within the E1 Enterprise Server, where some

    of the process IPC data is saved in shared memory. The process is listed in Server Manager as a zombie process. There are

    many potential causes of a zombie process, including but not limited to null or invalid pointer dereferences, heap memory

    corruption, stack memory corruption, and race conditions.

    OUT OF MEMORY KERNEL / MEMORY LEAK KERNEL

    Definition:

    An out of memory kernel is a kernel that has crashed because its memory footprint exceeded the maximum amount it is allowed

    to allocate. Generally, this points to a memory leak or the caching of overly large quantities of data.

  • Copyright Oracle 2011. All rights reserved 4

    Troubleshooting E1 Kernels 5/18/2011

    Chapter3 - Kernel Error Troubleshooting Procedure

    GENERAL TROUBLESHOOTING PHILOSOPHY

    Oracle JD Edwards EnterpriseOne is a highly complex system with many interacting components. The remainder of this

    chapter and the chapters that follow group similar problems together into a few broad categories and provide generalized

    techniques to handle any problem in one of these categories. However, in many cases, a more specific troubleshooting

    procedure may be necessary for a complex problem/issue.

    Whenever a problem is encountered, the very first action on the part of the troubleshooter should be to examine any relevant

    logfiles. Generally speaking, this means consulting jde_####.log, where #### is the Process ID (PID) of the relevant jdenet_k

    and/or jdenet_n, and also jas.log. If there is a clear error message at or near the end of any of these logfiles, acting on that

    message may be more efficient than following the procedure below.

    Similarly, the procedure below is designed to guide a troubleshooter until he or she finds something that reveals the root cause

    of the problem. If, at any point while following this procedure, the troubleshooter should find some clue to the root cause that

    is too specific to be discussed below, he or she should go off-script and pursue that clue; if this search results in a dead-end,

    the troubleshooter may resume the scripted procedure where he or she left off.

    TROUBLESHOOTING PROCEDURE IDENTIFY PRODUCT AREA OF PROBLEM

    There are several types of issues that can cause an E1 User to receive a time-out message or a Web-Exception. The following

    sections provide a question-and-answer decision tree to help identify the root cause of the problem.

    First the E1 admin needs to determine whether the problem is an Interactive Problem, an Enterprise Server Problem, or a Batch

    Problem.

    INTERACTIVE PROBLEMS

    General:

    1) Did the user receive a Web Exception with the following message, There was a problem with the server while running

    business function ?

    Yes Continue

    No Go to Transaction Processing

    2) Get the jas.log file.

    a. Search within in the jas logfile for the phrase, Associated kernel not found, where is the

    process ID of the COBK.

    b. Does the jas logfile contain the above phrase?

    Yes Continue

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 5

    No Go to Transaction Processing

    3) Log in to SM and go to the Management Dashboard.

    4) Select the Enterprise Server from the list of Managed Instances.

    5) Select Runtime Metrics->Process Detail.

    6) Does the process ID #### exist in the process detail list for the Enterprise Server?

    Yes Continue

    No Go to COBK Zombies:

    7) From SM, does the process ID #### (COBK) have a status of zombie?

    Yes Continue

    No Go to Transaction Processing

    8) Is the process ID #### (COBK) the only kernel with a status of zombie?

    Yes Go to COBK Zombies:

    No Go to Multiple COBK Zombies:

    Transaction Processing:

    1) Did the user receive a Transaction Rollback message?

    Yes Go to Chapter 6 - Hung Kernels with Low CPU

    No Go to High CPU

    High CPU:

    1) Determine how much CPU the COBK process is using. Platform specific instructions follow: (Note that, beginning in

    Tools Release 8.98.2.0, this information is also available from Server Manager in the Runtime Metrics->Process Detail

    page for the Enterprise Server.)

    a. Windows

    i. Launch Windows Task Manager. On the Performance tab, there is a graph showing overall CPU

    activity.

    ii. To see CPU activity specific to the COBK process, first select the Processes tab.

    iii. Go to View->Select Columns and check the box for the PID column if it is not already enabled.

    (The CPU Usage column should already be enabled, but if it is not, check that box as well.)

    iv. Click OK, and when you return to the table of processes, click on the PID column to sort by that value.

    v. Find the PID of the COBK, and check the value of the CPU Usage for that row.

    b. AS/400 iSeries From the terminal, type the command wrkactjob. This will show a table of processes running

    on that machine. If you know the name of the specific library/subsystem, you may view relevant processes only

    via the command wrkactjob sbs() where is the

    appropriate library.

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 6

    c. Unix SSH to the machine hosting the Enterprise Server and type the command top p where

    is the Process ID (PID) of the COBK. Consult the %CPU column.

    2) Is the COBK to which the user is connected using significant CPU?

    Yes Go to Chapter 5 - Hung Kernels with High CPU

    No Continue to Memory Leaks.

    Memory Leaks:

    1) Answer yes if any of the following are true:

    The processes memory usage keeps increasing

    This can be observed by using any OS supplied Tool such as Perfmon in Windows or Glance in HP-UX , etc

    The processes amount of allocated memory is already extremely large

    An out-of-memory error has been observed.

    Yes Chapter 7 - Out of Memory / Memory Leak Kernels

    No Continue to Metadata Kernel

    Metadata Kernel:

    1) Are there any Metadata Zombie Kernels listed in Server Manager?

    Yes Go to Chapter 4 - Zombie Kernel::Metadata Kernel

    No Go to Chapter 4 - Zombie Kernel :: CallObject Kernels

    ENTERPRISE SERVER PROBLEM / BATCH PROBLEM

    1) Are there any outstanding requests for jdenet_k or jdenet_n from SM or NetWM? (If this is a UBE problem, or if this is a

    multi-threaded kernel, answer no.)

    Yes Go to Outstanding Requests.

    No Continue

    2) Are there one or more COBK / RUNBATCH zombies?

    Yes Go to Chapter 4 - Zombie Kernels COBK Zombies.

    No Continue

    3) Is the process using a significant amount of CPU?

    Yes Go to Chapter 5 - Hung Kernels with High CPU

    No Continue

    4) Is the processes memory usage continuously and steadily increasing?

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 7

    Yes Go to Chapter 7 - Out of Memory / Memory Leak Kernels

    No Continue

    5) Is the processes memory usage constant but extremely large?

    Yes Go to Chapter 7 - Out of Memory / Memory Leak Kernels

    No Continue

    6) Is the process otherwise hanging or not responding?

    Yes Go to Chapter 6 - Hung Kernels with Low CPU

    No Continue

    7) It appears you have a very unusual issue. Contact Oracle GCS with as much information as is available. Especially make

    sure to include any of the following that are available:

    a) steps to reproduce the issue

    b) jde_####.log for the kernel.

    c) jde_####.log for the kernels jdenet_n parent process.

    d) jdedebug_####.log for the kernel

    e) jdedebug_####.log for the kernels jdenet_n parent process.

    f) dumpfile, core file, or callstack

    g) jas log

    h) java logs for enterprise server

    Outstanding Requests

    1) Is the number of processed requests increasing over time?

    Yes The kernel is still processing requests, but it is unable to keep up with the rate at which new requests are coming

    in, resulting in a backlog of queued operations. There may be a misconfiguration, or your hardware resources

    may be insufficient to meet the demands of your userbase.

    No Continue

    2) Observe the trend in the number of outstanding requests over time. Is the number increasing, decreasing, or constant?

    Return to Step 2 of Enterprise Server Problem above, but include this information if you end up contacting Oracle GCS.

    BATCH PROBLEM

    Refer to the corresponding Knowledge Experts or Documentation in Batch Area

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 8

    Chapter 4 - Zombie Kernels

    There are a myriad of programming errors that can cause a kernel to crash (resulting in a zombie kernel), including but not limited to null or invalid pointer dereferences, heap memory corruption, stack memory corruption, and race conditions. Furthermore, the crash may not occur until some time after the code containing the logic error executes.

    The main focus of this chapter will be on localizing the crash to a specific business function (BSFN) containing the error. Once the BSFN has been identified, the code can be examined for any programming errors.

    CALL OBJECT KERNELS (COBK)

    Determining the cause of the zombie status:

    COBK Zombies:

    1) Open the log file for the COBK/UBE to which the user is connected.

    Prior to tools release 8.98.3.0, this file will be named jde_####.log, where #### is either the Process ID (Windows and Unix) or the Job ID (iSeries) of the relevant COBK/UBE.

    From tools release 8.98.3.0 onward you will be looking for a file with a name of the form jde_*_dmp.log. (This file is created when a kernel crashes, and * represents the PID of the kernel and the timestamp of the crash.)

    2) Go to the end of the log file. Is there a call stack?

    Yes Continue

    No Go to JDENet Process Log

    3) Does the call stack show the BSFN?

    Yes Continue

    No Go to JDENet Process Log

    4) Can the issue be reproduced?

    Yes Go to Reproducing the Issue.

    No Continue to JDENet_N Parent Process Log

    JDENET_N Parent Process Log

    1) Obtain the jde_####.log where #### is the PID of the parent jdenet_n that spawned the zombie COBK/UBE. If you need instructions on finding the file, consult Obtaining the logfile for the Parent JDENET_N Process.

    2) Search the logfile for the keywords zombie and died. (If there are no hits on either search term, try searching for the Process ID of the COBK/UBE.)

    3) Is there a callstack associated with any of the search terms?

    No Go to Getting an OS Core File.

    Yes Continue

    4) Does the call stack contain a BSFN?

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 9

    No Go to Getting an OS Core File.

    Yes Continue

    5) Can the issue be reproduced?

    No Go to Multiple COBK Zombies.

    Yes Continue to Reproducing the Issue.

    Reproducing the Issue

    1) Turn on dynamic debugging before reproducing the issue.

    2) Can the issue be reproduced with debugging turned on?

    No Go to Tool Release

    Yes Continue

    3) Go ahead and reproduce the problem with debugging on.

    4) Open the resulting debug logfile (jdedebug_####.log) and scroll to the end of the file.

    5) Search upwards for the string BSFNLevel this should tell you the last BSFN to run before the kernel crashed. Continue to Trouble with a specific BSFN.

    Trouble with a Specific BSFN

    1) Is this a customized BSFN?

    Yes Go to Trouble with Customized BSFN

    No Continue

    2) Is there an ESU for this BSFN?

    Yes Apply the ESU. Generally, this will resolve the issue. If it persists go to Contacting Oracle GCS

    No Go to Contacting Oracle GCS

    Trouble Involving a Customized BSFN

    1) Is it possible to try replacing the BSFN with the original code from the release?

    Yes Continue.

    No Consult with the developers who customized the BSFN for your purposes.

    2) Try replacing the BSFN with the original code from the release. Does the problem disappear?

    Yes Consult with the developers who customized the BSFN for your purposes.

    No Continue

    3) Is there an ESU for this BSFN?

    Yes Continue

    No Go to Contacting Oracle GCS

    4) When the ESU is applied, does the problem go away?

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 10

    Yes You will need to merge the changes you made to the original BSFN into the version of the BSFN supplied by the ESU.

    No Go to Contacting Oracle GCS

    Contacting Oracle GCS

    1) Contact Oracle GCS with as much information as is available. Especially make sure to include any of the following that are available:

    a) the name of the BSFN

    b) whether the BSFN is customized

    c) whether there are any ESUs for the BSFN

    d) what tools release is in use

    e) steps to reproduce the issue

    f) jde_####.log for the kernel.

    g) jde_####.log for the kernels jdenet_n parent process.

    h) jdedebug_####.log for the kernel

    i) jdedebug_####.log for the kernels jdenet_n parent process.

    j) dumpfile, core file, or callstack

    k) jas log

    l) java logs for enterprise server

    Multiple COBK Zombies:

    1) Open all of jde_####.log files for all jdenet_n parent processes. There are two ways to do this:

    a) Option 1: If you have easy access to the machine hosting the Enterprise Server.

    i) On the hosting machine, navigate to the log folder for your Enterprise Server.

    ii) Grep (search within the text of these files) for the strings zombie and died.

    iii) Open up any files that contain either of these expressions.

    b) Option 2: If you have easy access to the Server Manager for your Enterprise Server.

    i) Log in to SM and go to the Management Dashboard.

    ii) Select the Enterprise Server from the list of Managed Instances.

    iii) Select Runtime Metrics->Process Detail.

    iv) Sort by Process Name.

    v) For any jdenet_n (Network Listener) processes, click the link in the JDELOG File Size column for that row to view the logfile.

    2) In each jde_####.log for a jdenet_n, locate the Business Functions (BSFN) call stack.

    3) Is there a pattern that one BSFN stands out more than the others in the call stack?

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 11

    Yes Continue

    No Go to Consult the OS Core File

    4) Can the issue be reproduced?

    Yes Go to Reproducing the Issue

    No Go to Consult the OS Core File

    Check Tools Release

    1) Is the customer on a supported release?

    Yes Continue

    No The customer should upgrade to a supported release or provide a compelling reason why this is not possible.

    2) Is the customer on the current release?

    Yes Skip to step 4.

    No Continue

    3) Can the customer upgrade to the current release?

    Yes The customer should upgrade to the current release and see if the problem is resolved. If the problem persists, then continue.

    No Continue

    4) Is there a Solution Document or announcements document in My Oracle Support Knowledge base for the customers issue?

    Yes Follow the instructions in the document for resolving the issue.

    No Go to Contacting Oracle GCS.

    Obtaining the Logfile for the Parent JDENET_N Process.

    1) If a COBK kernel has crashed, and there is no useful information in its log, there may be helpful information in the logfile for the parent JDENET_N process. This section will provide instructions on obtaining the file.

    2) Log in to Server Manager and go to the Management Dashboard.

    3) Select your Enterprise Server from the list of Managed Instances.

    4) Select Runtime Metrics->Process Detail.

    5) Is the zombie COBK listed?

    Yes Continue

    No The list of zombies has already been cleared. Skip to step #10

    6) Click the name (CALL OBJECT KERNEL) of the COBK that has crashed (the zombie COBK).

    7) Under General Information, find Parent Process ID. Is the Parent PID non-zero?

    Yes Continue

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 12

    No Skip to step #10

    8) Return to the Runtime Metrics->Process Detail page, and find the JDENET_N process whose PID matches the Parent PID. Click on the size of its log file (the entry under JDELOG File Size for that row) to view the logfile.

    9) Return to JDENET_N Parent Process Log.

    10) If there is more than one JDENET_N, you will have to find all JDENET_N logfiles and grep (search within the text of these files) for the PID of the zombie COBK to determine the appropriate logfile.

    If you have access to the machine hosting the Enterprise Server, the easiest way to do this is to connect to that machine, navigate to the log folder for the Enterprise Server, and search within jde_*.log

    Alternatively, the JDENET_N logfiles can be accessed one-at-a-time from the Runtime Metrics->Process

    Detail page of Server Manager by clicking on the JDELOG File Size for each process that is a Network Listener.

    11) Once you have identified the correct logfile, return to JDENET_N Parent Process Log.

    Consult the OS Core File

    If it has proven impossible to obtain a (useful) callstack from any of the EntepriseOne log files, it may still be possible to obtain a callstack from an OS-generated core file. If you are unfamiliar with generating and working with OS core dumps on your platform, information on doing so is available in Appendix C Getting and Using an OS Core File.

    Once you have examined the callstack, if you can determine which BSFN is running at the time of the crash, go to Trouble with a specific BSFN above.

    If you cannot isolate a specific BSFN, you should consult Oracle GCS.

    METADATA KERNEL

    There are historical issues that exist with Metadata Kernel, particularly in terms of out-of-memory errors and UBE-not-processing errors. It is believed that these issues were all resolved by Tools Release 8.98.2.0.

    If a customer is experiencing crashes of the Metadata Kernel, the customer should attempt to upgrade to a newer tools release.

    If the customer is already running a recent release, or an upgrade is not practical, the customer should contact Oracle GCS. It will be helpful to Oracle GCS to have:

    Any available logfiles for the kernel,

    Steps to reproduce the issue,

    A copy of the Java heap dump (see Enabling a Java Heap Dump).

    Enabling a Java Heap Dump

    To Enable a Java heap Dump is a JDK and OS specific set of instructions . Since better and more recent methods are being created in a very rapid pace its best to contact the Kernel Support or Dev SMEs for the latest means to create a Java Dump.

  • Copyright Oracle 2011. All rights reserved 13

    Troubleshooting E1 Kernels 5/18/2011

    Chapter 5 - Hung Kernels with High CPU

    A non-responsive kernel with high-CPU has not crashed per se. While the kernel is no longer performing its required duties, code continues to execute, most likely in some form of infinite loop. The first step in resolving this issue is to identify where in the continued code the execution is taking place.

    One can determine what code is running by examining a callstack. Since the kernel has not crashed in the sense of encountering a fatal error, there will NOT be a callstack written out to a file. Instead, a callstack can be obtained using OS tools such as procstac and cstack. These tools are discussed in Appendix D OS Tools for Obtaining a Call

    Stack from Running Code. Note that customers running tools release 8.98.3.0 and beyond can obtain such a callstack through Server Manager.

    It is important to note that, while a high-CPU hung kernel is most likely engaged in some sort of infinite loop, that loop will generally not be contained in the inner-most executing function of the callstack you obtain. Rather, the

    inner-most functions are likely to be contained within the infinite loop. Therefore, it is necessary to repeat the

    process of obtaining a callstack several (five to ten) times. The outermost entries in the callstack will remain the same across all the callstacks collected while the innermost entries will vary. The infinite loop most likely resides at the level of the inner-most function that is common to all of the collected callstacks.

  • Copyright Oracle 2011. All rights reserved 14

    Troubleshooting E1 Kernels 5/18/2011

    Chapter 6 - Hung Kernels with Low CPU

    IS A PACKAGE DEPLOYMENT CURRENTLY UNDERWAY?

    When a package is currently being deployed to the Enterprise Server, the kernels temporarily suspend normal operation,

    mimicking the behavior of a hung kernel with low CPU usage. Generally, package deployments are fairly quick to complete,

    but under certain circumstances, deployments can require extended time. Once the package deployment completes or times out,

    normal kernel operations will resume.

    If a package deployment is not underway, proceed to the next section.

    TROUBLESHOOTING LOW-CPU HUNG KERNELS

    Similar to a hung kernel with high-CPU, a non-responsive kernel with low-CPU has also not crashed in the traditional sense.

    Although the kernel is no longer performing its required duties, code continues to execute, most likely in some form of

    deadlock.

    A program is said to be in deadlock when two or more operations are each waiting for the other to finish, creating a situation in

    which neither operation ever completes and both wait forever. Though not technically deadlock, a situation with similar

    symptoms can arise when a single operation is waiting to obtain a lock on a resource, but that lock was not properly released

    when a previous operation finished using the resource.

    While UBE kernels are not multi-threaded, it is important to note that they are not immune from deadlock. Two separate UBE's

    executing simultaneously (or, more likely, the same UBE being executed multiple times simultaneously) can compete for locks

    on shared resources and end up in deadlock

    As in the previous chapter, the first step in resolving this issue is to identify where in the code the execution is. One can

    determine what code is running by examining a callstack. Since the kernel has not crashed in the sense of encountering a fatal

    error, there will NOT be a callstack written out to a file. Instead, a callstack can be obtained using OS tools such as procstac

    and cstack. The tools are discussed in Appendix D OS Tools for Obtaining a Call Stack from Running Code. Note that

    customers running tools release 8.98.3.0 and beyond can obtain such a callstack through Server Manager.

    After obtaining a call stack for all low-CPU hung kernels, the troubleshooter should examine the executing code to identify

    what resource locks are currently held and what locks are pending. The troubleshooter should then study the remainder of the

    code to determine where else these locks are obtained / released, and where the logical flaw resides.

  • Copyright Oracle 2011. All rights reserved 15

    Troubleshooting E1 Kernels 5/18/2011

    Chapter 7 - Out of Memory / Memory Leak Kernels

    MEMORY LEAKS

    Generally speaking, a kernel suffering from a memory leak is discovered after it has crashed. The kernel crashes when a

    memory allocation attempt fails because the process has reached its maximum allowed memory.1 Sometimes examining the

    callstack at the time of the crash can indicate where this failed memory allocation occurred, but that may or may not provide

    useful information. Often, the failed memory allocation is merely the unrelated victim of a programming error elsewhere in the

    code that prevents no-longer-needed memory from being recycled.

    OVERLY-AGGRESSIVE CACHING

    An out-of-memory error does not necessarily imply the existence of a memory leak per se. Misuse of the JDB cache is a

    common source of out-of-memory errors. The JDB cache can be used to store the result of a frequent database query in

    memory for improved performance. However, if the cache is used too liberally with large tables, free memory will fill up with

    JDB cache entries.

    Overly-aggressive caching can be an issue with call object kernels, but it more often causes problems in batch jobs, simply due

    to the much higher volume of data batch jobs generally manipulate. If an out of memory error is encountered, the

    troubleshooter should investigate what information is being stored in the JDB cache and verify that no unreasonably large

    queries are being cached.

    There are two ways that a query result may be stored in the JDB cache.

    1. If the table over which the query is made has been registered in the F98613 table, then the query result will be placed

    in the JDB cache. To check which tables' queries are being cached through this method, examine the F98613 table.

    2. A BSFN can use the JDB_AddTableToDBCache API to have a table's query results added to the cache. To check

    whether this has happened, debug logging must be enabled, and the debug log should be searched for the messages of

    the form: Entering JDB_AddTableToDBCache (Table =)

    Small, unchanging tables such as company constants are prime candidates for caching in the JDB cache. Except in very

    unusual circumstances, tables containing business data should never be cached.

    TROUBLESHOOTING OUT-OF-MEMORY ISSUES

    If an out-of-memory error does not appear to be related to overly-aggressive caching, the best way to troubleshoot a kernel that

    is running out of memory is to recreate the issue while using a memory profiling tool such as Purify, Valgrind, or Pex.

    (Customers using tools release 8.98.3.0 and beyond have the additional options of using BMD or Jade.). Memory profiling

    tools such as these will show the user what memory has been allocated and never been freed (reclaimed).

    1 Even when there is plentiful total free memory, an attempt to allocate a large block of memory will still fail if there is no

    adequately large block of contiguous free memory

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 16

    It is important to note that using any of the above profiling tools will incur a heavy performance penalty. If it is at all possible,

    this should be done on a non-production server.

  • Copyright Oracle 2011. All rights reserved 17

    Troubleshooting E1 Kernels 5/18/2011

    Appendix A Validation and Feedback

    This section documents that real-world validation that this Document has received.

    CUSTOMER VALIDATION

    Oracle is working with PeopleSoft customers to get feedback and validation on this document. Lessons learned from these

    customer experiences will be posted here.

    FIELD VALIDATION

    Oracle Consulting has provided feedback and validation on this document. Additional lessons learned from field experience

    will be posted here.

  • Copyright Oracle 2011. All rights reserved 18

    Troubleshooting E1 Kernels 5/18/2011

    Appendix B Glossary

    Term Definition

    BSFN Business Function

    COBK Call Object Kernel

    E1 Oracle JD Edwards EnterpriseOne

    ESU Electronic Software Update

    GCS Global Customer Support

    MDK Metadata Kernel

    PID Process Identifier (Process ID)

    SAR Software Action Request

    SM Server Manager

    NetWM Network Work Management standalone utility shipped with Enterprise Server

    that shows queues, outstanding requests, etc.

    Callstack A list of currently executing functions organized hierarchically to show parent

    (caller) to child (callee) relationships

    UBE Universal Batch Engine

    OS Operating System

    Infinite Loop A program is said to be in an infinite loop when it continues to execute the same

    section of code repeatedly forever.

    Deadlock A program is said to be in deadlock when two or more operations are each

    waiting for the other to finish, creating a situation where neither operation ever

    completes and both wait forever. While not technically deadlock, a situation with

    similar symptoms can arise when a single operation is waiting to obtain a lock on

    a resource and that lock was not properly released when a previous operation

    finished with the resource.

    Management Dashboard The entry page to Server Manager (SM). The page has the title Managed

    Homes and Managed Instances and can be reached by clicking a link in the

    upper left corner of most SM pages.

  • Copyright 2011 Oracle, Inc. All rights reserved. 19

    Troubleshooting E1 Kernels 5/18/2011

    Appendix C Getting and Using an OS Core File

    In Tools Release 8.98.3.0, several new features were added to streamline the debugging of kernel issues. This document is

    primarily intended for users of Tools Releases in the 8.98.2 family and earlier. Users of Tools Release 8.98.3 and beyond will

    find a simpler, platform independent set of instructions in the document, The KRM Documentation is present here:

    OU Recording:http://oukc.oracle.com/static09/opn/login/?t=checkusercookies|r=-1|c=839298384

    Documentation: https://support.oracle.com/CSP/main/article?cmd=show&id=1090646.1&type=NOT

    This chapter provides instructions for obtaining a call stack and a dump file on the following platforms:

    Window Server

    AS400 - iSeries

    UNIX

    WINDOWS

    Pre-requisite This is for the Window platform only

    1) Machine should have Debugging tools for windows installed, In this is not installed please download and install from

    following url:

    http://www.microsoft.com/whdc/devtools/debugging/installx86.mspx

    PS: The above package will install windbg, please note the path of windbg.exe we will use this to capture the crash dump.

    2) Have the customer download this version:

    Current Release version 6.11.1.402 - February 6, 2009

    Install 32-bit version 6.6.7.5 [15.2 MB]

    Steps to install UserDump:

    1. Download Site (version 8.1)

    http://www.microsoft.com/downloads/details.aspx?FamilyID=E089CA41-6A87-40C8-BF69-

    28AC08570B7E&displaylang=en&displaylang=en

    a) Click Download

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 20

    b) Click Run

    c) After the download completed, a new folder, C:\kktools\userdump8.1, will be created.

    2. Setup

    http://support.microsoft.com/kb/241215

    a) In C:\kktools\userdump8.1\x86, click setup.exe

    b) A folder C:\WINDOWS\system32\kktools will be created after the setup.

    3. Capturing E1 COBK

    a) Go to Control Panel->Process Dumper

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 21

    b) Click New

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 22

    c) enter: jdenet_k.exe and click OK

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 23

    d) Click Rules:

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 24

    e) Select Use custom rules

    - Point the Dump file folder to the folder is easily accessible. Make sure the folder exist

    - Keep all the setting as seen.

    - Check the Kill process after dumping

    - Click OK

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 25

    f) Optional: (unless instructed)

    1) Check All Exceptions OR

    2) Select specific exceptions

    i) Access violation

    ii) Array bounds exceeded

    iii) Stack Overflow

    iv) Invalid handle

    v) Overflow

    vi) Stack Check

    g) Click Apply or OK

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 26

    Getting Page Heap: (Optional)

    http://support.microsoft.com/kb/267802

    1. From the command line, go to the drive where the Debugging Tools for Window folder is installed.

    2. From the command line:

    >gflags /p /enable runbatch.exe /full

    /full = full page heap, this will use a lot memory and resources.

    3. Targetting specific dll

    >gflags /p /enable jdenet_k.exe /dlls callbsfn.dll cruntime.dll

    4. From the GUI interface of GFLAGS.

    a) Go to Start All Programs

    b) Debugging Tools for Window Global flags

    c) Click on Image File tab page

    d) Enter an executable name and TAB OUT - DO NOT HIT ENTER

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 27

    - check the options as seen

    e) To remove the settings, follow instruction 4a thru 4d but uncheck all options

    AS400 ISERIES

    When a C2M1211 or C2M1212 message is generated from a single-level store heap routine, the code checks for a *DTAARA

    named QGPL/QC2M1211 or QGPL/QC2M1212. If the data area exists, the program stack is dumped. If the data area does not

    exist, no dump is performed.

    Setup data area to capture call stack for C2M1212 heap error message.

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 28

    CRTDTAARA DTAARA(QGPL/QC2M1212) TYPE(*CHAR) LEN(1)

    Setup data area to capture call stack for C2M1211 heap error message.

    To setup C2M1211 data area will require SI27412 and SI28640 PTF ON V5R4.

    CRTDTAARA DTAARA(QGPL/QC2M1211) TYPE(*CHAR) LEN(1)

    Once the data area is in place, a spool file named QPRINT is created (this we can read to figure out which tools, apps or OS

    API is causing the memory overwrite) with dump information for every C2M1211 message or C2M1212 message (this may be

    something IBM can read).

    The spool file is created for the user running the job that gets the message. For example, if the job getting the C2M1211

    message or C2M1212 message is a server job or batch job running under userid ABC123, then the spool file is created in the

    output queue for userid ABC123. Once the spool files containing stack tracebacks are obtained, the data area can be removed,

    and the tracebacks analyzed.

    To disable the dumps, delete the data area(s).

    For further information please read Diagnosing and Debugging Memory Problems : C2M1211 and C2M1212 Messages from

    IBM website.

    When a C2M1211 message or C2M1212 message is generated from a teraspace heap routine, the code checks for a *DTAARA

    named QGPL/QC2M1211 or QGPL/QC2M1212. If the data area exists and contains at least 50 characters of data, a 50

    character string is retrieved from the data area. If the string within the data area matches one of the following strings, special

    behavior is triggered.

    _C_TS_dump_stack

    _C_TS_dump_stack_vfy_heap

    _C_TS_dump_stack_vfy_heap_wabort

    _C_TS_dump_stack_vry_heap_wsleep

    If the data area does not exist, no dump or heap verification is performed. For further information please read

    Enablement for teraspace heap memory managers from IBM website.

    Here is an example of how to create a data area to indicate to call _C_TS_malloc_debug to verify the heap whenever a

    C2M1211 message or C2M1212 message is generated:

    On IBM i 6.1 (with PTF SI33945) and IBM i 7.1 you can use following information to the data area.

    CRTDTAARA DTAARA(QGPL/QC2M1211) TYPE(*CHAR) LEN(50)

    VALUE('_C_TS_dump_stack_vfy_heap_wabort')

    CRTDTAARA DTAARA(QGPL/QC2M1212) TYPE(*CHAR) LEN(50)

    VALUE('_C_TS_dump_stack_vfy_heap_wabort')

    This will re-validate the heap, if it detects memory corruption and will abort the job.

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 29

    Caution : this should be used in a test environment as this can start throwing lot of errors/exceptions and with abort option

    you will see more zombie process.

    UNIX

    1) In the JDE.INI config file, under the [JDENET] section, set the following: HandleKrnlSignals=0 and krnlCoreDump=1.

    This will cause a core file to be dumped, provided the operating system allows it.

    2) If the Oracle client is being used to connect to an Oracle database, log in as the oracle userid that owns the Oracle Client

    install. Add the following line to the $ORACLE_HOME/network/admin/sqlnet.ora file:

    DIAG_SIGHANDLER_ENABLED=false

    3) Next, you must ensure that the operating system allows the creation of core files.

    a) On the command line type the command: ulimit -c. This will show the current maximum size for core files. b) If the size is 0 (or very small), then no core file will be created.

    c) To change the size for the core file, on the command line, type: ulimit -c where is the size in bytes

    d) Confirm the ulimit change by rerunning ulimit -c on the command line. If the value from step c above is not displayed, the hard limit may need to be raised by the root user. Changes to the /etc/security/limits

    e) If E1 Enterprise Server services are to be started from the command line using RunOneWorld.sh, start the E1

    Enterprise Server services from a login session where ulimit -c was run. The ulimit command has to be run

    for each new login session on the server that is used to run the RunOneWorld.sh script. If the E1 Enterprise Server

    needs to be stopped and restarted often, adding the ulimit -c command to the bottom of the

    $SYSTEM/bin32/toolsenv.sh script will ensure the ulimit command is run each time a new login session is opened.

    f) If the E1 Enterprise Server is to be stopped and restarted remotely via Server Manager, the Server Manager client on

    the Enterprise Server must be restarted from a login session where ulimit -c has been run. Run the ulimit

    command, then goto the jde_home/bin directory and run the command: restartAgent g) Test that core files are being created properly by selecting a jdenet_k process-PID and run the following command:

    kill -15 This should generate a core file.

    4) When the core file is generated, the core file has the same name in the $SYSTEM/bin32 directory, unless the operation

    system is actively managing core file names and locations. The server may already be configured to put all core files in a

    central location. If so, the server may be reconfigured, or the core files can be copied to the $SYSTEM/bin32 directory to

    be read. Option to generate the core file with the unique name.

    a) On Sun Solaris, put the coreadm command in the user profile:

    coreadm -p core.%f.%p $$

    The above command will generate the core file with the following format name:

    core..

    b) On Linux, log in as root and edit the /etc/sysctl.conf file and add the following line:

    kernel.core_uses_pid = 1

    Anytime the /etc/sysctl.conf file is changed, the root user must run the following command to make the change effective

    immediately: sysctl -p Once this is run, every new login session will get the new settings. Stop and restart E1 following

    the directions in step 3e or 3f.

    c) If no other core naming options are available, create a script to detect the core file and rename it. See the following for

    example. Run the script from the $SYSTEM/bin32 directory in the background with nohup using this command:

    nohup rename_core &

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 30

    rename_core script sample

    #!/bin/ksh

    # This script just hangs around waiting for a core file to appear, and if

    # one does, renames it to a name based on the current date and time.

    while true

    do

    sleep 30

    if [ -f core ]

    then

    cname="core.$(date +%Y%m%d%H%M%S)"

    echo renaming core to $cname

    mv core $cname

    done

    5) Once the core files are captured, the core files must be opened at the customer site to get the call stack.

    6) Which platform the customer is using?

    HP LINUX AIX SUN

    HP

    1) Do you know what executable create the core file? Yes No

    2) On the command line type:

    file 3) The above command will give you the executable name to be used in the Get HP Callstack (#4)

    Get HP Callstack

    4) Getting the callstack

    Command line:

    >gdb

    Example:

    >gdb jdenet_k core.xxxx.12345

    Once the core file is open, do the following

    >info thread This will give you a list of threads that were created within jdenet_k process.

    >thread # Open thread number

    >where List the callstack within that thread #

    >quit Exit gdb

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 31

    LINUX

    Linux core files generally must be read on the same server they were created. Displaying the core file on a different server can

    produce incorrect output.

    1) Do you know what executable create the core file? Yes No

    2) On the command line type:

    file 3) The above command will give you the executable name to be used in the Get Linux Callstack (#4)

    Get Linux Callstack

    4) Getting the callstack

    Command line:

    >gdb

    Example:

    >gdb jdenet_k core.12345

    Once the core file is open, do the following

    >info thread This will give you a list of threads that were created within jdenet_k process.

    >thread # Open thread number

    >where List the callstack within that thread #

    >quit Exit gdb

    There is some optional information that can be collected along with the stack:

    show charset Show the effective character set when the process crashed.

    show environment Show the environment variables when the processed crashed.

    AIX

    1) Do you know what executable create the core file? Yes No

    2) On the command line type:

    file 3) The above command will give you the executable name to be used in the Get AIX Callstack (#4)

    Get AIX Callstack

    4) Getting the callstack

    Command Line:

    dbx prog

    This will bring up the dbx command, the user has to hit enter or return key several time

    >where List the callstack

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 32

    SUN

    1) Simply type the following in the command line:

    Command Line:

    pstack

    This will list the callstack

  • Copyright 2011 Oracle, Inc. All rights reserved. 33

    Troubleshooting E1 Kernels 5/18/2011

    Appendix D OS Tools for Obtaining a Call Stack from Running Code

    Following Procstack/ Pstack command is to be used when a process is either hung or running on CPU with high usage. Please

    note that this should be used on Systems which are pre-898_3x as in 898_3.x and beyond the same call stacks can be obtained

    from CPU Diagnostics in Server manager (simply press the CPU Diagnostics in Server Manager.)

    Caution: This document may contain information, software, products, services which are not supported by Oracle Support

    Services and are being provided as is without warranty. Please refer to the following site for My Oracle Support Terms of

    use: https://support.oracle.com/CSP/ui/TermsOfUse.html

    UNIX

    Following should be run on various Unixes to dump call stacks:

    HP- UX : /usr/ccs/bin/pstack

    AIX: /usr/bin/procstack

    SUN: /usr/bin/pstack

    LINUX: /usr/bin/pstack

    More information on Procstack can be found on the following IBM link for Prockstack Command.

    WINDOWS

    Use ADPlus tool to collect the call stack information on Windows platform. For more information on how to use the tool,

    follow the link from Microsoft on How to use ADPlus to troubleshoot "hangs" and "crashes

    AS400

    The process below can be used to retrieve the program stack for a job with a single thread or the first thread of a multithreaded

    job.

    cmd: ADDLIBLE E900SYS cmd: SAW | Option 2 Work with Server Processes | Option 3 Display OneWorld Processes

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 34

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 35

    The following creates a spool file contaiing the program stack(call Stack)

    Cmd: DSPJOB JOB(072347/ONEWORLD/JDENET K) OUTPUT(*PRINT) OPTION(*PGMSTK)

    The following creates a spool file containing the program stack (call stack)

    cmd: DSPJOB JOB(072347/ONEWORLD/JDENET_K) OPTION(*PGMSTK)

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 36

    1. Create a library and output queue to move the previously generated spool file items.

    cmd: CRTLIB JDETEMP

    cmd: CRTOUTQ JDETEMP/JDETEMP

    2. Copy the items found in output queue WRKOUTQ JDETEMP/JDETEMP via iSeries Navigator to a local Windows folder.

    a. Expand the host name node. Login to the system. Expand the Basic Operations node. Right-hand click on

    Printer Output highlight Customize this View and select Include.

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 37

    Change the Users value to All. Type JDETEMP/JDETEMP in the Output queues field as shown below.

  • Troubleshooting E1 Kernels 5/18/2011

    Copyright Oracle 2011. All rights reserved 38

    b. Highlight all of the spool files found in the right-hand window pane. Click Ctrl-C (to copy) and paste these

    files into a local Windows Explorer folder, e.g. SND2DENVER.