Troubleshooting E1 Kernels-1

Copyright Oracle 2011. All rights reserved [i]

Troubleshooting E1 Kernels

Including:

Types of Kernel Problems

Kernel Error Troubleshooting Procedure

Getting and Using an OS Core File

OS Tools for Obtaining a call Stack from a running code

Copyright Oracle 2011. All rights reserved [ii]

Table of Contents

TABLE OF CONTENTS ............................................................................................................................................................ II

CHAPTER 1 - INTRODUCTION .............................................................................................................................................. 1

Intended Audience 1

Structure of this Document 1

Related Materials 1

CHAPTER 2 - TYPES OF KERNEL PROBLEMS ................................................................................................................. 3

Hung Kernel with Low CPU 3

Hung Kernel with High CPU 3

Zombie Process / Zombie Kernel 3

Out of Memory Kernel / Memory Leak Kernel 3

CHAPTER3 - KERNEL ERROR TROUBLESHOOTING PROCEDURE ........................................................................... 4

General Troubleshooting Philosophy 4

Troubleshooting Procedure Identify Product Area of Problem 4

Interactive Problems 4

Enterprise Server Problem / Batch Problem 6

Batch Problem 7

CHAPTER 4 - ZOMBIE KERNELS ........................................................................................................................................ 8

Call Object Kernels (COBK) 8

Metadata Kernel 12

CHAPTER 5 - HUNG KERNELS WITH HIGH CPU ......................................................................................................... 13

CHAPTER 6 - HUNG KERNELS WITH LOW CPU .......................................................................................................... 14

Is a Package Deployment Currently Underway? 14

Troubleshooting Low-CPU Hung Kernels 14

CHAPTER 7 - OUT OF MEMORY / MEMORY LEAK KERNELS................................................................................. 15

Memory Leaks 15

Overly-Aggressive Caching 15

Troubleshooting Out-of-Memory Issues 15

Troubleshooting E1 Kernels 5/18/2011

Copyright Oracle 2011. All rights reserved iii

APPENDIX A VALIDATION AND FEEDBACK ............................................................................................................... 17

Customer Validation 17

Field Validation 17

APPENDIX B GLOSSARY .................................................................................................................................................... 18

APPENDIX C GETTING AND USING AN OS CORE FILE ............................................................................................ 19

Windows 19

AS400 iSeries 27

UNIX 29

HP ............................................................................................................................................................................................ 30

LINUX ..................................................................................................................................................................................... 31

AIX ........................................................................................................................................................................................... 31

SUN .......................................................................................................................................................................................... 32

APPENDIX D OS TOOLS FOR OBTAINING A CALL STACK FROM RUNNING CODE ........................................ 33

Unix 33

Windows 33

AS400 33

Copyright Oracle 2011. All rights reserved 1


Chapter 1 - Introduction

JD Edwards EnterpriseOne Kernels consist of several types of processes. The process definitions can be found in JDE.INI. On

the enterprise server, two process name are registered, JDENET_N and JDENET_K. The JDENET_N process services

incoming and outgoing requests for the JDENET_K processes.

The number of JDENET_N processes needed on an EnterpriseOne server can be calculated based on the number of connections

and maximum number of net processes. For a detailed JDENET calculation, please refer to the document, JD Edwards

EnterpriseOne Tools #### System Administration Guide, where #### refers to the tools GA release. The calculation is

described in the section, Understanding the jde.ini File Settings, [JDENET].

E.g. The base guides for 898 are located here: http://download.oracle.com/docs/cd/E13780_01/jded/html/docset.html

The minimum and maximum numbers of each type of JDENET_K process are defined in JDE.INI. For each type of

JDENET_K kernel, there is a section titled [JDENET_KERNEL_DEF#] where # stands for 1, 2, etc. As of 8.97 tool release,

there are 32 JDENET_KERNEL_DEF definitions. (Two new definitions, JDENET_KERNEL_DEF31 and

JDENET_KERNEL_DEF32, were introduced in 8.97, and they correspond to the XMLPublisher and Management Kernels

respectively.) For detailed definitions of the JDENET_K processes, please refer to the document, JD Edwards

EnterpriseOne Tools #### System Administration Guide, where #### refers to the tools GA release. The necessary

calculations are described in the section, Understanding the jde.ini File Settings, [JDENET_KERNEL_DEF#].

INTENDED AUDIENCE

This document is intended for use by three different groups: Customers, Consultants, and Oracle Global Customer Support

(GCS).

This document is primarily concerned with debugging kernel issues for tools releases prior to 8.98.3.0. Tools release 8.98.3.0

introduces several new utilities to aid in troubleshooting kernel issues. While the information in this document will still be

correct when applied to releases beyond 8.98.3.0, it provides only minimal coverage of the improved troubleshooting utilities

and methodologies that are available in newer tools releases.

STRUCTURE OF THIS DOCUMENT

This document provides guidance to self diagnose the Kernel Issues based on pre-KRM methodology (pre-898_2.0)

The KRM Documentation is present here:

OU Recording:http://oukc.oracle.com/static09/opn/login/?t=checkusercookies|r=-1|c=839298384

Documentation: https://support.oracle.com/CSP/main/article?cmd=show&id=1090646.1&type=NOT

Keep in mind that Oracle updates this document as needed so that it reflects the most current feedback we receive from the

field. Therefore, the structure, headings, content, and length of this document are likely to vary with each posted version. To see

if the document has been updated since you last downloaded it, compare the date of your version to the date of the version

posted on My Oracle Support.

RELATED MATERIALS



We assume that our readers are experienced IT professionals, with a good understanding of JD Edwards EnterpriseOne. To

take full advantage of the information covered in this document, we recommend that you have a basic understanding of system

administration, basic Internet architecture, relational database concepts/SQL, and how to use Oracle JDEdwards applications.

This document is not intended to replace the documentation delivered with the CRM PeopleBooks. We recommend that before

you read this document, you read the PIA related information in the PeopleTools PeopleBooks to ensure that you have a well-

rounded understanding of our PIA technology. Note: Much of the information in this document will eventually be

incorporated into subsequent versions of the PeopleBooks.

Many of the fundamental concepts related to PIA are discussed in the following PeopleSoft PeopleBooks:

PeopleSoft Internet Architecture Administration (PeopleTools|Administration Tools|PeopleSoft Internet Architecture

Administration)

Application Designer (Development Tools|Application Designer)

Application Messaging (Integration Tools|Application Messaging)

PeopleCode (Development Tools|PeopleCode Reference)

Customers using tools release 8.98.3.0 or newer should also read KRM documentation for information on additional

troubleshooting techniques that are available to users of those releases as a supplement to the techniques described in this

document.

KRM Docs: OU Recording:http://oukc.oracle.com/static09/opn/login/?t=checkusercookies|r=-1|c=839298384




Chapter 2 - Types of Kernel Problems

This document refers to several specific types of kernel issues that a customer may encounter. The most important categories of

kernel problems are explained below.

HUNG KERNEL WITH LOW CPU

Definition:

A hung kernel with low CPU refers to a kernel that has stopped functioning correctly but whose process continues to run with

very little CPU activity. Generally, this points to a root cause related to deadlock.

HUNG KERNEL WITH HIGH CPU

Definition:

A hung kernel with high CPU refers to a kernel that has stopped functioning correctly but whose process continues to run with

significant CPU activity. Generally, this points to a root cause related to an infinite loop.

ZOMBIE PROCESS / ZOMBIE KERNEL

Definition:

When an E1 server process crashes due to a programming error in some piece of code that it is running, the kernel stops

running from the perspective of the OS. The process is flagged as a zombie kernel within the E1 Enterprise Server, where some

of the process IPC data is saved in shared memory. The process is listed in Server Manager as a zombie process. There are

many potential causes of a zombie process, including but not limited to null or invalid pointer dereferences, heap memory

corruption, stack memory corruption, and race conditions.

OUT OF MEMORY KERNEL / MEMORY LEAK KERNEL

Definition:

An out of memory kernel is a kernel that has crashed because its memory footprint exceeded the maximum amount it is allowed

to allocate. Generally, this points to a memory leak or the caching of overly large quantities of data.



Chapter3 - Kernel Error Troubleshooting Procedure

GENERAL TROUBLESHOOTING PHILOSOPHY

Oracle JD Edwards EnterpriseOne is a highly complex system with many interacting components. The remainder of this

chapter and the chapters that follow group similar problems together into a few broad categories and provide generalized

techniques to handle any problem in one of these categories. However, in many cases, a more specific troubleshooting

procedure may be necessary for a complex problem/issue.

Whenever a problem is encountered, the very first action on the part of the troubleshooter should be to examine any relevant

logfiles. Generally speaking, this means consulting jde_####.log, where #### is the Process ID (PID) of the relevant jdenet_k

and/or jdenet_n, and also jas.log. If there is a clear error message at or near the end of any of these logfiles, acting on that

message may be more efficient than following the procedure below.

Similarly, the procedure below is designed to guide a troubleshooter until he or she finds something that reveals the root cause

of the problem. If, at any point while following this procedure, the troubleshooter should find some clue to the root cause that

is too specific to be discussed below, he or she should go off-script and pursue that clue; if this search results in a dead-end,

the troubleshooter may resume the scripted procedure where he or she left off.

TROUBLESHOOTING PROCEDURE IDENTIFY PRODUCT AREA OF PROBLEM

There are several types of issues that can cause an E1 User to receive a time-out message or a Web-Exception. The following

sections provide a question-and-answer decision tree to help identify the root cause of the problem.

First the E1 admin needs to determine whether the problem is an Interactive Problem, an Enterprise Server Problem, or a Batch

Problem.

INTERACTIVE PROBLEMS

General:

1) Did the user receive a Web Exception with the following message, There was a problem with the server while running

business function ?

Yes Continue

No Go to Transaction Processing

2) Get the jas.log file.

a. Search within in the jas logfile for the phrase, Associated kernel not found, where is the

process ID of the COBK.

b. Does the jas logfile contain the above phrase?

Yes Continue




3) Log in to SM and go to the Management Dashboard.

4) Select the Enterprise Server from the list of Managed Instances.

5) Select Runtime Metrics->Process Detail.

6) Does the process ID #### exist in the process detail list for the Enterprise Server?

Yes Continue

No Go to COBK Zombies:

7) From SM, does the process ID #### (COBK) have a status of zombie?

Yes Continue


8) Is the process ID #### (COBK) the only kernel with a status of zombie?

Yes Go to COBK Zombies:

No Go to Multiple COBK Zombies:

Transaction Processing:

1) Did the user receive a Transaction Rollback message?

Yes Go to Chapter 6 - Hung Kernels with Low CPU

No Go to High CPU

High CPU:

1) Determine how much CPU the COBK process is using. Platform specific instructions follow: (Note that, beginning in

Tools Release 8.98.2.0, this information is also available from Server Manager in the Runtime Metrics->Process Detail

page for the Enterprise Server.)

a. Windows

i. Launch Windows Task Manager. On the Performance tab, there is a graph showing overall CPU

activity.

ii. To see CPU activity specific to the COBK process, first select the Processes tab.

iii. Go to View->Select Columns and check the box for the PID column if it is not already enabled.

(The CPU Usage column should already be enabled, but if it is not, check that box as well.)

iv. Click OK, and when you return to the table of processes, click on the PID column to sort by that value.

v. Find the PID of the COBK, and check the value of the CPU Usage for that row.

b. AS/400 iSeries From the terminal, type the command wrkactjob. This will show a table of processes running

on that machine. If you know the name of the specific library/subsystem, you may view relevant processes only

via the command wrkactjob sbs() where is the

appropriate library.



c. Unix SSH to the machine hosting the Enterprise Server and type the command top p where

is the Process ID (PID) of the COBK. Consult the %CPU column.

2) Is the COBK to which the user is connected using significant CPU?

Yes Go to Chapter 5 - Hung Kernels with High CPU

No Continue to Memory Leaks.

Memory Leaks:

1) Answer yes if any of the following are true:

The processes memory usage keeps increasing

This can be observed by using any OS supplied Tool such as Perfmon in Windows or Glance in HP-UX , etc

The processes amount of allocated memory is already extremely large

An out-of-memory error has been observed.

Yes Chapter 7 - Out of Memory / Memory Leak Kernels

No Continue to Metadata Kernel

Metadata Kernel:

1) Are there any Metadata Zombie Kernels listed in Server Manager?

Yes Go to Chapter 4 - Zombie Kernel::Metadata Kernel

No Go to Chapter 4 - Zombie Kernel :: CallObject Kernels

ENTERPRISE SERVER PROBLEM / BATCH PROBLEM

1) Are there any outstanding requests for jdenet_k or jdenet_n from SM or NetWM? (If this is a UBE problem, or if this is a

multi-threaded kernel, answer no.)

Yes Go to Outstanding Requests.

No Continue

2) Are there one or more COBK / RUNBATCH zombies?

Yes Go to Chapter 4 - Zombie Kernels COBK Zombies.

No Continue

3) Is the process using a significant amount of CPU?

Yes Go to Chapter 5 - Hung Kernels with High CPU

No Continue

4) Is the processes memory usage continuously and steadily increasing?



Yes Go to Chapter 7 - Out of Memory / Memory Leak Kernels

No Continue

5) Is the processes memory usage constant but extremely large?

Yes Go to Chapter 7 - Out of Memory / Memory Leak Kernels

No Continue

6) Is the process otherwise hanging or not responding?

Yes Go to Chapter 6 - Hung Kernels with Low CPU

No Continue

7) It appears you have a very unusual issue. Contact Oracle GCS with as much information as is available. Especially make

sure to include any of the following that are available:

a) steps to reproduce the issue

b) jde_####.log for the kernel.

c) jde_####.log for the kernels jdenet_n parent process.

d) jdedebug_####.log for the kernel

e) jdedebug_####.log for the kernels jdenet_n parent process.

f) dumpfile, core file, or callstack

g) jas log

h) java logs for enterprise server

Outstanding Requests

1) Is the number of processed requests increasing over time?

Yes The kernel is still processing requests, but it is unable to keep up with the rate at which new requests are coming

in, resulting in a backlog of queued operations. There may be a misconfiguration, or your hardware resources

may be insufficient to meet the demands of your userbase.

No Continue

2) Observe the trend in the number of outstanding requests over time. Is the number increasing, decreasing, or constant?

Return to Step 2 of Enterprise Server Problem above, but include this information if you end up contacting Oracle GCS.

BATCH PROBLEM

Refer to the corresponding Knowledge Experts or Documentation in Batch Area



Chapter 4 - Zombie Kernels

There are a myriad of programming errors that can cause a kernel to crash (resulting in a zombie kernel), including but not limited to null or invalid pointer dereferences, heap memory corruption, stack memory corruption, and race conditions. Furthermore, the crash may not occur until some time after the code containing the logic error executes.

The main focus of this chapter will be on localizing the crash to a specific business function (BSFN) containing the error. Once the BSFN has been identified, the code can be examined for any programming errors.

CALL OBJECT KERNELS (COBK)

Determining the cause of the zombie status:

COBK Zombies:

1) Open the log file for the COBK/UBE to which the user is connected.

Prior to tools release 8.98.3.0, this file will be named jde_####.log, where #### is either the Process ID (Windows and Unix) or the Job ID (iSeries) of the relevant COBK/UBE.

From tools release 8.98.3.0 onward you will be looking for a file with a name of the form jde_*_dmp.log. (This file is created when a kernel crashes, and * represents the PID of the kernel and the timestamp of the crash.)

2) Go to the end of the log file. Is there a call stack?

Yes Continue

No Go to JDENet Process Log

3) Does the call stack show the BSFN?

Yes Continue

No Go to JDENet Process Log

4) Can the issue be reproduced?

Yes Go to Reproducing the Issue.

No Continue to JDENet_N Parent Process Log

JDENET_N Parent Process Log

1) Obtain the jde_####.log where #### is the PID of the parent jdenet_n that spawned the zombie COBK/UBE. If you need instructions on finding the file, consult Obtaining the logfile for the Parent JDENET_N Process.

2) Search the logfile for the keywords zombie and died. (If there are no hits on either search term, try searching for the Process ID of the COBK/UBE.)

3) Is there a callstack associated with any of the search terms?

No Go to Getting an OS Core File.

Yes Continue

4) Does the call stack contain a BSFN?



No Go to Getting an OS Core File.

Yes Continue


No Go to Multiple COBK Zombies.

Yes Continue to Reproducing the Issue.

Reproducing the Issue

1) Turn on dynamic debugging before reproducing the issue.

2) Can the issue be reproduced with debugging turned on?

No Go to Tool Release

Yes Continue

3) Go ahead and reproduce the problem with debugging on.

4) Open the resulting debug logfile (jdedebug_####.log) and scroll to the end of the file.

5) Search upwards for the string BSFNLevel this should tell you the last BSFN to run before the kernel crashed. Continue to Trouble with a specific BSFN.

Trouble with a Specific BSFN

1) Is this a customized BSFN?

Yes Go to Trouble with Customized BSFN

No Continue

2) Is there an ESU for this BSFN?

Yes Apply the ESU. Generally, this will resolve the issue. If it persists go to Contacting Oracle GCS

No Go to Contacting Oracle GCS

Trouble Involving a Customized BSFN

1) Is it possible to try replacing the BSFN with the original code from the release?

Yes Continue.

No Consult with the developers who customized the BSFN for your purposes.

2) Try replacing the BSFN with the original code from the release. Does the problem disappear?

Yes Consult with the developers who customized the BSFN for your purposes.

No Continue

3) Is there an ESU for this BSFN?

Yes Continue


4) When the ESU is applied, does the problem go away?



Yes You will need to merge the changes you made to the original BSFN into the version of the BSFN supplied by the ESU.


Contacting Oracle GCS

1) Contact Oracle GCS with as much information as is available. Especially make sure to include any of the following that are available:

a) the name of the BSFN

b) whether the BSFN is customized

c) whether there are any ESUs for the BSFN

d) what tools release is in use

e) steps to reproduce the issue

f) jde_####.log for the kernel.

g) jde_####.log for the kernels jdenet_n parent process.

h) jdedebug_####.log for the kernel

i) jdedebug_####.log for the kernels jdenet_n parent process.

j) dumpfile, core file, or callstack

k) jas log

l) java logs for enterprise server

Multiple COBK Zombies:

1) Open all of jde_####.log files for all jdenet_n parent processes. There are two ways to do this:

a) Option 1: If you have easy access to the machine hosting the Enterprise Server.

i) On the hosting machine, navigate to the log folder for your Enterprise Server.

ii) Grep (search within the text of these files) for the strings zombie and died.

iii) Open up any files that contain either of these expressions.

b) Option 2: If you have easy access to the Server Manager for your Enterprise Server.

i) Log in to SM and go to the Management Dashboard.

ii) Select the Enterprise Server from the list of Managed Instances.

iii) Select Runtime Metrics->Process Detail.

iv) Sort by Process Name.

v) For any jdenet_n (Network Listener) processes, click the link in the JDELOG File Size column for that row to view the logfile.

2) In each jde_####.log for a jdenet_n, locate the Business Functions (BSFN) call stack.

3) Is there a pattern that one BSFN stands out more than the others in the call stack?



Yes Continue

No Go to Consult the OS Core File


Yes Go to Reproducing the Issue

No Go to Consult the OS Core File

Check Tools Release

1) Is the customer on a supported release?

Yes Continue

No The customer should upgrade to a supported release or provide a compelling reason why this is not possible.

2) Is the customer on the current release?

Yes Skip to step 4.

No Continue

3) Can the customer upgrade to the current release?

Yes The customer should upgrade to the current release and see if the problem is resolved. If the problem persists, then continue.

No Continue

4) Is there a Solution Document or announcements document in My Oracle Support Knowledge base for the customers issue?

Yes Follow the instructions in the document for resolving the issue.

No Go to Contacting Oracle GCS.

Obtaining the Logfile for the Parent JDENET_N Process.

1) If a COBK kernel has crashed, and there is no useful information in its log, there may be helpful information in the logfile for the parent JDENET_N process. This section will provide instructions on obtaining the file.

2) Log in to Server Manager and go to the Management Dashboard.

3) Select your Enterprise Server from the list of Managed Instances.

4) Select Runtime Metrics->Process Detail.

5) Is the zombie COBK listed?

Yes Continue

No The list of zombies has already been cleared. Skip to step #10

6) Click the name (CALL OBJECT KERNEL) of the COBK that has crashed (the zombie COBK).

7) Under General Information, find Parent Process ID. Is the Parent PID non-zero?

Yes Continue



No Skip to step #10

8) Return to the Runtime Metrics->Process Detail page, and find the JDENET_N process whose PID matches the Parent PID. Click on the size of its log file (the entry under JDELOG File Size for that row) to view the logfile.

9) Return to JDENET_N Parent Process Log.

10) If there is more than one JDENET_N, you will have to find all JDENET_N logfiles and grep (search within the text of these files) for the PID of the zombie COBK to determine the appropriate logfile.

If you have access to the machine hosting the Enterprise Server, the easiest way to do this is to connect to that machine, navigate to the log folder for the Enterprise Server, and search within jde_*.log

Alternatively, the JDENET_N logfiles can be accessed one-at-a-time from the Runtime Metrics->Process

Detail page of Server Manager by clicking on the JDELOG File Size for each process that is a Network Listener.

11) Once you have identified the correct logfile, return to JDENET_N Parent Process Log.

Consult the OS Core File

If it has proven impossible to obtain a (useful) callstack from any of the EntepriseOne log files, it may still be possible to obtain a callstack from an OS-generated core file. If you are unfamiliar with generating and working with OS core dumps on your platform, information on doing so is available in Appendix C Getting and Using an OS Core File.

Once you have examined the callstack, if you can determine which BSFN is running at the time of the crash, go to Trouble with a specific BSFN above.

If you cannot isolate a specific BSFN, you should consult Oracle GCS.

METADATA KERNEL

There are historical issues that exist with Metadata Kernel, particularly in terms of out-of-memory errors and UBE-not-processing errors. It is believed that these issues were all resolved by Tools Release 8.98.2.0.

If a customer is experiencing crashes of the Metadata Kernel, the customer should attempt to upgrade to a newer tools release.

If the customer is already running a recent release, or an upgrade is not practical, the customer should contact Oracle GCS. It will be helpful to Oracle GCS to have:

Any available logfiles for the kernel,

Steps to reproduce the issue,

A copy of the Java heap dump (see Enabling a Java Heap Dump).

Enabling a Java Heap Dump

To Enable a Java heap Dump is a JDK and OS specific set of instructions . Since better and more recent methods are being created in a very rapid pace its best to contact the Kernel Support or Dev SMEs for the latest means to create a Java Dump.



Chapter 5 - Hung Kernels with High CPU

A non-responsive kernel with high-CPU has not crashed per se. While the kernel is no longer performing its required duties, code continues to execute, most likely in some form of infinite loop. The first step in resolving this issue is to identify where in the continued code the execution is taking place.

One can determine what code is running by examining a callstack. Since the kernel has not crashed in the sense of encountering a fatal error, there will NOT be a callstack written out to a file. Instead, a callstack can be obtained using OS tools such as procstac and cstack. These tools are discussed in Appendix D OS Tools for Obtaining a Call

Stack from Running Code. Note that customers running tools release 8.98.3.0 and beyond can obtain such a callstack through Server Manager.

It is important to note that, while a high-CPU hung kernel is most likely engaged in some sort of infinite loop, that loop will generally not be contained in the inner-most executing function of the callstack you obtain. Rather, the

inner-most functions are likely to be contained within the infinite loop. Therefore, it is necessary to repeat the

process of obtaining a callstack several (five to ten) times. The outermost entries in the callstack will remain the same across all the callstacks collected while the innermost entries will vary. The infinite loop most likely resides at the level of the inner-most function that is common to all of the collected callstacks.



Chapter 6 - Hung Kernels with Low CPU

IS A PACKAGE DEPLOYMENT CURRENTLY UNDERWAY?

When a package is currently being deployed to the Enterprise Server, the kernels temporarily suspend normal operation,

mimicking the behavior of a hung kernel with low CPU usage. Generally, package deployments are fairly quick to complete,

but under certain circumstances, deployments can require extended time. Once the package deployment completes or times out,

normal kernel operations will resume.

If a package deployment is not underway, proceed to the next section.

TROUBLESHOOTING LOW-CPU HUNG KERNELS

Similar to a hung kernel with high-CPU, a non-responsive kernel with low-CPU has also not crashed in the traditional sense.

Although the kernel is no longer performing its required duties, code continues to execute, most likely in some form of

deadlock.

A program is said to be in deadlock when two or more operations are each waiting for the other to finish, creating a situation in

which neither operation ever completes and both wait forever. Though not technically deadlock, a situation with similar

symptoms can arise when a single operation is waiting to obtain a lock on a resource, but that lock was not properly released

when a previous operation finished using the resource.

While UBE kernels are not multi-threaded, it is important to note that they are not immune from deadlock. Two separate UBE's

executing simultaneously (or, more likely, the same UBE being executed multiple times simultaneously) can compete for locks

on shared resources and end up in deadlock

As in the previous chapter, the first step in resolving this issue is to identify where in the code the execution is. One can

determine what code is running by examining a callstack. Since the kernel has not crashed in the sense of encountering a fatal

error, there will NOT be a callstack written out to a file. Instead, a callstack can be obtained using OS tools such as procstac

and cstack. The tools are discussed in Appendix D OS Tools for Obtaining a Call Stack from Running Code. Note that

customers running tools release 8.98.3.0 and beyond can obtain such a callstack through Server Manager.

After obtaining a call stack for all low-CPU hung kernels, the troubleshooter should examine the executing code to identify

what resource locks are currently held and what locks are pending. The troubleshooter should then study the remainder of the

code to determine where else these locks are obtained / released, and where the logical flaw resides.



Chapter 7 - Out of Memory / Memory Leak Kernels

MEMORY LEAKS

Generally speaking, a kernel suffering from a memory leak is discovered after it has crashed. The kernel crashes when a

memory allocation attempt fails because the process has reached its maximum allowed memory.1 Sometimes examining the

callstack at the time of the crash can indicate where this failed memory allocation occurred, but that may or may not provide

useful information. Often, the failed memory allocation is merely the unrelated victim of a programming error elsewhere in the

code that prevents no-longer-needed memory from being recycled.

OVERLY-AGGRESSIVE CACHING

An out-of-memory error does not necessarily imply the existence of a memory leak per se. Misuse of the JDB cache is a

common source of out-of-memory errors. The JDB cache can be used to store the result of a frequent database query in

memory for improved performance. However, if the cache is used too liberally with large tables, free memory will fill up with

JDB cache entries.

Overly-aggressive caching can be an issue with call object kernels, but it more often causes problems in batch jobs, simply due

to the much higher volume of data batch jobs generally manipulate. If an out of memory error is encountered, the

troubleshooter should investigate what information is being stored in the JDB cache and verify that no unreasonably large

queries are being cached.

There are two ways that a query result may be stored in the JDB cache.

1. If the table over which the query is made has been registered in the F98613 table, then the query result will be placed

in the JDB cache. To check which tables' queries are being cached through this method, examine the F98613 table.

2. A BSFN can use the JDB_AddTableToDBCache API to have a table's query results added to the cache. To check

whether this has happened, debug logging must be enabled, and the debug log should be searched for the messages of

the form: Entering JDB_AddTableToDBCache (Table =)

Small, unchanging tables such as company constants are prime candidates for caching in the JDB cache. Except in very

unusual circumstances, tables containing business data should never be cached.

TROUBLESHOOTING OUT-OF-MEMORY ISSUES

If an out-of-memory error does not appear to be related to overly-aggressive caching, the best way to troubleshoot a kernel that

is running out of memory is to recreate the issue while using a memory profiling tool such as Purify, Valgrind, or Pex.

(Customers using tools release 8.98.3.0 and beyond have the additional options of using BMD or Jade.). Memory profiling

tools such as these will show the user what memory has been allocated and never been freed (reclaimed).

1 Even when there is plentiful total free memory, an attempt to allocate a large block of memory will still fail if there is no

adequately large block of contiguous free memory



It is important to note that using any of the above profiling tools will incur a heavy performance penalty. If it is at all possible,

this should be done on a non-production server.



Appendix A Validation and Feedback

This section documents that real-world validation that this Document has received.

CUSTOMER VALIDATION

Oracle is working with PeopleSoft customers to get feedback and validation on this document. Lessons learned from these

customer experiences will be posted here.

FIELD VALIDATION

Oracle Consulting has provided feedback and validation on this document. Additional lessons learned from field experience

will be posted here.



Appendix B Glossary

Term Definition

BSFN Business Function

COBK Call Object Kernel

E1 Oracle JD Edwards EnterpriseOne

ESU Electronic Software Update

GCS Global Customer Support

MDK Metadata Kernel

PID Process Identifier (Process ID)

SAR Software Action Request

SM Server Manager

NetWM Network Work Management standalone utility shipped with Enterprise Server

that shows queues, outstanding requests, etc.

Callstack A list of currently executing functions organized hierarchically to show parent

(caller) to child (callee) relationships

UBE Universal Batch Engine

OS Operating System

Infinite Loop A program is said to be in an infinite loop when it continues to execute the same

section of code repeatedly forever.

Deadlock A program is said to be in deadlock when two or more operations are each

waiting for the other to finish, creating a situation where neither operation ever

completes and both wait forever. While not technically deadlock, a situation with

similar symptoms can arise when a single operation is waiting to obtain a lock on

a resource and that lock was not properly released when a previous operation

finished with the resource.

Management Dashboard The entry page to Server Manager (SM). The page has the title Managed

Homes and Managed Instances and can be reached by clicking a link in the

upper left corner of most SM pages.

Copyright 2011 Oracle, Inc. All rights reserved. 19


Appendix C Getting and Using an OS Core File

In Tools Release 8.98.3.0, several new features were added to streamline the debugging of kernel issues. This document is

primarily intended for users of Tools Releases in the 8.98.2 family and earlier. Users of Tools Release 8.98.3 and beyond will

find a simpler, platform independent set of instructions in the document, The KRM Documentation is present here:

OU Recording:http://oukc.oracle.com/static09/opn/login/?t=checkusercookies|r=-1|c=839298384


This chapter provides instructions for obtaining a call stack and a dump file on the following platforms:

Window Server

AS400 - iSeries

UNIX

WINDOWS

Pre-requisite This is for the Window platform only

1) Machine should have Debugging tools for windows installed, In this is not installed please download and install from

following url:

http://www.microsoft.com/whdc/devtools/debugging/installx86.mspx

PS: The above package will install windbg, please note the path of windbg.exe we will use this to capture the crash dump.

2) Have the customer download this version:

Current Release version 6.11.1.402 - February 6, 2009

Install 32-bit version 6.6.7.5 [15.2 MB]

Steps to install UserDump:

1. Download Site (version 8.1)

http://www.microsoft.com/downloads/details.aspx?FamilyID=E089CA41-6A87-40C8-BF69-

28AC08570B7E&displaylang=en&displaylang=en

a) Click Download



b) Click Run

c) After the download completed, a new folder, C:\kktools\userdump8.1, will be created.

2. Setup

http://support.microsoft.com/kb/241215

a) In C:\kktools\userdump8.1\x86, click setup.exe

b) A folder C:\WINDOWS\system32\kktools will be created after the setup.

3. Capturing E1 COBK

a) Go to Control Panel->Process Dumper



b) Click New



c) enter: jdenet_k.exe and click OK



d) Click Rules:



e) Select Use custom rules

- Point the Dump file folder to the folder is easily accessible. Make sure the folder exist

- Keep all the setting as seen.

- Check the Kill process after dumping

- Click OK



f) Optional: (unless instructed)

1) Check All Exceptions OR

2) Select specific exceptions

i) Access violation

ii) Array bounds exceeded

iii) Stack Overflow

iv) Invalid handle

v) Overflow

vi) Stack Check

g) Click Apply or OK



Getting Page Heap: (Optional)

http://support.microsoft.com/kb/267802

1. From the command line, go to the drive where the Debugging Tools for Window folder is installed.

2. From the command line:

>gflags /p /enable runbatch.exe /full

/full = full page heap, this will use a lot memory and resources.

3. Targetting specific dll

>gflags /p /enable jdenet_k.exe /dlls callbsfn.dll cruntime.dll

4. From the GUI interface of GFLAGS.

a) Go to Start All Programs

b) Debugging Tools for Window Global flags

c) Click on Image File tab page

d) Enter an executable name and TAB OUT - DO NOT HIT ENTER



- check the options as seen

e) To remove the settings, follow instruction 4a thru 4d but uncheck all options

AS400 ISERIES

When a C2M1211 or C2M1212 message is generated from a single-level store heap routine, the code checks for a *DTAARA

named QGPL/QC2M1211 or QGPL/QC2M1212. If the data area exists, the program stack is dumped. If the data area does not

exist, no dump is performed.

Setup data area to capture call stack for C2M1212 heap error message.



CRTDTAARA DTAARA(QGPL/QC2M1212) TYPE(*CHAR) LEN(1)

Setup data area to capture call stack for C2M1211 heap error message.

To setup C2M1211 data area will require SI27412 and SI28640 PTF ON V5R4.


Once the data area is in place, a spool file named QPRINT is created (this we can read to figure out which tools, apps or OS

API is causing the memory overwrite) with dump information for every C2M1211 message or C2M1212 message (this may be

something IBM can read).

The spool file is created for the user running the job that gets the message. For example, if the job getting the C2M1211

message or C2M1212 message is a server job or batch job running under userid ABC123, then the spool file is created in the

output queue for userid ABC123. Once the spool files containing stack tracebacks are obtained, the data area can be removed,

and the tracebacks analyzed.

To disable the dumps, delete the data area(s).

For further information please read Diagnosing and Debugging Memory Problems : C2M1211 and C2M1212 Messages from

IBM website.

When a C2M1211 message or C2M1212 message is generated from a teraspace heap routine, the code checks for a *DTAARA

named QGPL/QC2M1211 or QGPL/QC2M1212. If the data area exists and contains at least 50 characters of data, a 50

character string is retrieved from the data area. If the string within the data area matches one of the following strings, special

behavior is triggered.

_C_TS_dump_stack

_C_TS_dump_stack_vfy_heap

_C_TS_dump_stack_vfy_heap_wabort

_C_TS_dump_stack_vry_heap_wsleep

If the data area does not exist, no dump or heap verification is performed. For further information please read

Enablement for teraspace heap memory managers from IBM website.

Here is an example of how to create a data area to indicate to call _C_TS_malloc_debug to verify the heap whenever a

C2M1211 message or C2M1212 message is generated:

On IBM i 6.1 (with PTF SI33945) and IBM i 7.1 you can use following information to the data area.


VALUE('_C_TS_dump_stack_vfy_heap_wabort')


VALUE('_C_TS_dump_stack_vfy_heap_wabort')

This will re-validate the heap, if it detects memory corruption and will abort the job.



Caution : this should be used in a test environment as this can start throwing lot of errors/exceptions and with abort option

you will see more zombie process.

UNIX

1) In the JDE.INI config file, under the [JDENET] section, set the following: HandleKrnlSignals=0 and krnlCoreDump=1.

This will cause a core file to be dumped, provided the operating system allows it.

2) If the Oracle client is being used to connect to an Oracle database, log in as the oracle userid that owns the Oracle Client

install. Add the following line to the $ORACLE_HOME/network/admin/sqlnet.ora file:

DIAG_SIGHANDLER_ENABLED=false

3) Next, you must ensure that the operating system allows the creation of core files.

a) On the command line type the command: ulimit -c. This will show the current maximum size for core files. b) If the size is 0 (or very small), then no core file will be created.

c) To change the size for the core file, on the command line, type: ulimit -c where is the size in bytes

d) Confirm the ulimit change by rerunning ulimit -c on the command line. If the value from step c above is not displayed, the hard limit may need to be raised by the root user. Changes to the /etc/security/limits

e) If E1 Enterprise Server services are to be started from the command line using RunOneWorld.sh, start the E1

Enterprise Server services from a login session where ulimit -c was run. The ulimit command has to be run

for each new login session on the server that is used to run the RunOneWorld.sh script. If the E1 Enterprise Server

needs to be stopped and restarted often, adding the ulimit -c command to the bottom of the

$SYSTEM/bin32/toolsenv.sh script will ensure the ulimit command is run each time a new login session is opened.

f) If the E1 Enterprise Server is to be stopped and restarted remotely via Server Manager, the Server Manager client on

the Enterprise Server must be restarted from a login session where ulimit -c has been run. Run the ulimit

command, then goto the jde_home/bin directory and run the command: restartAgent g) Test that core files are being created properly by selecting a jdenet_k process-PID and run the following command:

kill -15 This should generate a core file.

4) When the core file is generated, the core file has the same name in the $SYSTEM/bin32 directory, unless the operation

system is actively managing core file names and locations. The server may already be configured to put all core files in a

central location. If so, the server may be reconfigured, or the core files can be copied to the $SYSTEM/bin32 directory to

be read. Option to generate the core file with the unique name.

a) On Sun Solaris, put the coreadm command in the user profile:

coreadm -p core.%f.%p $$

The above command will generate the core file with the following format name:

core..

b) On Linux, log in as root and edit the /etc/sysctl.conf file and add the following line:

kernel.core_uses_pid = 1

Anytime the /etc/sysctl.conf file is changed, the root user must run the following command to make the change effective

immediately: sysctl -p Once this is run, every new login session will get the new settings. Stop and restart E1 following

the directions in step 3e or 3f.

c) If no other core naming options are available, create a script to detect the core file and rename it. See the following for

example. Run the script from the $SYSTEM/bin32 directory in the background with nohup using this command:

nohup rename_core &



rename_core script sample

#!/bin/ksh

# This script just hangs around waiting for a core file to appear, and if

# one does, renames it to a name based on the current date and time.

while true

do

sleep 30

if [ -f core ]

then

cname="core.$(date +%Y%m%d%H%M%S)"

echo renaming core to $cname

mv core $cname

done

5) Once the core files are captured, the core files must be opened at the customer site to get the call stack.

6) Which platform the customer is using?

HP LINUX AIX SUN

HP

1) Do you know what executable create the core file? Yes No

2) On the command line type:

file 3) The above command will give you the executable name to be used in the Get HP Callstack (#4)

Get HP Callstack

4) Getting the callstack

Command line:

>gdb

Example:

>gdb jdenet_k core.xxxx.12345

Once the core file is open, do the following

>info thread This will give you a list of threads that were created within jdenet_k process.

>thread # Open thread number

>where List the callstack within that thread #

>quit Exit gdb



LINUX

Linux core files generally must be read on the same server they were created. Displaying the core file on a different server can

produce incorrect output.



file 3) The above command will give you the executable name to be used in the Get Linux Callstack (#4)

Get Linux Callstack


Command line:

>gdb

Example:

>gdb jdenet_k core.12345

Once the core file is open, do the following

>info thread This will give you a list of threads that were created within jdenet_k process.

>thread # Open thread number

>where List the callstack within that thread #

>quit Exit gdb

There is some optional information that can be collected along with the stack:

show charset Show the effective character set when the process crashed.

show environment Show the environment variables when the processed crashed.

AIX



file 3) The above command will give you the executable name to be used in the Get AIX Callstack (#4)

Get AIX Callstack


Command Line:

dbx prog

This will bring up the dbx command, the user has to hit enter or return key several time

>where List the callstack



SUN

1) Simply type the following in the command line:

Command Line:

pstack

This will list the callstack

Copyright 2011 Oracle, Inc. All rights reserved. 33


Appendix D OS Tools for Obtaining a Call Stack from Running Code

Following Procstack/ Pstack command is to be used when a process is either hung or running on CPU with high usage. Please

note that this should be used on Systems which are pre-898_3x as in 898_3.x and beyond the same call stacks can be obtained

from CPU Diagnostics in Server manager (simply press the CPU Diagnostics in Server Manager.)

Caution: This document may contain information, software, products, services which are not supported by Oracle Support

Services and are being provided as is without warranty. Please refer to the following site for My Oracle Support Terms of

use: https://support.oracle.com/CSP/ui/TermsOfUse.html

UNIX

Following should be run on various Unixes to dump call stacks:

HP- UX : /usr/ccs/bin/pstack

AIX: /usr/bin/procstack

SUN: /usr/bin/pstack

LINUX: /usr/bin/pstack

More information on Procstack can be found on the following IBM link for Prockstack Command.

WINDOWS

Use ADPlus tool to collect the call stack information on Windows platform. For more information on how to use the tool,

follow the link from Microsoft on How to use ADPlus to troubleshoot "hangs" and "crashes

AS400

The process below can be used to retrieve the program stack for a job with a single thread or the first thread of a multithreaded

job.

cmd: ADDLIBLE E900SYS cmd: SAW | Option 2 Work with Server Processes | Option 3 Display OneWorld Processes



The following creates a spool file contaiing the program stack(call Stack)

Cmd: DSPJOB JOB(072347/ONEWORLD/JDENET K) OUTPUT(*PRINT) OPTION(*PGMSTK)

The following creates a spool file containing the program stack (call stack)

cmd: DSPJOB JOB(072347/ONEWORLD/JDENET_K) OPTION(*PGMSTK)



1. Create a library and output queue to move the previously generated spool file items.

cmd: CRTLIB JDETEMP

cmd: CRTOUTQ JDETEMP/JDETEMP

2. Copy the items found in output queue WRKOUTQ JDETEMP/JDETEMP via iSeries Navigator to a local Windows folder.

a. Expand the host name node. Login to the system. Expand the Basic Operations node. Right-hand click on

Printer Output highlight Customize this View and select Include.



Change the Users value to All. Type JDETEMP/JDETEMP in the Output queues field as shown below.



b. Highlight all of the spool files found in the right-hand window pane. Click Ctrl-C (to copy) and paste these

files into a local Windows Explorer folder, e.g. SND2DENVER.

Documents

Troubleshooting E1 Kernels-1