Concurrent Processing

PARALLEL CONCURRENT PROCESSING FAILOVER AND LOAD BALANCING OF E-BUSINESS SUITE RELEASE 11I AND RELEASE 12 Mike Swing, TruTek

Abstract Parallel Concurrent Processing Failover uses two mechanisms to detect a failure, dead connection detection, and detecting a failure of the process monitor for the Concurrent Managers, otherwise known as PMON (note that this is not the PMON from the database); introduced with Patch 6495206. Load balancing of the Concurrent Managers is critical if you expect parallel concurrent processing to function after the failover to the remaining node(s).

This paper reviews Concurrent Manager basics before we discuss the topics of failover and load balancing. One of the key components used by Concurrent Processing is Generic Service Management. The use of GSM with multiple nodes and seeded GSM services is discussed. Administering Concurrent Managers, managing control across nodes, starting and stopping the Concurrent Managers, and managing concurrent log files are skills needed to understand the configuration of Parallel Concurrent Processing failover and load balancing.

There are a number of ways that an E-Business Suite environment might be configured for failover:

• Database

• Fast Connection Failover (FCF)

• Transparent Application Failover (TAF)

• Parallel Concurrent Processing Failover

• Concurrent Manager Failover

This paper will discuss Parallel Concurrent Processing Failover, ICM Failover, CRM Failover, and Concurrent Manager Failover. We’ll leave the discussion of Database Failover, Fast Connection Failover and Transparent Application Failover for another time.

The paper concludes with a discussion of load balancing and the issues that must be considered to properly configure an E-Business Suite environment to take advantage of Oracle’s load balancing features.

Concurrent Processing Most user interactions with Oracle Applications data are conducted via the HTML interface or the Forms interface. However, reporting and interface programs may need to run periodically or on an ad hoc basis. As these programs may require a large number of computations, they are run in the background at a time, and with a priority, such that the work of interactive users is not impeded. Such programs are run on the Concurrent Processing server and run under Concurrent Managers. When a request is submitted to run a Concurrent Program through an Oracle Applications form or through Oracle Application Manager (OAM), the request inserts a row into the FND_CONCURRENT_REQUESTS table that specifies the program to be run. Concurrent Managers read the requests from the table and start the appropriate Concurrent Programs. The Concurrent Processing Server:

• Allows scheduling of batch jobs called Concurrent Requests. • Processes Concurrent Programs as a Concurrent Request. • Requests can be grouped together into Request Sets. • Different types of Concurrent Managers handle different types of requests. • A Concurrent Program can be assigned to a responsibility, and that responsibility can be assigned to users, allowing

them permission to run the Concurrent Program.

www.rmoug.org 1 RMOUG Training Days 2009

Parallel Concurrent Processing Failover and Load Balancing Swing

• Concurrent Managers may have limits on the Concurrent Programs that can be run, and the times that they can be started. Concurrent Requests have priorities, statuses, and log and out files in $APPLCSF.

Definitions The following are some acronyms that we will use throughout this paper:

• CP => Concurrent Processing • DCD => Dead Connection Detection • ICM => Internal Concurrent Manager • IM => Internal Monitor • CRM => Conflict Resolution Manager • PCP => Parallel Concurrent Processing • PMON => Process Monitor for ICM

Concurrent Requests Figure 1 shows an example of the Concurrent Manager Requests screen.

Figure 1

Phase and Status of Concurrent Requests Figure 2 shows the various Phases and Statuses that a Concurrent Program can have, with a description of what they mean:

Phase Status Description – Action Pending Normal The request is waiting to be picked up by the next available

manager. Pending Standby Waiting for CRM to resolve conflict. CRM could be slow or an

incompatible program is running. Running Normal The request is running normally.

The Phase and Status tell us what is happening with each Concurrent Program



Completed Normal The request has finished successfully Completed Error The request has finished with an error. Check logs. Completed Warning The request has finished with a Warning. Check the logs. Inactive No Manager Request won’t run without a manager. Specialization rules aren’t

configured properly. Figure 2

Concurrent Managers Figure 3 shows the Concurrent Manager Administer screen. Oracle seeds a number of Concurrent Managers and assigns Concurrent Programs to those managers. Your Applications System Administrator can also define custom managers and assign Concurrent Programs to those managers.

Figure 3

Figure 4 shows the different types of Concurrent Managers, their Service Instance, and their Program Name. Your Applications System Administrator can adjust the Concurrent Managers and Transaction Managers, but the other types of managers must be left alone.

Manager Type Service Instance Program Internal Concurrent Manager Internal Manager FNDLIBR Conflict Resolution Manager Conflict Resolution Manager FNDCRM Internal Monitor Internal Monitor:Node FNDIMON Service Manager FNDSM Concurrent Manager Standard Manager FNDLIBR Concurrent Manager Inventory Manager INVLIBR Concurrent Manager Session History Cleanup FNDLIBR Concurrent Manager PA Streamline Manager PALIBR Transaction Manager CRP Inquiry Manager CYQLIB



Transaction Manager FastFormula Transaction Manager FFTM Transaction Manager PO Document Approval Manager POXCON Transaction Manager Transaction Manager FNDTMTST Scheduler/Prerelease Manager FNDSVC OAM Generic Collection Service:Node FNDSVC

Figure 4

Concurrent Processing Overview This diagram provides an overview of how Concurrent Processing works.

JAVA Interface JInitiator

Web Browser

Forms Server

Report Review Agent

SQL*Net .rdx

Requests Log Out

Service Manager FNDSM

ICM FNDLIBR

Web ServerHTML Interface

Reports Server

Internal Monitor

FNDIMON Standard Manager

FNDLIBR FNDCRM

In the diagram, you can see that: 1. The Concurrent Processing server communicates with the database using Oracle SQL*Net. 2. Log and Out files from Concurrent Programs are generated on the Concurrent Processing server. Log files show

what occurred when the program ran, while out files are the output of the program. 3. The Concurrent Program log and output file from a request is passed back as a report to the Report Review Agent. 4. The Report Review Agent passes a file containing the entire report to the forms server. 5. The Forms Services component passes the report back to the user’s browser one page at time. Profile Options can be

used to control the size of the files and pages passed, to suit report volume and available network capacity.

Concurrent Manager Processes Internal Concurrent Manager Internal Concurrent Manager (FNDLIBR process) - Communicates with the Service Manager.

• The Internal Concurrent Manager (ICM) starts, sets the number of active processes, monitors, and terminates all other concurrent processes through requests made to the Service Manager, including restarting any failed processes.

• The ICM also starts, stops, and restarts the Service Manager for each node. • The ICM will perform process migration during an instance or node failure. • The ICM will be active on a single node. This is also true in a Parallel Concurrent Processing environment, where

the ICM will be active on at least one node at all times. • The ICM really does not have any scheduling responsibilities. It has NOTHING to do with scheduling requests, or

deciding which manager will run a particular request. The function of the ICM is to run 'queue control' requests; requests to startup or shutdown other managers.



• The ICM is responsible for startup and shutdown of the whole concurrent processing facility, and it monitors the other managers periodically, and restarts them if they should go down. It can also take over the Conflict Resolution Manager's job, and resolve incompatibilities.

• If the ICM itself should go down, requests will continue to run normally, except for 'queue control' requests. Your Applications System Administrator can restart the ICM by running the 'startmgr' command; there is no need to kill the other managers first.

Figure 5 shows the definition of the Internal Manager.

In this example of the ICM definition, there is a Secondary Node defined for PCP details.

Figure 5 In Release 11i, if there is more than one possible Secondary Node and the Primary Node fails, PCP will failover to any node that is available. By specifying a Secondary Node, it limits failover only to that node. An available node is any node, except AUTHENTICATION, in the FND_NODES table whose status is set to ‘Y’.

Figure 6

In Figure 6, the TCP connection to RH9 has been disconnected and it shows a status of ‘N’.



Service Manager (FNDSM process) - Communicates with the Internal Concurrent Manager, Concurrent Manager, and non-Manager Service processes.

• The Service Manager (SM) spawns and terminates manager and service processes (these could be Forms, Apache Listeners, Metrics or Reports Server, and any other process controlled through Generic Service Management).

• When the ICM terminates, the SM that resides on the same node with the ICM will also terminate. • The SM is “chained” to the ICM. The SM will only reinitialize after termination when there is a function it needs to

perform (start, or stop a process), so there may be periods of time when the SM is not active, and this would be normal.

• All processes initialized by the SM inherit the same environment as the SM. • The SM’s environment is set by the APPSORA.env file, and the gsmstart.sh script. • The TWO_TASK setting used by the SM to connect to a RAC instance must match the instance_name from

GV$INSTANCE. • The apps_<sid> listener must be active on each Concurrent Processing node to support the Service

Manager connection to the local instance. • There should be a Service Manager active on each node where a Concurrent or non-Manager service process will

reside. FNDSM failover as noted in the Concurrent Manager log:

Could not contact Service Manager FNDSM_RH8_VIS. The TNS alias could not be located, the listener process on RH3 could not be contacted, or the listener failed to spawn the Service Manager process. Found dead process: spid=(962754), cpid=(2259578), Service Instance=(1045) CONC-SM TNS FAIL Call to PingProcess failed for WFMAILER CONC-SM TNS FAIL Call to StopProcess failed for WFMAILER CONC-SM TNS FAIL Call to PingProcess failed for FNDCPGSC CONC-SM TNS FAIL Call to StopProcess failed for FNDOPP CONC-SM TNS FAIL Call to PingProcess failed for OAMGCS CONC-SM TNS FAIL Call to StopProcess failed for OAMGCS Found dead process: spid=(716870), cpid=(2259580), Service Instance=(2009) Found dead process: spid=(1442020), cpid=(2259579), Service Instance=(2010) Starting WFMGSMD Concurrent Manager : 15-AUG-2008 13:28:56 Starting WFMGSMDB Concurrent Manager : 15-AUG-2008 13:28:56 Starting WFALSNRSVCB Concurrent Manager : 15-AUG-2008 13:28:57 Starting STANDARD Concurrent Manager : 15-AUG-2008 13:30:31 Starting Internal Concurrent Manager Concurrent Manager : 15-AUG-2008 13:30:32

Internal Monitor (FNDIMON process) - Communicates with the Internal Concurrent Manager.

• This manager/service is used to implement Parallel Concurrent Processing. • You do not need to run this manager/service unless you are using Parallel Concurrent Processing. • The Internal Monitor (IM) monitors the Internal Concurrent Manager, and restarts any failed ICM on the local node.

It monitors whether the ICM is still running, and if the ICM crashes, it will restart it on another node. • During a node failure in a PCP environment, the IM will restart the ICM on a surviving node (multiple ICMs may be

started on multiple nodes, but only the first ICM started will eventually remain active, all others will gracefully terminate).

• There should be an Internal Monitor defined on each node where the ICM may migrate.

Standard Manager (FNDLIBR process) - Communicates with the Service Manager and any client application process.



• The Standard Manager is a worker process that initiates, and executes client requests on behalf of Applications batch and OLTP clients.

Figure 7 shows the Administer Concurrent Managers screen:

Notice that there are two nodes defined, RH7 and RH8

Figure 7 You can also see the Concurrent Managers from the OAM web page:

Figure 8

In Figure 9, the Standard Manager is active on RH9, even though no Primary Node is defined:

3 processes will run if the Standard Manager fails over

Figure 9 Since no Secondary Node is defined, the Standard Manager will not failover.



Notice in Figure 8 that in the Work Shifts definition, there are now Failover Processes, in order to specify the number of processes that will run when the Standard Manager fails over to the Secondary Node.

Transaction Manager Transaction Managers communicate with the Service Manager, and any user process initiated on behalf of Forms, or a Standard Manager request. A Transaction Manager:

• Supports synchronous processing of requests from a client program • Gets a request for a client program to run a server-side program synchronously. • Returns a status/results to the client program. • At runtime, it starts a number of these managers as defined. • Doesn’t poll concurrent request table for a new request • You only need 1 Transaction Manager per database, not 1 per instance.

Figure 10 shows some of the Transaction Managers in Release 12:

Figure 10

Note that between Release 11i and Release 12, the way that Transaction Managers work has changed: Release 11i Transaction Managers use DBMS_PIPE

• This does not work across RAC instances • RAC users must perform additional configuration. • Requires complicated configuration or additional hardware

Release 12 Transaction Managers use AQ

• Works across RAC Instances • Simplifies configuration • Reduces complexity • Profile Option can switch between mechanisms • DBMS_PIPE can be used for non-RAC users if performance becomes an issue

Transaction Managers allow a client to make a request for a program to be run on the server immediately. The client then waits for the program to complete and can receive program results from the server. As the client and server are two separate database sessions, the communication between them for Release 11i has been handled using the DBMS_PIPE package. Unfortunately the DBMS_PIPE package does not extend to communications between sessions on different RAC instances. On an Applications instance using RAC, the client and server are very likely to be on different instances, causing transactions to time out for long periods or fail completely. The current workaround is to manually set up Transaction Managers to connect to



all RAC instances, which not only takes up additional resources, but may also require additional middle-tier hardware or a complicated configuration that is difficult to maintain. In Release 12, the Transaction Managers use the AQ mechanism; the Transaction Managers, work on RAC connected to either instance. This greatly simplifies the configuration and reduces the complexity for RAC administrators. A Profile Option has been introduced to allow users to switch between the two transports DBMS_PIPE or AQ.

Process Request

Place Results on Return Queue

Start Transaction

Get Concurrent Processor

Place message on AQ

Retrieve Transaction

Results

Yes

No

Listen for Transaction Requests

SERVER (TM) CLIENT

Receive Return

Message

Return with Error

Yes

Timeout

Receive Request

Shut Down?

Yes

Timeout

Yes

Exit

No

Here we see the Client and Server Process flows for the AQ Transaction Managers. The client-side flow is:

1. The Client gets active Concurrent Processor Id which can process the transaction request. 2. The Client returns if it can’t find any processor id. 3. The Client places message containing the transaction details on the transaction AQ with the processor id as the

correlation id. This message is addressed by any available Transaction Manager that can process the client request. 4. The Client listens on the return queue for a return message until one arrives or a timeout period expires.

The server-side flow is:

1. The Transaction Manager will listen for any transaction requests that will get requests for its processor id. 2. The Transaction Manager will process the transaction request if there is any, and puts the results back in the return

AQ. 3. The Transaction Manager will repeat steps 1 and 2 until it shuts down.

TO SET UP TRANSACTION MANAGERS FOR PCP WHEN USING RAC

These steps apply to both 11i and R12:

1. Shut down the application tier services on all the nodes.

2. Shut down all the database instances cleanly in the RAC environment, using the command:

SQL>shutdown immediate;

3. Edit $ORACLE_HOME/dbs/<context_name>_ifile.ora and add these parameters:



_lm_global_posts=TRUE

_immediate_commit_propagation=TRUE From note: 362135.1

4. Start the instance on each database node.

5. Start up the Application tier services on all nodes.

6. Navigate to Profile > System and change the Profile Option ‘Concurrent: TM Transport Type' to ‘QUEUE', and verify that the Transaction Manager works across the RAC instance. ATG RUP3 (4334965) or higher provides an option to use AQs in place of Pipes. Note: 240818.1

7. Profile Option “Concurrent:TM Transport Type” can be set to PIPE or QUEUE 8. Pipes are more efficient but require a Transaction Manager to be running on each database Instance.

9. Navigate to the Concurrent > Manager > Define screen, and set up the Primary and Secondary Node names for the Transaction Managers.

10. Restart the Concurrent Managers.

Conflict Resolution Manager Concurrent Managers read requests to start Concurrent Programs. The Conflict Resolution Manager checks Concurrent Program definitions for incompatibility rules. If a program is identified as Run Alone, then the Conflict Resolution Manager prevents the Concurrent Managers from starting other programs in the same conflict domain. When a program lists other programs as being incompatible with it, the Conflict Resolution Manager prevents the program from starting until any incompatible programs in the same domain have completed running. If a Concurrent Program cannot run on any Concurrent Manager, perhaps because it has been assigned to a Concurrent Manager that is disabled, then the Concurrent Request will stack up in the Conflict Resolution Manager. When a Concurrent Program is started, Concurrent Managers read the request information from the FND Concurrent Request tables. The Conflict Resolution Manager checks Concurrent Program definitions for incompatibility rules. If a program is identified as Run Alone, then the Conflict Resolution Manager prevents the Concurrent Managers from starting other programs in the same conflict domain. When a program lists other programs as being incompatible with it, the Conflict Resolution Manager prevents the program from starting until any incompatible programs in the same domain have completed running. TO ENABLE/DISABLE THE CONFLICT RESOLUTION MANAGER

• Use the system Profile Option 'Concurrent: Use ICM'. 'No' o Allows the CRM to be started. o Setting it to 'Yes' causes the CRM to be shutdown o The Internal Manager (ICM) will take over the conflict resolution duties.

• Note that using the ICM to resolve conflicts is not recommended. • The CRM's sole purpose is to resolve conflicts, while the ICM has other functions to perform as well. Only set this

option to 'YES' if you have a good reason to do so.

Internal Scheduler/Prereleaser Manager The short name for this manager is FNDSCH. It is also known as the



Advanced Scheduler/Prereleaser Manager. This manager is intended to implement Advanced Schedules. Its job is to determine when a scheduled request is ready to run. Advanced Schedules were not fully implemented in Release 11.0. They are implemented in Release 11i, but are not widely used by the various Applications modules. General Ledger uses FNDSCH for financial schedules based on different calendars and period types. It is then possible to schedule AutoAllocation sets, Recurring Journals, MassAllocations, Budget Formulas, and MassBudgets to run according to the General Ledger schedules that have been defined. If financial schedules in GL are not being used then it is not a problem to deactivate this manager.

Internal Concurrent Manager Failover Definition Release 11i Define Primary and Secondary Nodes in Release 11i

Figure 11

By not specifying a secondary node the ICM can failover to any node that is available. Consider a system that has three or more concurrent processing nodes and two nodes go down, including primary node RH3. If the secondary node was specified, there would be a chance the secondary node would not be available. This capability, to failover to an un-named secondary node, is available for all managers in 11i. In Release 12 this works differently.

Release 12 In Release 12, for failover to function properly, both primary and secondary nodes must be specified. Most managers won’t start if a primary node is not assigned. However, a few managers, for example, the Internal Concurrent Manager, and the Conflict Resolution Manager will start on any available node. If a secondary node is not defined, the manager will not failover.



Figure 12

Release 12 Generic Services and Request Processing Managers

Figure 13

GENERIC SERVICES Generic Services include the Internal Concurrent Manager and Conflict Resolution Manager.



Figure 14

REQUEST PROCESSING MANAGERS Request Processing Managers include the Standard Manager and other Concurrent Managers.

Figure 15

GENERIC SERVICE MANAGEMENT An E-Business Suite system depends on a variety of services, such as Forms Listeners, HTTP Servers, Concurrent Managers, and Workflow Mailers. These services are composed of one or more processes. In the past, many of these processes had to be individually started and monitored by the Applications System Administrator. Management of these processes is complicated, since these services can be distributed across multiple host machines. The introduction of Generic Service Management in Release 11i helped simplify the management of these processes by providing a fault tolerant service framework and a central management console built into Oracle Applications Manager (OAM). Service Management is an extension of Concurrent Processing, and provides a framework for managing processes on multiple host machines. With Service Management, virtually any application tier service can be integrated into this framework.



Figure 16

Figure 16 shows that beginning with Release 11i, services such as the Oracle Forms Listener, Oracle Reports Server, Apache Web listener, and Oracle Workflow Mailer can be run under Service Management. With Service Management, the Internal Concurrent Manager (ICM) manages the various service processes across multiple hosts. On each host, a Service Manager acts on behalf of the ICM, allowing the ICM to monitor and control service processes on that host. Applications System Administrators can then configure, monitor, and control services though a management console that communicates with the ICM. Figure 17 shows the Oracle Application Manager (OAM) screen that an Applications System Administrator can use to manage the Concurrent Managers.



Figure 17

Service Management provides a fault tolerant system. If a service process exits unexpectedly, the ICM will automatically attempt to restart the process. If a host fails, the ICM may start the affected service processes on a secondary host. The ICM itself is monitored and kept alive by Internal Monitor processes located on various hosts. TEST – KILL SERVICES TO SEE IF GSM RESTARTS THEM In this example, we will kill the FNDSM process and the FNDCRM process to see if the Generic Services Manager correctly restarts the process: Kill FNDSM

applvis 9007 1 0 11:53 ? 00:00:00 FNDSM applvis 9159 9155 0 11:55 ? 00:00:00 FNDLIBR applvis 9161 5683 0 11:55 pts/3 00:00:00 grep FND [applvis@rh9 scripts]$ kill -9 9007 [applvis@rh9 scripts]$ ps -ef |grep FND applvis 9159 9155 0 11:55 ? 00:00:00 FNDLIBR



applvis 9169 1 0 11:55 ? 00:00:00 FNDSM applvis 9249 5683 0 11:57 pts/3 00:00:00 grep FND

Kill FNDCRM

[applvis@rh9 scripts]$ ps -ef |grep FNDCRM applvis 8886 1 0 11:52 ? 00:00:00 FNDCRM APPS/ZGA13053E1E1B7BA773417089054DA88F194EAC0D687728CC2551870E6B78C4B439EADB287342795115A88DBC85788CCB4 FND FNDCRM N 10 c LOCK Y RH9 1302318 [applvis@rh9 scripts]$ kill -9 8886 [applvis@rh9 scripts]$ ps -ef |grep FNDCRM applvis 9457 9392 0 12:09 ? 00:00:00 FNDCRM APPS/ZG26430816FA3570354BC57DE47FF105D145F8DE226EFE58CE04B416633DCB901267BFECFA7585114F7090060EFE1147BE FND FNDCRM N 10 c LOCK Y RH9 1302343

In each case, both of these services were started before I could enter the grep command to find the corresponding process. Figure 18 shows that the entire set of system services may be started or stopped with a single action.

Choose an action from the pulldown to start or stop services

Figure 18 GSM AND MULTIPLE NODES GSM enables users to manage Applications services across multiple middle-tier nodes. This includes services on Web/Forms nodes that previously have had no concurrent processing footprint. Users configuring GSM in a multiple-node system should be sure to have followed the instructions for setting up Parallel Concurrent Processing. This includes setting the environment variable APPLDCP=ON and assigning a Primary Node for all defined managers and services (if not already defined.) SEEDED GSM SERVICES When configuring GSM the following GSM Services are seeded automatically:



• Forms Listener • Metrics Server • Metrics Client • Reports Server • Apache Listener • LINUX users should not Activate the Reports Server under GSM

These services, once seeded, may be managed under GSM and controlled via the Oracle Applications Manager. FNDSVCRG – SERVICE CONTROLLER UTILITY FNDSVCRG is an executable introduced as a part of the Seeded GSM Services. It provides improved coordination between the GSM monitoring of these services and their command-line control scripts. The $FND_TOP/bin/FNDSVCRG executable is triggered from the concurrent processing control script before and after the script starts or stops the service. FNDSVCRG connects to the database and validates the configuration of the Seeded GSM Service. If a service is not enabled to be managed under GSM, the FNDSVCRG executable does nothing and exits. The script then continues to perform its normal start/stop actions. If a service is enabled for GSM management, the FNDSVCRG executable will update the service information in the database including the environment context, the current service log file location, and the current state of the service. VERIFY GSM

• To verify that GSM is working, start the Concurrent Managers. • Once GSM is enabled, the ICM uses Service Managers to start all Concurrent Managers and activated services. • If the ICM successfully starts the managers, then GSM has been configured properly. • If managers and/or services fail to start, errors should appear in the ICM log file.

Each Service Manager maintains its own log file named FNDSMxxxx.mgr, located in the same directory as the Concurrent Manager log files. It is useful to examine these log files when there are problems starting services. If you cannot locate the Service Manager log file, it is likely that the Service Managers are not starting properly and there is a configuration issue that needs troubleshooting.

Parallel Concurrent Processing APPLDCP Profile Option Starting with Release 11.5.10, FND.H, the APPLDCP environment variable is ignored. Release 12 GSM requires the value of APPLDCP to be set to “ON”. The value is hard-coded in afpcsq.lpc version 115.35, thereby ignoring the value of APPLDCP. According to Oracle’s ATG Development in Note 753678.1: “As of file "afpcsq.lpc" version 115.35 or higher, APPLDCP is internally hard-coded to "ON" when the Generic Service Management (GSM) is enabled--"keeping in mind, use of the GSM is required". In short, at "afpcsq.lpc" version 115.35 or higher with the GSM enabled, the setting of the APPLDCP environment variable is ignored--this is the "default behavior on all Release 12 releases." NOTE: As per ARU, "Patch 11i.FND.H" (3262159) and "Oracle Applications Release 11.5.10" (3140000) contains "afpcsq.lpc" version 115.37.”

Parallel Concurrent Processing • In a Release 11i or Release 12 environment with Parallel Concurrent Processing enabled, the Primary Node

assignment is optional for the Internal Concurrent Manager. • The Internal Concurrent Manager can be started from any of the nodes (host machines) identified as concurrent

processing server enabled.



• In the absence of a Primary Node assignment for the Internal Concurrent Manager, the Internal Concurrent Manager will stay on the node (host machine) where it was started.

• If a Primary Node is assigned, the Internal Concurrent Manager will migrate to that node if it was started on a different node.

• If the node on which the Internal Concurrent Manager is currently running becomes unavailable, the Internal Concurrent Manager will be restarted on an alternate concurrent processing node.

• If a Primary Node is not assigned, the Internal Concurrent Manager will continue to operate on the node where it was restarted.

• If a Primary Node has been assigned to the Internal Concurrent Manager, then it will be migrated back to that node whenever the node becomes available.

Release 11i Parallel Concurrent Processing • In releases before Release 11i, there must be an assigned Primary and Secondary Node for each Concurrent

Manager. • Primary and Secondary Nodes need not be explicitly assigned. However, you can assign Primary and Secondary

Nodes for directed load and failover capabilities. • In Release 11i, with three or more nodes in the concurrent processing tier, it is recommended to not specify the

Secondary Node for failover. This is because the specified Secondary Node may not be available when the Primary Node goes down.

• By not specifying the Secondary Node, GSM can find an available node with Concurrent Processing services that can be used during failover.

Release 12 Parallel Concurrent Processing • With Release 12, if a Secondary Node is not specified, the processes will not failover as they do in Release 11i. This

is a critical difference between Release 11i and Release 12.

Parallel Concurrent Processing Parallel concurrent processing allows distribution of Concurrent Managers across multiple nodes. Benefits are improved performance, availability and scalability (load balancing). Parallel Concurrent Processing (PCP) is activated along with Generic Service Management (GSM); it can not be activated independently of GSM. With parallel concurrent processing implemented with GSM, the Internal Concurrent Manager (ICM) tries to assign valid nodes for Concurrent Managers and other service instances. There should be only one ICM and CRM, at any given time. However, the ICM and CRM could be configured to run on several of the nodes. Concurrent Managers migrate to the surviving node when one of the concurrent nodes goes down. Parallel Concurrent Processing The following diagram shows how Parallel Concurrent Processing works:



Database

Data

JAVA Interface JInitiator

Web Browser

Forms Server

ReportReviewAgent

SQL*Net

.rdx

ReportReviewAgent

SQL*Net

.rdx

Requests

Requests

Logs

Logs

Out

Out

ICM FNDLIBR

Service ManagerFNDSM

Service ManagerFNDSM

ICM FNDLIBR

Web ServerHTML Interface

Internal Monitor

FNDIMON

Internal Monitor

FNDIMON Standard Manager

FNDLIBR

Standard Manager

FNDLIBR

FNDCRM

FNDCRM

Reports Server

Internal Concurrent Manager: The Internal Concurrent Manager can run on any node, and can activate and deactivate Concurrent Managers on all nodes. Since the Internal Concurrent Manager must be active at all times, it needs high fault tolerance. To provide this fault tolerance, parallel concurrent processing uses Internal Monitor Processes. Internal Monitor Processes: The sole job of an Internal Monitor Process is to monitor the Internal Concurrent Manager and to restart that manager should it fail. The first Internal Monitor Process to detect that the Internal Concurrent Manager has failed restarts that manager on its own node. Only one Internal Monitor Process can be active on a single node. You decide which nodes have an Internal Monitor Process when you configure your system. You can also assign each Internal Monitor Process a Primary and a Secondary Node to ensure failover protection. Internal Monitor Processes, like Concurrent Managers, have assigned work shifts, and are activated and deactivated by the Internal Concurrent Manager. However, automatic activation of PCP does not additionally require that Primary Nodes be assigned for all Concurrent Managers and other GSM-managed services. If no Primary Node is assigned for a service instance, the Internal Concurrent Manager (ICM) assigns a valid Concurrent Processing Server Node as the Target Node. In general, this node will be the same node where the Internal Concurrent Manager is running. In the case where the ICM is not on a Concurrent Processing Server Node, the ICM chooses an active Concurrent Processing Server Node in the system. If a Concurrent Processing Server Node is not available, a Target Node will not be assigned. If a Concurrent Manager does have an assigned Primary Node, it will only try to start up on that node; if the Primary Node is down, it will look for its assigned Secondary Node, if one exists. If both the Primary and Secondary Nodes are unavailable, the Concurrent Manager will not start (the ICM will not look for another node on which to start the Concurrent Manager). This strategy prevents overloading any node in the case of failover. The Concurrent Managers are aware of many aspects of the system state when they start up. When an ICM successfully starts



up, it checks the TNS listeners and database instances on all remote nodes. If an instance is down, the affected managers and services switch to their Secondary Nodes. Processes managed under GSM will only start on nodes that are in Online mode. If a node is changed from Online to Offline, the processes on that node will be shut down and switched to a Secondary Node if possible. Concurrent processing provides database instance-sensitive failover capabilities. When an instance is down, all managers connecting to it switch to a secondary middle-tier node. However, if you prefer to handle instance failover separately from such middle-tier failover (for example, using the TNS connection-time failover mechanism instead), use the Profile Option Concurrent:PCP Instance Check. When this Profile Option is set to OFF, Parallel Concurrent Processing will not provide database instance failover support; however, it will continue to provide middle-tier node failover support when a node goes down. For the Internal Concurrent Manager, you assign the Primary Node only.

To Set Up PCP with RAC The following assumes a 2 node RAC cluster, where node1 is known as vip1 and node2 is known as vip2:

1. Check the configuration files tnsnames.ora and listener.ora located under the 8.0.6 ORACLE_HOME at $ORACLE_HOME /network/admin/<context>. Ensure that you have information of all the other concurrent nodes for FNDSM and FNDFS entries.

2. Restart the Applications listener processes on each application node.

3. Log in to Oracle E-Business Suite Release 11i as SYSADMIN and choose the System Administrator Responsibility. Navigate to the Install > Nodes screen, and ensure that each node in the cluster is registered.

4. Verify that the Internal Monitor for each node is defined properly, with the correct Primary and Secondary Node specifications and work shift details.

5. Confirm that the Internal Monitor manager is activated from Concurrent > Manager > Administrator, activating the manager as required. For example, Internal Monitor: Host2 might have the Primary Node as vip2 and Secondary Node as vip1.

6. On all Concurrent Processing nodes, set the $APPLCSF environment variable to point to a log directory on a shared file system.

7. On all Concurrent Processing nodes, set the $APPLPTMP environment variable to the value of the UTL_FILE_DIR entry in the init.ora file on the database nodes. This value should be a directory on a shared file system.

8. Do not use a load balanced TNS entry for the value of s_cp_twotask. The request may hang if the sessions are load balanced. Worker 1 connected to DB Instance 1 places a message in the pipe, and expects Worker 2 (which is connected to DB Instance 2) to consume the message. However, Worker 2 never gets the message since pipes are instance private. Optimizing the E-Business Suite with Real Application Clusters (RAC) - Ahmed Alomari

9. Set Profile Option 'Concurrent: PCP Instance Check'

o to 'ON' means that Concurrent Managers will fail over to a secondary application tier node if the database instance to which it is connected goes down.



o to 'OFF' if instance-sensitive failover is not required.

Oracle Network Basics There are four failover methods (and one method that we haven’t tested yet) that can be used once a TCP failure is detected: Dead Connection Detection, TCP Keepalive, ICM Process Monitor (PMON), Connection Failure Recovery (Release 12), and 10g Timeout Parameters (our untested method). 1. Dead Connection Detection – sqlnet.expire_time=1 (minute) Dead Connection Detection (DCD) is a feature of SQL*Net 2.1 and later, including Oracle Net8. DCD detects when a partner in a SQL*Net V2 client/server or server/server connection has terminated unexpectedly, and releases the resources associated with it. DCD is initiated on the server when a connection is established. At this time SQL*Net reads the SQL*Net parameter files and sets a timer to generate an alarm. The timer interval is set by providing a non-zero value in minutes for the SQLNET.EXPIRE_TIME parameter in the sqlnet.ora file. When the timer expires, SQL*Net on the server sends a "probe" packet to the client. The probe is an empty SQL*Net packet and does not represent any form of SQL*Net level data, but it creates data traffic on the underlying protocol. If the client end of the connection is still active, the probe is discarded, and the timer mechanism is reset. If the client has terminated abnormally, the server will receive an error from the send call issued for the probe, and SQL*Net on the server will signal the operating system to release the connection's resources. TCP/IP, for example, is a connection-oriented protocol, and as such, the protocol will implement some level of packet timeout and retransmission in an effort to guarantee the safe and sequenced order of data packets. If a timely acknowledgement is not received in response to the probe packet, the TCP/IP stack will retransmit the packet some number of times before timing out. After TCP/IP gives up, then SQL*Net receives notification that the probe failed. On Unix servers, the sqlnet.ora file must be in either $TNS_ADMIN or $ORACLE_HOME/network/admin. Neither /etc nor /var/opt/oracle alone is valid. This is a server feature only. The client may be running any supported SQL*Net V2 release. DCD is much more resource-intensive than similar mechanisms at the protocol level. With DCD enabled, if the connection is idle for the duration of the time interval specified in minutes by the SQLNET.EXPIRE_TIME parameter, the Server-side process sends a small 10-byte packet to the client. This packet is sent using TCP/IP.

Both the Internal Concurrent Manager and the Internal Monitor can use the DCD functionality of the Network (TCP sqlnet).

The ICM is a client process connected to a DCD-enabled DB dedicated server process. The ICM holds the named PL/SQL Lock, the “ICM lock”. The IM is continuously trying to check whether it can get the same named PL/SQL Lock. As soon as the “ICM lock” is released by the DB / DCD, FNDIMON pings the ICM node, and the IM deduces that

the ICM has crashed. o If the ping succeeds, we conclude that the ICM is fine. Obviously, the ICM can be down, even if TCP is

working, so this is bad logic that can lead to false positives. o If the ping fails, we further check if it has been over four PMON cycles since the ICM updated the

work_start column in the FND_CONCURRENT_QUEUES table. o If it has been more than four PMON cycles we conclude that the ICM is dead.



The DCD comes into the picture here after the ICM has crashed and the database needs to identify that the ICM is gone.

The database needs to clean up the dedicated server process resource corresponding to the ICM client process.

To Configure Dead Connection Detection (DCD) Implement by: adding SQLNET.EXPIRE_TIME = 1 (Minutes) to the sqlnet.ora file With DCD enabled, if the connection is idle for the duration of the time interval specified in minutes by the SQLNET.EXPIRE_TIME parameter, the Server-side process sends a small 10-byte packet to the client. This packet is sent using TCP/IP. TCP/IP is a connection-oriented protocol. This protocol implements a level of packet timeout and retransmission to help guarantee the safe and sequenced order of data packets. If a timely acknowledgement is not received in response to the probe packet, the TCP/IP stack will retransmit the packet some number of times before timing out. After TCP/IP gives up, then SQL*Net receives notification that the probe failed. If the client side connection is still connected and responsive, the client sends a response packet back to the database server, resetting the timer, and another packet will be sent when next interval expires (assuming no other activity on the connection If the client fails to respond to the DCD probe packet:

• The Server side process is marked as a dead connection and • PMON performs the clean up of the database processes / resources and • The client OS processes are terminated

Dead Connection Detection:

1. DCD initiates clean up of OS and database processes that have disconnected / terminated abnormally 2. DCD will not initiate clean up sessions that are still connected ... but are idle / abandoned / inactive.

2. TCP Keepalive Keep-Alive is a TCP/IP mechanism that allows a connection to detect if the partner has unexpectedly died. It is a function of the TCP stack in use and is NOT an Oracle mechanism, although Oracle can request for KeepAlive to be enabled or disabled for a given connection. SQL*Net connections do not enable keepalive for TCP connections by default. However, it is possible to enable this by adding a parameter to the sqlnet.ora file. Adding this parameter turns on a TCP level facility which can detect the loss of a server. If the server dies then keepalive will notice this and signal an error to Oracle Net code. In a RAC environment, TAF notices this error and performs fail-over as if the remote instance had been aborted. TCP KEEPALIVE PARAMETERS FOR LINUX: tcp_keepalive_time the time since the last data packet sent and the first keepalive probe tcp_keepalive_intvl the time between keepalive probes tcp_keepalive_probes the number of probes to be sent before declaring the connection dead Initial Settings tcp_keepalive_time = 200 seconds tcp_keepalive_intvl = 20 tcp_keepalive_probes = 2 After 200 seconds of no response, TCP sends the first of 2 probes, 20 seconds apart. Then, TCP notifies SQL*Net of the failure, and SQL*Net removes the offending connection. tcp_retries1 (default: 3) The number of times TCP will attempt to retransmit a packet on an established

connection normally, without the extra effort of getting the network layers involved. tcp_retries2 (default: 15) The maximum number of times a TCP packet is retransmitted in established state

before giving up



tcp_syn_retries (default: 5) The maximum number of times initial SYNs for an active TCP connection attempt will be retransmitted. The default value is 5, corresponds to approximately 180 seconds.

Now let’s consider an example where the following TCP parameters are changed from their default values:

tcp_retries1 = 2 tcp_retries2 = 2 tcp_syn_retries = 2

In this example, the time to initialize the PCP failover was an average of 8 seconds after changing these TCP parameters. We found the following Linux parameters listed in the Metalink note: 249213.1

net.ipv4.tcp_keepalive_time 3000 net.ipv4.tcp_retries2 5 net.ipv4.tcp_syn_retries 1

By changing some of these parameters, the timeout period was reduced to about 20 seconds, with the following breakdown for the timeout:

• The client initiates a TCP/IP three-way handshake, but there is no response. • The client waits a specified amount of time (OS configurable usually) like 200ms. • It sends the SYN packet again, but still gets no response. • It waits 400ms and tries again. • Receiving no response, it waits 800ms and tries again. • Again receiving no response, it waits 1600ms and tries again. • After another wait of 3200ms, the client gives up. • By now 6.2 seconds have passed by.

Therefore it keeps trying every 3200ms until a magic interval occurs and it stops. On Sun this interval is tcp_ip_abort_cinterval and defaults to 3 minutes (180000ms).” Note: 249213.1 Six seconds is very close to the time measured during tests with tcp_syn_retries and tcp_retries2 set to 2. The measured average was 8 seconds. Multiple measurements at 5 seconds recorded no change in connection status. However, one failover was initiated at a measured time of 6 seconds. When configured correctly, Keepalive enables dead connections to be discovered and closed more quickly, freeing resources used on the server more quickly. At the time of this document, client side SQL*Net connections do not enable keepalive for TCP connections by default. However, it is possible to enable this by adding the ENABLE=BROKEN parameter to the SQL*Net connect string, by adding this parameter to the sqlnet.ora file. **WARNING** Keepalive intervals can typically be set to 2 hours or more (i.e,,it can take more than 2 hours to notice a dead server even if keepalive is enabled). To make keepalive useful for PCP and TAF the keepalive interval needs to be reduced to a smaller value (such as 2 minutes). If there are a lot of IDLE connections on your network, then reducing keepalive can increase network traffic significantly. Sample TNS alias to enable keepalive (notice the ENABLE=BROKEN clause) VIS_BALANCE = (DESCRIPTION = (ENABLE=BROKEN)



(ADDRESS_LIST = (LOAD_BALANCE = ON) (FAILOVER = ON) (ADDRESS = (PROTOCOL = TCP)(HOST = rh8)(PORT = 1521)) (ADDRESS = (PROTOCOL = TCP)(HOST = rh6)(PORT = 1521)))

3. ICM Process Monitor (PMON) – once TCP fails, this method, introduced with Patch 6495206, takes 2 minutes

• If the “ICM lock” is not available, FNDIMON will now ping the node of the ICM. • If the ping succeeds, we conclude that the ICM is fine. • If the ping fails, we further check if it has been over four PMON cycles since the ICM updated the WORK_START

column of the FND_CONCURRENT_QUEUES table. • If it has been more than four PMON cycles we conclude that the ICM is dead.

Release 11i only uses PMON if patch 6495206 has been applied. The PMON method is included in Release 12. DEFAULT PMON SETTINGS Figure 19 shows the Oracle Application Manager screen with the PMON settings for this instance:

Click here to edit the PMON parameters

Figure 19

4. Connection Failure Recovery (Release 12) When Concurrent Managers fail due to a loss of the database connection:

• A Reviver process will be started



• When the database connection is possible, the Reviver will restart Concurrent Processing • Concurrent Processing can be started / stopped when the network or database is down • This should reduce processing down time because Concurrent Processing restarts as soon as possible • This should reduce the Applications System Administrator’s workload, since he will no longer need to take the extra

step of restarting the Concurrent Managers Of the first three methods, in Release 11i, the method that recognizes the failure first depends on the timeout settings of each method. In Release 12, Method 4 is used to perform failover.

5. 10g Timeout Parameters (Untested Solution) With the release of Oracle 10g, Oracle can time out within a desired period, instead of waiting for the TCP timeout to occur. The following settings can be used in the sqlnet.ora file on the client or server:

• sqlnet.inbound_connect_timeout (server) • sqlnet.send_timeout (client and/or server) • sqlnet.recv_timeout (client and/or server)

This method should provide automated recovery for Concurrent Managers after network or database failures. When a network failure occurs on a concurrent processing node, resulting in a loss of database connectivity, all Concurrent Managers running on that node will eventually be forced to shut down. In cases where multiple Concurrent Processing nodes are being used, and these other nodes retain their database connection, the managers will migrate to the working nodes. In the case where only a single Concurrent Processing node is being used, or when all Concurrent Processing nodes lose their database connection (for example if the database node suffers a network failure), all running Concurrent Managers on the entire instance will be forced to shut down. Without this feature, when the network comes back up, the managers must be restarted manually, as there is no automatic restart facility. This can lead to lost productivity between the time the network is restored and when the managers are restarted. With this new feature, the Concurrent Managers will restart automatically as soon as connectivity is restored. To achieve this, when a connection failure situation arises, a new monitor process, the Reviver, is started. This process will remain alive until it is able to obtain a database connection and restart Concurrent Processing. In addition this allows the Applications System Administrator to maintain control over Concurrent Processing even when network or database failure has brought down Concurrent Processing. When the connection is down, an administrator can still start CP using the adcmctl.sh script and by doing so it will start a Reviver process. When Concurrent Processing is down and a Reviver process is actively waiting to restart Concurrent Processing, the adcmctl.sh script can be used to stop Concurrent Processing, as it will detect the Reviver and shut it down. There is no additional setup required to use Connection Failure Recovery. If you wish to disable Connection Failure Recovery you can do so by setting the Concurrent Processing Reviver Process context file variable to “Disabled”.



From Aaron Weisberg at Oracle.

Exit

Attempt to Get DB

Connection

Kill Previous DB Session

Yes

No

Starts to Shutdown

ICM REVIVER

ICM Started?

Sleep

Yes

No

Lost DB Connection?

Yes

Yes Spawn Reviver No

Start

Start ICM

Exit

Receive Shutdown?

No

reviver.sh – code summary Sleep 30 Test_connection Kill_old _icm Get session Alter system kill session Check_running_icm Fnd_conc.ecm_alive start_icm startmgr.sh This example shows the reviver.log: reviver.sh starting up... [ Mon Jan 12 20:02:15 MST 2009 ] - Read APPS username/password. [ Mon Jan 12 20:02:45 MST 2009 ] - Attempting database connection... [ Mon Jan 12 20:02:45 MST 2009 ] - Successful database connection. [ Mon Jan 12 20:02:45 MST 2009 ] - Killing previous ICM session... 1 row updated. Commit complete. [ Mon Jan 12 20:02:45 MST 2009 ] - Looking for a running ICM process... [ Mon Jan 12 20:02:45 MST 2009 ] - ICM now running, reviver.sh complete. Reviver Context variables

• Concurrent Processing Reviver Process s_cp_reviver

• Reviver Process PID Directory Location

s_cp_fndreviverpiddir Writable directory location to create a pid file for ICM reviver process



As part of its shutdown process, the ICM will detect that it is being forced to shut down due to losing its database connection. This is done by looking for specific error messages ORA-3113, ORA-3114 or ORA-1041; If one of these errors is detected:

• The ICM will assume that it has lost its database connection and will spawn the reviver process. The ICM will pass the Apps username/password to the script using a secure protocol, along with the Oracle session id of the current ICM process. When the script starts, it will attempt to make a database connection using sqlplus. If unsuccessful, it will sleep for a 30 seconds before trying again. It will continue this until it either successfully makes a connection or it receives a signal to shut itself down. When it successfully makes a connection, it will first kill the old ICM database session to make sure any locks are released, then start a new ICM using the normal startmgr script. It then checks to make sure an ICM is successfully running; it will not exit until a new ICM is running. Once the ICM is restarted, it will start up any other managers that had been shut down and normal processing will resume.

PCP Failover Failover is the process of migrating the Concurrent Managers from the Primary Node to the Secondary Node because of a concurrent processing tier failure or listener failure. Failback is when the Primary Node becomes available again and the Concurrent Managers need to migrate back to their original Primary Node. If the Concurrent Managers are set up for PCP fail-over:

• Failover is triggered when a node running the ICM goes down • When the ICM goes down, the connected database server process clears its resources (including named PL/SQL

“ICM lock”) • The database server process cleanup is dependent on the DCD mechanism of the network (sql*net) • sql*net determines that a connected client has closed down through the DCD mechanism, and triggers the database

server process cleanup For example, if: Primary Node = HOST1 – The Managers assigned to the Primary Node are ICM (FNDLIBR-cpmgr), and FNDCRM Secondary node = HOST2 – The Manager assigned to the Secondary Node is Standard Manager (FNDLIBR) When HOST1 becomes unavailable (this means TCP is no longer working), both the ICM and FNDCRM are migrated to HOST2. This can be seen from the Administer Concurrent Manager screen in the System Administrator Responsibility. The $APPLCSF/log/.mgr logfile shows that HOST1 is being added to the unavailable list. The Log and Out directories must be on a shared disk On HOST2, after the PMON cycle, FNDICM, FNDCRM, and FNDLIBR are now migrated and running. FNDIMON and FNDSM run independently on each concurrent processing node. FNDSM is not a persistent process, and FNDIMON is a persistent process local to each node



Be aware that if a TCP failure is not detected, failover will not occur. The following excerpt from a Concurrent Manager log shows the case where a failure is detected:

fdpsrp() (running_processes correction): ICM cannot obtain exclusive lock on FND_CONCURRENT_QUEUES Oracle error code returned: 1 This message is information and does not indicate a problem with CP functionality. remote call function (FNDIMON) 15-AUG-2008 10:06:02 - Function to call: PingProcess

The PingProcess at the end of this log continues until the concurrent manager processes resume, or a TCP failure is detected, and failover is begun.

ICM Failover in Release 11i • ICM and IM use the DCD functionality of the Network (TCP sqlnet). • ICM is a client process connected to a DCD enabled DB dedicated server process. • ICM holds the named PL/SQL Lock, the “ICM lock”. • IM is continuously trying to check whether it can get the same named PL/SQL Lock. • As soon as the “ICM lock” is released by the DB / DCD from the ICM crash, FNDIMON pings the ICM node, and

the IM deduces that the ICM has crashed. • The DCD works after the ICM has crashed and DB needs to identify that the ICM is gone. • Then, the DB needs to clean up the dedicated server process resource corresponding to the ICM client process • If the “ICM lock” is not available, FNDIMON will now ping the node of the ICM. • If the ping succeeds, we conclude that the ICM is fine.

o Obviously, the ICM can be down, even if TCP is working, this is bad logic. • If the ping fails, we further check if it has been over four PMON cycles since the ICM updated the WORK_START

column in the FND_CONCURRENT_QUEUES table. • If it has been more than four PMON cycles we conclude that the ICM is dead. • Fail over is triggered when node running the ICM goes down • This ICM going down would lead to connected database server process clearing its resources (including named

PL/SQL lock) • In turn, the database server process cleanup is dependent on DCD mechanism of network (sqlnet) • That is, sqlnet determines that connected client has closed down through DCD mechanism and triggers database

server process cleanup

11i PCP Failure The following steps occur in the order indicated:

• TCP Failure • ICM Lock is released, FNDIMON pings ICM node, if ping fails, check PMON • PMON detects a “dead process”, crashed ICM • reviver.sh • DCD

R12 PCP Failure

• TCP Failure • PMON detects a “dead process” • ICM Shutdown

o Look for error messages ORA-3113, ORA-3114 or ORA-1041 • reviver.sh • DCD

Test PCP Failover Components Test to explore effect of DCD, PMON and TCP failover methods. Variables: sqlnet.expire_time, PMON sleep and number of cycles, and the following TCP Keepalive parameters:



• tcp_keepalive_time, • tcp_keepalive_intvl, • tcp_keepalive_probes • tcp_retries1 (default: 3, new value 2) • tcp_retries2 (default: 15, new value 2) • tcp_syn_retries (default: 5, new value 2)

Failover time / Failback time In Seconds

Expire_time In Minutes

PMON Sleep

PMON Cycles

tcp_KA time

tcp KA intvl

tcp KA probes

tcp retries

tcp retries2

tcp syn retries

241/ 1 30 secs 4 200 20 2 3 15 5 250/ 50 5 30 secs 4 200 20 2 3 15 5

262 / 100 10 30 secs 4 200 20 2 3 15 5 300 / 75 1 15 secs 2 200 20 2 3 15 5 285/ 35 10 30 secs 4 1000 60 10 3 15 5

8/ 105 1 30 secs 4 1000 60 10 2 2 2 10/ 42 1 30 secs 4 200 20 2 2 2 2 7/ 40 10 30 secs 4 200 20 2 2 2 2 6/ 34 1 15 secs 2 200 20 2 2 2 2

Test the Failover and Failback of Parallel Concurrent Processing In Figure 20, Oracle Application Manager (OAM) shows the details of the Internal Manager (ICM) Activated on RH9:

Figure 20

In Figure 21, the ICM, CRM and Standard Managers all have their Primary Node as RH9.



Figure 21

In Figure 22, we can see that the Standard Manager is configured to failover to the Secondary Node RH7:

Figure 22

Disconnect TCP Connection from RH9

The Internal Concurrent Manager has encountered an error. Review concurrent manager log file for more detailed information. : 12-JAN-2009 15:22:55 -



Shutting down Internal Concurrent Manager : 12-JAN-2009 15:22:55 12-JAN-2009 15:22:55 The ICM has lost its database connection and is shutting down. Spawning reviver process to restart the ICM when the database becomes available again. Spawned reviver process 1541. The VIS_0112@VIS internal concurrent manager has terminated with status 1 - giving up. Found dead process: spid=(17963), cpid=(1302176), ORA pid=(26), manager=(0/1)

DB Node – RH8

Found dead process: spid=(1185), cpid=(1301550), ORA pid=(78), manager=(0/1) Process monitor session started : 12-JAN-2009 15:18:27 Internal Concurrent Manager found node RH9 to be down. Adding it to the list of unavailable nodes. CONC-SM TNS FAIL Call to PingProcess failed for XDPCTRLS CONC-SM TNS FAIL Call to PingProcess failed for XDPQORDS

In Figure 23, OAM shows node RH9 is down, as well as all the application services on RH9.

Database

Database Listener

TCP_KEEPALIVE takes 240 seconds before starting DCD

SQL*Net Client

SQL*Net Client

PCP PCP

Node RH9 is down!

sqlnet.ora

RH7 RH9

Figure 23 Figure 24 shows the CRM is down, Actual=0 and Target=1.



The Conflict Resolution Manager is down!

Figure 24 The ICM tries to restart the CRM and other failed processes, but can’t.

CONC-SM TNS FAIL Found dead process: spid=(999999), cpid=(1301562), Service Instance=(1050) Starting XDP_Q_EVENT_SVC Concurrent Manager : 12-JAN-2009 15:19:21 CONC-SM TNS FAIL Found dead process: spid=(999999), cpid=(1301563), Service Instance=(1051)

If we run the command ps-ef | grep applvis, we can see defunct processes:

The CRM and two other FNDLIBRs are shutting down, but the FNDSM is still running. The ICM is still running in another FNDLIBR, show below:

The FNDSM Service Manager is still running. RH9 is shown as down, TCP is disconnected, and the Internal Manager is failed over to RH7, as shown in Figure 25:



RH7 is now running the Internal Manager

Figure 25 RH7 starts up the Conflict Resolution Manager in Figure 26:

RH7 starts up the Conflict Resolution Manager

Figure 26 In Figure 27, the Concurrent Managers have started processing Concurrent Rerquests on the Secondary Node, RH7:



Figure 27

Figure 28 shows the Oracle Applications Manager screens with RH7 activated:

Figure 28

It is important to note that, unlike Release 11i, Release 12 doesn’t failover a manager if there is no Secondary Node defined. In Figure 29, only the Session History Cleanup, Standard Manager and WMS Task Archiving Manager have Secondary Nodes defined. In this case, the Primary Node is RH9 and the Secondary Node is RH7.



The Inventory Manager, MRP Manager and OAM Metrics Collection Manager will not failover unless they are defined to do so.

Figure 29 ICM Failover Figure 30 shows the Internal Manager processing migrating back to the Primary Node, RH9.

Starting Internal Concurrent Manager Concurrent Manager : 12-JAN-2009 15:19:45 : Started ICM on Target RH7. Process monitor session ended : 12-JAN-2009 15:21:15 : Migration of ICM has completed. Shutting down Internal Concurrent Manager : 12-JAN-2009 15:21:45 The VIS_0112@VIS internal concurrent manager has terminated successfully - exiting.

Figure 30

In Figure 31, the Internal Manager is up for RH9 and the Conflict Resolution Manager is starting up on RH9:



Figure 31

Figure 32

Failover is complete, for the ICM and CRM, from RH9 to RH7. In the next section the TCP is reconnected and the failback from RH7 to RH9 is documented. Connect TCP connection Failback from RH7 to RH9 Failback from RH7 to RH9 is starting:

Start of Failback Starting Internal Concurrent Manager Concurrent Manager : 12-JAN-2009 15:12:35 : Started ICM on Target RH9. Process monitor session ended : 12-JAN-2009 15:14:05 : Migration of ICM has completed.



Shutting down Internal Concurrent Manager : 12-JAN-2009 15:14:35 The VIS_0112@VIS internal concurrent manager has terminated successfully - exiting. ======================================================================= Starting VIS_0112@VIS Internal Concurrent Manager -- shell process ID 14927 logfile=/d01/oracle/VIS/inst/apps/VIS_rh8/logs/appl/conc/log/VIS_0112.mgr PRINTER=noprint mailto=applvis restart=N diag=N sleep=30 pmon=4 quesiz=1 Reviver is ENABLED End of Failback

Administer Concurrent Managers

Figure 33

Target Nodes Using the Services Instances page in Oracle Applications Manager (OAM) or the Administer Concurrent Managers form, you can view the Target Node for each Concurrent Manager in a parallel concurrent processing environment. The Target Node is the node that the processes associated with a Concurrent Manager should run. It can be the node that is explicitly defined as the Concurrent Manager's Primary Node in the Concurrent Managers window or the node assigned by the Internal Concurrent Manager, if no Primary Node is defined.



Figure 34

If you have defined Primary and Secondary Nodes for a manager, then when its Primary Node and ORACLE instance are available, the Target Node is set to the Primary Node. Otherwise, the Target Node is set to the manager's Secondary Node (if that node and its ORACLE instance are available). During process migration, processes migrate from their current node to the Target Node.

Control Across Nodes Using the Application Services category on the Site Map page in Oracle Applications Manager or the Administer Concurrent Managers form, it is possible to start, stop, abort, restart, and monitor Concurrent Managers and Internal Monitor Processes running on multiple nodes from any node in your parallel concurrent processing environment.



Figure 35

Figure 36 shows that It is not necessary log onto a node to control concurrent processing on it. It is possible to terminate the Internal Concurrent Manager or any other Concurrent Manager from any node in your parallel concurrent processing environment using Oracle Application Manager:



Figure 36

Starting the Concurrent Managers The Internal Concurrent Manager starts first, followed by the Conflict Resolution Manager and then the other Generic Managers, Concurrent Managers and Transaction Managers.

Figure 37

Start up parallel concurrent processing by running the adcmctl.sh script from the operating system prompt, as shown below:

adcmctl.sh start apps/apps The Internal Concurrent Manager starts up on the node where the adcmctl.sh script is run. If it is assigned to a different node, the ICM will migrate to the Primary Node, when available.



After the Internal Concurrent Manager starts up, it starts all the Internal Monitor Processes and all the Concurrent Managers. It attempts to start Internal Monitor Processes and Concurrent Managers on their Primary Nodes, and resorts to a Secondary Node only if a Primary Node is unavailable. From the Concurrent Manager logs:

Starting VIS_0815@VIS_BALANCE Internal Concurrent Manager -- shell process ID 978956 logfile=/VIS/logs/apps/log/VIS_0815.mgr PRINTER=noprint mailto=VIS restart=N diag=Y sleep=15 pmon=4 quesiz=1 (default)

Edit the ICM Runtime Parameters Figure 38 shows that you can edit the ICM Runtime Parameters from Oracle Application Manager:

Figure 38

In Figure 39, the defaults for the PMON settings are initially displayed:



Figure 39

Figure 40 shows that you can change the Sleep Interval to 15 seconds and keep the PMON cycles at 4. This should recognize a failure 1 minute after TCP finds a “dead peer”.



Figure 40

Once you’ve saved your changes, Figure 41 shows a screen that confirms that you made changes:

Figure 41



Make sure the PMON changes are made in the $FND_TOP/bin/batchmgr.sh file.

FILENAME # batchmgr # DESCRIPTION # fire up Internal Concurrent Manager process # USAGE # batchmgr arg1=val1 arg2=val2 ... # # Parameters may be sent via the environment. # # ARGUMENTS DEFAULT # [appmgr|sysmgr]=username/password # [sleep=sleep_seconds] 15 # [mgrname=manager_name] icm # [logfile=log_filename] $FND_TOP/$APPLLOG/$mgrname.mgr # [restart=N|mim minutes between restarts] N # [mailto="user1 user2..."] current user # [PRINTER=printer_name] # [pmon=iterations] 4 # [quesiz=pmon_iterations] 1 # [diag=Y|N] N # # SYSMGR holds the Oracle user as whom the manager should run # and its password. # # SLEEP holds the number of seconds that the manager should wait # between checks for new requests. # # MGRNAME is the name of the manager for locking and log purposes. # # LOGFILE is a filename in which the manager's own log is stored. # # RESTART is set to N if the manager should not restart itself after # a crash. Otherwise, it is an integer number of minutes. The # manager will attempt a restart after an abnormal termination # if the past invocation lasted for at least RESTART minutes. # # MAILTO is a list of users who should receive mail whenever # the manager terminates. ## # PMON is the duration of time between process monitor # checks (checks for failed workers). The unit of time # is concurrent manager iterations (request table checks). # # QUESIZ is the duration of time between worker quantity # checks (checks for number of active workers). The unit # of time is process monitor checks.

Concurrent Processing is typically started from the command line by using one of these start scripts, startmgr.sh or adcmctl.sh: startmgr.sh

• Schema logon is passed using sysmgr parameter • Apps logon may be passed using appmgr parameter • Apps user must have System Administrator responsibility

The startmgr.sh script accepts the schema logon when passed as the sysmgr parameter. Now it will also accept an Applications user sign on via the appmgr parameter. Note that the Applications User must have System Administrator responsibility in order to be able to successfully start Concurrent Processing. adcmctl.sh



• Accepts a single username/password combination • By default it is the schema logon • Context File variable: Concurrent Processing Password Type

o AppsSchema or AppsUser The adcmctl.sh script is more commonly used. It will accept a single username/password combination. There is a context file variable that determines whether this script expects a schema logon or an Applications logon. By default the schema logon is expected. To start using the Application Sign On instead, edit the context file variable Concurrent Processing Password Type and set its value to AppsUser. Then run autoconfig to regenerate the adcmctl.sh script. The script will then begin to expect an Applications Username and Password.

• Schema logon style: o CONCSUB apps/appspass SYSADMIN ‘System Administrator’ SYSADMIN CONCURRENT FND

FNDSCARU <parameters>

• New Apps User Sign On Style o CONCSUB Apps:User SYSADMIN ‘System Administrator’ User/UserPass CONCURRENT FND

FNDSCARU <parameters> For this example we will use the Concurrent Program FNDSCARU, the schema logon apps/appspass and the Applications User logon of User/UserPass. Previously to submit a request to run FNDSCARU using CONCSUB, you would run the CONCSUB program from the command line as shown here. Now you can choose to authenticate instead using an Applications username and password. To do so, in place of the schema logon you should specify Apps:User as shown here. This indicates that an Applications User Sign On will be used. Then for the Applications username parameter you should append the corresponding password. If you pass the Apps:User parameter but do not supply a password for your specified Applications username, you will be prompted to enter the password. Functional Security is enforced for Request Submission. After the Applications username and password is authenticated, CONCSUB will verify that the user has the appropriate permission to submit the Concurrent Request. If the security check fails, an error message will be printed to the screen.

Shutting Down Managers You shut down parallel concurrent processing by issuing a "Stop" command in the OAM Service Instances page or a "Deactivate" command in the Administer Concurrent Managers form. All Concurrent Managers and Internal Monitor processes are shut down before the Internal Concurrent Manager shuts down. Run the adcmctl.sh script from the Release 11i - $COMMON_TOP/admin/scripts/<Context Name>, or Release 12 - $INST_TOP/admin/scripts:

adcmctl.sh stop apps/apps After the failover test, sometimes the services would not failback on RH9. Figure 42 shows the OAM Dashboard and indicates that RH9 and the applications services are unavailable. Remember, the test pulls the TCP cable from the host.



Figure 42

In order to restart the services on RH9, first stop all the services on RH9 with:

adstpall.sh apps/apps (sometimes a kill -9 -1 is necessary as the APPLMGR user)

By stopping the services, GSM is able to restart the services, except the concurrent processing, which was stopped

Figure 43

In order to start the Concurrent Managers use:

adcmctl.sh start apps/apps This starts the concurrent processing on all nodes.

Figure 44

Concurrent Manager Log and Out Directories The Concurrent Manager first looks for the environment variable $APPLCSF. If this is set, it creates a path using two other environment variables: $APPLLOG and $APPLOUT It places log files in $APPLCSF/$APPLLOG, output files go in $APPLCSF/$APPLOUT So, for example, if you have this environment set: $APPLCSF = /d01/oracle/VIS/inst/apps/VIS_rh9 $APPLLOG = log $APPLOUT = out Then:

• Log files go to: /d01/oracle/VIS/inst/apps/VIS_rh9/logs • Out files to: /d01/oracle/ VIS/inst/apps/VIS_rh9/out



If $APPLCSF is not set, it places the files under the product top of the application associated with the request. For example, a PO report would go under $PO_TOP/$APPLLOG and $PO_TOP/$APPLOUT All these directories must exist and have the correct permissions. All concurrent requests produce a log file, but not necessarily an output file. Concurrent Manager log files follow the same convention, and will be found in the $APPLLOG directory

Concurrent Processing Tables Major tables that contain information about concurrent processing:

Table Description

FND_CONCURRENT_REQUESTS Details of user requests, including status, start date and completion date

FND_CONCURRENT_PROGRAMS Details of Concurrent Programs, including execution method, whether the program is constrained and whether there are incompatibilities

FND_CONCURRENT_PROCESSES Cross reference between concurrent requests and queues; includes a history of Concurrent Manager requests.

FND_CONCURRENT_QUEUES Details about the Concurrent Manager queues. FND_NODES Node info including availability status FND_CONCURRENT_QUEUE_PARAMS PMON and Reviver parameters



Load Balancing

Types of Load Balancing There are several types of load balancing:

• Concurrent Processing Load Balancing • JSP-JDBC Load Balancing • JVM Load Balancing • Functionally Referenced Nodes

For this paper, we’ll only discuss Concurrent Processing Load Balancing.

Concurrent Processing Load Balancing • Load Balancing with both nodes running – no failover • Load Balancing during failover

Parallel Concurrent Processing has many benefits. Key among these is its capability to provide failover in case of node failure. When a node fails, the processes that were running on that node are restarted on Secondary Nodes (as defined by the System Administrator.) This helps maintain throughput and keep the business running during node failures. However, a resource intensive node (one with many processes) may inadvertently overtax the system when it fails-over. A Secondary Node may not be able to handle its normal workload and the additional burden of managers/processes from a failed node. If too many processes are running on the Secondary Node when the Primary Node fails-over, the Secondary Node may not have the capacity to process the requests from additional Concurrent Managers. Release 12 introduces Failover Sensitive Workshifts. This enhancement allows the System Administrator to configure how many processes failover for each workshift. With this added control, Applications System Administrators are able to enjoy the benefits of PCP failover while reducing the risk of performance issues through overloaded resources.

Figure 45

Processing capabilities during failover may be severely degraded on the remaining hosts, unless failover processes are restricted. A host may be considered underutilized if the CPU utilization is less than 70%. A typical production environment may have two application tiers, each running Apache, Forms, and Concurrent Processing. Each node supports half the JSP and Forms users, half the Concurrent Requests and has 70% average CPU utilization. Release 11i has no mechanism for decreasing the number of processes a manager can run during a failover. It is clearly not possible to process 140% of the workload on one of two remaining apps tiers.



Figure 46

EXAMPLE OF DECREASING THE NUMBER OF “FAILOVER PROCESSES” IN RELEASE 12 In order to compensate for further failovers, the hosts have received hardware upgrades that allow them to process 100% more workload. Now each host has an average CPU utilization of 35%. The combined average workload during failover is 70%. This is approaching the limit where queuing theory indicates minor increases in the number of running processes can cause major increases in wait times. It’s clear, in order to really run a Release 11i or Release 12 system, during a failover, there are two choices:

• Run the servers at 35% or less utilization • Reduce the number of processes that are allowed during failover

For most businesses the second option is the most practical.



Figure 47

Conversely, if a failover occurs from node 1 to node 2, we may want to reduce the failover processes, however, this doesn’t work. Only if the node fails does the “failover processes” take effect.

Figure 48

Figure 49



Application Affinity – How to Define Application Affinity Define a Concurrent Manager to handle requests for a specific module. GL reports are commonly run under a GL Manager, while Payroll requests typically run using a Payroll Manager. By defining specialized managers it’s possible to direct concurrent requests to a specific concurrent processing node, by defining the Primary/Secondary Node. Specialization rules allow requests to be excluded from managers and included in the appropriate manager at the Application level. Related module requests should be directed to a specialized Concurrent Manager. This manager can have a Primary concurrent processing node that will use sqlnet to direct the database traffic to a related node in a RAC cluster. Quick note: It seems a little silly to go to all the trouble to create the RAC cluster and then figure out ways to direct traffic to a specific node. Why not just get a bigger, monolithic, SMP machine for the database server? For a more complete, serious discussion, please refer to Optimizing the E-Business Suite with Real Application Clusters (RAC) by Ahmed Alomari.




References

249213.1 - Performance problems with Failover when TCP Network goes down 364171.1- TAF Session Hangs, Select Fails To Complete W/ Loss Of NIC: Tune TCP Keepalive 211362.1 - Process Monitor Session Cycle Repeats Too Frequently 291201.1 - How To Remove a Dead Connection to the Target Database 362135.1 - Configuring Oracle Applications Release 11i with Oracle10g Release 2 Real Application Clusters and Automatic Storage Management Optimizing the E-Business Suite with Real Application Clusters (RAC) - Ahmed Alomari 240818.1 - Concurrent Processing: Transaction Manager Setup and Configuration Requirement in an 11i RAC Environment R12 ATG - Concurrent Processing Functional Overview – Aaron Weisberg 210062.1 - Generic Service Management (GSM) in Oracle Applications 11i 271090.1 - Parallel Concurrent Processing Failover/Failback Expectations 241370.1 - Concurrent Manager Setup and Configuration Requirements in an 11i RAC Environment 602899.1 - Some More Facts On How to Activate Parallel Concurrent Processing

Documents

Concurrent Processing