21
© Copyright International Business Machines Corporation 1987, 2009 US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Cor IBM ILOG ODM Enterprise V3.3 Building a High Availability ODM Enterprise environment V1.0

Building a High Availability ODM Enterprise environment · 2018. 11. 2. · Building a High Availability ODM Enterprise environment ... clustered, and manages high availability across

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Building a High Availability ODM Enterprise environment · 2018. 11. 2. · Building a High Availability ODM Enterprise environment ... clustered, and manages high availability across

copy Copyright International Business Machines Corporation 1987 2009 US Government Users Restricted Rights - Use duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Cor

IBM ILOG ODM Enterprise V33

Building a High Availability ODM Enterprise environment V10

Table of Contents 1 Planning a HA ODM Enterprise environment 3

A Overview of ODM Enterprise 3 1) Databases 3 2) Optimization Engines 3 3) Job Processors 4 4) Job Manager 4

B Protecting the system 5 1) Protecting the ability to run jobs 5 2) Protecting the ability to manage jobs 5

C Sample HA topology 6 D IBM middleware protection strategies 6 E System failure detection 8 F Special considerations for ODM Clients 8

2 Configuring a HA ODM Enterprise environment 9 A Introduction 9 B Configuring DB2 HADR 9 C Configuring WebSphere Application Server and HTTP Server 11

1) Overview 11 2) Procedure 11

D Configuring ODM Application 13 3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster 15

A Workload Management capabilities of ODME 3301 15 1) Job Control and Administrative requests Workload Management 15 2) Job solving Workload Management 15

B High Availablity capabilities of ODME 3301 on WAS‐NDampDB2 HA 17 1) Operations continuity 17 2) Operations recovery 18

4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment 19 A Job processor fails to extract OPL binaries upon restart 19 B Solve cannot recover after WAS job‐processor or odmsolver stops 20 C Bad error reporting when Optimization Server loses connection from the Repository DB 20 D ODME cannot start when WAS administrative security is enabled 20 E ODM solver does not start 21

2

1 Planning a HA ODM Enterprise environment This document describes the points to consider when planning to use the IBMreg ILOG ODM Enterprise Optimization Server as part of a highly available solution

A Overview of ODM Enterprise IBM ILOG ODM Enterprise offers a platform for optimization‐based planning and scheduling applications One of its main components is the Optimization Server used by planners to perform computations At the Optimization Servers core is the Optimization Engine which performs long running computationally expensive optimization ldquosolvesrdquo Each solve requires access to Scenario Data as the input to the solve and as storage for solve results A Job Processor is an application which runs on the same server as one or more Optimization Engines and initiates new solve jobs based on the contents in the Jobs database The Jobs Database is populated by the Job Manager The Job Manager receives requests from clients to schedule and prioritize solve jobs

1) Databases The ODME product uses relational databases a Jobs database and one or more Scenario databases The Jobs database stores information on pending jobs The Scenario database holds data used as input to optimization jobs and the results of the previously completed solve operations Failure of the Jobs database causes the failure of both the Job Manager and the Job Processor Failure of the Scenario database causes the failure of any jobs using that database

2) Optimization Engines The Optimization Engines run as separate processes which wrap invocations of a native solve engine It retrieves scenario data from a database executes the optimization solve and finally writes back the result to the Scenario database The database connectivity is provided by JDBC database drivers A new instance of an Optimization Engine solver process is created for each new solve Failure of an Optimization Engine solver process means that the optimization in progress cannot complete

3

3) Job Processors A Job Processor initiates new solve jobs based on the contents of the Jobs database The Job Processor has a fixed number of solver slots (usually as many as there are physical CPU cores on the host) When the Job Processor has an Optimization Engine slot that is ready it will poll the database to check for new jobs New jobs will get solved by a newly launched Optimization Engine solve process The Job Processor is also responsible for updating the Jobs database with solve progress and final status Failure of the Job Processor means that new jobs cannot be picked for solving and that recording of progress or results of complete jobs The Optimization Engine solver process maintains contact with the Job Processor while running jobs ndash if the Job Processor fails then the Optimization Engine will stop The Job Processor runs as a Java EE application Database connectivity is provided by a JDBC data source The Job Processor responds to queries from the Job Manager using the Java Messaging Service (JMS) JMS is used only for interacting with jobs such as to cancel or accept the current solution and not for regular solving Multiple Job Processors can use the same Jobs database and will mark jobs as in‐progress as they are run

4) Job Manager The Job Manager receives requests from clients to schedule and solve jobs Jobs are stored as records in a Jobs relational database The Job Manager runs as a Java EE application Database connectivity to the Jobs DB is provided by a JDBC XA data source The Job Manager communicates with the Job Processor to interact with running jobs using the Java Messaging Service (JMS) Clients interact with the Job Manager using either a SOAPHTTP web service for solve jobs submission or by using a web‐based console for solve queue administration The Job Manager application holds no state information in memory (all state is in the database) so multiple instances of the application can be used The Job Manager includes a timer which checks the status of running jobs If a job has not been active (no reported heartbeat) for 120 seconds (default value settable in JobMonitors ejbjobActivityTimeout environment variable) the Job Manager will mark the job as failed and available to be restarted Failure of the Job Manager means that

bull No new jobs can be submitted bull Status of jobs cannot be queried bull Running solve jobs cannot be aborted bull Pending jobs can still be run as long as other components (Job Processor DBs) have not

failed

4

B Protecting the system This diagram shows the dependencies of ODME components on high‐level middleware components

Figure Component dependency diagram

These dependencies show the key consequences of the failure of any component For example if the application server for the Job Processor fails then so does the Job Processor application and any associated Optimization Engines If database manager for the Jobs DB fails then the Jobs DB Job Manager and Job Processor and Optimization Engine all fail While two logical application servers are depicted the solution may be deployed on a single physical application server instance The same is true for the two logical database managers This would of course mean that a failure could have a greater impact

1) Protecting the ability to run jobs Assuming job records are available in the Jobs database the ability to run optimization solve jobs is based on the following components

bull Job Processor bull Optimization Engine(s) bull Jobs database bull Scenario database

From a middleware perspective the following needs to be protected bull Application servers used by Job Processor bull The database servers used for the Jobs database bull The database servers used for the Scenario database bull The physical servers and network used by each of the above

2) Protecting the ability to manage jobs The ability to manage jobs is provided by the following logical components

bull Job Manager application bull Jobs database

5

From a middleware perspective the following needs to be protected bull HTTP server used by Job Manager bull Application server used by Job Manager bull The database server used for the Jobs database bull The physical servers and network used by each of the above

C Sample HA topology Here is a sample topology which can be used to protect the ODME solution

This logical topology consists of

Optimization Servers each running a Java EE application server with the Job Processor and Job Manager applications installed Each Optimization Server will host one or more Optimization Engines Both Optimization Servers can be operation at the same time ‐ if one Optimization Server fails then the other will continue taking jobs from the Jobs Database Any number of Optimization Servers could theoretically be used

Database server hosting both the Jobs and Scenario databases Keeping multiple copies of the same database active and up‐to‐date could be very difficult so instead a passive backup should be kept This backup needs to be up‐to‐date and ready to become active if the primary database server fails

Load balancing server which can route HTTP traffic to either of the two Job Managers The load balancer also needs to be backed‐up This backup can also be passive and ready to become active if the primary load balancer fails

D IBM middleware protection strategies There are many ways in which IBM middleware can be protected including (but not limited to)

bull Software bull WebSphere Application Server Network Deployment allows application servers to be

clustered and manages high availability across Java EE applications bull DB2reg can be used to keep an up‐to‐date replica of a database on a separate server using

its High Availability Disaster Recovery (HADR) feature bull Tivoli System Automation provides advanced clustering support for managing highly

6

available systems bull Hardware

bull A PowerHA solution provides a cluster of IBM Power servers using shared disks If one server (either software processes or hardware) fails the other takes over Power HA SystemMirror is available for AIX Linux and IBM i

bull Disk technology such as a Redundant Array of Independent Disks (RAID)

For example the following describes a software‐only topology to enable high availability using WebSphere Application Server‐Network Deployment and DB2 HADR

Figure 1 ODME Software‐only HA Topology example

This topology consists of

Optimization Servers running WebSphere Application Server (WAS) Each Optimization Server runs a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications is installed into the cluster Each Optimization Server hosts one or more ODM Optimization Engines A WAS Service Integration bus (SIBus) is configured to allow the JMS communication between Job Managers and Job Processors ‐ the SIBus uses a single messaging engine with a HADR database store which WAS will automatically move to the other WAS server in the event of a failure In case of primary database failure DB2 HADR will switch to the alternate standby database server

Database servers running DB2 Enterprise Server Edition Both servers run the same software with one acting as the primary database server The primary database server replicates all database updates on the standby server using DB2s High Availability Disaster Recovery (HADR) feature In addition Tivoli System Automation could be used on these servers to detect a failure and instruct the standby server to take over as the primary database

Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests

7

to one of the WAS servers In this configuration the WAS server is chosen on a round‐robin basis

E System failure detection A key factor in creating a highly available system is how quickly you can recover from a failure The solution might be able to cope with the failure of one component but two or more may be difficult so detecting and recovering from failures is critical

It is important to monitor at many levels A failure could occur in the ODME application the hosting application server the operating system the physical hardware for the server or with a network connection

There are many software solutions for monitoring middleware such as IBM Tivoli Monitoring

F Special considerations for ODM Clients ODM client applications such as ODM Planner Studio have direct access to the Scenario database defined in the odmapp deployment settings

ODM application

configuration (odmapp

ODM Enterprise IDE

ODMRepositorySCENARIO

DB

Development Deployment

IT developer

Java Development

Tools

OPL Studio

ODM Editors

ODM Studio -Planner and

Reviewer Editions

Optimization Server

Custom Clients and Batch Files

odmapp

odmapp

odmapp

odmapp

solve

solve

Readwrite

Rea

dwr

ite

Readwrite

The odmapp files generated with ODME IDE include their own Scenario database access definitions which are configured independently from the the Optimization Server JOBSs

When an odmapp is intended to take advantage of HA recovery of its Scenario DB its Data Source definitions must be enhanced with HA‐specific settings that will enable switching database operations to the alternate DB instance This will enable you to take HA recovery into account both when used from within the Planner Studio and when used for solving on an Optimization Server

8

2 Configuring a HA ODM Enterprise environment

A Introduction This document describes a sample configuration of the IBM ILOG ODM Enterprise as part of a highly available (HA) solution There are many ways to provide high availability using various combinations of specialized hardware and software This document describes a software‐based solution using the following products

bull IBM ILOG ODM Enterprise V3301

bull IBM WebSphere Application Server Network Deployment V61 FP 025

bull IBM HTTP Server V61 FP 025

bull IBM DB2 Enterprise Server Edition V95 FixPack 3

This document does not provide an exhaustive step‐by‐step guide but instead highlights specific considerations for configuring HA with the products listed above Links are provided to product documentation articles and Redbooks which describe the steps in more details

Next configuration steps describe how to configure the sample topology depicted in Figure 1 ODME Software‐only HA Topology example above This topology consists of

Optimization Servers running WebSphere Application Server (WAS) with each Optimization Server will run a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications are installed into the cluster

Database servers running DB2 Enterprise Server Edition with both servers running the same software in a activepassive HADR setup where the primary database server replicates all database updates to the standby server

Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests between the WAS servers on a weighted round‐robin basis

Not represented in the previous topology ODM Client Applications will be configured to benefit from the automatic client rerouting offered by DB2 HADR to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server

B Configuring DB2 HADR DB2 has a feature called High Availability Disaster Recovery (HADR) which provides a high performance replication system A DB2 HADR system consists of two database servers one active and one standby Any changes made to the active system will also be replicated in the standby system At any point an administrator can instruct the standby system to ldquotake overrdquo as the primary ndash after this happens the roles of the two systems are swapped

DB2 HADR Requirements

Before installing DB2 with HADR as the ODME application datastore you need to be aware of these basic requirements for both the primary and standby DB2 servers

9

bull Identical operating system version and patch level

bull The primary and standby server machine should be connected with high‐speed TCPIP connection and reachable by TCPIP from the client application

bull Identical DB2 version and patch level software bit size (32‐bit or 64‐bit) and installation path

DB2 HADR Setup

1 Install DB2 UDB Enterprise Server Edition on both the primary and standby machines

Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connection between DB2 HADR primary and stand‐by machine works well

2 Start the DB2 servers on both machines if they are not already running

3 Create your database and the required tables on the primary machine only The databases on secondary machines will be cloned from the primary machine (See the DB2 Information Center for detailed installation information)

bull Optimization Server DB ndash used to store ODME jobs Use the scripts provided with ODM Enterprise (typically serverdatabasedb2-create-tablessql) to create the JOBS database tables Make a note of the userID that you use to create the tables because it is used in the table qualifier and schema

bull Scenario database ndash used to store ODME scenario data The database tables themselves will be initialized when developing the ODM application using the ODME IDE

bull SIBus database ndash used by the WAS Service Integration Bus

Tip The DB2 logs need to be of a sufficient size especially the scenario database in which there are important updates Make sure you set the database logging to archive logging rather than the default circular logging because otherwise it will not be possible to enable HADR

4 Configure HADR for each database from the primary machine using the Setup HADR wizard as presented in

httpwwwredbooksibmcomredbooksSG247363wwhelpwwhimpljshtmlwwhelphtm

Tip The easiest way to create the databases on the secondary machines is to do it during the HADR setup process by using the backup method During HADR setup you may be asked for the peer window parameter you can leave it at the default value of 0

Useful Links bull DB2 V95 InfoCenter

httppublibboulderibmcominfocenterdb2luwv9r5indexjsp

bull The IBM Redbook ldquoHigh Availability and Disaster Recovery Options for DB2 on Linux UNIX and Windowsrdquo provides a useful guide

10

httpwwwredbooksibmcomabstractsSG247363html

bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

httppublibboulderibmcominfocenterdb2luwv9r5topiccomibmdb2luwadminhadocdocc0011976html

C Configuring WebSphere Application Server and HTTP Server

1) Overview WebSphere Application Server Network Deployment allows multiple servers to be clustered together Installing a Java EE application into the cluster will perform the installation on each cluster member

The ODM Enterprise Job Manager and Job Processor applications use the Java Messaging Service (JMS) to communicate with each other To use JMS in a clustered environment in WAS a service integration bus (SIBus) is used with each server added as a clustered bus member In our architecture only one server needs to host a messaging engine ndash in the event of a failure in that server WAS will move the messaging engine to another server To support this each WAS server must be able to access the SIBus data store so in this topology the data store will be hosted in a DB2 database

2) Procedure The following instructions extend the single‐server instructions provided with ODME 3301 with a focus on differences specific to a clustered deployment

a) Install WAS 61 Network Deployment as detailed in

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

Tip Install the deployment manager node first and start it For the other nodes used in optimisation cluster select a ldquoCustomrdquo environment in the profile manager wizard which will add the new node into the cell Deployment manager and cluster nodes should be created with security disabled

b) Start and connect to the deployment manager console create a new cluster in Servers =gt

Clusters and define cluster members for nodes created earlier

c) Create a ldquoDB2 Universal JDBC Driver Provider (XA)rdquo provider for the cluster scope in Resources =gt JDBC =gt JDBC Providers and specify database class path information for cluster nodes

d) With this provider create new JDBC Data sources in the cluster scope for each HA database used by the optimisation server cluster the Jobs and the SIBus databases Create the data sources with all the settings that pertain to the primary DB2 host alternate (standby) database definitions will be specified through additional DB connection properties

11

Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well

The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules

e) Set the custom properties of these JDBC data sources

bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables

bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name

bull clientRerouteAlternatePortNumber standby server port number for client reroute

bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2

bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15

bull fullyMaterializeInputStreams set to true

bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues

f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled

g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection

h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)

bull OptimizationServerTopic named jmsoptimserverTopic in JNDI

bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI

bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI

bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic

i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope

j) Install IBM HTTP Server 61

12

Tip

bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application

bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier

k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of

plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps

l) Start cluster nodes

m) Start the cluster in Servers =gt Clusters

n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole

Useful links

bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo

bull ldquoRoadmap Installing the Network Deployment productrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

bull WAS InfoCenter ldquoInstalling Web server plug‐insrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaetins_webpluginshtml

bull ldquoWebSphere Application Server Network Deployment V6 High Availability Solutionsrdquo

httpwwwredbooksibmcomabstractssg246688htmlOpen

bull ldquoService integration high availability and workload sharing configurationsrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0007_html

bull SIBus Configuration for high availability

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0010_html0

D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute

13

Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically

To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example

ltdeploymentSettingsgt ltrepository multiUser=truegt ltconnectiongt ltJDBCDriverClass name=comibmdb2jccDB2Drivergt ltJDBCURLgtjdbcdb2SERVER1PORT1ODMltJDBCURLgt ltauthenticationgt ltauthenticationgt ltconnectiongt ltmanagerClass name=ilogodmdatasvcpersistdb2IloDB2RepositoryManagerFactorygt ltappSchema name=SCHEMAgt ltpropertiesgt ltproperty name=clientRerouteAlternateServerName value=SERVER2gt ltproperty name=clientRerouteAlternatePortNumber value=PORT2gt ltproperty name=maxRetriesForClientReroute value=2gt ltproperty name=retryIntervalForClientReroute value=15gt ltpropertiesgt ltrepositorygt ltdeploymentSettingsgt

These additional properties are

bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the

server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again

Notes bull This property list can be extended with other DB2 properties to match your

needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html

Useful Links bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

bull httppublibboulderibmcominfocenterdb2luwv9r5topic

bull comibmdb2luwadminhadocdocc0011976html

14

3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster

This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR

This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301

When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)

WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version

HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure

A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines

1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme

Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI

The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS

2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node

Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity

On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving

15

capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained

A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing

Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests

WAS 1 mgmt Create job A

Client submit job A

proc

JOBS DB

WAS 2

store running complete

mgmt

proc

submit job B

Create job B

Solve job B

hot standby

job A Q status

job A read status

Solve job A

readStatus

progress

submit job C

Solve job C

running completeprogress

running completeprogress

QS QS

IHS WLM plugin WLM WLM

SOAPHTTP SOAPHTTP SOAPHTTP

WLM WLM WLM

Create job C

Figure 2 Typical ODME 3301 load balancing timeline

A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time

0

1

2

3

4

5

6

7

time

0

100

200

300

400

500

600

running

server1

server2Queue Depth

Figure 3 Typical balance of load on ODME 3301

16

The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log

B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure

1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs

Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

stopWAS 1

WAS 1mgmt Create job A

Client submit job A

proc

JOBSDB

WAS 2

store

running

complete

mgmt

proc

submit job B

Create job B

Solve job B

hot

standby

job A Q status

job A read status

Solve job A

readStatus

progress

IHSWLMplugin WLM WLM

SO

AP

HTTP

SOA

PH

TTP

SOAP

HTTP

17

Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster

Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will

2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter

Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again

18

4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment

There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME

Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected

Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues

A Job processor fails to extract OPL binaries upon restart

Symptoms

optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running

Queued jobs are not processed (remain in NOT_STARTED state)

Only one of the cluster members runs jobs although the queue is saturated

SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)

Explanation

The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization

Remediation

Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed

In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files

19

B Solve cannot recover after WAS job-processor or odmsolver stops

Symptoms

When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio

Explanation

In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock

Remediation

The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console

C Bad error reporting when Optimization Server loses connection from the Repository DB

Symptoms

The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost

Explanation

The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display

Remediation

This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered

D ODME cannot start when WAS administrative security is enabled

Symptoms

Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled

This results in an exception being raised during startup of Optimization Server reported in the

20

SystemOutlog

Explanation

The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree

Remediation

WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization

E ODM solver does not start

Symptoms

Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001

Explanation

The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running

Remediation

Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs

21

Page 2: Building a High Availability ODM Enterprise environment · 2018. 11. 2. · Building a High Availability ODM Enterprise environment ... clustered, and manages high availability across

Table of Contents 1 Planning a HA ODM Enterprise environment 3

A Overview of ODM Enterprise 3 1) Databases 3 2) Optimization Engines 3 3) Job Processors 4 4) Job Manager 4

B Protecting the system 5 1) Protecting the ability to run jobs 5 2) Protecting the ability to manage jobs 5

C Sample HA topology 6 D IBM middleware protection strategies 6 E System failure detection 8 F Special considerations for ODM Clients 8

2 Configuring a HA ODM Enterprise environment 9 A Introduction 9 B Configuring DB2 HADR 9 C Configuring WebSphere Application Server and HTTP Server 11

1) Overview 11 2) Procedure 11

D Configuring ODM Application 13 3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster 15

A Workload Management capabilities of ODME 3301 15 1) Job Control and Administrative requests Workload Management 15 2) Job solving Workload Management 15

B High Availablity capabilities of ODME 3301 on WAS‐NDampDB2 HA 17 1) Operations continuity 17 2) Operations recovery 18

4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment 19 A Job processor fails to extract OPL binaries upon restart 19 B Solve cannot recover after WAS job‐processor or odmsolver stops 20 C Bad error reporting when Optimization Server loses connection from the Repository DB 20 D ODME cannot start when WAS administrative security is enabled 20 E ODM solver does not start 21

2

1 Planning a HA ODM Enterprise environment This document describes the points to consider when planning to use the IBMreg ILOG ODM Enterprise Optimization Server as part of a highly available solution

A Overview of ODM Enterprise IBM ILOG ODM Enterprise offers a platform for optimization‐based planning and scheduling applications One of its main components is the Optimization Server used by planners to perform computations At the Optimization Servers core is the Optimization Engine which performs long running computationally expensive optimization ldquosolvesrdquo Each solve requires access to Scenario Data as the input to the solve and as storage for solve results A Job Processor is an application which runs on the same server as one or more Optimization Engines and initiates new solve jobs based on the contents in the Jobs database The Jobs Database is populated by the Job Manager The Job Manager receives requests from clients to schedule and prioritize solve jobs

1) Databases The ODME product uses relational databases a Jobs database and one or more Scenario databases The Jobs database stores information on pending jobs The Scenario database holds data used as input to optimization jobs and the results of the previously completed solve operations Failure of the Jobs database causes the failure of both the Job Manager and the Job Processor Failure of the Scenario database causes the failure of any jobs using that database

2) Optimization Engines The Optimization Engines run as separate processes which wrap invocations of a native solve engine It retrieves scenario data from a database executes the optimization solve and finally writes back the result to the Scenario database The database connectivity is provided by JDBC database drivers A new instance of an Optimization Engine solver process is created for each new solve Failure of an Optimization Engine solver process means that the optimization in progress cannot complete

3

3) Job Processors A Job Processor initiates new solve jobs based on the contents of the Jobs database The Job Processor has a fixed number of solver slots (usually as many as there are physical CPU cores on the host) When the Job Processor has an Optimization Engine slot that is ready it will poll the database to check for new jobs New jobs will get solved by a newly launched Optimization Engine solve process The Job Processor is also responsible for updating the Jobs database with solve progress and final status Failure of the Job Processor means that new jobs cannot be picked for solving and that recording of progress or results of complete jobs The Optimization Engine solver process maintains contact with the Job Processor while running jobs ndash if the Job Processor fails then the Optimization Engine will stop The Job Processor runs as a Java EE application Database connectivity is provided by a JDBC data source The Job Processor responds to queries from the Job Manager using the Java Messaging Service (JMS) JMS is used only for interacting with jobs such as to cancel or accept the current solution and not for regular solving Multiple Job Processors can use the same Jobs database and will mark jobs as in‐progress as they are run

4) Job Manager The Job Manager receives requests from clients to schedule and solve jobs Jobs are stored as records in a Jobs relational database The Job Manager runs as a Java EE application Database connectivity to the Jobs DB is provided by a JDBC XA data source The Job Manager communicates with the Job Processor to interact with running jobs using the Java Messaging Service (JMS) Clients interact with the Job Manager using either a SOAPHTTP web service for solve jobs submission or by using a web‐based console for solve queue administration The Job Manager application holds no state information in memory (all state is in the database) so multiple instances of the application can be used The Job Manager includes a timer which checks the status of running jobs If a job has not been active (no reported heartbeat) for 120 seconds (default value settable in JobMonitors ejbjobActivityTimeout environment variable) the Job Manager will mark the job as failed and available to be restarted Failure of the Job Manager means that

bull No new jobs can be submitted bull Status of jobs cannot be queried bull Running solve jobs cannot be aborted bull Pending jobs can still be run as long as other components (Job Processor DBs) have not

failed

4

B Protecting the system This diagram shows the dependencies of ODME components on high‐level middleware components

Figure Component dependency diagram

These dependencies show the key consequences of the failure of any component For example if the application server for the Job Processor fails then so does the Job Processor application and any associated Optimization Engines If database manager for the Jobs DB fails then the Jobs DB Job Manager and Job Processor and Optimization Engine all fail While two logical application servers are depicted the solution may be deployed on a single physical application server instance The same is true for the two logical database managers This would of course mean that a failure could have a greater impact

1) Protecting the ability to run jobs Assuming job records are available in the Jobs database the ability to run optimization solve jobs is based on the following components

bull Job Processor bull Optimization Engine(s) bull Jobs database bull Scenario database

From a middleware perspective the following needs to be protected bull Application servers used by Job Processor bull The database servers used for the Jobs database bull The database servers used for the Scenario database bull The physical servers and network used by each of the above

2) Protecting the ability to manage jobs The ability to manage jobs is provided by the following logical components

bull Job Manager application bull Jobs database

5

From a middleware perspective the following needs to be protected bull HTTP server used by Job Manager bull Application server used by Job Manager bull The database server used for the Jobs database bull The physical servers and network used by each of the above

C Sample HA topology Here is a sample topology which can be used to protect the ODME solution

This logical topology consists of

Optimization Servers each running a Java EE application server with the Job Processor and Job Manager applications installed Each Optimization Server will host one or more Optimization Engines Both Optimization Servers can be operation at the same time ‐ if one Optimization Server fails then the other will continue taking jobs from the Jobs Database Any number of Optimization Servers could theoretically be used

Database server hosting both the Jobs and Scenario databases Keeping multiple copies of the same database active and up‐to‐date could be very difficult so instead a passive backup should be kept This backup needs to be up‐to‐date and ready to become active if the primary database server fails

Load balancing server which can route HTTP traffic to either of the two Job Managers The load balancer also needs to be backed‐up This backup can also be passive and ready to become active if the primary load balancer fails

D IBM middleware protection strategies There are many ways in which IBM middleware can be protected including (but not limited to)

bull Software bull WebSphere Application Server Network Deployment allows application servers to be

clustered and manages high availability across Java EE applications bull DB2reg can be used to keep an up‐to‐date replica of a database on a separate server using

its High Availability Disaster Recovery (HADR) feature bull Tivoli System Automation provides advanced clustering support for managing highly

6

available systems bull Hardware

bull A PowerHA solution provides a cluster of IBM Power servers using shared disks If one server (either software processes or hardware) fails the other takes over Power HA SystemMirror is available for AIX Linux and IBM i

bull Disk technology such as a Redundant Array of Independent Disks (RAID)

For example the following describes a software‐only topology to enable high availability using WebSphere Application Server‐Network Deployment and DB2 HADR

Figure 1 ODME Software‐only HA Topology example

This topology consists of

Optimization Servers running WebSphere Application Server (WAS) Each Optimization Server runs a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications is installed into the cluster Each Optimization Server hosts one or more ODM Optimization Engines A WAS Service Integration bus (SIBus) is configured to allow the JMS communication between Job Managers and Job Processors ‐ the SIBus uses a single messaging engine with a HADR database store which WAS will automatically move to the other WAS server in the event of a failure In case of primary database failure DB2 HADR will switch to the alternate standby database server

Database servers running DB2 Enterprise Server Edition Both servers run the same software with one acting as the primary database server The primary database server replicates all database updates on the standby server using DB2s High Availability Disaster Recovery (HADR) feature In addition Tivoli System Automation could be used on these servers to detect a failure and instruct the standby server to take over as the primary database

Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests

7

to one of the WAS servers In this configuration the WAS server is chosen on a round‐robin basis

E System failure detection A key factor in creating a highly available system is how quickly you can recover from a failure The solution might be able to cope with the failure of one component but two or more may be difficult so detecting and recovering from failures is critical

It is important to monitor at many levels A failure could occur in the ODME application the hosting application server the operating system the physical hardware for the server or with a network connection

There are many software solutions for monitoring middleware such as IBM Tivoli Monitoring

F Special considerations for ODM Clients ODM client applications such as ODM Planner Studio have direct access to the Scenario database defined in the odmapp deployment settings

ODM application

configuration (odmapp

ODM Enterprise IDE

ODMRepositorySCENARIO

DB

Development Deployment

IT developer

Java Development

Tools

OPL Studio

ODM Editors

ODM Studio -Planner and

Reviewer Editions

Optimization Server

Custom Clients and Batch Files

odmapp

odmapp

odmapp

odmapp

solve

solve

Readwrite

Rea

dwr

ite

Readwrite

The odmapp files generated with ODME IDE include their own Scenario database access definitions which are configured independently from the the Optimization Server JOBSs

When an odmapp is intended to take advantage of HA recovery of its Scenario DB its Data Source definitions must be enhanced with HA‐specific settings that will enable switching database operations to the alternate DB instance This will enable you to take HA recovery into account both when used from within the Planner Studio and when used for solving on an Optimization Server

8

2 Configuring a HA ODM Enterprise environment

A Introduction This document describes a sample configuration of the IBM ILOG ODM Enterprise as part of a highly available (HA) solution There are many ways to provide high availability using various combinations of specialized hardware and software This document describes a software‐based solution using the following products

bull IBM ILOG ODM Enterprise V3301

bull IBM WebSphere Application Server Network Deployment V61 FP 025

bull IBM HTTP Server V61 FP 025

bull IBM DB2 Enterprise Server Edition V95 FixPack 3

This document does not provide an exhaustive step‐by‐step guide but instead highlights specific considerations for configuring HA with the products listed above Links are provided to product documentation articles and Redbooks which describe the steps in more details

Next configuration steps describe how to configure the sample topology depicted in Figure 1 ODME Software‐only HA Topology example above This topology consists of

Optimization Servers running WebSphere Application Server (WAS) with each Optimization Server will run a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications are installed into the cluster

Database servers running DB2 Enterprise Server Edition with both servers running the same software in a activepassive HADR setup where the primary database server replicates all database updates to the standby server

Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests between the WAS servers on a weighted round‐robin basis

Not represented in the previous topology ODM Client Applications will be configured to benefit from the automatic client rerouting offered by DB2 HADR to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server

B Configuring DB2 HADR DB2 has a feature called High Availability Disaster Recovery (HADR) which provides a high performance replication system A DB2 HADR system consists of two database servers one active and one standby Any changes made to the active system will also be replicated in the standby system At any point an administrator can instruct the standby system to ldquotake overrdquo as the primary ndash after this happens the roles of the two systems are swapped

DB2 HADR Requirements

Before installing DB2 with HADR as the ODME application datastore you need to be aware of these basic requirements for both the primary and standby DB2 servers

9

bull Identical operating system version and patch level

bull The primary and standby server machine should be connected with high‐speed TCPIP connection and reachable by TCPIP from the client application

bull Identical DB2 version and patch level software bit size (32‐bit or 64‐bit) and installation path

DB2 HADR Setup

1 Install DB2 UDB Enterprise Server Edition on both the primary and standby machines

Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connection between DB2 HADR primary and stand‐by machine works well

2 Start the DB2 servers on both machines if they are not already running

3 Create your database and the required tables on the primary machine only The databases on secondary machines will be cloned from the primary machine (See the DB2 Information Center for detailed installation information)

bull Optimization Server DB ndash used to store ODME jobs Use the scripts provided with ODM Enterprise (typically serverdatabasedb2-create-tablessql) to create the JOBS database tables Make a note of the userID that you use to create the tables because it is used in the table qualifier and schema

bull Scenario database ndash used to store ODME scenario data The database tables themselves will be initialized when developing the ODM application using the ODME IDE

bull SIBus database ndash used by the WAS Service Integration Bus

Tip The DB2 logs need to be of a sufficient size especially the scenario database in which there are important updates Make sure you set the database logging to archive logging rather than the default circular logging because otherwise it will not be possible to enable HADR

4 Configure HADR for each database from the primary machine using the Setup HADR wizard as presented in

httpwwwredbooksibmcomredbooksSG247363wwhelpwwhimpljshtmlwwhelphtm

Tip The easiest way to create the databases on the secondary machines is to do it during the HADR setup process by using the backup method During HADR setup you may be asked for the peer window parameter you can leave it at the default value of 0

Useful Links bull DB2 V95 InfoCenter

httppublibboulderibmcominfocenterdb2luwv9r5indexjsp

bull The IBM Redbook ldquoHigh Availability and Disaster Recovery Options for DB2 on Linux UNIX and Windowsrdquo provides a useful guide

10

httpwwwredbooksibmcomabstractsSG247363html

bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

httppublibboulderibmcominfocenterdb2luwv9r5topiccomibmdb2luwadminhadocdocc0011976html

C Configuring WebSphere Application Server and HTTP Server

1) Overview WebSphere Application Server Network Deployment allows multiple servers to be clustered together Installing a Java EE application into the cluster will perform the installation on each cluster member

The ODM Enterprise Job Manager and Job Processor applications use the Java Messaging Service (JMS) to communicate with each other To use JMS in a clustered environment in WAS a service integration bus (SIBus) is used with each server added as a clustered bus member In our architecture only one server needs to host a messaging engine ndash in the event of a failure in that server WAS will move the messaging engine to another server To support this each WAS server must be able to access the SIBus data store so in this topology the data store will be hosted in a DB2 database

2) Procedure The following instructions extend the single‐server instructions provided with ODME 3301 with a focus on differences specific to a clustered deployment

a) Install WAS 61 Network Deployment as detailed in

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

Tip Install the deployment manager node first and start it For the other nodes used in optimisation cluster select a ldquoCustomrdquo environment in the profile manager wizard which will add the new node into the cell Deployment manager and cluster nodes should be created with security disabled

b) Start and connect to the deployment manager console create a new cluster in Servers =gt

Clusters and define cluster members for nodes created earlier

c) Create a ldquoDB2 Universal JDBC Driver Provider (XA)rdquo provider for the cluster scope in Resources =gt JDBC =gt JDBC Providers and specify database class path information for cluster nodes

d) With this provider create new JDBC Data sources in the cluster scope for each HA database used by the optimisation server cluster the Jobs and the SIBus databases Create the data sources with all the settings that pertain to the primary DB2 host alternate (standby) database definitions will be specified through additional DB connection properties

11

Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well

The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules

e) Set the custom properties of these JDBC data sources

bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables

bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name

bull clientRerouteAlternatePortNumber standby server port number for client reroute

bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2

bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15

bull fullyMaterializeInputStreams set to true

bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues

f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled

g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection

h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)

bull OptimizationServerTopic named jmsoptimserverTopic in JNDI

bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI

bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI

bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic

i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope

j) Install IBM HTTP Server 61

12

Tip

bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application

bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier

k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of

plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps

l) Start cluster nodes

m) Start the cluster in Servers =gt Clusters

n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole

Useful links

bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo

bull ldquoRoadmap Installing the Network Deployment productrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

bull WAS InfoCenter ldquoInstalling Web server plug‐insrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaetins_webpluginshtml

bull ldquoWebSphere Application Server Network Deployment V6 High Availability Solutionsrdquo

httpwwwredbooksibmcomabstractssg246688htmlOpen

bull ldquoService integration high availability and workload sharing configurationsrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0007_html

bull SIBus Configuration for high availability

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0010_html0

D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute

13

Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically

To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example

ltdeploymentSettingsgt ltrepository multiUser=truegt ltconnectiongt ltJDBCDriverClass name=comibmdb2jccDB2Drivergt ltJDBCURLgtjdbcdb2SERVER1PORT1ODMltJDBCURLgt ltauthenticationgt ltauthenticationgt ltconnectiongt ltmanagerClass name=ilogodmdatasvcpersistdb2IloDB2RepositoryManagerFactorygt ltappSchema name=SCHEMAgt ltpropertiesgt ltproperty name=clientRerouteAlternateServerName value=SERVER2gt ltproperty name=clientRerouteAlternatePortNumber value=PORT2gt ltproperty name=maxRetriesForClientReroute value=2gt ltproperty name=retryIntervalForClientReroute value=15gt ltpropertiesgt ltrepositorygt ltdeploymentSettingsgt

These additional properties are

bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the

server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again

Notes bull This property list can be extended with other DB2 properties to match your

needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html

Useful Links bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

bull httppublibboulderibmcominfocenterdb2luwv9r5topic

bull comibmdb2luwadminhadocdocc0011976html

14

3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster

This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR

This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301

When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)

WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version

HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure

A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines

1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme

Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI

The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS

2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node

Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity

On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving

15

capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained

A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing

Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests

WAS 1 mgmt Create job A

Client submit job A

proc

JOBS DB

WAS 2

store running complete

mgmt

proc

submit job B

Create job B

Solve job B

hot standby

job A Q status

job A read status

Solve job A

readStatus

progress

submit job C

Solve job C

running completeprogress

running completeprogress

QS QS

IHS WLM plugin WLM WLM

SOAPHTTP SOAPHTTP SOAPHTTP

WLM WLM WLM

Create job C

Figure 2 Typical ODME 3301 load balancing timeline

A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time

0

1

2

3

4

5

6

7

time

0

100

200

300

400

500

600

running

server1

server2Queue Depth

Figure 3 Typical balance of load on ODME 3301

16

The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log

B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure

1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs

Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

stopWAS 1

WAS 1mgmt Create job A

Client submit job A

proc

JOBSDB

WAS 2

store

running

complete

mgmt

proc

submit job B

Create job B

Solve job B

hot

standby

job A Q status

job A read status

Solve job A

readStatus

progress

IHSWLMplugin WLM WLM

SO

AP

HTTP

SOA

PH

TTP

SOAP

HTTP

17

Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster

Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will

2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter

Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again

18

4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment

There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME

Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected

Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues

A Job processor fails to extract OPL binaries upon restart

Symptoms

optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running

Queued jobs are not processed (remain in NOT_STARTED state)

Only one of the cluster members runs jobs although the queue is saturated

SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)

Explanation

The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization

Remediation

Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed

In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files

19

B Solve cannot recover after WAS job-processor or odmsolver stops

Symptoms

When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio

Explanation

In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock

Remediation

The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console

C Bad error reporting when Optimization Server loses connection from the Repository DB

Symptoms

The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost

Explanation

The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display

Remediation

This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered

D ODME cannot start when WAS administrative security is enabled

Symptoms

Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled

This results in an exception being raised during startup of Optimization Server reported in the

20

SystemOutlog

Explanation

The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree

Remediation

WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization

E ODM solver does not start

Symptoms

Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001

Explanation

The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running

Remediation

Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs

21

Page 3: Building a High Availability ODM Enterprise environment · 2018. 11. 2. · Building a High Availability ODM Enterprise environment ... clustered, and manages high availability across

1 Planning a HA ODM Enterprise environment This document describes the points to consider when planning to use the IBMreg ILOG ODM Enterprise Optimization Server as part of a highly available solution

A Overview of ODM Enterprise IBM ILOG ODM Enterprise offers a platform for optimization‐based planning and scheduling applications One of its main components is the Optimization Server used by planners to perform computations At the Optimization Servers core is the Optimization Engine which performs long running computationally expensive optimization ldquosolvesrdquo Each solve requires access to Scenario Data as the input to the solve and as storage for solve results A Job Processor is an application which runs on the same server as one or more Optimization Engines and initiates new solve jobs based on the contents in the Jobs database The Jobs Database is populated by the Job Manager The Job Manager receives requests from clients to schedule and prioritize solve jobs

1) Databases The ODME product uses relational databases a Jobs database and one or more Scenario databases The Jobs database stores information on pending jobs The Scenario database holds data used as input to optimization jobs and the results of the previously completed solve operations Failure of the Jobs database causes the failure of both the Job Manager and the Job Processor Failure of the Scenario database causes the failure of any jobs using that database

2) Optimization Engines The Optimization Engines run as separate processes which wrap invocations of a native solve engine It retrieves scenario data from a database executes the optimization solve and finally writes back the result to the Scenario database The database connectivity is provided by JDBC database drivers A new instance of an Optimization Engine solver process is created for each new solve Failure of an Optimization Engine solver process means that the optimization in progress cannot complete

3

3) Job Processors A Job Processor initiates new solve jobs based on the contents of the Jobs database The Job Processor has a fixed number of solver slots (usually as many as there are physical CPU cores on the host) When the Job Processor has an Optimization Engine slot that is ready it will poll the database to check for new jobs New jobs will get solved by a newly launched Optimization Engine solve process The Job Processor is also responsible for updating the Jobs database with solve progress and final status Failure of the Job Processor means that new jobs cannot be picked for solving and that recording of progress or results of complete jobs The Optimization Engine solver process maintains contact with the Job Processor while running jobs ndash if the Job Processor fails then the Optimization Engine will stop The Job Processor runs as a Java EE application Database connectivity is provided by a JDBC data source The Job Processor responds to queries from the Job Manager using the Java Messaging Service (JMS) JMS is used only for interacting with jobs such as to cancel or accept the current solution and not for regular solving Multiple Job Processors can use the same Jobs database and will mark jobs as in‐progress as they are run

4) Job Manager The Job Manager receives requests from clients to schedule and solve jobs Jobs are stored as records in a Jobs relational database The Job Manager runs as a Java EE application Database connectivity to the Jobs DB is provided by a JDBC XA data source The Job Manager communicates with the Job Processor to interact with running jobs using the Java Messaging Service (JMS) Clients interact with the Job Manager using either a SOAPHTTP web service for solve jobs submission or by using a web‐based console for solve queue administration The Job Manager application holds no state information in memory (all state is in the database) so multiple instances of the application can be used The Job Manager includes a timer which checks the status of running jobs If a job has not been active (no reported heartbeat) for 120 seconds (default value settable in JobMonitors ejbjobActivityTimeout environment variable) the Job Manager will mark the job as failed and available to be restarted Failure of the Job Manager means that

bull No new jobs can be submitted bull Status of jobs cannot be queried bull Running solve jobs cannot be aborted bull Pending jobs can still be run as long as other components (Job Processor DBs) have not

failed

4

B Protecting the system This diagram shows the dependencies of ODME components on high‐level middleware components

Figure Component dependency diagram

These dependencies show the key consequences of the failure of any component For example if the application server for the Job Processor fails then so does the Job Processor application and any associated Optimization Engines If database manager for the Jobs DB fails then the Jobs DB Job Manager and Job Processor and Optimization Engine all fail While two logical application servers are depicted the solution may be deployed on a single physical application server instance The same is true for the two logical database managers This would of course mean that a failure could have a greater impact

1) Protecting the ability to run jobs Assuming job records are available in the Jobs database the ability to run optimization solve jobs is based on the following components

bull Job Processor bull Optimization Engine(s) bull Jobs database bull Scenario database

From a middleware perspective the following needs to be protected bull Application servers used by Job Processor bull The database servers used for the Jobs database bull The database servers used for the Scenario database bull The physical servers and network used by each of the above

2) Protecting the ability to manage jobs The ability to manage jobs is provided by the following logical components

bull Job Manager application bull Jobs database

5

From a middleware perspective the following needs to be protected bull HTTP server used by Job Manager bull Application server used by Job Manager bull The database server used for the Jobs database bull The physical servers and network used by each of the above

C Sample HA topology Here is a sample topology which can be used to protect the ODME solution

This logical topology consists of

Optimization Servers each running a Java EE application server with the Job Processor and Job Manager applications installed Each Optimization Server will host one or more Optimization Engines Both Optimization Servers can be operation at the same time ‐ if one Optimization Server fails then the other will continue taking jobs from the Jobs Database Any number of Optimization Servers could theoretically be used

Database server hosting both the Jobs and Scenario databases Keeping multiple copies of the same database active and up‐to‐date could be very difficult so instead a passive backup should be kept This backup needs to be up‐to‐date and ready to become active if the primary database server fails

Load balancing server which can route HTTP traffic to either of the two Job Managers The load balancer also needs to be backed‐up This backup can also be passive and ready to become active if the primary load balancer fails

D IBM middleware protection strategies There are many ways in which IBM middleware can be protected including (but not limited to)

bull Software bull WebSphere Application Server Network Deployment allows application servers to be

clustered and manages high availability across Java EE applications bull DB2reg can be used to keep an up‐to‐date replica of a database on a separate server using

its High Availability Disaster Recovery (HADR) feature bull Tivoli System Automation provides advanced clustering support for managing highly

6

available systems bull Hardware

bull A PowerHA solution provides a cluster of IBM Power servers using shared disks If one server (either software processes or hardware) fails the other takes over Power HA SystemMirror is available for AIX Linux and IBM i

bull Disk technology such as a Redundant Array of Independent Disks (RAID)

For example the following describes a software‐only topology to enable high availability using WebSphere Application Server‐Network Deployment and DB2 HADR

Figure 1 ODME Software‐only HA Topology example

This topology consists of

Optimization Servers running WebSphere Application Server (WAS) Each Optimization Server runs a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications is installed into the cluster Each Optimization Server hosts one or more ODM Optimization Engines A WAS Service Integration bus (SIBus) is configured to allow the JMS communication between Job Managers and Job Processors ‐ the SIBus uses a single messaging engine with a HADR database store which WAS will automatically move to the other WAS server in the event of a failure In case of primary database failure DB2 HADR will switch to the alternate standby database server

Database servers running DB2 Enterprise Server Edition Both servers run the same software with one acting as the primary database server The primary database server replicates all database updates on the standby server using DB2s High Availability Disaster Recovery (HADR) feature In addition Tivoli System Automation could be used on these servers to detect a failure and instruct the standby server to take over as the primary database

Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests

7

to one of the WAS servers In this configuration the WAS server is chosen on a round‐robin basis

E System failure detection A key factor in creating a highly available system is how quickly you can recover from a failure The solution might be able to cope with the failure of one component but two or more may be difficult so detecting and recovering from failures is critical

It is important to monitor at many levels A failure could occur in the ODME application the hosting application server the operating system the physical hardware for the server or with a network connection

There are many software solutions for monitoring middleware such as IBM Tivoli Monitoring

F Special considerations for ODM Clients ODM client applications such as ODM Planner Studio have direct access to the Scenario database defined in the odmapp deployment settings

ODM application

configuration (odmapp

ODM Enterprise IDE

ODMRepositorySCENARIO

DB

Development Deployment

IT developer

Java Development

Tools

OPL Studio

ODM Editors

ODM Studio -Planner and

Reviewer Editions

Optimization Server

Custom Clients and Batch Files

odmapp

odmapp

odmapp

odmapp

solve

solve

Readwrite

Rea

dwr

ite

Readwrite

The odmapp files generated with ODME IDE include their own Scenario database access definitions which are configured independently from the the Optimization Server JOBSs

When an odmapp is intended to take advantage of HA recovery of its Scenario DB its Data Source definitions must be enhanced with HA‐specific settings that will enable switching database operations to the alternate DB instance This will enable you to take HA recovery into account both when used from within the Planner Studio and when used for solving on an Optimization Server

8

2 Configuring a HA ODM Enterprise environment

A Introduction This document describes a sample configuration of the IBM ILOG ODM Enterprise as part of a highly available (HA) solution There are many ways to provide high availability using various combinations of specialized hardware and software This document describes a software‐based solution using the following products

bull IBM ILOG ODM Enterprise V3301

bull IBM WebSphere Application Server Network Deployment V61 FP 025

bull IBM HTTP Server V61 FP 025

bull IBM DB2 Enterprise Server Edition V95 FixPack 3

This document does not provide an exhaustive step‐by‐step guide but instead highlights specific considerations for configuring HA with the products listed above Links are provided to product documentation articles and Redbooks which describe the steps in more details

Next configuration steps describe how to configure the sample topology depicted in Figure 1 ODME Software‐only HA Topology example above This topology consists of

Optimization Servers running WebSphere Application Server (WAS) with each Optimization Server will run a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications are installed into the cluster

Database servers running DB2 Enterprise Server Edition with both servers running the same software in a activepassive HADR setup where the primary database server replicates all database updates to the standby server

Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests between the WAS servers on a weighted round‐robin basis

Not represented in the previous topology ODM Client Applications will be configured to benefit from the automatic client rerouting offered by DB2 HADR to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server

B Configuring DB2 HADR DB2 has a feature called High Availability Disaster Recovery (HADR) which provides a high performance replication system A DB2 HADR system consists of two database servers one active and one standby Any changes made to the active system will also be replicated in the standby system At any point an administrator can instruct the standby system to ldquotake overrdquo as the primary ndash after this happens the roles of the two systems are swapped

DB2 HADR Requirements

Before installing DB2 with HADR as the ODME application datastore you need to be aware of these basic requirements for both the primary and standby DB2 servers

9

bull Identical operating system version and patch level

bull The primary and standby server machine should be connected with high‐speed TCPIP connection and reachable by TCPIP from the client application

bull Identical DB2 version and patch level software bit size (32‐bit or 64‐bit) and installation path

DB2 HADR Setup

1 Install DB2 UDB Enterprise Server Edition on both the primary and standby machines

Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connection between DB2 HADR primary and stand‐by machine works well

2 Start the DB2 servers on both machines if they are not already running

3 Create your database and the required tables on the primary machine only The databases on secondary machines will be cloned from the primary machine (See the DB2 Information Center for detailed installation information)

bull Optimization Server DB ndash used to store ODME jobs Use the scripts provided with ODM Enterprise (typically serverdatabasedb2-create-tablessql) to create the JOBS database tables Make a note of the userID that you use to create the tables because it is used in the table qualifier and schema

bull Scenario database ndash used to store ODME scenario data The database tables themselves will be initialized when developing the ODM application using the ODME IDE

bull SIBus database ndash used by the WAS Service Integration Bus

Tip The DB2 logs need to be of a sufficient size especially the scenario database in which there are important updates Make sure you set the database logging to archive logging rather than the default circular logging because otherwise it will not be possible to enable HADR

4 Configure HADR for each database from the primary machine using the Setup HADR wizard as presented in

httpwwwredbooksibmcomredbooksSG247363wwhelpwwhimpljshtmlwwhelphtm

Tip The easiest way to create the databases on the secondary machines is to do it during the HADR setup process by using the backup method During HADR setup you may be asked for the peer window parameter you can leave it at the default value of 0

Useful Links bull DB2 V95 InfoCenter

httppublibboulderibmcominfocenterdb2luwv9r5indexjsp

bull The IBM Redbook ldquoHigh Availability and Disaster Recovery Options for DB2 on Linux UNIX and Windowsrdquo provides a useful guide

10

httpwwwredbooksibmcomabstractsSG247363html

bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

httppublibboulderibmcominfocenterdb2luwv9r5topiccomibmdb2luwadminhadocdocc0011976html

C Configuring WebSphere Application Server and HTTP Server

1) Overview WebSphere Application Server Network Deployment allows multiple servers to be clustered together Installing a Java EE application into the cluster will perform the installation on each cluster member

The ODM Enterprise Job Manager and Job Processor applications use the Java Messaging Service (JMS) to communicate with each other To use JMS in a clustered environment in WAS a service integration bus (SIBus) is used with each server added as a clustered bus member In our architecture only one server needs to host a messaging engine ndash in the event of a failure in that server WAS will move the messaging engine to another server To support this each WAS server must be able to access the SIBus data store so in this topology the data store will be hosted in a DB2 database

2) Procedure The following instructions extend the single‐server instructions provided with ODME 3301 with a focus on differences specific to a clustered deployment

a) Install WAS 61 Network Deployment as detailed in

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

Tip Install the deployment manager node first and start it For the other nodes used in optimisation cluster select a ldquoCustomrdquo environment in the profile manager wizard which will add the new node into the cell Deployment manager and cluster nodes should be created with security disabled

b) Start and connect to the deployment manager console create a new cluster in Servers =gt

Clusters and define cluster members for nodes created earlier

c) Create a ldquoDB2 Universal JDBC Driver Provider (XA)rdquo provider for the cluster scope in Resources =gt JDBC =gt JDBC Providers and specify database class path information for cluster nodes

d) With this provider create new JDBC Data sources in the cluster scope for each HA database used by the optimisation server cluster the Jobs and the SIBus databases Create the data sources with all the settings that pertain to the primary DB2 host alternate (standby) database definitions will be specified through additional DB connection properties

11

Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well

The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules

e) Set the custom properties of these JDBC data sources

bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables

bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name

bull clientRerouteAlternatePortNumber standby server port number for client reroute

bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2

bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15

bull fullyMaterializeInputStreams set to true

bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues

f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled

g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection

h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)

bull OptimizationServerTopic named jmsoptimserverTopic in JNDI

bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI

bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI

bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic

i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope

j) Install IBM HTTP Server 61

12

Tip

bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application

bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier

k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of

plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps

l) Start cluster nodes

m) Start the cluster in Servers =gt Clusters

n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole

Useful links

bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo

bull ldquoRoadmap Installing the Network Deployment productrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

bull WAS InfoCenter ldquoInstalling Web server plug‐insrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaetins_webpluginshtml

bull ldquoWebSphere Application Server Network Deployment V6 High Availability Solutionsrdquo

httpwwwredbooksibmcomabstractssg246688htmlOpen

bull ldquoService integration high availability and workload sharing configurationsrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0007_html

bull SIBus Configuration for high availability

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0010_html0

D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute

13

Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically

To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example

ltdeploymentSettingsgt ltrepository multiUser=truegt ltconnectiongt ltJDBCDriverClass name=comibmdb2jccDB2Drivergt ltJDBCURLgtjdbcdb2SERVER1PORT1ODMltJDBCURLgt ltauthenticationgt ltauthenticationgt ltconnectiongt ltmanagerClass name=ilogodmdatasvcpersistdb2IloDB2RepositoryManagerFactorygt ltappSchema name=SCHEMAgt ltpropertiesgt ltproperty name=clientRerouteAlternateServerName value=SERVER2gt ltproperty name=clientRerouteAlternatePortNumber value=PORT2gt ltproperty name=maxRetriesForClientReroute value=2gt ltproperty name=retryIntervalForClientReroute value=15gt ltpropertiesgt ltrepositorygt ltdeploymentSettingsgt

These additional properties are

bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the

server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again

Notes bull This property list can be extended with other DB2 properties to match your

needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html

Useful Links bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

bull httppublibboulderibmcominfocenterdb2luwv9r5topic

bull comibmdb2luwadminhadocdocc0011976html

14

3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster

This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR

This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301

When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)

WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version

HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure

A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines

1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme

Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI

The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS

2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node

Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity

On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving

15

capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained

A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing

Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests

WAS 1 mgmt Create job A

Client submit job A

proc

JOBS DB

WAS 2

store running complete

mgmt

proc

submit job B

Create job B

Solve job B

hot standby

job A Q status

job A read status

Solve job A

readStatus

progress

submit job C

Solve job C

running completeprogress

running completeprogress

QS QS

IHS WLM plugin WLM WLM

SOAPHTTP SOAPHTTP SOAPHTTP

WLM WLM WLM

Create job C

Figure 2 Typical ODME 3301 load balancing timeline

A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time

0

1

2

3

4

5

6

7

time

0

100

200

300

400

500

600

running

server1

server2Queue Depth

Figure 3 Typical balance of load on ODME 3301

16

The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log

B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure

1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs

Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

stopWAS 1

WAS 1mgmt Create job A

Client submit job A

proc

JOBSDB

WAS 2

store

running

complete

mgmt

proc

submit job B

Create job B

Solve job B

hot

standby

job A Q status

job A read status

Solve job A

readStatus

progress

IHSWLMplugin WLM WLM

SO

AP

HTTP

SOA

PH

TTP

SOAP

HTTP

17

Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster

Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will

2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter

Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again

18

4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment

There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME

Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected

Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues

A Job processor fails to extract OPL binaries upon restart

Symptoms

optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running

Queued jobs are not processed (remain in NOT_STARTED state)

Only one of the cluster members runs jobs although the queue is saturated

SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)

Explanation

The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization

Remediation

Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed

In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files

19

B Solve cannot recover after WAS job-processor or odmsolver stops

Symptoms

When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio

Explanation

In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock

Remediation

The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console

C Bad error reporting when Optimization Server loses connection from the Repository DB

Symptoms

The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost

Explanation

The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display

Remediation

This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered

D ODME cannot start when WAS administrative security is enabled

Symptoms

Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled

This results in an exception being raised during startup of Optimization Server reported in the

20

SystemOutlog

Explanation

The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree

Remediation

WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization

E ODM solver does not start

Symptoms

Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001

Explanation

The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running

Remediation

Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs

21

Page 4: Building a High Availability ODM Enterprise environment · 2018. 11. 2. · Building a High Availability ODM Enterprise environment ... clustered, and manages high availability across

3) Job Processors A Job Processor initiates new solve jobs based on the contents of the Jobs database The Job Processor has a fixed number of solver slots (usually as many as there are physical CPU cores on the host) When the Job Processor has an Optimization Engine slot that is ready it will poll the database to check for new jobs New jobs will get solved by a newly launched Optimization Engine solve process The Job Processor is also responsible for updating the Jobs database with solve progress and final status Failure of the Job Processor means that new jobs cannot be picked for solving and that recording of progress or results of complete jobs The Optimization Engine solver process maintains contact with the Job Processor while running jobs ndash if the Job Processor fails then the Optimization Engine will stop The Job Processor runs as a Java EE application Database connectivity is provided by a JDBC data source The Job Processor responds to queries from the Job Manager using the Java Messaging Service (JMS) JMS is used only for interacting with jobs such as to cancel or accept the current solution and not for regular solving Multiple Job Processors can use the same Jobs database and will mark jobs as in‐progress as they are run

4) Job Manager The Job Manager receives requests from clients to schedule and solve jobs Jobs are stored as records in a Jobs relational database The Job Manager runs as a Java EE application Database connectivity to the Jobs DB is provided by a JDBC XA data source The Job Manager communicates with the Job Processor to interact with running jobs using the Java Messaging Service (JMS) Clients interact with the Job Manager using either a SOAPHTTP web service for solve jobs submission or by using a web‐based console for solve queue administration The Job Manager application holds no state information in memory (all state is in the database) so multiple instances of the application can be used The Job Manager includes a timer which checks the status of running jobs If a job has not been active (no reported heartbeat) for 120 seconds (default value settable in JobMonitors ejbjobActivityTimeout environment variable) the Job Manager will mark the job as failed and available to be restarted Failure of the Job Manager means that

bull No new jobs can be submitted bull Status of jobs cannot be queried bull Running solve jobs cannot be aborted bull Pending jobs can still be run as long as other components (Job Processor DBs) have not

failed

4

B Protecting the system This diagram shows the dependencies of ODME components on high‐level middleware components

Figure Component dependency diagram

These dependencies show the key consequences of the failure of any component For example if the application server for the Job Processor fails then so does the Job Processor application and any associated Optimization Engines If database manager for the Jobs DB fails then the Jobs DB Job Manager and Job Processor and Optimization Engine all fail While two logical application servers are depicted the solution may be deployed on a single physical application server instance The same is true for the two logical database managers This would of course mean that a failure could have a greater impact

1) Protecting the ability to run jobs Assuming job records are available in the Jobs database the ability to run optimization solve jobs is based on the following components

bull Job Processor bull Optimization Engine(s) bull Jobs database bull Scenario database

From a middleware perspective the following needs to be protected bull Application servers used by Job Processor bull The database servers used for the Jobs database bull The database servers used for the Scenario database bull The physical servers and network used by each of the above

2) Protecting the ability to manage jobs The ability to manage jobs is provided by the following logical components

bull Job Manager application bull Jobs database

5

From a middleware perspective the following needs to be protected bull HTTP server used by Job Manager bull Application server used by Job Manager bull The database server used for the Jobs database bull The physical servers and network used by each of the above

C Sample HA topology Here is a sample topology which can be used to protect the ODME solution

This logical topology consists of

Optimization Servers each running a Java EE application server with the Job Processor and Job Manager applications installed Each Optimization Server will host one or more Optimization Engines Both Optimization Servers can be operation at the same time ‐ if one Optimization Server fails then the other will continue taking jobs from the Jobs Database Any number of Optimization Servers could theoretically be used

Database server hosting both the Jobs and Scenario databases Keeping multiple copies of the same database active and up‐to‐date could be very difficult so instead a passive backup should be kept This backup needs to be up‐to‐date and ready to become active if the primary database server fails

Load balancing server which can route HTTP traffic to either of the two Job Managers The load balancer also needs to be backed‐up This backup can also be passive and ready to become active if the primary load balancer fails

D IBM middleware protection strategies There are many ways in which IBM middleware can be protected including (but not limited to)

bull Software bull WebSphere Application Server Network Deployment allows application servers to be

clustered and manages high availability across Java EE applications bull DB2reg can be used to keep an up‐to‐date replica of a database on a separate server using

its High Availability Disaster Recovery (HADR) feature bull Tivoli System Automation provides advanced clustering support for managing highly

6

available systems bull Hardware

bull A PowerHA solution provides a cluster of IBM Power servers using shared disks If one server (either software processes or hardware) fails the other takes over Power HA SystemMirror is available for AIX Linux and IBM i

bull Disk technology such as a Redundant Array of Independent Disks (RAID)

For example the following describes a software‐only topology to enable high availability using WebSphere Application Server‐Network Deployment and DB2 HADR

Figure 1 ODME Software‐only HA Topology example

This topology consists of

Optimization Servers running WebSphere Application Server (WAS) Each Optimization Server runs a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications is installed into the cluster Each Optimization Server hosts one or more ODM Optimization Engines A WAS Service Integration bus (SIBus) is configured to allow the JMS communication between Job Managers and Job Processors ‐ the SIBus uses a single messaging engine with a HADR database store which WAS will automatically move to the other WAS server in the event of a failure In case of primary database failure DB2 HADR will switch to the alternate standby database server

Database servers running DB2 Enterprise Server Edition Both servers run the same software with one acting as the primary database server The primary database server replicates all database updates on the standby server using DB2s High Availability Disaster Recovery (HADR) feature In addition Tivoli System Automation could be used on these servers to detect a failure and instruct the standby server to take over as the primary database

Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests

7

to one of the WAS servers In this configuration the WAS server is chosen on a round‐robin basis

E System failure detection A key factor in creating a highly available system is how quickly you can recover from a failure The solution might be able to cope with the failure of one component but two or more may be difficult so detecting and recovering from failures is critical

It is important to monitor at many levels A failure could occur in the ODME application the hosting application server the operating system the physical hardware for the server or with a network connection

There are many software solutions for monitoring middleware such as IBM Tivoli Monitoring

F Special considerations for ODM Clients ODM client applications such as ODM Planner Studio have direct access to the Scenario database defined in the odmapp deployment settings

ODM application

configuration (odmapp

ODM Enterprise IDE

ODMRepositorySCENARIO

DB

Development Deployment

IT developer

Java Development

Tools

OPL Studio

ODM Editors

ODM Studio -Planner and

Reviewer Editions

Optimization Server

Custom Clients and Batch Files

odmapp

odmapp

odmapp

odmapp

solve

solve

Readwrite

Rea

dwr

ite

Readwrite

The odmapp files generated with ODME IDE include their own Scenario database access definitions which are configured independently from the the Optimization Server JOBSs

When an odmapp is intended to take advantage of HA recovery of its Scenario DB its Data Source definitions must be enhanced with HA‐specific settings that will enable switching database operations to the alternate DB instance This will enable you to take HA recovery into account both when used from within the Planner Studio and when used for solving on an Optimization Server

8

2 Configuring a HA ODM Enterprise environment

A Introduction This document describes a sample configuration of the IBM ILOG ODM Enterprise as part of a highly available (HA) solution There are many ways to provide high availability using various combinations of specialized hardware and software This document describes a software‐based solution using the following products

bull IBM ILOG ODM Enterprise V3301

bull IBM WebSphere Application Server Network Deployment V61 FP 025

bull IBM HTTP Server V61 FP 025

bull IBM DB2 Enterprise Server Edition V95 FixPack 3

This document does not provide an exhaustive step‐by‐step guide but instead highlights specific considerations for configuring HA with the products listed above Links are provided to product documentation articles and Redbooks which describe the steps in more details

Next configuration steps describe how to configure the sample topology depicted in Figure 1 ODME Software‐only HA Topology example above This topology consists of

Optimization Servers running WebSphere Application Server (WAS) with each Optimization Server will run a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications are installed into the cluster

Database servers running DB2 Enterprise Server Edition with both servers running the same software in a activepassive HADR setup where the primary database server replicates all database updates to the standby server

Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests between the WAS servers on a weighted round‐robin basis

Not represented in the previous topology ODM Client Applications will be configured to benefit from the automatic client rerouting offered by DB2 HADR to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server

B Configuring DB2 HADR DB2 has a feature called High Availability Disaster Recovery (HADR) which provides a high performance replication system A DB2 HADR system consists of two database servers one active and one standby Any changes made to the active system will also be replicated in the standby system At any point an administrator can instruct the standby system to ldquotake overrdquo as the primary ndash after this happens the roles of the two systems are swapped

DB2 HADR Requirements

Before installing DB2 with HADR as the ODME application datastore you need to be aware of these basic requirements for both the primary and standby DB2 servers

9

bull Identical operating system version and patch level

bull The primary and standby server machine should be connected with high‐speed TCPIP connection and reachable by TCPIP from the client application

bull Identical DB2 version and patch level software bit size (32‐bit or 64‐bit) and installation path

DB2 HADR Setup

1 Install DB2 UDB Enterprise Server Edition on both the primary and standby machines

Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connection between DB2 HADR primary and stand‐by machine works well

2 Start the DB2 servers on both machines if they are not already running

3 Create your database and the required tables on the primary machine only The databases on secondary machines will be cloned from the primary machine (See the DB2 Information Center for detailed installation information)

bull Optimization Server DB ndash used to store ODME jobs Use the scripts provided with ODM Enterprise (typically serverdatabasedb2-create-tablessql) to create the JOBS database tables Make a note of the userID that you use to create the tables because it is used in the table qualifier and schema

bull Scenario database ndash used to store ODME scenario data The database tables themselves will be initialized when developing the ODM application using the ODME IDE

bull SIBus database ndash used by the WAS Service Integration Bus

Tip The DB2 logs need to be of a sufficient size especially the scenario database in which there are important updates Make sure you set the database logging to archive logging rather than the default circular logging because otherwise it will not be possible to enable HADR

4 Configure HADR for each database from the primary machine using the Setup HADR wizard as presented in

httpwwwredbooksibmcomredbooksSG247363wwhelpwwhimpljshtmlwwhelphtm

Tip The easiest way to create the databases on the secondary machines is to do it during the HADR setup process by using the backup method During HADR setup you may be asked for the peer window parameter you can leave it at the default value of 0

Useful Links bull DB2 V95 InfoCenter

httppublibboulderibmcominfocenterdb2luwv9r5indexjsp

bull The IBM Redbook ldquoHigh Availability and Disaster Recovery Options for DB2 on Linux UNIX and Windowsrdquo provides a useful guide

10

httpwwwredbooksibmcomabstractsSG247363html

bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

httppublibboulderibmcominfocenterdb2luwv9r5topiccomibmdb2luwadminhadocdocc0011976html

C Configuring WebSphere Application Server and HTTP Server

1) Overview WebSphere Application Server Network Deployment allows multiple servers to be clustered together Installing a Java EE application into the cluster will perform the installation on each cluster member

The ODM Enterprise Job Manager and Job Processor applications use the Java Messaging Service (JMS) to communicate with each other To use JMS in a clustered environment in WAS a service integration bus (SIBus) is used with each server added as a clustered bus member In our architecture only one server needs to host a messaging engine ndash in the event of a failure in that server WAS will move the messaging engine to another server To support this each WAS server must be able to access the SIBus data store so in this topology the data store will be hosted in a DB2 database

2) Procedure The following instructions extend the single‐server instructions provided with ODME 3301 with a focus on differences specific to a clustered deployment

a) Install WAS 61 Network Deployment as detailed in

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

Tip Install the deployment manager node first and start it For the other nodes used in optimisation cluster select a ldquoCustomrdquo environment in the profile manager wizard which will add the new node into the cell Deployment manager and cluster nodes should be created with security disabled

b) Start and connect to the deployment manager console create a new cluster in Servers =gt

Clusters and define cluster members for nodes created earlier

c) Create a ldquoDB2 Universal JDBC Driver Provider (XA)rdquo provider for the cluster scope in Resources =gt JDBC =gt JDBC Providers and specify database class path information for cluster nodes

d) With this provider create new JDBC Data sources in the cluster scope for each HA database used by the optimisation server cluster the Jobs and the SIBus databases Create the data sources with all the settings that pertain to the primary DB2 host alternate (standby) database definitions will be specified through additional DB connection properties

11

Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well

The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules

e) Set the custom properties of these JDBC data sources

bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables

bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name

bull clientRerouteAlternatePortNumber standby server port number for client reroute

bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2

bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15

bull fullyMaterializeInputStreams set to true

bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues

f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled

g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection

h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)

bull OptimizationServerTopic named jmsoptimserverTopic in JNDI

bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI

bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI

bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic

i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope

j) Install IBM HTTP Server 61

12

Tip

bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application

bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier

k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of

plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps

l) Start cluster nodes

m) Start the cluster in Servers =gt Clusters

n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole

Useful links

bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo

bull ldquoRoadmap Installing the Network Deployment productrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

bull WAS InfoCenter ldquoInstalling Web server plug‐insrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaetins_webpluginshtml

bull ldquoWebSphere Application Server Network Deployment V6 High Availability Solutionsrdquo

httpwwwredbooksibmcomabstractssg246688htmlOpen

bull ldquoService integration high availability and workload sharing configurationsrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0007_html

bull SIBus Configuration for high availability

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0010_html0

D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute

13

Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically

To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example

ltdeploymentSettingsgt ltrepository multiUser=truegt ltconnectiongt ltJDBCDriverClass name=comibmdb2jccDB2Drivergt ltJDBCURLgtjdbcdb2SERVER1PORT1ODMltJDBCURLgt ltauthenticationgt ltauthenticationgt ltconnectiongt ltmanagerClass name=ilogodmdatasvcpersistdb2IloDB2RepositoryManagerFactorygt ltappSchema name=SCHEMAgt ltpropertiesgt ltproperty name=clientRerouteAlternateServerName value=SERVER2gt ltproperty name=clientRerouteAlternatePortNumber value=PORT2gt ltproperty name=maxRetriesForClientReroute value=2gt ltproperty name=retryIntervalForClientReroute value=15gt ltpropertiesgt ltrepositorygt ltdeploymentSettingsgt

These additional properties are

bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the

server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again

Notes bull This property list can be extended with other DB2 properties to match your

needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html

Useful Links bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

bull httppublibboulderibmcominfocenterdb2luwv9r5topic

bull comibmdb2luwadminhadocdocc0011976html

14

3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster

This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR

This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301

When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)

WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version

HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure

A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines

1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme

Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI

The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS

2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node

Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity

On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving

15

capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained

A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing

Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests

WAS 1 mgmt Create job A

Client submit job A

proc

JOBS DB

WAS 2

store running complete

mgmt

proc

submit job B

Create job B

Solve job B

hot standby

job A Q status

job A read status

Solve job A

readStatus

progress

submit job C

Solve job C

running completeprogress

running completeprogress

QS QS

IHS WLM plugin WLM WLM

SOAPHTTP SOAPHTTP SOAPHTTP

WLM WLM WLM

Create job C

Figure 2 Typical ODME 3301 load balancing timeline

A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time

0

1

2

3

4

5

6

7

time

0

100

200

300

400

500

600

running

server1

server2Queue Depth

Figure 3 Typical balance of load on ODME 3301

16

The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log

B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure

1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs

Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

stopWAS 1

WAS 1mgmt Create job A

Client submit job A

proc

JOBSDB

WAS 2

store

running

complete

mgmt

proc

submit job B

Create job B

Solve job B

hot

standby

job A Q status

job A read status

Solve job A

readStatus

progress

IHSWLMplugin WLM WLM

SO

AP

HTTP

SOA

PH

TTP

SOAP

HTTP

17

Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster

Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will

2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter

Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again

18

4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment

There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME

Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected

Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues

A Job processor fails to extract OPL binaries upon restart

Symptoms

optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running

Queued jobs are not processed (remain in NOT_STARTED state)

Only one of the cluster members runs jobs although the queue is saturated

SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)

Explanation

The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization

Remediation

Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed

In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files

19

B Solve cannot recover after WAS job-processor or odmsolver stops

Symptoms

When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio

Explanation

In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock

Remediation

The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console

C Bad error reporting when Optimization Server loses connection from the Repository DB

Symptoms

The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost

Explanation

The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display

Remediation

This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered

D ODME cannot start when WAS administrative security is enabled

Symptoms

Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled

This results in an exception being raised during startup of Optimization Server reported in the

20

SystemOutlog

Explanation

The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree

Remediation

WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization

E ODM solver does not start

Symptoms

Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001

Explanation

The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running

Remediation

Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs

21

Page 5: Building a High Availability ODM Enterprise environment · 2018. 11. 2. · Building a High Availability ODM Enterprise environment ... clustered, and manages high availability across

B Protecting the system This diagram shows the dependencies of ODME components on high‐level middleware components

Figure Component dependency diagram

These dependencies show the key consequences of the failure of any component For example if the application server for the Job Processor fails then so does the Job Processor application and any associated Optimization Engines If database manager for the Jobs DB fails then the Jobs DB Job Manager and Job Processor and Optimization Engine all fail While two logical application servers are depicted the solution may be deployed on a single physical application server instance The same is true for the two logical database managers This would of course mean that a failure could have a greater impact

1) Protecting the ability to run jobs Assuming job records are available in the Jobs database the ability to run optimization solve jobs is based on the following components

bull Job Processor bull Optimization Engine(s) bull Jobs database bull Scenario database

From a middleware perspective the following needs to be protected bull Application servers used by Job Processor bull The database servers used for the Jobs database bull The database servers used for the Scenario database bull The physical servers and network used by each of the above

2) Protecting the ability to manage jobs The ability to manage jobs is provided by the following logical components

bull Job Manager application bull Jobs database

5

From a middleware perspective the following needs to be protected bull HTTP server used by Job Manager bull Application server used by Job Manager bull The database server used for the Jobs database bull The physical servers and network used by each of the above

C Sample HA topology Here is a sample topology which can be used to protect the ODME solution

This logical topology consists of

Optimization Servers each running a Java EE application server with the Job Processor and Job Manager applications installed Each Optimization Server will host one or more Optimization Engines Both Optimization Servers can be operation at the same time ‐ if one Optimization Server fails then the other will continue taking jobs from the Jobs Database Any number of Optimization Servers could theoretically be used

Database server hosting both the Jobs and Scenario databases Keeping multiple copies of the same database active and up‐to‐date could be very difficult so instead a passive backup should be kept This backup needs to be up‐to‐date and ready to become active if the primary database server fails

Load balancing server which can route HTTP traffic to either of the two Job Managers The load balancer also needs to be backed‐up This backup can also be passive and ready to become active if the primary load balancer fails

D IBM middleware protection strategies There are many ways in which IBM middleware can be protected including (but not limited to)

bull Software bull WebSphere Application Server Network Deployment allows application servers to be

clustered and manages high availability across Java EE applications bull DB2reg can be used to keep an up‐to‐date replica of a database on a separate server using

its High Availability Disaster Recovery (HADR) feature bull Tivoli System Automation provides advanced clustering support for managing highly

6

available systems bull Hardware

bull A PowerHA solution provides a cluster of IBM Power servers using shared disks If one server (either software processes or hardware) fails the other takes over Power HA SystemMirror is available for AIX Linux and IBM i

bull Disk technology such as a Redundant Array of Independent Disks (RAID)

For example the following describes a software‐only topology to enable high availability using WebSphere Application Server‐Network Deployment and DB2 HADR

Figure 1 ODME Software‐only HA Topology example

This topology consists of

Optimization Servers running WebSphere Application Server (WAS) Each Optimization Server runs a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications is installed into the cluster Each Optimization Server hosts one or more ODM Optimization Engines A WAS Service Integration bus (SIBus) is configured to allow the JMS communication between Job Managers and Job Processors ‐ the SIBus uses a single messaging engine with a HADR database store which WAS will automatically move to the other WAS server in the event of a failure In case of primary database failure DB2 HADR will switch to the alternate standby database server

Database servers running DB2 Enterprise Server Edition Both servers run the same software with one acting as the primary database server The primary database server replicates all database updates on the standby server using DB2s High Availability Disaster Recovery (HADR) feature In addition Tivoli System Automation could be used on these servers to detect a failure and instruct the standby server to take over as the primary database

Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests

7

to one of the WAS servers In this configuration the WAS server is chosen on a round‐robin basis

E System failure detection A key factor in creating a highly available system is how quickly you can recover from a failure The solution might be able to cope with the failure of one component but two or more may be difficult so detecting and recovering from failures is critical

It is important to monitor at many levels A failure could occur in the ODME application the hosting application server the operating system the physical hardware for the server or with a network connection

There are many software solutions for monitoring middleware such as IBM Tivoli Monitoring

F Special considerations for ODM Clients ODM client applications such as ODM Planner Studio have direct access to the Scenario database defined in the odmapp deployment settings

ODM application

configuration (odmapp

ODM Enterprise IDE

ODMRepositorySCENARIO

DB

Development Deployment

IT developer

Java Development

Tools

OPL Studio

ODM Editors

ODM Studio -Planner and

Reviewer Editions

Optimization Server

Custom Clients and Batch Files

odmapp

odmapp

odmapp

odmapp

solve

solve

Readwrite

Rea

dwr

ite

Readwrite

The odmapp files generated with ODME IDE include their own Scenario database access definitions which are configured independently from the the Optimization Server JOBSs

When an odmapp is intended to take advantage of HA recovery of its Scenario DB its Data Source definitions must be enhanced with HA‐specific settings that will enable switching database operations to the alternate DB instance This will enable you to take HA recovery into account both when used from within the Planner Studio and when used for solving on an Optimization Server

8

2 Configuring a HA ODM Enterprise environment

A Introduction This document describes a sample configuration of the IBM ILOG ODM Enterprise as part of a highly available (HA) solution There are many ways to provide high availability using various combinations of specialized hardware and software This document describes a software‐based solution using the following products

bull IBM ILOG ODM Enterprise V3301

bull IBM WebSphere Application Server Network Deployment V61 FP 025

bull IBM HTTP Server V61 FP 025

bull IBM DB2 Enterprise Server Edition V95 FixPack 3

This document does not provide an exhaustive step‐by‐step guide but instead highlights specific considerations for configuring HA with the products listed above Links are provided to product documentation articles and Redbooks which describe the steps in more details

Next configuration steps describe how to configure the sample topology depicted in Figure 1 ODME Software‐only HA Topology example above This topology consists of

Optimization Servers running WebSphere Application Server (WAS) with each Optimization Server will run a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications are installed into the cluster

Database servers running DB2 Enterprise Server Edition with both servers running the same software in a activepassive HADR setup where the primary database server replicates all database updates to the standby server

Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests between the WAS servers on a weighted round‐robin basis

Not represented in the previous topology ODM Client Applications will be configured to benefit from the automatic client rerouting offered by DB2 HADR to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server

B Configuring DB2 HADR DB2 has a feature called High Availability Disaster Recovery (HADR) which provides a high performance replication system A DB2 HADR system consists of two database servers one active and one standby Any changes made to the active system will also be replicated in the standby system At any point an administrator can instruct the standby system to ldquotake overrdquo as the primary ndash after this happens the roles of the two systems are swapped

DB2 HADR Requirements

Before installing DB2 with HADR as the ODME application datastore you need to be aware of these basic requirements for both the primary and standby DB2 servers

9

bull Identical operating system version and patch level

bull The primary and standby server machine should be connected with high‐speed TCPIP connection and reachable by TCPIP from the client application

bull Identical DB2 version and patch level software bit size (32‐bit or 64‐bit) and installation path

DB2 HADR Setup

1 Install DB2 UDB Enterprise Server Edition on both the primary and standby machines

Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connection between DB2 HADR primary and stand‐by machine works well

2 Start the DB2 servers on both machines if they are not already running

3 Create your database and the required tables on the primary machine only The databases on secondary machines will be cloned from the primary machine (See the DB2 Information Center for detailed installation information)

bull Optimization Server DB ndash used to store ODME jobs Use the scripts provided with ODM Enterprise (typically serverdatabasedb2-create-tablessql) to create the JOBS database tables Make a note of the userID that you use to create the tables because it is used in the table qualifier and schema

bull Scenario database ndash used to store ODME scenario data The database tables themselves will be initialized when developing the ODM application using the ODME IDE

bull SIBus database ndash used by the WAS Service Integration Bus

Tip The DB2 logs need to be of a sufficient size especially the scenario database in which there are important updates Make sure you set the database logging to archive logging rather than the default circular logging because otherwise it will not be possible to enable HADR

4 Configure HADR for each database from the primary machine using the Setup HADR wizard as presented in

httpwwwredbooksibmcomredbooksSG247363wwhelpwwhimpljshtmlwwhelphtm

Tip The easiest way to create the databases on the secondary machines is to do it during the HADR setup process by using the backup method During HADR setup you may be asked for the peer window parameter you can leave it at the default value of 0

Useful Links bull DB2 V95 InfoCenter

httppublibboulderibmcominfocenterdb2luwv9r5indexjsp

bull The IBM Redbook ldquoHigh Availability and Disaster Recovery Options for DB2 on Linux UNIX and Windowsrdquo provides a useful guide

10

httpwwwredbooksibmcomabstractsSG247363html

bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

httppublibboulderibmcominfocenterdb2luwv9r5topiccomibmdb2luwadminhadocdocc0011976html

C Configuring WebSphere Application Server and HTTP Server

1) Overview WebSphere Application Server Network Deployment allows multiple servers to be clustered together Installing a Java EE application into the cluster will perform the installation on each cluster member

The ODM Enterprise Job Manager and Job Processor applications use the Java Messaging Service (JMS) to communicate with each other To use JMS in a clustered environment in WAS a service integration bus (SIBus) is used with each server added as a clustered bus member In our architecture only one server needs to host a messaging engine ndash in the event of a failure in that server WAS will move the messaging engine to another server To support this each WAS server must be able to access the SIBus data store so in this topology the data store will be hosted in a DB2 database

2) Procedure The following instructions extend the single‐server instructions provided with ODME 3301 with a focus on differences specific to a clustered deployment

a) Install WAS 61 Network Deployment as detailed in

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

Tip Install the deployment manager node first and start it For the other nodes used in optimisation cluster select a ldquoCustomrdquo environment in the profile manager wizard which will add the new node into the cell Deployment manager and cluster nodes should be created with security disabled

b) Start and connect to the deployment manager console create a new cluster in Servers =gt

Clusters and define cluster members for nodes created earlier

c) Create a ldquoDB2 Universal JDBC Driver Provider (XA)rdquo provider for the cluster scope in Resources =gt JDBC =gt JDBC Providers and specify database class path information for cluster nodes

d) With this provider create new JDBC Data sources in the cluster scope for each HA database used by the optimisation server cluster the Jobs and the SIBus databases Create the data sources with all the settings that pertain to the primary DB2 host alternate (standby) database definitions will be specified through additional DB connection properties

11

Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well

The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules

e) Set the custom properties of these JDBC data sources

bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables

bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name

bull clientRerouteAlternatePortNumber standby server port number for client reroute

bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2

bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15

bull fullyMaterializeInputStreams set to true

bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues

f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled

g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection

h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)

bull OptimizationServerTopic named jmsoptimserverTopic in JNDI

bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI

bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI

bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic

i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope

j) Install IBM HTTP Server 61

12

Tip

bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application

bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier

k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of

plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps

l) Start cluster nodes

m) Start the cluster in Servers =gt Clusters

n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole

Useful links

bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo

bull ldquoRoadmap Installing the Network Deployment productrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

bull WAS InfoCenter ldquoInstalling Web server plug‐insrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaetins_webpluginshtml

bull ldquoWebSphere Application Server Network Deployment V6 High Availability Solutionsrdquo

httpwwwredbooksibmcomabstractssg246688htmlOpen

bull ldquoService integration high availability and workload sharing configurationsrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0007_html

bull SIBus Configuration for high availability

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0010_html0

D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute

13

Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically

To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example

ltdeploymentSettingsgt ltrepository multiUser=truegt ltconnectiongt ltJDBCDriverClass name=comibmdb2jccDB2Drivergt ltJDBCURLgtjdbcdb2SERVER1PORT1ODMltJDBCURLgt ltauthenticationgt ltauthenticationgt ltconnectiongt ltmanagerClass name=ilogodmdatasvcpersistdb2IloDB2RepositoryManagerFactorygt ltappSchema name=SCHEMAgt ltpropertiesgt ltproperty name=clientRerouteAlternateServerName value=SERVER2gt ltproperty name=clientRerouteAlternatePortNumber value=PORT2gt ltproperty name=maxRetriesForClientReroute value=2gt ltproperty name=retryIntervalForClientReroute value=15gt ltpropertiesgt ltrepositorygt ltdeploymentSettingsgt

These additional properties are

bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the

server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again

Notes bull This property list can be extended with other DB2 properties to match your

needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html

Useful Links bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

bull httppublibboulderibmcominfocenterdb2luwv9r5topic

bull comibmdb2luwadminhadocdocc0011976html

14

3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster

This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR

This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301

When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)

WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version

HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure

A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines

1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme

Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI

The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS

2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node

Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity

On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving

15

capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained

A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing

Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests

WAS 1 mgmt Create job A

Client submit job A

proc

JOBS DB

WAS 2

store running complete

mgmt

proc

submit job B

Create job B

Solve job B

hot standby

job A Q status

job A read status

Solve job A

readStatus

progress

submit job C

Solve job C

running completeprogress

running completeprogress

QS QS

IHS WLM plugin WLM WLM

SOAPHTTP SOAPHTTP SOAPHTTP

WLM WLM WLM

Create job C

Figure 2 Typical ODME 3301 load balancing timeline

A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time

0

1

2

3

4

5

6

7

time

0

100

200

300

400

500

600

running

server1

server2Queue Depth

Figure 3 Typical balance of load on ODME 3301

16

The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log

B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure

1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs

Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

stopWAS 1

WAS 1mgmt Create job A

Client submit job A

proc

JOBSDB

WAS 2

store

running

complete

mgmt

proc

submit job B

Create job B

Solve job B

hot

standby

job A Q status

job A read status

Solve job A

readStatus

progress

IHSWLMplugin WLM WLM

SO

AP

HTTP

SOA

PH

TTP

SOAP

HTTP

17

Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster

Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will

2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter

Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again

18

4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment

There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME

Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected

Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues

A Job processor fails to extract OPL binaries upon restart

Symptoms

optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running

Queued jobs are not processed (remain in NOT_STARTED state)

Only one of the cluster members runs jobs although the queue is saturated

SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)

Explanation

The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization

Remediation

Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed

In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files

19

B Solve cannot recover after WAS job-processor or odmsolver stops

Symptoms

When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio

Explanation

In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock

Remediation

The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console

C Bad error reporting when Optimization Server loses connection from the Repository DB

Symptoms

The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost

Explanation

The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display

Remediation

This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered

D ODME cannot start when WAS administrative security is enabled

Symptoms

Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled

This results in an exception being raised during startup of Optimization Server reported in the

20

SystemOutlog

Explanation

The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree

Remediation

WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization

E ODM solver does not start

Symptoms

Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001

Explanation

The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running

Remediation

Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs

21

Page 6: Building a High Availability ODM Enterprise environment · 2018. 11. 2. · Building a High Availability ODM Enterprise environment ... clustered, and manages high availability across

From a middleware perspective the following needs to be protected bull HTTP server used by Job Manager bull Application server used by Job Manager bull The database server used for the Jobs database bull The physical servers and network used by each of the above

C Sample HA topology Here is a sample topology which can be used to protect the ODME solution

This logical topology consists of

Optimization Servers each running a Java EE application server with the Job Processor and Job Manager applications installed Each Optimization Server will host one or more Optimization Engines Both Optimization Servers can be operation at the same time ‐ if one Optimization Server fails then the other will continue taking jobs from the Jobs Database Any number of Optimization Servers could theoretically be used

Database server hosting both the Jobs and Scenario databases Keeping multiple copies of the same database active and up‐to‐date could be very difficult so instead a passive backup should be kept This backup needs to be up‐to‐date and ready to become active if the primary database server fails

Load balancing server which can route HTTP traffic to either of the two Job Managers The load balancer also needs to be backed‐up This backup can also be passive and ready to become active if the primary load balancer fails

D IBM middleware protection strategies There are many ways in which IBM middleware can be protected including (but not limited to)

bull Software bull WebSphere Application Server Network Deployment allows application servers to be

clustered and manages high availability across Java EE applications bull DB2reg can be used to keep an up‐to‐date replica of a database on a separate server using

its High Availability Disaster Recovery (HADR) feature bull Tivoli System Automation provides advanced clustering support for managing highly

6

available systems bull Hardware

bull A PowerHA solution provides a cluster of IBM Power servers using shared disks If one server (either software processes or hardware) fails the other takes over Power HA SystemMirror is available for AIX Linux and IBM i

bull Disk technology such as a Redundant Array of Independent Disks (RAID)

For example the following describes a software‐only topology to enable high availability using WebSphere Application Server‐Network Deployment and DB2 HADR

Figure 1 ODME Software‐only HA Topology example

This topology consists of

Optimization Servers running WebSphere Application Server (WAS) Each Optimization Server runs a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications is installed into the cluster Each Optimization Server hosts one or more ODM Optimization Engines A WAS Service Integration bus (SIBus) is configured to allow the JMS communication between Job Managers and Job Processors ‐ the SIBus uses a single messaging engine with a HADR database store which WAS will automatically move to the other WAS server in the event of a failure In case of primary database failure DB2 HADR will switch to the alternate standby database server

Database servers running DB2 Enterprise Server Edition Both servers run the same software with one acting as the primary database server The primary database server replicates all database updates on the standby server using DB2s High Availability Disaster Recovery (HADR) feature In addition Tivoli System Automation could be used on these servers to detect a failure and instruct the standby server to take over as the primary database

Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests

7

to one of the WAS servers In this configuration the WAS server is chosen on a round‐robin basis

E System failure detection A key factor in creating a highly available system is how quickly you can recover from a failure The solution might be able to cope with the failure of one component but two or more may be difficult so detecting and recovering from failures is critical

It is important to monitor at many levels A failure could occur in the ODME application the hosting application server the operating system the physical hardware for the server or with a network connection

There are many software solutions for monitoring middleware such as IBM Tivoli Monitoring

F Special considerations for ODM Clients ODM client applications such as ODM Planner Studio have direct access to the Scenario database defined in the odmapp deployment settings

ODM application

configuration (odmapp

ODM Enterprise IDE

ODMRepositorySCENARIO

DB

Development Deployment

IT developer

Java Development

Tools

OPL Studio

ODM Editors

ODM Studio -Planner and

Reviewer Editions

Optimization Server

Custom Clients and Batch Files

odmapp

odmapp

odmapp

odmapp

solve

solve

Readwrite

Rea

dwr

ite

Readwrite

The odmapp files generated with ODME IDE include their own Scenario database access definitions which are configured independently from the the Optimization Server JOBSs

When an odmapp is intended to take advantage of HA recovery of its Scenario DB its Data Source definitions must be enhanced with HA‐specific settings that will enable switching database operations to the alternate DB instance This will enable you to take HA recovery into account both when used from within the Planner Studio and when used for solving on an Optimization Server

8

2 Configuring a HA ODM Enterprise environment

A Introduction This document describes a sample configuration of the IBM ILOG ODM Enterprise as part of a highly available (HA) solution There are many ways to provide high availability using various combinations of specialized hardware and software This document describes a software‐based solution using the following products

bull IBM ILOG ODM Enterprise V3301

bull IBM WebSphere Application Server Network Deployment V61 FP 025

bull IBM HTTP Server V61 FP 025

bull IBM DB2 Enterprise Server Edition V95 FixPack 3

This document does not provide an exhaustive step‐by‐step guide but instead highlights specific considerations for configuring HA with the products listed above Links are provided to product documentation articles and Redbooks which describe the steps in more details

Next configuration steps describe how to configure the sample topology depicted in Figure 1 ODME Software‐only HA Topology example above This topology consists of

Optimization Servers running WebSphere Application Server (WAS) with each Optimization Server will run a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications are installed into the cluster

Database servers running DB2 Enterprise Server Edition with both servers running the same software in a activepassive HADR setup where the primary database server replicates all database updates to the standby server

Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests between the WAS servers on a weighted round‐robin basis

Not represented in the previous topology ODM Client Applications will be configured to benefit from the automatic client rerouting offered by DB2 HADR to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server

B Configuring DB2 HADR DB2 has a feature called High Availability Disaster Recovery (HADR) which provides a high performance replication system A DB2 HADR system consists of two database servers one active and one standby Any changes made to the active system will also be replicated in the standby system At any point an administrator can instruct the standby system to ldquotake overrdquo as the primary ndash after this happens the roles of the two systems are swapped

DB2 HADR Requirements

Before installing DB2 with HADR as the ODME application datastore you need to be aware of these basic requirements for both the primary and standby DB2 servers

9

bull Identical operating system version and patch level

bull The primary and standby server machine should be connected with high‐speed TCPIP connection and reachable by TCPIP from the client application

bull Identical DB2 version and patch level software bit size (32‐bit or 64‐bit) and installation path

DB2 HADR Setup

1 Install DB2 UDB Enterprise Server Edition on both the primary and standby machines

Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connection between DB2 HADR primary and stand‐by machine works well

2 Start the DB2 servers on both machines if they are not already running

3 Create your database and the required tables on the primary machine only The databases on secondary machines will be cloned from the primary machine (See the DB2 Information Center for detailed installation information)

bull Optimization Server DB ndash used to store ODME jobs Use the scripts provided with ODM Enterprise (typically serverdatabasedb2-create-tablessql) to create the JOBS database tables Make a note of the userID that you use to create the tables because it is used in the table qualifier and schema

bull Scenario database ndash used to store ODME scenario data The database tables themselves will be initialized when developing the ODM application using the ODME IDE

bull SIBus database ndash used by the WAS Service Integration Bus

Tip The DB2 logs need to be of a sufficient size especially the scenario database in which there are important updates Make sure you set the database logging to archive logging rather than the default circular logging because otherwise it will not be possible to enable HADR

4 Configure HADR for each database from the primary machine using the Setup HADR wizard as presented in

httpwwwredbooksibmcomredbooksSG247363wwhelpwwhimpljshtmlwwhelphtm

Tip The easiest way to create the databases on the secondary machines is to do it during the HADR setup process by using the backup method During HADR setup you may be asked for the peer window parameter you can leave it at the default value of 0

Useful Links bull DB2 V95 InfoCenter

httppublibboulderibmcominfocenterdb2luwv9r5indexjsp

bull The IBM Redbook ldquoHigh Availability and Disaster Recovery Options for DB2 on Linux UNIX and Windowsrdquo provides a useful guide

10

httpwwwredbooksibmcomabstractsSG247363html

bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

httppublibboulderibmcominfocenterdb2luwv9r5topiccomibmdb2luwadminhadocdocc0011976html

C Configuring WebSphere Application Server and HTTP Server

1) Overview WebSphere Application Server Network Deployment allows multiple servers to be clustered together Installing a Java EE application into the cluster will perform the installation on each cluster member

The ODM Enterprise Job Manager and Job Processor applications use the Java Messaging Service (JMS) to communicate with each other To use JMS in a clustered environment in WAS a service integration bus (SIBus) is used with each server added as a clustered bus member In our architecture only one server needs to host a messaging engine ndash in the event of a failure in that server WAS will move the messaging engine to another server To support this each WAS server must be able to access the SIBus data store so in this topology the data store will be hosted in a DB2 database

2) Procedure The following instructions extend the single‐server instructions provided with ODME 3301 with a focus on differences specific to a clustered deployment

a) Install WAS 61 Network Deployment as detailed in

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

Tip Install the deployment manager node first and start it For the other nodes used in optimisation cluster select a ldquoCustomrdquo environment in the profile manager wizard which will add the new node into the cell Deployment manager and cluster nodes should be created with security disabled

b) Start and connect to the deployment manager console create a new cluster in Servers =gt

Clusters and define cluster members for nodes created earlier

c) Create a ldquoDB2 Universal JDBC Driver Provider (XA)rdquo provider for the cluster scope in Resources =gt JDBC =gt JDBC Providers and specify database class path information for cluster nodes

d) With this provider create new JDBC Data sources in the cluster scope for each HA database used by the optimisation server cluster the Jobs and the SIBus databases Create the data sources with all the settings that pertain to the primary DB2 host alternate (standby) database definitions will be specified through additional DB connection properties

11

Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well

The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules

e) Set the custom properties of these JDBC data sources

bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables

bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name

bull clientRerouteAlternatePortNumber standby server port number for client reroute

bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2

bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15

bull fullyMaterializeInputStreams set to true

bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues

f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled

g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection

h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)

bull OptimizationServerTopic named jmsoptimserverTopic in JNDI

bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI

bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI

bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic

i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope

j) Install IBM HTTP Server 61

12

Tip

bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application

bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier

k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of

plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps

l) Start cluster nodes

m) Start the cluster in Servers =gt Clusters

n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole

Useful links

bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo

bull ldquoRoadmap Installing the Network Deployment productrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

bull WAS InfoCenter ldquoInstalling Web server plug‐insrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaetins_webpluginshtml

bull ldquoWebSphere Application Server Network Deployment V6 High Availability Solutionsrdquo

httpwwwredbooksibmcomabstractssg246688htmlOpen

bull ldquoService integration high availability and workload sharing configurationsrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0007_html

bull SIBus Configuration for high availability

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0010_html0

D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute

13

Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically

To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example

ltdeploymentSettingsgt ltrepository multiUser=truegt ltconnectiongt ltJDBCDriverClass name=comibmdb2jccDB2Drivergt ltJDBCURLgtjdbcdb2SERVER1PORT1ODMltJDBCURLgt ltauthenticationgt ltauthenticationgt ltconnectiongt ltmanagerClass name=ilogodmdatasvcpersistdb2IloDB2RepositoryManagerFactorygt ltappSchema name=SCHEMAgt ltpropertiesgt ltproperty name=clientRerouteAlternateServerName value=SERVER2gt ltproperty name=clientRerouteAlternatePortNumber value=PORT2gt ltproperty name=maxRetriesForClientReroute value=2gt ltproperty name=retryIntervalForClientReroute value=15gt ltpropertiesgt ltrepositorygt ltdeploymentSettingsgt

These additional properties are

bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the

server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again

Notes bull This property list can be extended with other DB2 properties to match your

needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html

Useful Links bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

bull httppublibboulderibmcominfocenterdb2luwv9r5topic

bull comibmdb2luwadminhadocdocc0011976html

14

3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster

This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR

This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301

When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)

WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version

HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure

A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines

1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme

Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI

The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS

2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node

Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity

On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving

15

capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained

A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing

Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests

WAS 1 mgmt Create job A

Client submit job A

proc

JOBS DB

WAS 2

store running complete

mgmt

proc

submit job B

Create job B

Solve job B

hot standby

job A Q status

job A read status

Solve job A

readStatus

progress

submit job C

Solve job C

running completeprogress

running completeprogress

QS QS

IHS WLM plugin WLM WLM

SOAPHTTP SOAPHTTP SOAPHTTP

WLM WLM WLM

Create job C

Figure 2 Typical ODME 3301 load balancing timeline

A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time

0

1

2

3

4

5

6

7

time

0

100

200

300

400

500

600

running

server1

server2Queue Depth

Figure 3 Typical balance of load on ODME 3301

16

The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log

B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure

1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs

Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

stopWAS 1

WAS 1mgmt Create job A

Client submit job A

proc

JOBSDB

WAS 2

store

running

complete

mgmt

proc

submit job B

Create job B

Solve job B

hot

standby

job A Q status

job A read status

Solve job A

readStatus

progress

IHSWLMplugin WLM WLM

SO

AP

HTTP

SOA

PH

TTP

SOAP

HTTP

17

Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster

Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will

2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter

Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again

18

4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment

There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME

Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected

Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues

A Job processor fails to extract OPL binaries upon restart

Symptoms

optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running

Queued jobs are not processed (remain in NOT_STARTED state)

Only one of the cluster members runs jobs although the queue is saturated

SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)

Explanation

The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization

Remediation

Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed

In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files

19

B Solve cannot recover after WAS job-processor or odmsolver stops

Symptoms

When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio

Explanation

In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock

Remediation

The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console

C Bad error reporting when Optimization Server loses connection from the Repository DB

Symptoms

The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost

Explanation

The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display

Remediation

This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered

D ODME cannot start when WAS administrative security is enabled

Symptoms

Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled

This results in an exception being raised during startup of Optimization Server reported in the

20

SystemOutlog

Explanation

The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree

Remediation

WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization

E ODM solver does not start

Symptoms

Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001

Explanation

The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running

Remediation

Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs

21

Page 7: Building a High Availability ODM Enterprise environment · 2018. 11. 2. · Building a High Availability ODM Enterprise environment ... clustered, and manages high availability across

available systems bull Hardware

bull A PowerHA solution provides a cluster of IBM Power servers using shared disks If one server (either software processes or hardware) fails the other takes over Power HA SystemMirror is available for AIX Linux and IBM i

bull Disk technology such as a Redundant Array of Independent Disks (RAID)

For example the following describes a software‐only topology to enable high availability using WebSphere Application Server‐Network Deployment and DB2 HADR

Figure 1 ODME Software‐only HA Topology example

This topology consists of

Optimization Servers running WebSphere Application Server (WAS) Each Optimization Server runs a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications is installed into the cluster Each Optimization Server hosts one or more ODM Optimization Engines A WAS Service Integration bus (SIBus) is configured to allow the JMS communication between Job Managers and Job Processors ‐ the SIBus uses a single messaging engine with a HADR database store which WAS will automatically move to the other WAS server in the event of a failure In case of primary database failure DB2 HADR will switch to the alternate standby database server

Database servers running DB2 Enterprise Server Edition Both servers run the same software with one acting as the primary database server The primary database server replicates all database updates on the standby server using DB2s High Availability Disaster Recovery (HADR) feature In addition Tivoli System Automation could be used on these servers to detect a failure and instruct the standby server to take over as the primary database

Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests

7

to one of the WAS servers In this configuration the WAS server is chosen on a round‐robin basis

E System failure detection A key factor in creating a highly available system is how quickly you can recover from a failure The solution might be able to cope with the failure of one component but two or more may be difficult so detecting and recovering from failures is critical

It is important to monitor at many levels A failure could occur in the ODME application the hosting application server the operating system the physical hardware for the server or with a network connection

There are many software solutions for monitoring middleware such as IBM Tivoli Monitoring

F Special considerations for ODM Clients ODM client applications such as ODM Planner Studio have direct access to the Scenario database defined in the odmapp deployment settings

ODM application

configuration (odmapp

ODM Enterprise IDE

ODMRepositorySCENARIO

DB

Development Deployment

IT developer

Java Development

Tools

OPL Studio

ODM Editors

ODM Studio -Planner and

Reviewer Editions

Optimization Server

Custom Clients and Batch Files

odmapp

odmapp

odmapp

odmapp

solve

solve

Readwrite

Rea

dwr

ite

Readwrite

The odmapp files generated with ODME IDE include their own Scenario database access definitions which are configured independently from the the Optimization Server JOBSs

When an odmapp is intended to take advantage of HA recovery of its Scenario DB its Data Source definitions must be enhanced with HA‐specific settings that will enable switching database operations to the alternate DB instance This will enable you to take HA recovery into account both when used from within the Planner Studio and when used for solving on an Optimization Server

8

2 Configuring a HA ODM Enterprise environment

A Introduction This document describes a sample configuration of the IBM ILOG ODM Enterprise as part of a highly available (HA) solution There are many ways to provide high availability using various combinations of specialized hardware and software This document describes a software‐based solution using the following products

bull IBM ILOG ODM Enterprise V3301

bull IBM WebSphere Application Server Network Deployment V61 FP 025

bull IBM HTTP Server V61 FP 025

bull IBM DB2 Enterprise Server Edition V95 FixPack 3

This document does not provide an exhaustive step‐by‐step guide but instead highlights specific considerations for configuring HA with the products listed above Links are provided to product documentation articles and Redbooks which describe the steps in more details

Next configuration steps describe how to configure the sample topology depicted in Figure 1 ODME Software‐only HA Topology example above This topology consists of

Optimization Servers running WebSphere Application Server (WAS) with each Optimization Server will run a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications are installed into the cluster

Database servers running DB2 Enterprise Server Edition with both servers running the same software in a activepassive HADR setup where the primary database server replicates all database updates to the standby server

Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests between the WAS servers on a weighted round‐robin basis

Not represented in the previous topology ODM Client Applications will be configured to benefit from the automatic client rerouting offered by DB2 HADR to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server

B Configuring DB2 HADR DB2 has a feature called High Availability Disaster Recovery (HADR) which provides a high performance replication system A DB2 HADR system consists of two database servers one active and one standby Any changes made to the active system will also be replicated in the standby system At any point an administrator can instruct the standby system to ldquotake overrdquo as the primary ndash after this happens the roles of the two systems are swapped

DB2 HADR Requirements

Before installing DB2 with HADR as the ODME application datastore you need to be aware of these basic requirements for both the primary and standby DB2 servers

9

bull Identical operating system version and patch level

bull The primary and standby server machine should be connected with high‐speed TCPIP connection and reachable by TCPIP from the client application

bull Identical DB2 version and patch level software bit size (32‐bit or 64‐bit) and installation path

DB2 HADR Setup

1 Install DB2 UDB Enterprise Server Edition on both the primary and standby machines

Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connection between DB2 HADR primary and stand‐by machine works well

2 Start the DB2 servers on both machines if they are not already running

3 Create your database and the required tables on the primary machine only The databases on secondary machines will be cloned from the primary machine (See the DB2 Information Center for detailed installation information)

bull Optimization Server DB ndash used to store ODME jobs Use the scripts provided with ODM Enterprise (typically serverdatabasedb2-create-tablessql) to create the JOBS database tables Make a note of the userID that you use to create the tables because it is used in the table qualifier and schema

bull Scenario database ndash used to store ODME scenario data The database tables themselves will be initialized when developing the ODM application using the ODME IDE

bull SIBus database ndash used by the WAS Service Integration Bus

Tip The DB2 logs need to be of a sufficient size especially the scenario database in which there are important updates Make sure you set the database logging to archive logging rather than the default circular logging because otherwise it will not be possible to enable HADR

4 Configure HADR for each database from the primary machine using the Setup HADR wizard as presented in

httpwwwredbooksibmcomredbooksSG247363wwhelpwwhimpljshtmlwwhelphtm

Tip The easiest way to create the databases on the secondary machines is to do it during the HADR setup process by using the backup method During HADR setup you may be asked for the peer window parameter you can leave it at the default value of 0

Useful Links bull DB2 V95 InfoCenter

httppublibboulderibmcominfocenterdb2luwv9r5indexjsp

bull The IBM Redbook ldquoHigh Availability and Disaster Recovery Options for DB2 on Linux UNIX and Windowsrdquo provides a useful guide

10

httpwwwredbooksibmcomabstractsSG247363html

bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

httppublibboulderibmcominfocenterdb2luwv9r5topiccomibmdb2luwadminhadocdocc0011976html

C Configuring WebSphere Application Server and HTTP Server

1) Overview WebSphere Application Server Network Deployment allows multiple servers to be clustered together Installing a Java EE application into the cluster will perform the installation on each cluster member

The ODM Enterprise Job Manager and Job Processor applications use the Java Messaging Service (JMS) to communicate with each other To use JMS in a clustered environment in WAS a service integration bus (SIBus) is used with each server added as a clustered bus member In our architecture only one server needs to host a messaging engine ndash in the event of a failure in that server WAS will move the messaging engine to another server To support this each WAS server must be able to access the SIBus data store so in this topology the data store will be hosted in a DB2 database

2) Procedure The following instructions extend the single‐server instructions provided with ODME 3301 with a focus on differences specific to a clustered deployment

a) Install WAS 61 Network Deployment as detailed in

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

Tip Install the deployment manager node first and start it For the other nodes used in optimisation cluster select a ldquoCustomrdquo environment in the profile manager wizard which will add the new node into the cell Deployment manager and cluster nodes should be created with security disabled

b) Start and connect to the deployment manager console create a new cluster in Servers =gt

Clusters and define cluster members for nodes created earlier

c) Create a ldquoDB2 Universal JDBC Driver Provider (XA)rdquo provider for the cluster scope in Resources =gt JDBC =gt JDBC Providers and specify database class path information for cluster nodes

d) With this provider create new JDBC Data sources in the cluster scope for each HA database used by the optimisation server cluster the Jobs and the SIBus databases Create the data sources with all the settings that pertain to the primary DB2 host alternate (standby) database definitions will be specified through additional DB connection properties

11

Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well

The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules

e) Set the custom properties of these JDBC data sources

bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables

bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name

bull clientRerouteAlternatePortNumber standby server port number for client reroute

bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2

bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15

bull fullyMaterializeInputStreams set to true

bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues

f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled

g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection

h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)

bull OptimizationServerTopic named jmsoptimserverTopic in JNDI

bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI

bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI

bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic

i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope

j) Install IBM HTTP Server 61

12

Tip

bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application

bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier

k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of

plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps

l) Start cluster nodes

m) Start the cluster in Servers =gt Clusters

n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole

Useful links

bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo

bull ldquoRoadmap Installing the Network Deployment productrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

bull WAS InfoCenter ldquoInstalling Web server plug‐insrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaetins_webpluginshtml

bull ldquoWebSphere Application Server Network Deployment V6 High Availability Solutionsrdquo

httpwwwredbooksibmcomabstractssg246688htmlOpen

bull ldquoService integration high availability and workload sharing configurationsrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0007_html

bull SIBus Configuration for high availability

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0010_html0

D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute

13

Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically

To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example

ltdeploymentSettingsgt ltrepository multiUser=truegt ltconnectiongt ltJDBCDriverClass name=comibmdb2jccDB2Drivergt ltJDBCURLgtjdbcdb2SERVER1PORT1ODMltJDBCURLgt ltauthenticationgt ltauthenticationgt ltconnectiongt ltmanagerClass name=ilogodmdatasvcpersistdb2IloDB2RepositoryManagerFactorygt ltappSchema name=SCHEMAgt ltpropertiesgt ltproperty name=clientRerouteAlternateServerName value=SERVER2gt ltproperty name=clientRerouteAlternatePortNumber value=PORT2gt ltproperty name=maxRetriesForClientReroute value=2gt ltproperty name=retryIntervalForClientReroute value=15gt ltpropertiesgt ltrepositorygt ltdeploymentSettingsgt

These additional properties are

bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the

server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again

Notes bull This property list can be extended with other DB2 properties to match your

needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html

Useful Links bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

bull httppublibboulderibmcominfocenterdb2luwv9r5topic

bull comibmdb2luwadminhadocdocc0011976html

14

3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster

This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR

This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301

When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)

WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version

HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure

A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines

1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme

Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI

The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS

2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node

Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity

On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving

15

capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained

A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing

Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests

WAS 1 mgmt Create job A

Client submit job A

proc

JOBS DB

WAS 2

store running complete

mgmt

proc

submit job B

Create job B

Solve job B

hot standby

job A Q status

job A read status

Solve job A

readStatus

progress

submit job C

Solve job C

running completeprogress

running completeprogress

QS QS

IHS WLM plugin WLM WLM

SOAPHTTP SOAPHTTP SOAPHTTP

WLM WLM WLM

Create job C

Figure 2 Typical ODME 3301 load balancing timeline

A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time

0

1

2

3

4

5

6

7

time

0

100

200

300

400

500

600

running

server1

server2Queue Depth

Figure 3 Typical balance of load on ODME 3301

16

The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log

B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure

1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs

Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

stopWAS 1

WAS 1mgmt Create job A

Client submit job A

proc

JOBSDB

WAS 2

store

running

complete

mgmt

proc

submit job B

Create job B

Solve job B

hot

standby

job A Q status

job A read status

Solve job A

readStatus

progress

IHSWLMplugin WLM WLM

SO

AP

HTTP

SOA

PH

TTP

SOAP

HTTP

17

Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster

Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will

2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter

Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again

18

4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment

There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME

Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected

Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues

A Job processor fails to extract OPL binaries upon restart

Symptoms

optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running

Queued jobs are not processed (remain in NOT_STARTED state)

Only one of the cluster members runs jobs although the queue is saturated

SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)

Explanation

The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization

Remediation

Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed

In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files

19

B Solve cannot recover after WAS job-processor or odmsolver stops

Symptoms

When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio

Explanation

In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock

Remediation

The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console

C Bad error reporting when Optimization Server loses connection from the Repository DB

Symptoms

The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost

Explanation

The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display

Remediation

This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered

D ODME cannot start when WAS administrative security is enabled

Symptoms

Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled

This results in an exception being raised during startup of Optimization Server reported in the

20

SystemOutlog

Explanation

The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree

Remediation

WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization

E ODM solver does not start

Symptoms

Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001

Explanation

The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running

Remediation

Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs

21

Page 8: Building a High Availability ODM Enterprise environment · 2018. 11. 2. · Building a High Availability ODM Enterprise environment ... clustered, and manages high availability across

to one of the WAS servers In this configuration the WAS server is chosen on a round‐robin basis

E System failure detection A key factor in creating a highly available system is how quickly you can recover from a failure The solution might be able to cope with the failure of one component but two or more may be difficult so detecting and recovering from failures is critical

It is important to monitor at many levels A failure could occur in the ODME application the hosting application server the operating system the physical hardware for the server or with a network connection

There are many software solutions for monitoring middleware such as IBM Tivoli Monitoring

F Special considerations for ODM Clients ODM client applications such as ODM Planner Studio have direct access to the Scenario database defined in the odmapp deployment settings

ODM application

configuration (odmapp

ODM Enterprise IDE

ODMRepositorySCENARIO

DB

Development Deployment

IT developer

Java Development

Tools

OPL Studio

ODM Editors

ODM Studio -Planner and

Reviewer Editions

Optimization Server

Custom Clients and Batch Files

odmapp

odmapp

odmapp

odmapp

solve

solve

Readwrite

Rea

dwr

ite

Readwrite

The odmapp files generated with ODME IDE include their own Scenario database access definitions which are configured independently from the the Optimization Server JOBSs

When an odmapp is intended to take advantage of HA recovery of its Scenario DB its Data Source definitions must be enhanced with HA‐specific settings that will enable switching database operations to the alternate DB instance This will enable you to take HA recovery into account both when used from within the Planner Studio and when used for solving on an Optimization Server

8

2 Configuring a HA ODM Enterprise environment

A Introduction This document describes a sample configuration of the IBM ILOG ODM Enterprise as part of a highly available (HA) solution There are many ways to provide high availability using various combinations of specialized hardware and software This document describes a software‐based solution using the following products

bull IBM ILOG ODM Enterprise V3301

bull IBM WebSphere Application Server Network Deployment V61 FP 025

bull IBM HTTP Server V61 FP 025

bull IBM DB2 Enterprise Server Edition V95 FixPack 3

This document does not provide an exhaustive step‐by‐step guide but instead highlights specific considerations for configuring HA with the products listed above Links are provided to product documentation articles and Redbooks which describe the steps in more details

Next configuration steps describe how to configure the sample topology depicted in Figure 1 ODME Software‐only HA Topology example above This topology consists of

Optimization Servers running WebSphere Application Server (WAS) with each Optimization Server will run a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications are installed into the cluster

Database servers running DB2 Enterprise Server Edition with both servers running the same software in a activepassive HADR setup where the primary database server replicates all database updates to the standby server

Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests between the WAS servers on a weighted round‐robin basis

Not represented in the previous topology ODM Client Applications will be configured to benefit from the automatic client rerouting offered by DB2 HADR to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server

B Configuring DB2 HADR DB2 has a feature called High Availability Disaster Recovery (HADR) which provides a high performance replication system A DB2 HADR system consists of two database servers one active and one standby Any changes made to the active system will also be replicated in the standby system At any point an administrator can instruct the standby system to ldquotake overrdquo as the primary ndash after this happens the roles of the two systems are swapped

DB2 HADR Requirements

Before installing DB2 with HADR as the ODME application datastore you need to be aware of these basic requirements for both the primary and standby DB2 servers

9

bull Identical operating system version and patch level

bull The primary and standby server machine should be connected with high‐speed TCPIP connection and reachable by TCPIP from the client application

bull Identical DB2 version and patch level software bit size (32‐bit or 64‐bit) and installation path

DB2 HADR Setup

1 Install DB2 UDB Enterprise Server Edition on both the primary and standby machines

Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connection between DB2 HADR primary and stand‐by machine works well

2 Start the DB2 servers on both machines if they are not already running

3 Create your database and the required tables on the primary machine only The databases on secondary machines will be cloned from the primary machine (See the DB2 Information Center for detailed installation information)

bull Optimization Server DB ndash used to store ODME jobs Use the scripts provided with ODM Enterprise (typically serverdatabasedb2-create-tablessql) to create the JOBS database tables Make a note of the userID that you use to create the tables because it is used in the table qualifier and schema

bull Scenario database ndash used to store ODME scenario data The database tables themselves will be initialized when developing the ODM application using the ODME IDE

bull SIBus database ndash used by the WAS Service Integration Bus

Tip The DB2 logs need to be of a sufficient size especially the scenario database in which there are important updates Make sure you set the database logging to archive logging rather than the default circular logging because otherwise it will not be possible to enable HADR

4 Configure HADR for each database from the primary machine using the Setup HADR wizard as presented in

httpwwwredbooksibmcomredbooksSG247363wwhelpwwhimpljshtmlwwhelphtm

Tip The easiest way to create the databases on the secondary machines is to do it during the HADR setup process by using the backup method During HADR setup you may be asked for the peer window parameter you can leave it at the default value of 0

Useful Links bull DB2 V95 InfoCenter

httppublibboulderibmcominfocenterdb2luwv9r5indexjsp

bull The IBM Redbook ldquoHigh Availability and Disaster Recovery Options for DB2 on Linux UNIX and Windowsrdquo provides a useful guide

10

httpwwwredbooksibmcomabstractsSG247363html

bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

httppublibboulderibmcominfocenterdb2luwv9r5topiccomibmdb2luwadminhadocdocc0011976html

C Configuring WebSphere Application Server and HTTP Server

1) Overview WebSphere Application Server Network Deployment allows multiple servers to be clustered together Installing a Java EE application into the cluster will perform the installation on each cluster member

The ODM Enterprise Job Manager and Job Processor applications use the Java Messaging Service (JMS) to communicate with each other To use JMS in a clustered environment in WAS a service integration bus (SIBus) is used with each server added as a clustered bus member In our architecture only one server needs to host a messaging engine ndash in the event of a failure in that server WAS will move the messaging engine to another server To support this each WAS server must be able to access the SIBus data store so in this topology the data store will be hosted in a DB2 database

2) Procedure The following instructions extend the single‐server instructions provided with ODME 3301 with a focus on differences specific to a clustered deployment

a) Install WAS 61 Network Deployment as detailed in

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

Tip Install the deployment manager node first and start it For the other nodes used in optimisation cluster select a ldquoCustomrdquo environment in the profile manager wizard which will add the new node into the cell Deployment manager and cluster nodes should be created with security disabled

b) Start and connect to the deployment manager console create a new cluster in Servers =gt

Clusters and define cluster members for nodes created earlier

c) Create a ldquoDB2 Universal JDBC Driver Provider (XA)rdquo provider for the cluster scope in Resources =gt JDBC =gt JDBC Providers and specify database class path information for cluster nodes

d) With this provider create new JDBC Data sources in the cluster scope for each HA database used by the optimisation server cluster the Jobs and the SIBus databases Create the data sources with all the settings that pertain to the primary DB2 host alternate (standby) database definitions will be specified through additional DB connection properties

11

Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well

The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules

e) Set the custom properties of these JDBC data sources

bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables

bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name

bull clientRerouteAlternatePortNumber standby server port number for client reroute

bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2

bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15

bull fullyMaterializeInputStreams set to true

bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues

f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled

g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection

h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)

bull OptimizationServerTopic named jmsoptimserverTopic in JNDI

bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI

bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI

bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic

i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope

j) Install IBM HTTP Server 61

12

Tip

bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application

bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier

k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of

plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps

l) Start cluster nodes

m) Start the cluster in Servers =gt Clusters

n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole

Useful links

bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo

bull ldquoRoadmap Installing the Network Deployment productrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

bull WAS InfoCenter ldquoInstalling Web server plug‐insrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaetins_webpluginshtml

bull ldquoWebSphere Application Server Network Deployment V6 High Availability Solutionsrdquo

httpwwwredbooksibmcomabstractssg246688htmlOpen

bull ldquoService integration high availability and workload sharing configurationsrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0007_html

bull SIBus Configuration for high availability

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0010_html0

D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute

13

Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically

To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example

ltdeploymentSettingsgt ltrepository multiUser=truegt ltconnectiongt ltJDBCDriverClass name=comibmdb2jccDB2Drivergt ltJDBCURLgtjdbcdb2SERVER1PORT1ODMltJDBCURLgt ltauthenticationgt ltauthenticationgt ltconnectiongt ltmanagerClass name=ilogodmdatasvcpersistdb2IloDB2RepositoryManagerFactorygt ltappSchema name=SCHEMAgt ltpropertiesgt ltproperty name=clientRerouteAlternateServerName value=SERVER2gt ltproperty name=clientRerouteAlternatePortNumber value=PORT2gt ltproperty name=maxRetriesForClientReroute value=2gt ltproperty name=retryIntervalForClientReroute value=15gt ltpropertiesgt ltrepositorygt ltdeploymentSettingsgt

These additional properties are

bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the

server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again

Notes bull This property list can be extended with other DB2 properties to match your

needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html

Useful Links bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

bull httppublibboulderibmcominfocenterdb2luwv9r5topic

bull comibmdb2luwadminhadocdocc0011976html

14

3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster

This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR

This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301

When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)

WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version

HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure

A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines

1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme

Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI

The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS

2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node

Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity

On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving

15

capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained

A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing

Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests

WAS 1 mgmt Create job A

Client submit job A

proc

JOBS DB

WAS 2

store running complete

mgmt

proc

submit job B

Create job B

Solve job B

hot standby

job A Q status

job A read status

Solve job A

readStatus

progress

submit job C

Solve job C

running completeprogress

running completeprogress

QS QS

IHS WLM plugin WLM WLM

SOAPHTTP SOAPHTTP SOAPHTTP

WLM WLM WLM

Create job C

Figure 2 Typical ODME 3301 load balancing timeline

A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time

0

1

2

3

4

5

6

7

time

0

100

200

300

400

500

600

running

server1

server2Queue Depth

Figure 3 Typical balance of load on ODME 3301

16

The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log

B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure

1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs

Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

stopWAS 1

WAS 1mgmt Create job A

Client submit job A

proc

JOBSDB

WAS 2

store

running

complete

mgmt

proc

submit job B

Create job B

Solve job B

hot

standby

job A Q status

job A read status

Solve job A

readStatus

progress

IHSWLMplugin WLM WLM

SO

AP

HTTP

SOA

PH

TTP

SOAP

HTTP

17

Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster

Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will

2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter

Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again

18

4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment

There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME

Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected

Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues

A Job processor fails to extract OPL binaries upon restart

Symptoms

optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running

Queued jobs are not processed (remain in NOT_STARTED state)

Only one of the cluster members runs jobs although the queue is saturated

SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)

Explanation

The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization

Remediation

Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed

In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files

19

B Solve cannot recover after WAS job-processor or odmsolver stops

Symptoms

When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio

Explanation

In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock

Remediation

The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console

C Bad error reporting when Optimization Server loses connection from the Repository DB

Symptoms

The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost

Explanation

The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display

Remediation

This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered

D ODME cannot start when WAS administrative security is enabled

Symptoms

Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled

This results in an exception being raised during startup of Optimization Server reported in the

20

SystemOutlog

Explanation

The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree

Remediation

WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization

E ODM solver does not start

Symptoms

Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001

Explanation

The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running

Remediation

Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs

21

Page 9: Building a High Availability ODM Enterprise environment · 2018. 11. 2. · Building a High Availability ODM Enterprise environment ... clustered, and manages high availability across

2 Configuring a HA ODM Enterprise environment

A Introduction This document describes a sample configuration of the IBM ILOG ODM Enterprise as part of a highly available (HA) solution There are many ways to provide high availability using various combinations of specialized hardware and software This document describes a software‐based solution using the following products

bull IBM ILOG ODM Enterprise V3301

bull IBM WebSphere Application Server Network Deployment V61 FP 025

bull IBM HTTP Server V61 FP 025

bull IBM DB2 Enterprise Server Edition V95 FixPack 3

This document does not provide an exhaustive step‐by‐step guide but instead highlights specific considerations for configuring HA with the products listed above Links are provided to product documentation articles and Redbooks which describe the steps in more details

Next configuration steps describe how to configure the sample topology depicted in Figure 1 ODME Software‐only HA Topology example above This topology consists of

Optimization Servers running WebSphere Application Server (WAS) with each Optimization Server will run a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications are installed into the cluster

Database servers running DB2 Enterprise Server Edition with both servers running the same software in a activepassive HADR setup where the primary database server replicates all database updates to the standby server

Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests between the WAS servers on a weighted round‐robin basis

Not represented in the previous topology ODM Client Applications will be configured to benefit from the automatic client rerouting offered by DB2 HADR to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server

B Configuring DB2 HADR DB2 has a feature called High Availability Disaster Recovery (HADR) which provides a high performance replication system A DB2 HADR system consists of two database servers one active and one standby Any changes made to the active system will also be replicated in the standby system At any point an administrator can instruct the standby system to ldquotake overrdquo as the primary ndash after this happens the roles of the two systems are swapped

DB2 HADR Requirements

Before installing DB2 with HADR as the ODME application datastore you need to be aware of these basic requirements for both the primary and standby DB2 servers

9

bull Identical operating system version and patch level

bull The primary and standby server machine should be connected with high‐speed TCPIP connection and reachable by TCPIP from the client application

bull Identical DB2 version and patch level software bit size (32‐bit or 64‐bit) and installation path

DB2 HADR Setup

1 Install DB2 UDB Enterprise Server Edition on both the primary and standby machines

Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connection between DB2 HADR primary and stand‐by machine works well

2 Start the DB2 servers on both machines if they are not already running

3 Create your database and the required tables on the primary machine only The databases on secondary machines will be cloned from the primary machine (See the DB2 Information Center for detailed installation information)

bull Optimization Server DB ndash used to store ODME jobs Use the scripts provided with ODM Enterprise (typically serverdatabasedb2-create-tablessql) to create the JOBS database tables Make a note of the userID that you use to create the tables because it is used in the table qualifier and schema

bull Scenario database ndash used to store ODME scenario data The database tables themselves will be initialized when developing the ODM application using the ODME IDE

bull SIBus database ndash used by the WAS Service Integration Bus

Tip The DB2 logs need to be of a sufficient size especially the scenario database in which there are important updates Make sure you set the database logging to archive logging rather than the default circular logging because otherwise it will not be possible to enable HADR

4 Configure HADR for each database from the primary machine using the Setup HADR wizard as presented in

httpwwwredbooksibmcomredbooksSG247363wwhelpwwhimpljshtmlwwhelphtm

Tip The easiest way to create the databases on the secondary machines is to do it during the HADR setup process by using the backup method During HADR setup you may be asked for the peer window parameter you can leave it at the default value of 0

Useful Links bull DB2 V95 InfoCenter

httppublibboulderibmcominfocenterdb2luwv9r5indexjsp

bull The IBM Redbook ldquoHigh Availability and Disaster Recovery Options for DB2 on Linux UNIX and Windowsrdquo provides a useful guide

10

httpwwwredbooksibmcomabstractsSG247363html

bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

httppublibboulderibmcominfocenterdb2luwv9r5topiccomibmdb2luwadminhadocdocc0011976html

C Configuring WebSphere Application Server and HTTP Server

1) Overview WebSphere Application Server Network Deployment allows multiple servers to be clustered together Installing a Java EE application into the cluster will perform the installation on each cluster member

The ODM Enterprise Job Manager and Job Processor applications use the Java Messaging Service (JMS) to communicate with each other To use JMS in a clustered environment in WAS a service integration bus (SIBus) is used with each server added as a clustered bus member In our architecture only one server needs to host a messaging engine ndash in the event of a failure in that server WAS will move the messaging engine to another server To support this each WAS server must be able to access the SIBus data store so in this topology the data store will be hosted in a DB2 database

2) Procedure The following instructions extend the single‐server instructions provided with ODME 3301 with a focus on differences specific to a clustered deployment

a) Install WAS 61 Network Deployment as detailed in

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

Tip Install the deployment manager node first and start it For the other nodes used in optimisation cluster select a ldquoCustomrdquo environment in the profile manager wizard which will add the new node into the cell Deployment manager and cluster nodes should be created with security disabled

b) Start and connect to the deployment manager console create a new cluster in Servers =gt

Clusters and define cluster members for nodes created earlier

c) Create a ldquoDB2 Universal JDBC Driver Provider (XA)rdquo provider for the cluster scope in Resources =gt JDBC =gt JDBC Providers and specify database class path information for cluster nodes

d) With this provider create new JDBC Data sources in the cluster scope for each HA database used by the optimisation server cluster the Jobs and the SIBus databases Create the data sources with all the settings that pertain to the primary DB2 host alternate (standby) database definitions will be specified through additional DB connection properties

11

Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well

The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules

e) Set the custom properties of these JDBC data sources

bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables

bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name

bull clientRerouteAlternatePortNumber standby server port number for client reroute

bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2

bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15

bull fullyMaterializeInputStreams set to true

bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues

f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled

g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection

h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)

bull OptimizationServerTopic named jmsoptimserverTopic in JNDI

bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI

bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI

bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic

i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope

j) Install IBM HTTP Server 61

12

Tip

bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application

bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier

k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of

plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps

l) Start cluster nodes

m) Start the cluster in Servers =gt Clusters

n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole

Useful links

bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo

bull ldquoRoadmap Installing the Network Deployment productrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

bull WAS InfoCenter ldquoInstalling Web server plug‐insrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaetins_webpluginshtml

bull ldquoWebSphere Application Server Network Deployment V6 High Availability Solutionsrdquo

httpwwwredbooksibmcomabstractssg246688htmlOpen

bull ldquoService integration high availability and workload sharing configurationsrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0007_html

bull SIBus Configuration for high availability

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0010_html0

D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute

13

Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically

To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example

ltdeploymentSettingsgt ltrepository multiUser=truegt ltconnectiongt ltJDBCDriverClass name=comibmdb2jccDB2Drivergt ltJDBCURLgtjdbcdb2SERVER1PORT1ODMltJDBCURLgt ltauthenticationgt ltauthenticationgt ltconnectiongt ltmanagerClass name=ilogodmdatasvcpersistdb2IloDB2RepositoryManagerFactorygt ltappSchema name=SCHEMAgt ltpropertiesgt ltproperty name=clientRerouteAlternateServerName value=SERVER2gt ltproperty name=clientRerouteAlternatePortNumber value=PORT2gt ltproperty name=maxRetriesForClientReroute value=2gt ltproperty name=retryIntervalForClientReroute value=15gt ltpropertiesgt ltrepositorygt ltdeploymentSettingsgt

These additional properties are

bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the

server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again

Notes bull This property list can be extended with other DB2 properties to match your

needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html

Useful Links bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

bull httppublibboulderibmcominfocenterdb2luwv9r5topic

bull comibmdb2luwadminhadocdocc0011976html

14

3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster

This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR

This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301

When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)

WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version

HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure

A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines

1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme

Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI

The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS

2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node

Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity

On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving

15

capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained

A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing

Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests

WAS 1 mgmt Create job A

Client submit job A

proc

JOBS DB

WAS 2

store running complete

mgmt

proc

submit job B

Create job B

Solve job B

hot standby

job A Q status

job A read status

Solve job A

readStatus

progress

submit job C

Solve job C

running completeprogress

running completeprogress

QS QS

IHS WLM plugin WLM WLM

SOAPHTTP SOAPHTTP SOAPHTTP

WLM WLM WLM

Create job C

Figure 2 Typical ODME 3301 load balancing timeline

A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time

0

1

2

3

4

5

6

7

time

0

100

200

300

400

500

600

running

server1

server2Queue Depth

Figure 3 Typical balance of load on ODME 3301

16

The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log

B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure

1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs

Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

stopWAS 1

WAS 1mgmt Create job A

Client submit job A

proc

JOBSDB

WAS 2

store

running

complete

mgmt

proc

submit job B

Create job B

Solve job B

hot

standby

job A Q status

job A read status

Solve job A

readStatus

progress

IHSWLMplugin WLM WLM

SO

AP

HTTP

SOA

PH

TTP

SOAP

HTTP

17

Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster

Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will

2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter

Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again

18

4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment

There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME

Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected

Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues

A Job processor fails to extract OPL binaries upon restart

Symptoms

optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running

Queued jobs are not processed (remain in NOT_STARTED state)

Only one of the cluster members runs jobs although the queue is saturated

SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)

Explanation

The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization

Remediation

Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed

In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files

19

B Solve cannot recover after WAS job-processor or odmsolver stops

Symptoms

When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio

Explanation

In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock

Remediation

The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console

C Bad error reporting when Optimization Server loses connection from the Repository DB

Symptoms

The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost

Explanation

The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display

Remediation

This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered

D ODME cannot start when WAS administrative security is enabled

Symptoms

Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled

This results in an exception being raised during startup of Optimization Server reported in the

20

SystemOutlog

Explanation

The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree

Remediation

WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization

E ODM solver does not start

Symptoms

Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001

Explanation

The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running

Remediation

Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs

21

Page 10: Building a High Availability ODM Enterprise environment · 2018. 11. 2. · Building a High Availability ODM Enterprise environment ... clustered, and manages high availability across

bull Identical operating system version and patch level

bull The primary and standby server machine should be connected with high‐speed TCPIP connection and reachable by TCPIP from the client application

bull Identical DB2 version and patch level software bit size (32‐bit or 64‐bit) and installation path

DB2 HADR Setup

1 Install DB2 UDB Enterprise Server Edition on both the primary and standby machines

Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connection between DB2 HADR primary and stand‐by machine works well

2 Start the DB2 servers on both machines if they are not already running

3 Create your database and the required tables on the primary machine only The databases on secondary machines will be cloned from the primary machine (See the DB2 Information Center for detailed installation information)

bull Optimization Server DB ndash used to store ODME jobs Use the scripts provided with ODM Enterprise (typically serverdatabasedb2-create-tablessql) to create the JOBS database tables Make a note of the userID that you use to create the tables because it is used in the table qualifier and schema

bull Scenario database ndash used to store ODME scenario data The database tables themselves will be initialized when developing the ODM application using the ODME IDE

bull SIBus database ndash used by the WAS Service Integration Bus

Tip The DB2 logs need to be of a sufficient size especially the scenario database in which there are important updates Make sure you set the database logging to archive logging rather than the default circular logging because otherwise it will not be possible to enable HADR

4 Configure HADR for each database from the primary machine using the Setup HADR wizard as presented in

httpwwwredbooksibmcomredbooksSG247363wwhelpwwhimpljshtmlwwhelphtm

Tip The easiest way to create the databases on the secondary machines is to do it during the HADR setup process by using the backup method During HADR setup you may be asked for the peer window parameter you can leave it at the default value of 0

Useful Links bull DB2 V95 InfoCenter

httppublibboulderibmcominfocenterdb2luwv9r5indexjsp

bull The IBM Redbook ldquoHigh Availability and Disaster Recovery Options for DB2 on Linux UNIX and Windowsrdquo provides a useful guide

10

httpwwwredbooksibmcomabstractsSG247363html

bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

httppublibboulderibmcominfocenterdb2luwv9r5topiccomibmdb2luwadminhadocdocc0011976html

C Configuring WebSphere Application Server and HTTP Server

1) Overview WebSphere Application Server Network Deployment allows multiple servers to be clustered together Installing a Java EE application into the cluster will perform the installation on each cluster member

The ODM Enterprise Job Manager and Job Processor applications use the Java Messaging Service (JMS) to communicate with each other To use JMS in a clustered environment in WAS a service integration bus (SIBus) is used with each server added as a clustered bus member In our architecture only one server needs to host a messaging engine ndash in the event of a failure in that server WAS will move the messaging engine to another server To support this each WAS server must be able to access the SIBus data store so in this topology the data store will be hosted in a DB2 database

2) Procedure The following instructions extend the single‐server instructions provided with ODME 3301 with a focus on differences specific to a clustered deployment

a) Install WAS 61 Network Deployment as detailed in

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

Tip Install the deployment manager node first and start it For the other nodes used in optimisation cluster select a ldquoCustomrdquo environment in the profile manager wizard which will add the new node into the cell Deployment manager and cluster nodes should be created with security disabled

b) Start and connect to the deployment manager console create a new cluster in Servers =gt

Clusters and define cluster members for nodes created earlier

c) Create a ldquoDB2 Universal JDBC Driver Provider (XA)rdquo provider for the cluster scope in Resources =gt JDBC =gt JDBC Providers and specify database class path information for cluster nodes

d) With this provider create new JDBC Data sources in the cluster scope for each HA database used by the optimisation server cluster the Jobs and the SIBus databases Create the data sources with all the settings that pertain to the primary DB2 host alternate (standby) database definitions will be specified through additional DB connection properties

11

Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well

The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules

e) Set the custom properties of these JDBC data sources

bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables

bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name

bull clientRerouteAlternatePortNumber standby server port number for client reroute

bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2

bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15

bull fullyMaterializeInputStreams set to true

bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues

f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled

g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection

h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)

bull OptimizationServerTopic named jmsoptimserverTopic in JNDI

bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI

bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI

bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic

i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope

j) Install IBM HTTP Server 61

12

Tip

bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application

bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier

k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of

plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps

l) Start cluster nodes

m) Start the cluster in Servers =gt Clusters

n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole

Useful links

bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo

bull ldquoRoadmap Installing the Network Deployment productrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

bull WAS InfoCenter ldquoInstalling Web server plug‐insrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaetins_webpluginshtml

bull ldquoWebSphere Application Server Network Deployment V6 High Availability Solutionsrdquo

httpwwwredbooksibmcomabstractssg246688htmlOpen

bull ldquoService integration high availability and workload sharing configurationsrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0007_html

bull SIBus Configuration for high availability

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0010_html0

D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute

13

Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically

To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example

ltdeploymentSettingsgt ltrepository multiUser=truegt ltconnectiongt ltJDBCDriverClass name=comibmdb2jccDB2Drivergt ltJDBCURLgtjdbcdb2SERVER1PORT1ODMltJDBCURLgt ltauthenticationgt ltauthenticationgt ltconnectiongt ltmanagerClass name=ilogodmdatasvcpersistdb2IloDB2RepositoryManagerFactorygt ltappSchema name=SCHEMAgt ltpropertiesgt ltproperty name=clientRerouteAlternateServerName value=SERVER2gt ltproperty name=clientRerouteAlternatePortNumber value=PORT2gt ltproperty name=maxRetriesForClientReroute value=2gt ltproperty name=retryIntervalForClientReroute value=15gt ltpropertiesgt ltrepositorygt ltdeploymentSettingsgt

These additional properties are

bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the

server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again

Notes bull This property list can be extended with other DB2 properties to match your

needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html

Useful Links bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

bull httppublibboulderibmcominfocenterdb2luwv9r5topic

bull comibmdb2luwadminhadocdocc0011976html

14

3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster

This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR

This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301

When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)

WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version

HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure

A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines

1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme

Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI

The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS

2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node

Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity

On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving

15

capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained

A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing

Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests

WAS 1 mgmt Create job A

Client submit job A

proc

JOBS DB

WAS 2

store running complete

mgmt

proc

submit job B

Create job B

Solve job B

hot standby

job A Q status

job A read status

Solve job A

readStatus

progress

submit job C

Solve job C

running completeprogress

running completeprogress

QS QS

IHS WLM plugin WLM WLM

SOAPHTTP SOAPHTTP SOAPHTTP

WLM WLM WLM

Create job C

Figure 2 Typical ODME 3301 load balancing timeline

A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time

0

1

2

3

4

5

6

7

time

0

100

200

300

400

500

600

running

server1

server2Queue Depth

Figure 3 Typical balance of load on ODME 3301

16

The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log

B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure

1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs

Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

stopWAS 1

WAS 1mgmt Create job A

Client submit job A

proc

JOBSDB

WAS 2

store

running

complete

mgmt

proc

submit job B

Create job B

Solve job B

hot

standby

job A Q status

job A read status

Solve job A

readStatus

progress

IHSWLMplugin WLM WLM

SO

AP

HTTP

SOA

PH

TTP

SOAP

HTTP

17

Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster

Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will

2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter

Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again

18

4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment

There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME

Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected

Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues

A Job processor fails to extract OPL binaries upon restart

Symptoms

optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running

Queued jobs are not processed (remain in NOT_STARTED state)

Only one of the cluster members runs jobs although the queue is saturated

SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)

Explanation

The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization

Remediation

Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed

In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files

19

B Solve cannot recover after WAS job-processor or odmsolver stops

Symptoms

When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio

Explanation

In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock

Remediation

The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console

C Bad error reporting when Optimization Server loses connection from the Repository DB

Symptoms

The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost

Explanation

The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display

Remediation

This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered

D ODME cannot start when WAS administrative security is enabled

Symptoms

Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled

This results in an exception being raised during startup of Optimization Server reported in the

20

SystemOutlog

Explanation

The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree

Remediation

WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization

E ODM solver does not start

Symptoms

Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001

Explanation

The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running

Remediation

Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs

21

Page 11: Building a High Availability ODM Enterprise environment · 2018. 11. 2. · Building a High Availability ODM Enterprise environment ... clustered, and manages high availability across

httpwwwredbooksibmcomabstractsSG247363html

bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

httppublibboulderibmcominfocenterdb2luwv9r5topiccomibmdb2luwadminhadocdocc0011976html

C Configuring WebSphere Application Server and HTTP Server

1) Overview WebSphere Application Server Network Deployment allows multiple servers to be clustered together Installing a Java EE application into the cluster will perform the installation on each cluster member

The ODM Enterprise Job Manager and Job Processor applications use the Java Messaging Service (JMS) to communicate with each other To use JMS in a clustered environment in WAS a service integration bus (SIBus) is used with each server added as a clustered bus member In our architecture only one server needs to host a messaging engine ndash in the event of a failure in that server WAS will move the messaging engine to another server To support this each WAS server must be able to access the SIBus data store so in this topology the data store will be hosted in a DB2 database

2) Procedure The following instructions extend the single‐server instructions provided with ODME 3301 with a focus on differences specific to a clustered deployment

a) Install WAS 61 Network Deployment as detailed in

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

Tip Install the deployment manager node first and start it For the other nodes used in optimisation cluster select a ldquoCustomrdquo environment in the profile manager wizard which will add the new node into the cell Deployment manager and cluster nodes should be created with security disabled

b) Start and connect to the deployment manager console create a new cluster in Servers =gt

Clusters and define cluster members for nodes created earlier

c) Create a ldquoDB2 Universal JDBC Driver Provider (XA)rdquo provider for the cluster scope in Resources =gt JDBC =gt JDBC Providers and specify database class path information for cluster nodes

d) With this provider create new JDBC Data sources in the cluster scope for each HA database used by the optimisation server cluster the Jobs and the SIBus databases Create the data sources with all the settings that pertain to the primary DB2 host alternate (standby) database definitions will be specified through additional DB connection properties

11

Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well

The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules

e) Set the custom properties of these JDBC data sources

bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables

bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name

bull clientRerouteAlternatePortNumber standby server port number for client reroute

bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2

bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15

bull fullyMaterializeInputStreams set to true

bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues

f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled

g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection

h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)

bull OptimizationServerTopic named jmsoptimserverTopic in JNDI

bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI

bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI

bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic

i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope

j) Install IBM HTTP Server 61

12

Tip

bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application

bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier

k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of

plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps

l) Start cluster nodes

m) Start the cluster in Servers =gt Clusters

n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole

Useful links

bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo

bull ldquoRoadmap Installing the Network Deployment productrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

bull WAS InfoCenter ldquoInstalling Web server plug‐insrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaetins_webpluginshtml

bull ldquoWebSphere Application Server Network Deployment V6 High Availability Solutionsrdquo

httpwwwredbooksibmcomabstractssg246688htmlOpen

bull ldquoService integration high availability and workload sharing configurationsrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0007_html

bull SIBus Configuration for high availability

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0010_html0

D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute

13

Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically

To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example

ltdeploymentSettingsgt ltrepository multiUser=truegt ltconnectiongt ltJDBCDriverClass name=comibmdb2jccDB2Drivergt ltJDBCURLgtjdbcdb2SERVER1PORT1ODMltJDBCURLgt ltauthenticationgt ltauthenticationgt ltconnectiongt ltmanagerClass name=ilogodmdatasvcpersistdb2IloDB2RepositoryManagerFactorygt ltappSchema name=SCHEMAgt ltpropertiesgt ltproperty name=clientRerouteAlternateServerName value=SERVER2gt ltproperty name=clientRerouteAlternatePortNumber value=PORT2gt ltproperty name=maxRetriesForClientReroute value=2gt ltproperty name=retryIntervalForClientReroute value=15gt ltpropertiesgt ltrepositorygt ltdeploymentSettingsgt

These additional properties are

bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the

server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again

Notes bull This property list can be extended with other DB2 properties to match your

needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html

Useful Links bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

bull httppublibboulderibmcominfocenterdb2luwv9r5topic

bull comibmdb2luwadminhadocdocc0011976html

14

3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster

This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR

This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301

When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)

WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version

HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure

A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines

1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme

Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI

The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS

2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node

Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity

On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving

15

capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained

A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing

Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests

WAS 1 mgmt Create job A

Client submit job A

proc

JOBS DB

WAS 2

store running complete

mgmt

proc

submit job B

Create job B

Solve job B

hot standby

job A Q status

job A read status

Solve job A

readStatus

progress

submit job C

Solve job C

running completeprogress

running completeprogress

QS QS

IHS WLM plugin WLM WLM

SOAPHTTP SOAPHTTP SOAPHTTP

WLM WLM WLM

Create job C

Figure 2 Typical ODME 3301 load balancing timeline

A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time

0

1

2

3

4

5

6

7

time

0

100

200

300

400

500

600

running

server1

server2Queue Depth

Figure 3 Typical balance of load on ODME 3301

16

The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log

B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure

1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs

Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

stopWAS 1

WAS 1mgmt Create job A

Client submit job A

proc

JOBSDB

WAS 2

store

running

complete

mgmt

proc

submit job B

Create job B

Solve job B

hot

standby

job A Q status

job A read status

Solve job A

readStatus

progress

IHSWLMplugin WLM WLM

SO

AP

HTTP

SOA

PH

TTP

SOAP

HTTP

17

Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster

Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will

2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter

Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again

18

4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment

There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME

Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected

Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues

A Job processor fails to extract OPL binaries upon restart

Symptoms

optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running

Queued jobs are not processed (remain in NOT_STARTED state)

Only one of the cluster members runs jobs although the queue is saturated

SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)

Explanation

The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization

Remediation

Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed

In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files

19

B Solve cannot recover after WAS job-processor or odmsolver stops

Symptoms

When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio

Explanation

In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock

Remediation

The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console

C Bad error reporting when Optimization Server loses connection from the Repository DB

Symptoms

The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost

Explanation

The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display

Remediation

This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered

D ODME cannot start when WAS administrative security is enabled

Symptoms

Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled

This results in an exception being raised during startup of Optimization Server reported in the

20

SystemOutlog

Explanation

The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree

Remediation

WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization

E ODM solver does not start

Symptoms

Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001

Explanation

The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running

Remediation

Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs

21

Page 12: Building a High Availability ODM Enterprise environment · 2018. 11. 2. · Building a High Availability ODM Enterprise environment ... clustered, and manages high availability across

Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well

The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules

e) Set the custom properties of these JDBC data sources

bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables

bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name

bull clientRerouteAlternatePortNumber standby server port number for client reroute

bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2

bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15

bull fullyMaterializeInputStreams set to true

bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues

f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled

g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection

h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)

bull OptimizationServerTopic named jmsoptimserverTopic in JNDI

bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI

bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI

bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic

i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope

j) Install IBM HTTP Server 61

12

Tip

bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application

bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier

k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of

plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps

l) Start cluster nodes

m) Start the cluster in Servers =gt Clusters

n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole

Useful links

bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo

bull ldquoRoadmap Installing the Network Deployment productrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

bull WAS InfoCenter ldquoInstalling Web server plug‐insrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaetins_webpluginshtml

bull ldquoWebSphere Application Server Network Deployment V6 High Availability Solutionsrdquo

httpwwwredbooksibmcomabstractssg246688htmlOpen

bull ldquoService integration high availability and workload sharing configurationsrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0007_html

bull SIBus Configuration for high availability

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0010_html0

D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute

13

Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically

To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example

ltdeploymentSettingsgt ltrepository multiUser=truegt ltconnectiongt ltJDBCDriverClass name=comibmdb2jccDB2Drivergt ltJDBCURLgtjdbcdb2SERVER1PORT1ODMltJDBCURLgt ltauthenticationgt ltauthenticationgt ltconnectiongt ltmanagerClass name=ilogodmdatasvcpersistdb2IloDB2RepositoryManagerFactorygt ltappSchema name=SCHEMAgt ltpropertiesgt ltproperty name=clientRerouteAlternateServerName value=SERVER2gt ltproperty name=clientRerouteAlternatePortNumber value=PORT2gt ltproperty name=maxRetriesForClientReroute value=2gt ltproperty name=retryIntervalForClientReroute value=15gt ltpropertiesgt ltrepositorygt ltdeploymentSettingsgt

These additional properties are

bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the

server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again

Notes bull This property list can be extended with other DB2 properties to match your

needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html

Useful Links bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

bull httppublibboulderibmcominfocenterdb2luwv9r5topic

bull comibmdb2luwadminhadocdocc0011976html

14

3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster

This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR

This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301

When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)

WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version

HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure

A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines

1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme

Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI

The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS

2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node

Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity

On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving

15

capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained

A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing

Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests

WAS 1 mgmt Create job A

Client submit job A

proc

JOBS DB

WAS 2

store running complete

mgmt

proc

submit job B

Create job B

Solve job B

hot standby

job A Q status

job A read status

Solve job A

readStatus

progress

submit job C

Solve job C

running completeprogress

running completeprogress

QS QS

IHS WLM plugin WLM WLM

SOAPHTTP SOAPHTTP SOAPHTTP

WLM WLM WLM

Create job C

Figure 2 Typical ODME 3301 load balancing timeline

A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time

0

1

2

3

4

5

6

7

time

0

100

200

300

400

500

600

running

server1

server2Queue Depth

Figure 3 Typical balance of load on ODME 3301

16

The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log

B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure

1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs

Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

stopWAS 1

WAS 1mgmt Create job A

Client submit job A

proc

JOBSDB

WAS 2

store

running

complete

mgmt

proc

submit job B

Create job B

Solve job B

hot

standby

job A Q status

job A read status

Solve job A

readStatus

progress

IHSWLMplugin WLM WLM

SO

AP

HTTP

SOA

PH

TTP

SOAP

HTTP

17

Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster

Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will

2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter

Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again

18

4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment

There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME

Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected

Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues

A Job processor fails to extract OPL binaries upon restart

Symptoms

optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running

Queued jobs are not processed (remain in NOT_STARTED state)

Only one of the cluster members runs jobs although the queue is saturated

SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)

Explanation

The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization

Remediation

Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed

In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files

19

B Solve cannot recover after WAS job-processor or odmsolver stops

Symptoms

When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio

Explanation

In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock

Remediation

The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console

C Bad error reporting when Optimization Server loses connection from the Repository DB

Symptoms

The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost

Explanation

The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display

Remediation

This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered

D ODME cannot start when WAS administrative security is enabled

Symptoms

Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled

This results in an exception being raised during startup of Optimization Server reported in the

20

SystemOutlog

Explanation

The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree

Remediation

WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization

E ODM solver does not start

Symptoms

Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001

Explanation

The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running

Remediation

Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs

21

Page 13: Building a High Availability ODM Enterprise environment · 2018. 11. 2. · Building a High Availability ODM Enterprise environment ... clustered, and manages high availability across

Tip

bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application

bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier

k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of

plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps

l) Start cluster nodes

m) Start the cluster in Servers =gt Clusters

n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole

Useful links

bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo

bull ldquoRoadmap Installing the Network Deployment productrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaerins_ndroadmaphtml

bull WAS InfoCenter ldquoInstalling Web server plug‐insrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherendmultiplatformdocinfoaeaetins_webpluginshtml

bull ldquoWebSphere Application Server Network Deployment V6 High Availability Solutionsrdquo

httpwwwredbooksibmcomabstractssg246688htmlOpen

bull ldquoService integration high availability and workload sharing configurationsrdquo

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0007_html

bull SIBus Configuration for high availability

httppublibboulderibmcominfocenterwasinfov6r1topiccomibmwebspherepmcnddocconceptscjt0010_html0

D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute

13

Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically

To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example

ltdeploymentSettingsgt ltrepository multiUser=truegt ltconnectiongt ltJDBCDriverClass name=comibmdb2jccDB2Drivergt ltJDBCURLgtjdbcdb2SERVER1PORT1ODMltJDBCURLgt ltauthenticationgt ltauthenticationgt ltconnectiongt ltmanagerClass name=ilogodmdatasvcpersistdb2IloDB2RepositoryManagerFactorygt ltappSchema name=SCHEMAgt ltpropertiesgt ltproperty name=clientRerouteAlternateServerName value=SERVER2gt ltproperty name=clientRerouteAlternatePortNumber value=PORT2gt ltproperty name=maxRetriesForClientReroute value=2gt ltproperty name=retryIntervalForClientReroute value=15gt ltpropertiesgt ltrepositorygt ltdeploymentSettingsgt

These additional properties are

bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the

server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again

Notes bull This property list can be extended with other DB2 properties to match your

needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html

Useful Links bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

bull httppublibboulderibmcominfocenterdb2luwv9r5topic

bull comibmdb2luwadminhadocdocc0011976html

14

3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster

This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR

This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301

When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)

WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version

HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure

A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines

1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme

Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI

The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS

2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node

Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity

On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving

15

capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained

A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing

Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests

WAS 1 mgmt Create job A

Client submit job A

proc

JOBS DB

WAS 2

store running complete

mgmt

proc

submit job B

Create job B

Solve job B

hot standby

job A Q status

job A read status

Solve job A

readStatus

progress

submit job C

Solve job C

running completeprogress

running completeprogress

QS QS

IHS WLM plugin WLM WLM

SOAPHTTP SOAPHTTP SOAPHTTP

WLM WLM WLM

Create job C

Figure 2 Typical ODME 3301 load balancing timeline

A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time

0

1

2

3

4

5

6

7

time

0

100

200

300

400

500

600

running

server1

server2Queue Depth

Figure 3 Typical balance of load on ODME 3301

16

The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log

B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure

1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs

Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

stopWAS 1

WAS 1mgmt Create job A

Client submit job A

proc

JOBSDB

WAS 2

store

running

complete

mgmt

proc

submit job B

Create job B

Solve job B

hot

standby

job A Q status

job A read status

Solve job A

readStatus

progress

IHSWLMplugin WLM WLM

SO

AP

HTTP

SOA

PH

TTP

SOAP

HTTP

17

Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster

Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will

2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter

Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again

18

4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment

There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME

Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected

Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues

A Job processor fails to extract OPL binaries upon restart

Symptoms

optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running

Queued jobs are not processed (remain in NOT_STARTED state)

Only one of the cluster members runs jobs although the queue is saturated

SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)

Explanation

The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization

Remediation

Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed

In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files

19

B Solve cannot recover after WAS job-processor or odmsolver stops

Symptoms

When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio

Explanation

In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock

Remediation

The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console

C Bad error reporting when Optimization Server loses connection from the Repository DB

Symptoms

The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost

Explanation

The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display

Remediation

This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered

D ODME cannot start when WAS administrative security is enabled

Symptoms

Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled

This results in an exception being raised during startup of Optimization Server reported in the

20

SystemOutlog

Explanation

The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree

Remediation

WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization

E ODM solver does not start

Symptoms

Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001

Explanation

The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running

Remediation

Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs

21

Page 14: Building a High Availability ODM Enterprise environment · 2018. 11. 2. · Building a High Availability ODM Enterprise environment ... clustered, and manages high availability across

Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically

To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example

ltdeploymentSettingsgt ltrepository multiUser=truegt ltconnectiongt ltJDBCDriverClass name=comibmdb2jccDB2Drivergt ltJDBCURLgtjdbcdb2SERVER1PORT1ODMltJDBCURLgt ltauthenticationgt ltauthenticationgt ltconnectiongt ltmanagerClass name=ilogodmdatasvcpersistdb2IloDB2RepositoryManagerFactorygt ltappSchema name=SCHEMAgt ltpropertiesgt ltproperty name=clientRerouteAlternateServerName value=SERVER2gt ltproperty name=clientRerouteAlternatePortNumber value=PORT2gt ltproperty name=maxRetriesForClientReroute value=2gt ltproperty name=retryIntervalForClientReroute value=15gt ltpropertiesgt ltrepositorygt ltdeploymentSettingsgt

These additional properties are

bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the

server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again

Notes bull This property list can be extended with other DB2 properties to match your

needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html

Useful Links bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo

bull httppublibboulderibmcominfocenterdb2luwv9r5topic

bull comibmdb2luwadminhadocdocc0011976html

14

3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster

This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR

This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301

When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)

WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version

HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure

A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines

1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme

Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI

The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS

2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node

Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity

On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving

15

capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained

A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing

Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests

WAS 1 mgmt Create job A

Client submit job A

proc

JOBS DB

WAS 2

store running complete

mgmt

proc

submit job B

Create job B

Solve job B

hot standby

job A Q status

job A read status

Solve job A

readStatus

progress

submit job C

Solve job C

running completeprogress

running completeprogress

QS QS

IHS WLM plugin WLM WLM

SOAPHTTP SOAPHTTP SOAPHTTP

WLM WLM WLM

Create job C

Figure 2 Typical ODME 3301 load balancing timeline

A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time

0

1

2

3

4

5

6

7

time

0

100

200

300

400

500

600

running

server1

server2Queue Depth

Figure 3 Typical balance of load on ODME 3301

16

The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log

B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure

1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs

Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

stopWAS 1

WAS 1mgmt Create job A

Client submit job A

proc

JOBSDB

WAS 2

store

running

complete

mgmt

proc

submit job B

Create job B

Solve job B

hot

standby

job A Q status

job A read status

Solve job A

readStatus

progress

IHSWLMplugin WLM WLM

SO

AP

HTTP

SOA

PH

TTP

SOAP

HTTP

17

Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster

Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will

2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter

Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again

18

4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment

There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME

Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected

Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues

A Job processor fails to extract OPL binaries upon restart

Symptoms

optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running

Queued jobs are not processed (remain in NOT_STARTED state)

Only one of the cluster members runs jobs although the queue is saturated

SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)

Explanation

The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization

Remediation

Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed

In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files

19

B Solve cannot recover after WAS job-processor or odmsolver stops

Symptoms

When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio

Explanation

In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock

Remediation

The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console

C Bad error reporting when Optimization Server loses connection from the Repository DB

Symptoms

The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost

Explanation

The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display

Remediation

This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered

D ODME cannot start when WAS administrative security is enabled

Symptoms

Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled

This results in an exception being raised during startup of Optimization Server reported in the

20

SystemOutlog

Explanation

The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree

Remediation

WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization

E ODM solver does not start

Symptoms

Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001

Explanation

The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running

Remediation

Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs

21

Page 15: Building a High Availability ODM Enterprise environment · 2018. 11. 2. · Building a High Availability ODM Enterprise environment ... clustered, and manages high availability across

3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster

This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR

This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301

When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)

WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version

HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure

A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines

1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme

Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI

The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS

2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node

Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity

On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving

15

capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained

A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing

Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests

WAS 1 mgmt Create job A

Client submit job A

proc

JOBS DB

WAS 2

store running complete

mgmt

proc

submit job B

Create job B

Solve job B

hot standby

job A Q status

job A read status

Solve job A

readStatus

progress

submit job C

Solve job C

running completeprogress

running completeprogress

QS QS

IHS WLM plugin WLM WLM

SOAPHTTP SOAPHTTP SOAPHTTP

WLM WLM WLM

Create job C

Figure 2 Typical ODME 3301 load balancing timeline

A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time

0

1

2

3

4

5

6

7

time

0

100

200

300

400

500

600

running

server1

server2Queue Depth

Figure 3 Typical balance of load on ODME 3301

16

The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log

B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure

1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs

Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

stopWAS 1

WAS 1mgmt Create job A

Client submit job A

proc

JOBSDB

WAS 2

store

running

complete

mgmt

proc

submit job B

Create job B

Solve job B

hot

standby

job A Q status

job A read status

Solve job A

readStatus

progress

IHSWLMplugin WLM WLM

SO

AP

HTTP

SOA

PH

TTP

SOAP

HTTP

17

Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster

Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will

2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter

Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again

18

4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment

There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME

Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected

Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues

A Job processor fails to extract OPL binaries upon restart

Symptoms

optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running

Queued jobs are not processed (remain in NOT_STARTED state)

Only one of the cluster members runs jobs although the queue is saturated

SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)

Explanation

The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization

Remediation

Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed

In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files

19

B Solve cannot recover after WAS job-processor or odmsolver stops

Symptoms

When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio

Explanation

In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock

Remediation

The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console

C Bad error reporting when Optimization Server loses connection from the Repository DB

Symptoms

The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost

Explanation

The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display

Remediation

This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered

D ODME cannot start when WAS administrative security is enabled

Symptoms

Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled

This results in an exception being raised during startup of Optimization Server reported in the

20

SystemOutlog

Explanation

The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree

Remediation

WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization

E ODM solver does not start

Symptoms

Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001

Explanation

The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running

Remediation

Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs

21

Page 16: Building a High Availability ODM Enterprise environment · 2018. 11. 2. · Building a High Availability ODM Enterprise environment ... clustered, and manages high availability across

capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained

A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing

Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests

WAS 1 mgmt Create job A

Client submit job A

proc

JOBS DB

WAS 2

store running complete

mgmt

proc

submit job B

Create job B

Solve job B

hot standby

job A Q status

job A read status

Solve job A

readStatus

progress

submit job C

Solve job C

running completeprogress

running completeprogress

QS QS

IHS WLM plugin WLM WLM

SOAPHTTP SOAPHTTP SOAPHTTP

WLM WLM WLM

Create job C

Figure 2 Typical ODME 3301 load balancing timeline

A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time

0

1

2

3

4

5

6

7

time

0

100

200

300

400

500

600

running

server1

server2Queue Depth

Figure 3 Typical balance of load on ODME 3301

16

The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log

B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure

1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs

Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

stopWAS 1

WAS 1mgmt Create job A

Client submit job A

proc

JOBSDB

WAS 2

store

running

complete

mgmt

proc

submit job B

Create job B

Solve job B

hot

standby

job A Q status

job A read status

Solve job A

readStatus

progress

IHSWLMplugin WLM WLM

SO

AP

HTTP

SOA

PH

TTP

SOAP

HTTP

17

Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster

Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will

2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter

Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again

18

4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment

There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME

Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected

Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues

A Job processor fails to extract OPL binaries upon restart

Symptoms

optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running

Queued jobs are not processed (remain in NOT_STARTED state)

Only one of the cluster members runs jobs although the queue is saturated

SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)

Explanation

The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization

Remediation

Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed

In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files

19

B Solve cannot recover after WAS job-processor or odmsolver stops

Symptoms

When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio

Explanation

In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock

Remediation

The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console

C Bad error reporting when Optimization Server loses connection from the Repository DB

Symptoms

The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost

Explanation

The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display

Remediation

This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered

D ODME cannot start when WAS administrative security is enabled

Symptoms

Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled

This results in an exception being raised during startup of Optimization Server reported in the

20

SystemOutlog

Explanation

The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree

Remediation

WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization

E ODM solver does not start

Symptoms

Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001

Explanation

The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running

Remediation

Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs

21

Page 17: Building a High Availability ODM Enterprise environment · 2018. 11. 2. · Building a High Availability ODM Enterprise environment ... clustered, and manages high availability across

The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log

B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure

1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs

Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure

stopWAS 1

WAS 1mgmt Create job A

Client submit job A

proc

JOBSDB

WAS 2

store

running

complete

mgmt

proc

submit job B

Create job B

Solve job B

hot

standby

job A Q status

job A read status

Solve job A

readStatus

progress

IHSWLMplugin WLM WLM

SO

AP

HTTP

SOA

PH

TTP

SOAP

HTTP

17

Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster

Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will

2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter

Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again

18

4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment

There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME

Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected

Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues

A Job processor fails to extract OPL binaries upon restart

Symptoms

optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running

Queued jobs are not processed (remain in NOT_STARTED state)

Only one of the cluster members runs jobs although the queue is saturated

SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)

Explanation

The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization

Remediation

Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed

In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files

19

B Solve cannot recover after WAS job-processor or odmsolver stops

Symptoms

When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio

Explanation

In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock

Remediation

The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console

C Bad error reporting when Optimization Server loses connection from the Repository DB

Symptoms

The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost

Explanation

The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display

Remediation

This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered

D ODME cannot start when WAS administrative security is enabled

Symptoms

Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled

This results in an exception being raised during startup of Optimization Server reported in the

20

SystemOutlog

Explanation

The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree

Remediation

WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization

E ODM solver does not start

Symptoms

Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001

Explanation

The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running

Remediation

Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs

21

Page 18: Building a High Availability ODM Enterprise environment · 2018. 11. 2. · Building a High Availability ODM Enterprise environment ... clustered, and manages high availability across

Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster

Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will

2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter

Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again

18

4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment

There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME

Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected

Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues

A Job processor fails to extract OPL binaries upon restart

Symptoms

optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running

Queued jobs are not processed (remain in NOT_STARTED state)

Only one of the cluster members runs jobs although the queue is saturated

SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)

Explanation

The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization

Remediation

Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed

In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files

19

B Solve cannot recover after WAS job-processor or odmsolver stops

Symptoms

When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio

Explanation

In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock

Remediation

The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console

C Bad error reporting when Optimization Server loses connection from the Repository DB

Symptoms

The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost

Explanation

The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display

Remediation

This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered

D ODME cannot start when WAS administrative security is enabled

Symptoms

Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled

This results in an exception being raised during startup of Optimization Server reported in the

20

SystemOutlog

Explanation

The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree

Remediation

WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization

E ODM solver does not start

Symptoms

Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001

Explanation

The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running

Remediation

Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs

21

Page 19: Building a High Availability ODM Enterprise environment · 2018. 11. 2. · Building a High Availability ODM Enterprise environment ... clustered, and manages high availability across

4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment

There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME

Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected

Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues

A Job processor fails to extract OPL binaries upon restart

Symptoms

optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running

Queued jobs are not processed (remain in NOT_STARTED state)

Only one of the cluster members runs jobs although the queue is saturated

SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)

Explanation

The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization

Remediation

Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed

In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files

19

B Solve cannot recover after WAS job-processor or odmsolver stops

Symptoms

When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio

Explanation

In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock

Remediation

The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console

C Bad error reporting when Optimization Server loses connection from the Repository DB

Symptoms

The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost

Explanation

The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display

Remediation

This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered

D ODME cannot start when WAS administrative security is enabled

Symptoms

Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled

This results in an exception being raised during startup of Optimization Server reported in the

20

SystemOutlog

Explanation

The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree

Remediation

WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization

E ODM solver does not start

Symptoms

Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001

Explanation

The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running

Remediation

Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs

21

Page 20: Building a High Availability ODM Enterprise environment · 2018. 11. 2. · Building a High Availability ODM Enterprise environment ... clustered, and manages high availability across

B Solve cannot recover after WAS job-processor or odmsolver stops

Symptoms

When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio

Explanation

In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock

Remediation

The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console

C Bad error reporting when Optimization Server loses connection from the Repository DB

Symptoms

The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost

Explanation

The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display

Remediation

This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered

D ODME cannot start when WAS administrative security is enabled

Symptoms

Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled

This results in an exception being raised during startup of Optimization Server reported in the

20

SystemOutlog

Explanation

The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree

Remediation

WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization

E ODM solver does not start

Symptoms

Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001

Explanation

The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running

Remediation

Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs

21

Page 21: Building a High Availability ODM Enterprise environment · 2018. 11. 2. · Building a High Availability ODM Enterprise environment ... clustered, and manages high availability across

SystemOutlog

Explanation

The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree

Remediation

WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization

E ODM solver does not start

Symptoms

Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001

Explanation

The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running

Remediation

Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs

21