Oracle SOA Suite 11g Troubleshooting Methodology (whitepaper)

Middleware

1 Session 185

ORACLE SOA SUITE 11G TROUBLESHOOTING METHODOLOGY

Harold A. Dost III, Raastech, Inc.

ABSTRACT Most troubleshooting guides simply list out solutions to common errors. This paper introduces a troubleshooting methodology surrounding performance, composite instances, deployment, and logging. The goal is to better equip the reader with the ability to solve most problems as they pertain to the SOA infrastructure and its executed transactions. As well as, learn where to look, what to look for, and what do afterward.

TARGET AUDIENCE This is intended for every Oracle SOA Suite 11g developer and administrator should read.

EXECUTIVE SUMMARY There is no guarantee that every error is an easy fix away. However, this paper will provide the reader with a better understanding of where to look for errors, how to categorize them, and deal with them in an appropriate manner. With the tools explained later on, quicker resolutions may be achieved which will produce more efficient development and better support.

BACKGROUND According to Splunk one of their clients, Macy’s, noted that tracking down the exact cause of a problem could be “exceedingly difficult.” It often required a team comprised of members from various IT functional areas to fix these problems. Even with these teams, resolutions still took days.

In the past when an issue presented itself it was always the network admins to be blamed. Over time technology has improved in that area, therefore network issues at most companies are few and far between by comparison. Everything is connected higher in the stack through various integration servers and technologies. The blame is often shifted to the integration team, but much like people blaming a browser for a bad Internet connection, this can often be misdirected.

Customers hold a company responsible to maintain near-continuous reliable services and by transitivity the

integration team. This puts a lot of pressure onto the integration team to quickly determine if the error is something within their realm or if it needs a different group’s attention.

One of the biggest pains with tracking down issues in middleware is that it is composed of so many layers. For example, a web application might make a call. The payload first goes through Oracle Enterprise Gateway (OEG), this is because it is going from the Internet to the company intranet. Then the company uses Oracle Service Bus (OSB) for all internal service calls to abstract naming and versions of services. Finally, the payload makes it to Oracle SOA Suite and it goes on from there to call other systems. Since the focus is on Oracle SOA Suite, below are a few issues.

A custom ANT script is used to iterate through a list of composites and deploys them one at a time. After the 66th composite an OutofMemeory:PermGen error is thrown; an odd but repeatable error. A much more common error is: “Unable to access endpoint…” This error can have many explanations from a simple timeout, to a security issue such as an invalid certificate. Without knowing how to diagnose the source of these symptoms will slow down even the most senior developers and administrators.

TECHNICAL DISCUSSIONS AND EXAMPLES Before learning how to solve these problems, it is first a good idea to step back and acknowledge that troubleshooting problems is an art. Like any other art it is part skill and part knowledge. For skill there each person has a certain level of natural inclination towards solving problems, some being better than others. Much of it deals with having a very methodical and scientific approach. The other half, knowledge, refers to a person’s intimacy with the product. Unless someone has the ability to deduce the topology of a system without ever using it is at hand, there needs to be some time spent working with and understanding the various subsystems of a product. To understand how SOA Suite works and how to fix errors there are many resources.

Many people, not having an answer to an issue, will immediately jump onto the Internet and perform a series of queries on their favorite search engine. This can lead to various blogs and even some Oracle specific resources,

Middleware

2 Session 185

such as the OTN discussion forums. This is often wasted time leading to solutions that aren’t related to the problem at hand. Finding no resolution, many will hop onto the Oracle support site to search for the existence of a patch. While none of these options are bad, if unable to properly direct searches this can be very time consuming, frustrating, and wasteful. The Internet should not be the only resource used. In fact, one’s brainstorming and knowledge should also be a resource on how to determine the issue at hand. Ideally once the source of the issue is tracked a resolution is obvious or at least achievable. If that is not the case then it’s time to resort to the aforementioned resources. The company may also have an error tracking and knowledge base of its own. Also, always remember talking to coworkers is useful, since often issues have been previously solved and forgotten.

The first step in tracking down the error should be to classify the problem. For purposes of this paper, lets start by placing the issues into one of three major categories: deployment, runtime, and performance. Distinguishing between these categories may not be at first obvious, but after encountering a few different types of problems this will provide a better idea. Runtime errors are going to be an issue in the logic of integration; this can be actual code or configuration in the server.

In certain cases the problem would be specific to a particular composite. Signs that only a composite is affected are usually obvious since the only errors showing related to that integration. However, there may also be issues that affect the entire infrastructure. For now, the focus is on singular composites and deployment.

The quickest, and usually easiest, issues to troubleshoot are deployment related. Deployment of a composite is broken into different phases: cleanup, validation, compilation, and the deployment. The cleanup phase should never fail as it searches for existing packaged integrations and deletes them if they exist. Validation examines the code, and many errors related to bad references and XML. The compilation phase will provide further errors should they arise, but if successful this also packages the source into a JAR file to prepare for deployment. Finally, deployment occurs. The deployment process will reveal a number of issues, however they may not all be displayed from the deployer’s point of view. Normally that is not a problem, as most of the issues will be revealed at runtime. These issues are usually with the server configuration: data sources, queues, topics, etc. When dealing with a process that polls a database or file folder the processes will simply not start. The best way to identify the root cause here is to tail the out logs while performing a deployment. Commonly, the issue is a bad JNDI name or a directory that doesn’t exist. Most of these require coordination with

an application administrator depending on the level of permissions that the developer has in the particular environment. Issues that can be determined by the developers themselves will be discussed with runtime errors.

During runtime any number of errors can occur, but not all of them will be caused by individual composites. Some of them can be overarching issues that affect multiple integrations. Similar to deployment issues, runtime issues may be caused by problems in the code or in server configurations. Most code related issues will appear in the flow trace and will be obvious to solve. Most issues, even non-code related, will manifest as an error in the console but the root cause will be hidden in the logs.

In the case of Figure 1, the error is a missing organization. This is a business fault and should be handled by the integration code or passed back to the calling application. Other issues can include errors like: “Cannot insert NULL

into…” These issues may or may not need to be handled by the integration. Unfortunately, not all of the errors will appear in the logs all the time, or the error that does show is not descriptive enough to determine a resolution immediately. One such error is the “Unable to access the following endpoints…” error. Logging levels can be increased to various levels to obtain further information. However, there are many different loggers available, so always knowing which logger to modify can be difficult. The best way to decide which logger should be modified is by looking in the header of a log message. Next, finding the right level of logging can be difficult, because trace logging at times can be overly verbose leading to more time sifting through the noise. One of the best ways to find the right logging level is to increment by a couple levels at a time until the true problem is revealed.

There are many signs that there is a problem with the performance of a system. Some of those signs being:

The Oracle Enterprise Manager Fusion Middleware Control is abnormally slow.

The completion time of composites is increased consistently across the board.

The size of the dehydration store is growing rapidly.

A large number of errors are appearing in the logs.

<Aug 6, 2011 10:10:33 AM EDT> <Error> <oracle.soa.mediator.serviceEngine> <BEA-

000000>

<Got an exception:

oracle.fabric.common.FabricInvocationException:

javax.xml.ws.soap.SOAPFaultException:

Message: Organization 129024 not found. Stack trace: at

Core.WebServices.Message.MessageWebService.SaveNotificati

on(Organization organization, Notification notification)

in c:\Data\1.0\Core\Message\MessageWebService.svc.cs:line

100, detail=javax.xml.ws.soap.SOAPFaultException:

Figure 1: Business Fault

Middleware

3 Session 185

Knowing the server is experiencing any of these issues listed above means there is likely a performance issue.

There are a number of places to look to track down the root cause. First, check if there is enough available space on the hard drives. A lack a space can result in drastic performance reductions. Secondly, be sure to check the processor, memory, and I/O statistics with a tool like vmstat to help narrow down which process is exactly hogging resources on the [virtual] machine. Other factors in performance can be the number of files open and the number of processes running. A runaway integration has the possibility to consume all file descriptors thereby degrading performance across the rest of the system. If issues arise like this, it is often a good idea in development to clear the logs and restart Weblogic while watching the logs for any errors that may be a precursor to the “too many files open” error. If nothing is found specific to SOA Suite, check other applications running, and be sure to check the OS logs (/var/log/messages). While errors can be a common reason for a slow environment, there could be other issues playing a role.

A tuned JVM is the only one that will give the kind of performance demanded by production level environment; this is especially true when there are high volumes of transactions passing through the environment. If the application server is not already running in the JRockit JVM, it is highly recommended. Speed increases can be realized with little configuration. However, once JRockit is running there are a number of tools such as the JRockit Flight Recorder (JFR) that come with the JVM to further tune your instance as necessary. As of writing this paper, the Hotspot and JRockit JVMs will ship as one product with the release of JDK 8. This means the benefits of JRockit will be realized within the JVM. Tuning a JVM is not the only useful part interacting directly with your configuration settings. Additional information can be provided by your JVM as well. Performing a heap dump when a memory error occurs is one of those ways. The JVM is not the only part that should be monitored.

Data sources are another critical component that should be monitored in the case of performance issues. It is possible that the available connection pool has been saturated with connections and is causing a bottleneck. If there is consistently an issue with a particular connection pool, involve a DBA to help understand why the pool may be getting full. There may be some SQL tuning that can be done so that queries and procedures run more efficiently shortening the length of connection times.

In the end, even this paper can only gloss over the very complex art that is troubleshooting. There are many variables that can come into determining the cause such as security considerations, operating system, hardware, etc.

Most issues that arise can be narrowed into runtime or infrastructure errors, performance issues, and deployment issues. Targeting the category can allow focus on where the true cause of the issue lay. For deployment issues, it is good to have an understanding of the overall deployment process. Also, knowing the purpose of the adf-config.xml can provide insight as to how the MDS is referenced and other important deployment related information.

When dealing with errors determining whether there is a code specific issue or a system wide issue can prevent many long hours looking in the wrong place. Modifying logging levels can assist in this and allow for drilling into the true cause of the issue.

APPENDICES JVM Performance Tuning Documentation

http://docs.oracle.com/cd/E23943_01/web.1111/e13814.pdf

http://docs.oracle.com/cd/E15289_01/doc.40/e15060.pdf

Location of out.err (Used for deployment errors)

Unix/Linux:

/tmp/out.err

Microsoft Windows:

C:\Users\<user>\AppData\Local\Temp\out.err

Oracle ADF-config.xml Description http://docs.oracle.com/cd/E15586_01/web.1111/b31974/appendixa.htm#BGBIFEJE

REFERENCES

Splunk. Ensure the availability and performance of your critical applications using the genius of splunk. Retrieved from http://www.splunk.com/web_assets/pdfs/secure/Troubleshooting_Critical_Applications.pdf

Technology

Oracle SOA Suite 11g Troubleshooting Methodology (whitepaper)