IBM Datacap 9.0.1 Performance Scalability, Tuning and Best

IBM Datacap 9.0.1 Performance Scalability, Tuning, and Best Practices

IBM Datacap 9.0.1 Performance Scalability, Tuning and Best Practices

April 2016

© Copyright IBM Corporation 2016

Enterprise Content Management

www.ibm.com

Do not reproduce any part of this document in any form by any means without prior written authorization of IBM.

This document is provided “as is” without warranty of any kind, either expressed or implied. This statement includes but is not limited to the implied warranty

of merchantability or fitness for a particular purpose. This document is intended for informational purposes only. It might include technical inaccuracies or

typographical errors. The information in this paper, and any conclusions that are drawn from it, are subject to change without notice. Many factors

contributed to the results described in this paper, and comparable results are not guaranteed. Performance numbers vary greatly depending upon system

configuration. All data in this document pertains only to the specific test configuration and specific releases of the software described.


Page 2 of 21

Introduction

IBM® Datacap is a complete solution for document and data capture. Datacap scans, classifies, recognizes, validates, verifies, and exports data and document images quickly, accurately and cost effectively. Datacap provides libraries of hundreds of script-based and code-based (.NET) actions to combine the common recognition engines for OCR, ICR, OMR, and bar codes. Datacap accurately captures data from any type of structured, highly variable, or unstructured documents. Datacap can capture typeset text, typewriter and computer print, hand print, bar codes, and check box data. By using the Datacap rules engine, data capture can be tailored to fit the most demanding business requirements and can be changed quickly when business needs change. For indexing applications, Datacap streamlines the manual data entry of index entries by using recognition to automatically identify the index values on each document and to automate the document identification process.

Datacap components typically include the Datacap server, Datacap Studio, Application Manager, Rulerunner service, and the Datacap Navigator (Datacap plug-in on IBM Content Navigator). The Datacap Administration database, Engine database, and Fingerprint databases contain application workflow, job, and task definitions, store batch contents, and manage application fingerprints.

IBM® Datacap provides various software components, each of which supports a defined set of functions. For example, you can build software applications that capture data from documents by using the Datacap Studio software component. Another example is the Datacap Rulerunner Service software component that runs batch processing tasks that do not require operator interaction. Datacap Rulerunner Service can be used with any Datacap application. The functions of Datacap components are not application-specific, and the components can be incorporated into any Datacap application. When you run the installation program to install Datacap, you have the opportunity to select individual installation options, each of which includes one or more Datacap components. From runtime perspective, in the Datacap architecture, two designed CPU intensive components: Rulerunner service and Datacap Web Services service are for Rule (OCR, OCI and OMR, and other rule) running. Therefore, they are the main factors for Datacap performance scalability. More than that, when a large number of documents are involved, some other low level matters, like DB query, network or disk ID, also come in to play as factors for performance, which need to be carefully noticed as well.

The Transaction Endpoints comprise a new REST API in the Datacap Web Services component, which exposes Datacap application rules directly to clients without the need for a Datacap workflow or queuing. By using the Web Services Transaction Endpoint, users don’t need to create batch any more, if they want to do OCR on their images. What user need is to scan their images and wait for OCR result. The Datacap Web Services Transaction Endpoint makes it possible to deploy Datacap system on Cloud environment and provide whole Datacap features as services.

This white paper presents the excellent vertical and horizontal scalability demonstrated for three Datacap usage scenarios:

Non-interactive ingestion, by using Rulerunner, imports over 260 20-page batches per minute.

Interactive ingestion, by using Datacap Navigator, with "scan" and "verify" users processes over 70 20-page batches per minute.

REST API ingestion, by using Datacap Web Services Transactional Endpoint, imports over 130 20-page batches per minute.

The white paper also provides performance tuning and monitoring recommendations based on specific settings such as document size, batch size, and numbers of threads.


Page 3 of 21

The performance results in the white paper represent data and workloads that were run on an isolated network on designated operating systems and configurations. Actual performance in a customer environment might vary depending on many factors such as system configuration, workload characteristics, and data volumes. These lab test results are not guaranteed to be repeatable in other systems.


Page 4 of 21

Test Overview

Control Environment

The servers in the environment are all running on Windows 2012. The test environment is inside an

isolated 1Gbps network. The Datacap batch folder is on NAS storage.

Server Type Hardware Software Level

Datacap Server X3550m4 20 cores, 64 GB MEM Datacap 9.0.1 .NET 4.5.2 Batch Folder – NAS Storage 100G

Rulerunner Server x3 X3550m4 20 cores, 64 GB MEM Datacap 9.0.1 .NET 4.5.2

Datacap Web Services Server x3

X3250 4 cores, 32 GB MEM Datacap 9.0.1 .NET 4.5.2

Datacap Navigator X3550m4 20 cores, 64 GB MEM Navigator 2.0.3.4 Datacap plug-in 9.0.1 WebSphere 8.5.5.5

Content Platform Engine X3550m4 20 cores, 64 GB MEM Content Platform Engine 5.2.1 WebSphere 8.5.5.5

RDB X3550m4 20 cores, 64 GB MEM DB2 10.5.0.5

Test Controller X3250 4 cores, 32 GB MEM Rational Performance Tester 8.5.1

LDAP X3250 4 cores, 32 GB MEM Windows 2012

Figure 1 Environment


Page 5 of 21

Figure 2 Environment topology

ISOLA

TED N

ETWO

RK

Datacap

NAS Storage

Rulerunner 1

Rulerunner 2

Rulerunner 3

Database

wTM 1

wTM 2

wTM 3

Navigator

CPE

LDAP

F5

Datacap plug-in

Hardware Load Balancer


Page 6 of 21

Scenario #1: Non-Interactive (Rulerunner) End-to-End Scalability Test

Datacap Rulerunner Service is a Windows service that processes all unattended capture tasks on documents. In a typical production environment, Rulerunner is configured to run page identification and recognition tasks.

Test Matrix

For assessing performance of the Rulerunner component, vertical and horizontal scalability were tested. A set of tests ran for vertical scalability, by increasing the number of Rulerunner worker threads in each Rulerunner machine up to the number of cores, to demonstrate the trend of fully using system resources, especially CPU usage. For horizontal scalability, a fixed number of Rulerunner worker threads were configured on each Rulerunner node. Tests were run with 1, 2, and 3 Rulerunner nodes to demonstrate the trend of linear performance scale-out.

Vertical Scalability 1 Rulerunner threads

on one node 10 Rulerunner threads

on one node 20 Rulerunner threads

on one node

Horizontal Scalability 20 Rulerunner threads

on one node

40 Rulerunner threads on two nodes

(20 threads/node)

60 Rulerunner threads on three nodes

(20 threads/node)

Figure 3 Test Matrix

To simulate a real world customer environment and scenario, 200,000 batches were pre-populated into the system with job done status before the test was started. These batches were present but were not processed during these scalability tests. The TravelDocs sample application was used in the test. Each batch contained 20 TravelDocs sample images. To simulate a realistic customer scenario, new documents are continually copied to the Mvscan polled input folder throughout the test.

Test Workflow

As shown in Figure 4, in Datacap Rulerunner, images were imported into Datacap through the MVScan

task (batches created) from disk, and then processed by PageID, Profiler, and Export tasks in sequence

automatically.


Page 7 of 21

Figure 4 Rulerunner Non-Interactive Capture Workflow

Test Configuration

For each Rulerunner worker thread, a set of tasks were configured – MVScan, PageID, Profiler, and Export.

Figure 5 Rulerunner Configuration


Page 8 of 21

Test Results

Vertical Scalability

The test result demonstrates that the Datacap Rulerunner Service component has linear throughput performance proportional to the number of Rulerunner threads. Meanwhile, the CPU usage of the Rulerunner server was proportional to the number of threads executing.

Figure 6 Non-Interactive End-To-End Vertical Scalability Test –Throughput

Figure 7 Non-Interactive End-To-End Vertical Scalability Test – Rulerunner CPU Usage


Page 9 of 21

Horizontal Scalability

The test result shows that the Datacap Rulerunner Service component has near-linear throughput performance proportional to the number of Rulerunner server nodes.

Figure 8 Non-Interactive End-To-End Horizontal Scalability Test – Throughput

Figure 9 Non-Interactive End-To-End Horizontal Scalability Test - CPU Usage


Page 10 of 21

Scenario #2: Interactive (Datacap Navigator) End-to-End Scalability Test

From version 9.0, Datacap provides an IBM Content Navigator thin client user interface.. In the Datacap Navigator, users can perform document capture tasks like “Scan”, “Upload”, “Verify”, and “Classify”, and administrative operations, like managing Users & Groups, modifying workflows and maintaining shortcuts.

Test Matrix

To assess performance of the Datacap Navigator Client, vertical and horizontal scalability was tested. A set of tests ran for vertical scalability, by increasing the number of concurrent client users (test threads). The tests demonstrate the trend of fully using the system resources of the component, especially CPU usage. For horizontal scalability, the number of Datacap Web Services node number is increasing 1 - 3 for tests to demonstrate the trend of linear performance scale-out.

Vertical Scalability 50vus

on one node 100vus

on one node 150vus

on one node 200vus

on one node 250vus

on one node

Horizontal Scalability 70vus

on one node 140vus

on two nodes 210vus

on three nodes Figure 10 Test Matrix

To simulate a real world customer environment and scenario, 200,000 batches were pre-populated into the system with job done status before the test was started. These batches were present but were not processed during these scalability tests. The TravelDocs sample application was used in the test. Each batch contained 20 TravelDocs sample images.

Test Workflow

As shown in Figure 11, 2 groups of users were used, Scanners, and Verifiers, for two typical user roles in

daily work. Scanners do “NScan” and “NUpload” operations for creating batches to Datacap system.

Verifiers do “Verify” operation to review the batches submitted.


Page 11 of 21

Figure 11 Datacap Navigator Interactive End_to_End Workflow

Test Methodology

Rational Performance Tester (RPT) drives the Datacap Navigator workload. Two execution groups are

defined in the workload schedule. The first one runs "NScan" and "NUpload" tasks, the second one runs

the "Verify" task. The system throughput is measured by the number of batches completed per minute.

Figure 12 Rational Performance Tester Test Schedule


Page 12 of 21

Test Results


Datacap Web Services (wTM) is the Datacap API used by Datacap Navigator to invoke “back end” server side processing. These services include queuing batces and running rules (performing unattended tasks such as OCR). wTM scales vertically by spawning either a thread or a separate Windows process to handle each client request. This option is controlled by a Windows registry setting called eRRO. When eRRO is enabled, wTM spawns a separate process for each request (see the Monitoring & Tuning section for details of the configuration). Enabling eRRO for the wTM component improves scale-up, this configuration was applied on the vertical scalability test. The test result demonstrates that the Datacap Navigator Client component has throughput performance proportional to the number of concurrent users (test threads). Meanwhile, the CPU usage of wTM server was also linear with the number of threads.

Figure 13 Interactive End-To-End Vertical Scalability Test - Throughput

Figure 14 Interactive End-To-End Vertical Scalability Test - wTM CPU Usage


Page 13 of 21


The test result shows that the Datacap Navigator Client component has linear throughput performance with the increasing number of wTM server nodes.

Figure 15 Interactive End-To-End Horizontal Scalability Test – Throughput

Figure 16 Interactive End-To-End Horizontal Scalability Test - wTM CPU Usage


Page 14 of 21

Scenario #3: wTM Transactional Endpoint Scalability Test

Datacap Web Services is a Windows web service or Microsoft IIS-based web service for interaction with Datacap through a simple, platform-independent, representational state transfer (REST) application programming interface (API). The Transaction endpoints process documents directly by executing rules immediately without using a Datacap Server, batch queuing or databases.

Test Matrix

For assessing performance of the wTM Transaction endpoints, vertical and horizontal scalability was tested. Vertical scalability was evaluated by varying the number of concurrent client users (test threads). These tests demonstrate the trend of fully using available system resources, especially CPU usage. For horizontal scalability, the number of Datacap wTM server nodes was increased from 1 through 3 to demonstrate the trend of linear performance scale-out.

Vertical Scalability 10vus

on one node 50vus

on one node 100vus

on one node 150vus

on one node 200vus

on one node

Horizontal Scalability 150vus

on one node 300vus

on two nodes 450vus

on three nodes

Figure 17 Test Matrix of wTM Transactional Endpoint Scalability Test

To simulate a real world customer environment and scenario, 200,000 batches were pre-populated into the system with job done status before the test was started. These batches were present but were not processed during these scalability tests. The TravelDocs sample application was used in the test. Each batch contained 20 TravelDocs sample images.

Test Methodology

An API-based load test framework drives a wTM workload similar to Datacap Navigator. The test driver

exercises the several wTM Transactional Endpoints to build up a transaction – Log on -> TransStart ->

UploadFiles -> OCR -> ReturnResultFile -> TransEnd -> Logoff. The system throughput is measured by

the number of batches completed per minute.


Page 15 of 21

Figure 18 Test Workflow of wTM Transactional Endpoint Scalability Test

Test Results

"eRRO enable" is configured on the Datacap wTM Servers for both vertical and horizontal scalability tests, to achieve the best scalability.


The test result demonstrates that the Datacap wTM component has linear throughput performance with increasing numbers of concurrent users (test threads). Meanwhile, CPU usage of the wTM server (the rule running component used by Datacap Navigator Client) increased linearly as well.

Figure 19 wTM Transactional Endpoint Vertical Scalability Test -Throughput

Figure 20 wTM Transactional Endpoint Vertical Scalability Test - wTM CPU Usage


Page 16 of 21


The test result shows that the Datacap wTM component has linear throughput performance with the increasing number of wTM server nodes.

Figure 21 wTM Transactional Endpoint Horizontal Scalability Test - Throughput

Figure 22 wTM Transactional Endpoint Vertical Scalability Test – wTM CPU Usage


Page 17 of 21

Monitoring & Tuning

CPU/Disk/Network

To determine the optimal configuration for the Datacap and IBM Content Navigator, monitor the CPU,

memory, network, and disk. Depending upon the operating system, nmon or perfmon can be used to

monitor the server statistics and help identify areas for tuning.

Datacap monitoring and troubleshooting

The IBM Datacap V9.0.1 Knowledge Center references multiple options for monitoring and troubleshooting an IBM Datacap system. More information on Datacap can be found in the IBM Datacap V9.0.1 Knowledge Center.

Examining Rulerunner thread logs and Windows Event Viewer logs can help troubleshoot

Rulerunner Server.

The IBM Content Navigator logs can be used to troubleshooting Navigator.

Enabling logging is useful to troubleshoot many types of errors in any Datacap component. The

logging option to “flush buffers” should not be enabled in most cases, as this has a severe

performance impact. “Flush buffer” is needed to force out all messages in case of a hang or

crash where the log is incomplete. Logging levels can be adjusted to limit the size of log files.

Enabling logging for Datacap Server service

Enabling logging for Rulerunner Service

Enabling logging for the Datacap Navigator

Enabling the Datacap Web Services log

Enabling logging for Datacap Desktop

Datacap Dashboard

Datacap dashboard is a new feature in Datacap V9.0.1, which can help in the following areas:

Provide a visual interface where overall capture system progress, success, problems,

and bottlenecks can be rapidly observed.

Highlight problems and exceptions on the system proactively.

Help the business user or IT administrator define threshold values on key metrics that

can be used to provide proactive notifications.

Best Practices and Tuning Recommendations

Performance tuning for IBM Content Navigator

Increasing JDBC connection pool sizes for WebSphere Application Server

If errors occur in IBM Content Navigator during high-volume conditions or long duration

transactions, you can increase the maximum connections for the JDBC connection pools. The

benefit of increasing the JDBC connection pool size is that you create more connections for users

to access IBM Content Navigator.

http://www-01.ibm.com/support/knowledgecenter/SSZRWV_9.0.1/com.ibm.datacaptoc.doc/datacap_9.0.1.htm?lang=en

http://www-01.ibm.com/support/knowledgecenter/SSZRWV_9.0.1/com.ibm.datacaptoc.doc/datacap_9.0.1.htm?lang=en

javascript:;


Page 18 of 21

To increase the JDBC connection pool size:

1. In WebSphere® Application Server, browse to Resources > JDBC > Data Sources. 2. Select the data source to modify. 3. Click Connection pool properties. 4. Change the default number of connections for the Maximum Connections setting. Change

the setting from 10 to a higher number, such as 100. 5. Click Apply. Over time, gradually adjust the maximum value to allow for more connections if

necessary. You can increase the number of connections by 100. If performance decreases, increase the number again by another 100 connections.

Improving performance for WebSphere Application Server and Windows

Set the WebSphere Application Server webcontainer threadpool property to a minimum size

of 50 and a maximum size of 100.

Set the WebSphere Application Server heap size property to 768 - 1024 MB.

For the Windows TCPIP limit property, set the TCPIP connection number to 150 and the

TCPIP port number to 10000.

For more detailed information, you can go to the <<Performance tuning for IBM Content Navigator>>

section on the IBM Content Navigator V2.0.3 Knowledge Center.

Performance tuning for Database

Performance tuning for the DB2

Use bufferpool with 32K pagesize for the DB2 Datacap databases. A sample script is shown as

follows:

db2 connect to dcdb

db2 list tablespaces show detail

db2 create bufferpool dcdb_bp size 262144 pagesize 32K

db2 create regular tablespace dcdb_ts pagesize 32k bufferpool dcdb_bp

db2 create user temporary tablespace utemp32 pagesize 32k bufferpool dcdb_bp

db2 create system temporary tablespace stemp32 pagesize 32k bufferpool dcdb_bp

db2 drop tablespace USERSPACE1

db2 connect reset

Please adjust the size of the bufferpool (262144 is used in the above example) to a value suitable

for your system.

What might be noticed when deciding DB2 bufferpool size?

A buffer pool is associated with a single database and can be used by more than one table

space. When considering a buffer pool for one or more table spaces, you must ensure that

the table space page size and the buffer pool page size are the same for all table spaces that

the buffer pool services. A table space can only use one buffer pool.

http://www-01.ibm.com/support/knowledgecenter/SSEUEX_2.0.3/com.ibm.installingeuc.doc/eucpt000.htm?lang=en

javascript:;


Page 19 of 21

The goal of buffer pool tuning is to help DB2 make the best possible use of the memory

available for buffers. The overall buffer size has a significant effect on DB2 performance,

because a large number of pages can significantly reduce I/O, which is the most time-

consuming operation.

Please make sure that the total buffer size is NOT too large to exceed that available physical

memory, or a minimum system buffer pool for each page size is allocated, and performance

is sharply reduced.

With DB2 Version 8 and higher, you can change buffer pool sizes without shutting down the

database. The ALTER BUFFERPOOL statement with the IMMEDIATE option takes effect

right away, unless there is not enough reserved space in the database-shared memory to

allocate new space.

DB2 Version 9.1 and higher enables fully automated size management of a bufferpool. DB2

self-tuning memory manager (STMM) controls this automation process.

Performance tuning for the MS SQL Server

Make sure a database maintenance plan is in place and run periodically.

Once the system contains a large number of batches, or when the number of batches

changes dramatically, increase performance using the SQL Server Rebuild Index Task.

If you have full transaction logging enabled, put the logs on a separate disk from the data for

best performance.

Follow other Microsoft recommended DB optimization and maintenance practices.

Performance Best Practices for Datacap

Batches with one or two pages require more overhead than larger batches. If possible,

provide Datacap with batches containing at least 4 pages on average.

Configure enough Rulerunner Server threads to utilize up to 80% of the server’s available

CPU when fully loaded. Start by considering relatively CPU intensive tasks such as OCR or

document conversion. Configure a thread to perform all (any) of these tasks. Clone that

thread so that the total number of threads is between 100% and 150% of the number of CPU

cores available on that server. Then add as many threads as needed that spend much of

their time idle, such as vscan and I/O intensive tasks.

Larger input files with high resolutions require more time to process that smaller files with the

same content. 200 or 300 DPI is recommended for black and white images, 150 to 200 DPI

for color and grayscale images.

Optimizing Datacap performance in general, especially important for high volume

systems

Use Datacap Maintenance Manager or other tools to archive completed batches after some

time.

If the number of batches or queues in the system exceeds .5 - 1 Million, monitor database

performance

Follow tuning and maintenance practices recommended by DBMS vendor

javascript:;


Page 20 of 21

Provide high performance NAS and non-congested network paths for the Datacap application

folder and the Batches folder

In very high volume systems, consider spraying batches onto multiple folders (e.g. change

batch folder periodically)

Performance tuning for the Datacap System

During hardware planning, allocate servers with maximum CPU capacity to the Rulerunner

server function, since the Rulerunner component is the most CPU intensive component.

For the Rulerunner service, clear the “Reflush buffer on each message” for "Rulerunner Log"

and clear the “Log Reflush” for “RRS Log” through Rulerunner Manager as enabling them

can reduce performance.

Enabling eRRO can help the wTM server scale up.

1. Stop the wTM service on the wTM server.

2. Open the Registry Editor with “regedit” on Windows, and change the value of

"HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Datacap\RRS” to 1 to force

rules to be run in individual processes

3. Start the wTM service on the wTM server.

javascript:;


Page 21 of 21

Authors and Contributors

Authors Mercury Li, ECM Datacap Performance Engineering Wen Li Zeng, ECM Datacap Performance Engineering Ang Li, ECM Datacap Performance Engineering Zhan You Hou, ECM Datacap Performance Engineering

Contributors Dave Royer, ECM Performance Engineering Architect Noel Kropf, Datacap Software Architect

Documents

IBM Datacap 9.0.1 Performance Scalability, Tuning and Best