SAP BI-20TB PROOF OF CONCEPT

SAP BI-20TB PROOF OF CONCEPT

SSAAPP NNeettWWeeaavveerr®® BBuussiinneessss IInntteelllliiggeennccee IIBBMM DDBB22 DDaattaabbaassee PPaarrttiittiioonniinngg FFuunnccttiioonn ((DDPPFF))

IIBBMM TTiivvoollii DDaattaa PPrrootteeccttiioonn IIBBMM SSeerrvveerr PP55

IIBBMM SSeerrvveerr SSttoorraaggee DDSS88330000

EExxttrreemmee BBuussiinneessss WWaarreehhoouussiinngg EExxppeerriieenncceess wwiitthh SSAAPP NNeettWWeeaavveerr BBuussiinneessss

IInntteelllliiggeennccee aatt 2200 TTeerraabbyytteess

PPrrooooff ooff CCoonncceepptt

WWhhiitteeppaappeerr SSeerriieess

IBM SAP International Competence Centre Walldorf, Germany

IBM BM PSSC – Customer Centre

Montpellier, France

DB2 /SAP Centre of Excellence, IBM Lab Böblingen, Germany

IBM eBusiness Technical Sales Support, Germany

SAP Solution Support – Centre of Expertise Business Intelligence

SAP AG, Walldorf, Germany

VVeerrssiioonn:: 11..22 MMaayy 22000077


IBM EMEA ATS PSSC - IBM ISICC - SAP CoE Logistics EMEA

2

Special Notices © Copyright IBM Corporation, 2007 All Rights Reserved. All trademarks or registered trademarks mentioned herein are the property of their respective holders.



3

Management Overview of Total Project Goals SAP NetWeaver® Business Intelligence High End Scalability Proof of Concept As many large enterprises begin approaching SAP system sizes which far exceed the Boundaries of those supported by current experience, the need to pioneer new infrastructure designs, investigate new supporting technologies, and readdress application scalability becomes urgent. Customer Need Beginning in December 2005, IBM and SAP engaged in cooperation on behalf of a large SAP NetWeaver Business Intelligence (SAP BI) customer, to perform a proof of concept for the next generation of very high-end BI requirements. This proof of concept demonstrates the strength of the two companies combined technologies and skills. The challenge was to demonstrate the scalability, manageability, and reliability of the BI application and the infrastructure design as BI requirements begin to necessitate multi-terabyte databases. High Stakes – Actual and Future Customer Requirements Based on actual customer data, business processes, and data models, this PoC reproduced the BI system growth from 7TB to 20TB and finally 60TB. At each of these strategic phases of data growth, a series of critical tests were performed for verification of the customers critical KPIs. One set of these tests focused on the infrastructure management, backup/restore and disaster recovery scenarios. Here the infrastructure functionality had to maintain the windows of time available for such activity as the database increased to nearly 9 times the baseline size. Another set of tests focused on the time critical application processes, including the high volume integration of new source data as well as online reporting performance, simulating a 24*7 SLA over multiple geographies, common to many large enterprises. These tests represented a combined load scenario in which cube maintenance, aggregate building, and online reporting, with fixed response time requirements, ran simultaneously, with load increasing over the lifetime of the project to 5 times this large customers current “peak load” volume. The Investment and Commitment of the Alliance This project spanned 14 months, occupied a combined IBM/SAP team of 20-25 specialists, had an infrastructure “high-water” footprint of 240TB disk capacity, 448 server CPUs, 3.5 terabyte RAM, and included up to 3 test landscapes in parallel.



4

Results – Customer Satisfaction The combination of SAP® BI and IBM infrastructure components show an unbeatable scalability addressing and fulfilling the requirements at high-end BI scenarios in the industry today. This project proved the scalability of SAP NetWeaver® Business Intelligence, version 3.5, beyond all boundaries known to this time, and data deriving from these tests were immediately able to address concerns of other large companies also moving rapidly toward multi-terabyte Business Intelligence databases. IBM The highly scalable IBM System P5, with hardware multi-threading, provided the Servers basis, and the IBM System Storage DS8300, with flashcopy technology, the Storage infrastructure. The IBM DB2 database with Database Partitioning Feature (DPF) provided the scalable parallel database infrastructure which mastered the application workload and allowed a full backup of the 20TB DB in less than 7 hrs. SAP The SAP® application and IBM infrastructure delivered a concurrent BI load of 125 million records per hour in cube load, 24 million record per hour in aggregation and 1.9 reporting queries per second with 20% hitting fact tables of 200 million records with an average response time of 16seconds. The IBM/SAP Alliance All challenges set by the customer were achieved, paving the way for future BI architecture implementation for this customer world-wide, and providing a general very high-end proof point for best practices and directions for BI. This project is detailed in a joint IBM/SAP Redbook to be generally available in 2007.

Scope of this Document This document contains a series of whitepapers resulting from the phase 1 of this project. As this is written, phase 2 is drawing to an end. The primary objective of phase 1 was to test the scalability design, made any modifications to the design deemed necessary from the information gleaned from this phase, and design the high-end scalability hardware for phase 2. The whitepapers included here are fairly technical, but focus on the design, the reason criteria for the design, and the perceived success of the implementation.



5

Table of Contents

Project Introduction…………………………………….6 By Herve Sabrie, Project Manager, PSSC Montpellier -Overview of the business background, goals, and results of this project.

AIX and the SAP NetWeaver BI Combined Load….…..9 By Carol Davis, Senior Certified IT Specialist System p SAP Solutions, ISICC

Franck Almarcha, AIX Performance Specialist for SAP, PSSC Jean-Philippe Durney, IT specialist System P, PSSC Dr. Michael Junges, Senior Support Consultant, CoE Technology, SAP Steffen Mueller, Senior Support Consultant , CoE Technology, SAP

- Insight into the SAP implementation and the technical implications of the

customer load KPI goals on the p595 infrastructure.

Storage Design for a High-End Parallel Database….…55 By Philippe Jachymczyk, IT Specialist for Storage Networking, PSSC - Insights into the storage design and SAN infrastructure for the parallel DB2

database.

Backup/Restore/Recovery of a 20TB SAP NetWeaver Business Intelligence System in a DB2 UDB Multi-partitioned Environment................................................75

By Dr. Edmund Häfele, IT Specialist for SAP, eBusiness Technical Sales Support Thomas Rech, Senior Consultant, SAP / DB2 Center of Excellence, Böblingen

- Insight into the technical decisions and implementation design for IBM Tivoli

backup and restore functionality for the multi-partition DB2 UDB ESE DPF installation.



6

Project Introduction In the fiercely increasing competition amongst corporations it has become mandatory to make quick and sound crucial business decisions based on the analysis of business critical data. This is the point where data warehouses come into play. SAP NetWeaver® Business Intelligence (SAP NetWeaver BI) consolidates external and internal sources of data into a single repository with powerful search and reporting functionality that aid the organizations and enterprises in strategic planning, data management and archiving. The objective of this document is to provide the experience feedback of a large Proof-of-Concept running SAP NetWeaver BI under AIX with DB2 DPF handling a database size up to 20TB. The target audience for this document is the IT Architect community. The motivation for this document is to get the relevant design information gleaned from this project to the architect community (and exec level), to be followed by the detailed information being provided in a Redbook.

Background During more than 14 months, a team of 20 persons were involved in running this PoC. The activity took place at the IBM Products & Solutions Support Centre, Montpellier, France. It involves various skills from the ISICC Waldorf, the DB2 Centre of Excellence and our partner SAP, together with the remote support from the IBM laboratories. This PoC was done at the request of a large international company currently running their business warehouse activity on IBM infrastructure. The objective was to chart the future direction this infrastructure should take in order to support the predicted growth of the BI system in terms of size and load. The goals of the PoC were modelled to fulfil the expectations of this customer’s workload, based on actual business processes and actual design criteria. The infrastructure used for this PoC was based on one System p595-64cpu connected to a Storage System DS8300. Three full parallel landscapes were installed to allow parallel testing. Two set of tests were performed in order to demonstrate the stability, scalability and assess the performance of the proposed solution. The infrastructure tests were run to validate the manageability of an equivalent production system while the ‘on-line’ tests were run to reproduce the daily activity and associated workload.

The Target KPIs and Results The following summarizes the specific KPI requirements which were the challenge of this PoC.



7

Infrastructure Test KPIs: KPI1: Simulate a recovery of a single days work after the failure or logical error on the current system. Perform a storage system “flashback” to restore an earlier flashcopy image, and roll forward the database, rerunning 500GBs of log data. KPI2: Simulate disaster recovery by restoring an older system image by tape and reapplying several days’ worth of logs.

• KPI2.1 Restore database from tape. • KPI2.2 Roll forward 2TBs of log data.

KPIa: Simulate the flashcopy backup scenario to be used for daily database backups by performing a flashcopy and backup of the copy to tape. KPI3b: Perform an incremental flashcopy of a full-days modifications (modifications resulting in 500GBs of database logs) with concurrent online activity. KPI3c: Demonstrate the effects of an online backup (from database to tape) on the performance of the online reporting workload. KPI4: Completely rebuild 3 TBs of indices. KPI Load Characteristic Challenge Achieved KPI1 Flashback+rollfw

500GB Logs Restore + roll forward < 8hrs 4Hr 30Min

KPI2 Disaster Recovery Fm tape: 2TB logs

Restore + roll forward < 18hrs

7:40 + 4:40 = 12 Hr 20Min

KPIa Daily Flashcopy Tape backup of FC < 8Hr 6Hr 35Min KPIb Incremental FC Update current copy < 8Hr 30 Min KPIc Online tape backup Backup database with <20%

degradation to online 6Hr 58Min <15%

KPI4 Recreate Indices Rebuild indices < 2hr. 1Hr. 5Min. Concurrent Load Test KPIs KPIA: Simulate the current “worse-case” load observed in the customer production for the scalability baseline. KPID: Simulate the short-term growth expectation. (ref: official run KPID-2.2) KPI Load Characteristic Challenge Achieved KPIA Cube Loading 25 M Records/Hr 25,353,800/Hr Data Aggregation 5 M Records/Hr 6,493,607/Hr Reporting 50 Navigation/Min (<20sec) 77/min (13.6 sec) KPID Cube Loading 75 M Records/Hr 125.38 M Rec/Hr Data Aggregation 15 M Records/Hr 24.67 M Rec/Hr Reporting 75 Navigations/Min (<20sec) 115/min (11.76 sec)



8

Executive summary Implementing and running a business warehouse is part of the business strategy of large customers. This repository of consolidated data combining production data extracted from the day-to-day operations as well as sales data representing the market trends is considered a strategic tool for informed business decisions and further business development. For International companies, running businesses all over the world, such consolidated warehouses can quickly become very huge. Positioned at the heart of their strategy, the reliability, scalability and manageability of the underlying solution is essential and in every sense vital. The IBM System p and DB2 Universal Database software often demonstrated their drumbeat of record-setting accomplishments when associated with the SAP standard benchmarks. By successfully running a complex Proof-Of-Concept for SAP NetWeaver BI, we managed to reach remarkable results based on ‘real’ customer data, reproducing ‘real’ production system behaviour, and more importantly, we demonstrated the scalability of our solution and its ability to cope with future business growth.



9

AIX and the SAP NetWeaver BI Combined Load

Contents Server Concept for the Combined Load Scenario .........................................................9

Starting Point and Approach......................................................................................9 Overview of the Application Combined Load Tests ...............................................13 Overall Concept of Reporting and Maintenance .....................................................16 Applicability of the PoC to Real-Life BI Scenarios ................................................16 POWER5 in the PoC: Environment and Configuration .........................................19 SAP Application Servers and Load Balancing ........................................................22

PoC Results: Combined Load at 20TBs ......................................................................23 KPI Achievements ...................................................................................................23 Resource Requirements ...........................................................................................26

Tuning of the Application Combined Load .................................................................33 Characteristics and Tuning option of Cube Load ....................................................33 Characteristics and Tuning of Query .......................................................................40 Characteristics Aggregation and Job-Chain Design ................................................43 Monitoring and Analysis of Combined Load ..........................................................44

P5 Virtualization ..........................................................................................................49 Summary......................................................................................................................53

Server Concept for the Combined Load Scenario This whitepaper describes the load scenarios which were closely modeled on the activity and load monitored at the customer production system, and simulated to drive the PoC. The load scenarios do not include the full production cycle but target those aspects defined as critical load. This section focuses on the SAP application servers used for the combine load and the p595 systems of both application server and database. This whitepaper includes the approach taken to achieve maximum efficiency in system utilization, the means of measurement, and summarizes resource utilization. It follows the p595 servers through the landscape changes required to achieve the necessary combined load KPIs and to take advantage of technology changes beneficial to performance. The infrastructure basis went through a number of technical upgrades and migrations, paving the way for the customer migration in plan for the implementation of the new system design. This paper documents the effect of the various modifications on the projected production load.

Starting Point and Approach In this proof of concept, the starting point was the current hardware and software infrastructure in production at the customer site. This hardware/software level provided the baseline for the project begin, and determined the progress through the infrastructure migration. The PoC is actually based on a clone of the customer system.



10

Fig. 1 Infrastructure Migration Path

AIX 5.2DB2 V8

AIX 5.3DB2 V8

AIX 5.3DB2 V9

AIX 5.3micropartions

DB2 V9

Phase 1 Phase 2

7TB - 20 TB

20TB - 60 TB

In Phase 1 of the project, covered by this document, the P5 servers were limited to dLPAR configurations, and the requirement to distribute system resources in a static manner. During the life-cycle of the project, it was of course possible to reconfigure the server environment for better load balance, however it was not feasible, nor desirable, to attempt any dynamic reconfiguration during a test run. In this case, a non-optimal CPU distribution over LPARS, or a memory over-commitment resulting in poor performance, the configuration was modified for the subsequent runs. As in a typical production environment, the system configuration selected, needed to cope with the varying load profiles of the production day. In the baseline phase of the project, the load balancing was done without the use of hardware multi-threading (SMT) as this was not yet implemented at the customer. The configuration of the SAP environment was tailored to reflect the underlying hardware infrastructure such that during the migration to AIX 5.3 and SMT, the SAP instances were also rebalanced for improved scalability. From experience gained in Phase1 of the project, a proposal for a viable micro-partition implementation was designed, to allow for dynamic processor sharing during the highest of the load tests. Phase 2 will be based on micro-partitioning. This is described later in detail. In addition to these changes over the landscape, the following infrastructure chances were validated: Fig. 2 Infrastructure upgrade path

Baseline

DB server: p595POWER5 with 256 GB

Storage server:DS8000

DB2 V8, AIX 5.2

HW UpgradeDB server: p595Memory 256GB to

512 GB

Technology Upgrade

Sorage ServerDS8300 to Turbo

TechnologyUpgradeDB Server

1.9 to 2.3 GHz

Software UpgradeDB2 from

V8 to V9Server

Architecture Migration

DB2 LPARs splitover 5 p595s

Software Upgrade

AIX 5.2 to 5.3

TechnologyUpgradeAppSvrs

1.9 to 2.3 GHz

1. The stating point for the AIX implementation was the customer version as AIX 5.2. This was upgraded to AIX 5.3 with SMT functionality.



11

2. The original hardware for phase1 was a 64way p595 with 256GB memory. This proved to be a limitation for the Database and additional memory was added for the DB LPARS.

3. The DB storage server was upgraded to the new “turbo” technology,

increasing the speed of the flash copy and increasing the bandwidth. This played more of a role in the infrastructure tests then in the load tests as the original storage server was not near its limits, and therefore the additional capacity had little effect on throughput or run times of the combined load tests.

4. The Database LPARs were then upgraded to the new p595 technology, with a

speed bump from 1.9 to 2.3 GHz. This improvement was visible in both database heavy components of the combined load: query and aggregation.

5. The database stated with the customers current implementation at DB2 V8 and

was updated during the project to the new V9. The results of this were improved scalability across more processors.

6. The application server machines were upgraded with a speed bump from 1.9

to 2.3 GHz. This improvement was very visible in the data load scenario which is very much application server heavy.

7. In preparation for the move to the new architecture in phase2, the database

was tested on the new hardware for verification at 20TBs. This move included the migration of the DB from 5 LPARs on a single p595 to the same LPAR configuration but on 5 p595s (one LPAR on each), and the new storage configuration spanning multiple DS8300s.

The methodology of moving across the landscape was to maintain a reference point for each move. A change in test methodology, a change in OS settings or configuration parameters, a change in middleware settings or versions, were done with a back-link to the previous runs. The intention is to insure that run-data can be compared across the test scenarios by means of either direct compatibility, or extrapolation. Fig. 3 Back-link for each migration step

p595DBv8 512GB

p595DBv8

256GB

p595+DBv8 512GB

p595+DBv9

512GB

Load Test1

Load Test5



12

Overview of the SAP NetWeaver BI Components BI Processes: A simple SAP NetWeaver BI (SAP BI) upload and reporting process comprises the following steps (only transactional data): 1) Extraction of transactional data from a source system (e.g. SAP® ERP) and update of data into the EDW layer (ODS object) 2) Update from the EDW ODS object to the InfoCube(s) using Update rules 3) Rollup of data into the InfoCube’s Aggregates 4) Reporting on the data in the InfoCube and its Aggregates Steps two to four are covered in the online workload scenario in the PoC.

BI objects used DataSource: The DataSource comprises the extractor program, the extract structure (those fields that are delivered by the extractor) and the transfer structure (those fields that are transferred to the target system, in general a subset). The DataSource must be maintained in the source system (which in our case is the SAP BI itself), its metadata has to be replicated into the target system. When we speak of DataSource, we often refer directly to the extractor. The extractor program itself is triggered and controlled by an InfoPackage. The DataSource used in the PoC are all so-called Export DataSource, built on SAP BI DataProviders (ODS Objects). The naming convention is 8[name of the DataProvider]. The corresponding InfoSources have the same naming convention. InfoPackage: An InfoPackage contains the metadata that controls the behaviour of an extractor (which data is to be extracted, full or delta) and the data flow in SAP BI (which data targets are to be updated, include PSA). Triggering an InfoPackage



13

essentially means starting the extractor in the source system including the whole upload process in SAP BI. InfoSource: An InfoSource essentially contains the mapping scheme from transfer structure (in general: SAP R/3® fields) to the so-called communication structure (SAP BI InfoObjects). DataSource are assigned to InfoSources, and update rules can be defined from the InfoSources communication structure to DataTargets. Update Rules are used for transformation and enrichment of data coming from DataSource, they are DataTarget-specific. They can be simply 1:1 but also can contain complex ABAP routines, performing lookups, calculations etc. By creating update rules, multiple DataTargets can be assigned to an InfoSource (and thus a DataSource).

Overview of the Application Combined Load Tests This section describes the profile of the different load types as seen from their behavior and resource consumption aspects. One important consideration during a stress test of this kind is the granularity of the load. The granularity determines the level of fine tuning. In a stress test combining several different load profiles, the finer the granularity is of each load type, the greater the chances are that the loads can be balanced. The combined load test scenario consisted of three different load types, each with a very different profile and a different KPI requirement. They represent the online query execution, and the cube maintenance, necessary to bring new data online for reporting. The online users are simulated by queries initiated by the Load Runner tool. Cube maintenance includes two activities, data load and data aggregation. These are both initiated by SAP Process-Chains (batch). Characteristics of the Cube Load Scenario: Transforming new data into the format defined for the reporting objects and loading the data into the target objects. Profile: Pseudo Batch (batch driver spawning massive parallel dia-tasks) Upload of data from the ODS into the infocubes. For this scenario, data is extracted from the source repository, in this case an ODS, using a selection criteria (DB2 select), processed through complex translation rules (CPU intensive) and then written into the target info cube (DB2 insert). This load allows for a wide variety of configuration options; level of parallelism, size of info packages, number of target cubes per extractor, and load balancing mechanism being the most significant. The translation rules provided by the customer for this scenario were extremely complex and represented their “worst case”.



14

Fig. 4 Data Load Scenario

ODS Extractor

IC

:ICDia

Dia

:

DataTranslation Rules for arget info cube

Source object Batch job extracting data for processing Asynchronous

dialog tasks processing each datablock

Target Info-Cubes

The diagram above shows the upload process and components. The batch “extractor” selects the data in data blocks from the source and initiates an asynchronous dialog-task to take over the job of processing the block through the translation rules and updating the target info-cube(s). The dialog tasks can run locally or on another application server in the system. Characteristics of Aggregate Build: Aggregating the new data according to rules defined to improve the access efficiency of known queries. Profile: Batch The aggregation of the loaded data is primarily database intensive. The roll-up in these tests was done sequentially over 10 aggregates per cube. There is not much configuration or tuning possibility for the aggregate load. The three options available in this scenario were:

1. Type of cube: profitability analysis or sales statistics. The aggregate hierarchy structure defined for the cube type is decisive for the Rollup performance.

2. Number of cubes to use: this was based on the number required to get the KPI

throughput. The granularity of this was a batch job (an additional cube more or less). Not much in the way of fine tuning.

3. The block sized used. There was little guidance on this possibility available, so

the customer setting was used. Later tests verified this selection. Characterisitcs of the Online Query: End User Reporting Profile: Online The query load consists of 10 different query types with variants which cause them to range over 50 different infocubes used for reporting. The queries are designed such that some of them use the OLAP cache in the app-servers, some use aggregates, and others go directly to the fact tables. This behavior is defined in detail elsewhere in this document.



15

The query load is database focused and database “sensitive”. Competition for database resources is immediately translated into poor response times. The Query-KPI is the most delicate of the combined load types. Tuning of the queries was restricted to database methods to improve access paths and database buffer quality. Combined Load Fig. 5 Load Distribution – this shows the relative ratio of CPU utilization of the load types. Scaling of the load will increase the CPU requirement in this ratio. This graph shows the load distribution of the different load profiles in terms of ratio of physical CPUs utilized (phyc) on the application servers vs. the database servers. This graph was created using calibration tests of the load types in isolation, and then all the load scenarios together (KPI-A). (KPI-A represents 25Mil records per hour upload + 5 million records aggregated + .8 queries per second with an average of <20 second response time.) Query is 2 to 1 in favor of database; upload is 6 to 1 in favor of application server, for example. Targets for Combined Load: Statement of Work Fig. 6 PoC Targets over 3 Scaling Points

Combined Load Requirements

5

2515

75

125

25

0,8

1,25

2,08

0

20

40

60

80

100

120

140

160

KPIA KPID KPIG

Mil

Rec

/Hr

0

0,5

1

1,5

2

2,5

Tnx/

Sec

Aggregation

Load Requirements

Query/Sec

21

21

1

6

1

2.25

1

3.9

0

1

2

3

4

5

6

Physical CPUs

Query Aggregate Load KPI-A KPI-D

Load Distribution Ratio DB:APPs

DB

APPS



16

The combined load scenarios in the PoC were to scale from the customer’s current peak workload, to 5 times this workload. KPIA was tested on 7 and 20 terabytes, on hardware representing the current customer installation: one p595, and one DS8300 storage server for the database. KPID spans the current and the new hardware for phase2, where the database will be distributed across multiple p595s using micro-partitioning, and multiple storage servers. KPID will span 20 to 60 TBs. KPIG will be carried out in phase2 on the 60 TBs and the phase2 HW only.

Overall Concept of Reporting and Maintenance Fig. 7 PoC Cube Use Categories

One important point to note in this PoC is that the scenario is based on a “follow the sun” design, with several independent geographies or regions. Reporting is a daytime activity, maintenance is a nighttime activity. The objective is never to maintain cubes which are currently active for reporting. Therefore, in these scenarios, there are usage categories for the cubes: those active for reporting (target of the queries), those being loaded, and those being aggregated. In the real-life scenario, cubes would be maintained during off-hours: loaded, aggregated, statistics updated and then given free for reporting. The

throughput requirements for this PoC represent the windows of time required to do this maintenance.

Applicability of the PoC to Real-Life BI Scenarios Unlike old SAP ERP systems (e.g., SAP R/3), SAP NetWeaver BI systems are very flexible with regards to customizing data models, data transformation rules and design of reporting queries. Whereas in SAP R/3, mostly standard transactions are used whose logic yield essentially the same workload in different customer installations since they are only customizable to a certain degree, the processes in SAP BI are highly adaptable to the customers’ specific requirements. This applies to:

• the data model of the InfoCubes (and thus the star schemas) and the ODS objects

• the data flow for staging the data within SAP BI (e.g. using PSA, staging in (EDW)-ODS)

• the data transformation rules for modifying, enriching, and adjusting the data for the InfoProviders

• the design of the Queries with regards to o the structure of the data that is to be displayed o the complexity of calculated key figures o the amount of data that is to be read

50 cubes for daytime: reporting

50 cubes for night: maintenance



17

o the number of InfoProviders that are accessed (MultiProvider) • the connectivity to other systems

Transferability: There is no ‘norm’ as to how an SAP BI system should be built, and customers’ systems look quite different even if similar processes (e.g. SCM Analytics, HR Reporting) are implemented. There is a high variety of business reporting requirements across different companies, making it necessary to adjust the BI system in many different areas to the customer’s needs. Hence, BI installations from customers vary very much with regards to the layout of the data model, data flow etc. The system that has been set up for the Proof of Concept therefore cannot allow for a detailed (1:1) transferability of the results to other customer BI installations, e.g. with respect to data throughput or number of navigation steps per hour. The performance values achieved in the different areas (reporting, data upload and aggregate rollup) depend heavily on the data model and implemented business logic. Some of the factors that affect these key figures are:

• Reporting structure – whether single Info-Cube of complex MultiProvider • Info-Cube Design, aggregates, and OLAP-Features • Complexity of transfer and update rules • InfoProvider design (number of characteristics, cube dimensions) • Data flow design • Type and number of source systems • Aggregate structure, hierarchy, and size ratio

What has been set up in the Proof of concept for producing the data load is supposed to simulate a comparatively heavy workload in the three areas. All InfoCubes are updated using comparatively complex (i.e. CPU intensive) update rules; we expect the average update logic in typical BI installations to be less complex and workload producing. The throughput numbers (records/hour) denote the number of records written to the fact table of the target InfoCube, the ratio between records read and records written was approximately 3:2, and i.e. 1/3 of the records read from the source are deleted in the update rules. The start routine performs lookups from various master data tables, ODS active data tables and DDIC tables and can be considered more expensive than average, producing additional DB accesses. Additionally, only DataMarts are used in the Data Load scenario. This means that also the whole extractor workload is produced within SAP BI, which is not usually the case in customer installations (where data is extracted in other systems). All these aspects should be factored in when wanting to transfer the throughput figures to other installations. The reporting scenario comprises a mixture of OLAP cache and Aggregate use which from our experience can be considered representative for a typical customer installation: 50% OLAP cache usage, 80% Aggregates. However, there is no general statistics on these figures across different customers systems and of course, these ratios vary considerably in other installations, depending on the degree of optimization for such a scenario. All Queries run on MultiProviders. The use of



18

MultiProviders is state-of-the-art and is recommended for performance improvement. Instead of running a query on a large InfoCube containing data from a large time period, it is better to split the data with respect to a (time) characteristic to multiple InfoCubes and to run the Query on a MultiProvider built on these InfoCubes. This allows parallelization of the InfoCube accesses which is in general faster than the access to one very large InfoCube fact table. The Aggregates are structured in a relatively flat Aggregate hierarchy (two levels), and the InfoCubes used for the Rollup process have 10 and 21 Aggregates, respectively. Certainly, the Rollup performance heavily depends on the Aggregate hierarchy, i.e. the size and structure of the source tables (fact tables of InfoCube or basis aggregates). Basis aggregates are mainly used for supporting the Rollup process by serving as data source: Aggregates can be built out of other Aggregates (of higher granularity) to reduce the amount of data to be read and, hence, to improve the roll-up performance. The Aggregate hierarchy is determined automatically. Because of these dependencies, the throughput figures for the Rollup process are only to a certain extent transferable to other scenarios. Summary: Keeping in mind that customer BI installations vary in many aspects, the different KPI figures achieved in this PoC can be taken as performance indicators which can be transferred to other installations to a limited extent if the complexity of the tested scenarios is taken into account. Unlike the SAPS (SAP application performance standard) values achieved by benchmarking SD-Applications in SAP ERP, the throughput values obtained in this PoC cannot be considered as standardized figures since they are specific for the implemented scenarios. However, the purpose of this PoC is to show that the SAP NetWeaver Business Intelligence application together with the IBM infrastructure can handle heavy on-line application activity in combination with infrastructure workload for a large (>20TB) database, still providing stability and manageability of the solution.



19

POWER5 in the PoC: Environment and Configuration Fig. 8 PoC P5 Configuration Baseline: Single p595

The original SoW defined a single p595 for Phase 1 of the project. The ratio of application server to DB server requirement for the data-load throughput targets forced the addition of further application servers. This diagram shows the logical implementation of the baseline installation. This remained the central configuration throughout phase 1, with additional application servers being added to support the upload. The 33 Database instances were distributed over 5 physical LPARs. The first LPAR dedicated to DB2 node0, the focus of the client activity. Each of the additional 4 LPARs housed 8 DB2 instances. This is described in detail in the DB2 section of this document.

A single LPAR was dedicated to the application servers. At the point the project began it was not clear what the load distribution would be, and not having micro-partitioning available, the best option was to combine the three instances in a single LPAR. The load types were separated into dedicated application servers such that the resource utilization and behaviour could be tracked.

o CI administration and aggregation o Cube Load o Online reporting (Query)

LPAR0 (DB2)NODE0

LPAR1 (DB2)NODE6 - NODE13




LPAR5 (SAP)CI 00

APP 01

APP 02

IBM p595 64 CPU 1.9 GHZ256 GB RAM



20

Fig. 9 Design of Application Server: Baseline

Through out phase 1, the application server load has been divided by using different dedicated instances and network aliases. This allows the instances to appear as if they were installed on separate servers, but allows the flexibility of CPU sharing within the LPAR. Once the load distribution was known, it would be possible to separate the

instances to different servers without any change in the job-chains, Load Runner access, or monitoring overviews. This picture shows the concept on a single Ethernet adapter, both front-end network used for online access, and backbone server network connecting the application server(s) to the database. Final Hardware Configuration for KPI-A 20TB Fig. 10 Network and Communication Flow

sys3cisys3onlsys3btc Cpu:11

Mem:30GB

sys3ci

sys3as03

sys3as04

sys3as03sys3as05

sys3as04sys3as06

Cpu:32Mem:94GB

Cpu:32Mem:94GB

SAP ServersDB Servers

Cpu:12Mem:25GB

Cpu:10Mem:47GB

Cpu:10Mem:47GB

Cpu:10Mem:47GB

Cpu:10Mem:47GB

sys3db0

sys3db1

sys3db2

sys3db3

sys3db4

10.3.13.5210.3.13.59

10.3.13.50

10.10.10.5210.10.10.59

10.10.10.50

10.3.13.2

10.3.13.4

10.3.13.3

10.3.13.6

10.10.10.2

10.10.10.4

10.10.10.3

10.10.10.6

10.10.10.51

10.10.10.54

10.10.10.57

10.10.10.56

10.10.10.58

en0

en1

52

50

59

52

50

59

DVEBMGS00 Central InstanceAdmin Activity

10.3.13.user network

D02data loadingBatch Activity

sys3ci sys3cip

sys3btc sys3btcp

D01online query load

sys3onl sys3onlp

10.10.10.server network

network.52 is basis network, 59 and 50 are aliases for other instances.

sys3ci



21

Communication Flow The above diagram shows the configuration used for combined load tests, KPI-A, from 7 to 20 terabytes. The application server configuration has been considerably expanded due to the requirements of the data-loading scenario. The blue arrows represent the communication focus. The application server clients connect to DB0 only. All client orientated traffic is between the application servers and DB0. DB0 has a unique role in the DPF environment and functions as a type of master coordinator in regard to the other instances. All communication between the DB2 nodes is between DB0 and the others. The additional instances do not communicate amongst themselves. Therefore the DB0 node was implemented with first a 2 card ether-channel, and then eventually a 4 card ether-channel to handle this communications load. Although it would have been possible to implement virtual Ethernet channel as a backbone network between the DB LPARs after the introduction of AIX5.3, this was purposely not done so as not to introduce a DB dependency on a single server. This decision was made in respect of the new design requirement for the phase2 hardware. Application Servers For KPIA, an additional 64 CPUs were added to the baseline configuration for application servers in order to handle the cube load requirements of 25 million records per hour. The data translation rules in effect, between the ODS and the target cubes, as defined by the customer requirements, are complex and CPU intensive. The CPUs used for load were split into two physical LPARs as they resided on separate p595s. Each LPAR housed two SAP instances during KPI-A. This was done to improve the SAP load balancing functionality. Using round-robin load balancing, each instance participated equally in the distribution and the chances of more equal distribution over the physical LPARs was achieved. This is done at the cost of redundant memory for SAP buffer pools and instance related memory structures required for the 2nd instance. 4 SAP instances were used for loading; the CI (00) was reserved for administration, Btc (01) was used for aggregation and load triggering, and the onl (02) dedicated to query. Fig. 11 Round Robin Load Balancing

btctrigger

as03

as04

as06

as05

RR



22

SAP Application Servers and Load Balancing The main effort of the combined load scenario was focused around the cube-load design. This had the highest throughput requirements, and the greatest flexibility of load design. The objective was to achieve the most through-put possible with the least load on the database, the database being limited by specification to a single p595 in phase 1. This was to be done while maintaining an approach that could be implemented in a production system. The designs selected for data-load tests were verified by both SAP and the customer for feasibility to ensure the approach did not to stray into absurdity just to achieve the throughput. The first decision was to use dedicated cube load application servers which could be driven to capacity. It would normally not be possible or practical to reserve such hardware capacity in a production system. However, with possibility of using virtualization in Phase 2, reserving these CPU resources would no longer be necessary in the final configuration. KPID 20 TB Hardware Configurations Fig. 12 Phase1 Hardware for KPID

The major differences in the landscape used to run the KPID on the phase1 hardware at 20TBs is the addition of 2 application servers for data-load, and the increase in the

KPI-D

sys2ci sys2onl sys2bt

Cpu:32 Mem:80GB

sys2ci

sys2as0

sys2as04

Cpu:32 Mem:128GB

Cpu:32 Mem:128GB

SAP Servers DB Servers

Cpu:12 Mem:25GB

Cpu:10 Mem:47GB

Cpu:10 Mem:47GB

Cpu:10 Mem:47GB

Cpu:10 Mem:47GB

sys2db0

sys2db1

sys2db2

sys2db3

sys2db4

10.3.13.62 10.3.13.69 10.3.13.60

10.10.10.62 10.10.10.69 10.10.10.60

10.3.13.110 10.3.13.133

10.3.13.111 10.3.13.134

10.10.10.110 10.10.10.133

10.10.10.111 10.10.10.134

10.10.10.61

10.10.10.63

10.10.10.64

10.10.10.66

10.10.10.67

sys2as05

sys2as0

Cpu:32 Mem:128GB

Cpu:32 Mem:128GB

10.3.13.112 10.3.13.135

10.3.13.113 10.3.13.136

10.10.10.112 10.10.10.135

10.10.10.113 10.10.10.136



23

ether-channel adapters in the backbone network. The DB0 LPAR is now running with a 4*gbit ether channel, and the large application servers, and the other DB LPARs are each using 2*gbit ether channels. During the KPID tests, the overall memory allocation was modified: Load APPs (sys2as3-as6): 92GB DB0: 32GB DB1-4: 116 GB The DB p595 was upgraded from 256GB to 512GB to allow this increase for the DB2 instances and improve overall buffer quality. This was a requirement for KPID success.

PoC Results: Combined Load at 20TBs

KPI Achievements Query Results Fig. 13 Query Throughput and Response Times

The graphs above depict the goals and achievements at 20TBs. The query throughput was intentionally over-achieved, as the granularity of the load was plus/minus a virtual user in Load Runner, and it was felt better to remain conservative. The response time criterion was to achieve the throughput with average response times under 20 seconds, both KPIA and KPID were achieved with large margins. Between KPIA and KPID there was a change in the requirements for the queries which increased the weight of several queries which directly access the tables. It was necessary to implement a limited number of DB2 “statviews” to achieve the KPID results.

0.8

1.29

1.25

1.46

0

0.5

1

1.5

Tnx/Sec

KPIA KPID

Query Throughput Over Achieved

Query Requirement

Query Achieved

Well Under Rsp Time Limit

20

16.1

11.5

0 5 10 15 20 25

Seconds

KPIA

KPID

SoW Limit



24

Data Load Achievements Fig. 14 Query Data-load Throughput This graph depicts the data-load targets and achievement at 20TBs on the single server hardware (DB on 1 p595 and DS8300 storage server). After an intensive study of the scalability factors which affect the upload design, scaling this load became a question of the number and size of the application servers handling the translation rules. The KPIA landscape had 64 CPUs available, and the KPID landscape had 128. Aggregation Achievement Fig. 15 Aggregation Throughput The aggregation presented the greatest problems in scalability. In the case of KPID, the target was not fully achieved although accepted by the customer. Part of this was due to the way the tests were designed. In the KPIA and KPID, the concurrent cube aggregation was triggered all at the same time. Each of these 2 to 8 cubes (depending on KPI) had the same layout. They began a serial aggregation of 3 very large and complex aggregates, and then finished with 7 much smaller and simpler aggregates.

2527.5

75 77

0

20

40

60

80

Mil Rec/Hr

KPIA KPID

Data Load Over Achieved

Load Requirement

Load-Achieved

5

6.9

15

14

0

5

10

15

Mil Rec/Hr

KPIA KPID

Challenge in Aggregation

AggregateRequirement

Aggregate Achieved



25

Runtime and complexity of sequentially processed aggregates

10 Aggregates per Info-Cube

Fig. 16 Aggregate Complexity The customer themselves implements a means of parallelizing the aggregation within a cube so long there is no restricting hierarchy. This would have possibly allowed the lighter aggregates overlay with the complex aggregates and improved the overall throughput/hr. This method was not implemented in phase1. With the introduction of DB2 V9, and the p595+ for the database server, a major improvement in the aggregation was achieved. In the graph below, the improvements in aggregation for these two changes are depicted: p595+ for DB LPARs and DB2 V9 with p595+. These changes doubled the throughput for aggregation. Fig. 17 Improvements in DB Environment

5

6,9

14

15

16,7

15

28,6

0

5

10

15

20

25

30Mil Rec/Hr

KPID DBV8 p595+ DBV9P+

Challenge in Aggregation AggregateRequirementAggregate Achieved



26

Resource Requirements

Physical CPU Requirements Fig. 18 Price Performance on CPU Utilization While moving though the changes in the landscape, a “price/performance” charting was maintained to follow the trend in throughput per cost of physical CPU utilized end to end. This chart is based on the upload throughput as this consumes the most CPUs, but the data comes from a full combined load. The YELLOW trend line shows the number of million records loaded per CPU consumed. An increasing trend is proof of improving efficiency. The BAR depicts the overall throughput for data-load achieved during the run. Note the big “efficiency jump” achieved in KPIA, resulting from the simple move the AIX5.3 and SMT.

Load Throughput vs Cost

0102030405060708090

100

7TB 20TB AIX53 KPID Turbo p595+

Mil

Rec

-Hr

00.050.10.150.20.250.30.350.40.450.5

Mil

Rec

/hr

per

CP

U

Load Throughput

Mil/CPU



27

Fig. 19 Total Resources Available

LPAR0 (DB2)partition0

LPAR1 (DB2)partition6 - partition13




LPAR5 (SAP)CI 00

APP 01

APP 02

IBM p595 64 CPU

AS01 (sysxas03/sysxas05)32 cpu

KPI-A

CI 00

APP 01 Batch

APP 02 Online

Move Apps off.. extend DB to 64 CPUS

CI Batch and Online32 CPUs

Baseline

64 CPUs

Two new LPARs added: AS01 and AS02

KPI-D Two new LPARs added for 2 AS: AS03 and AS04One new LPAR added for SAP CI

AS02 (sysxas04/sysxas06)32 cpu

AS01 (sysax03)32 cpu

128 CPUs 224 CPUs*





* these are the number of CPUs available in the configuration, not the CPUs consumed for the tests

This overview shows the growth of the landscape From Baseline, to KPIA, and then KPID. The graph below shows the physical CPU resources consumed per KPI achieved. This is the sum of the physical CPUs consumed over all application servers and all DB LPARs. This is based on the peak load, as the KPI can only be achieved, with the documented throughput and runtime, by having covered the peak load requirements. In this case a “shrink wrapping” has not been done. To do this, the resources are reduced to a high average for price/performance, and the strict KPI requirements done with the limited resources to determine what would be the smallest system which could achieve the KPIs. In this case, as the target is still a larger KPI (KPIG), this has not yet been done. KPIA (AIX 5.3) used 101 physical 1.9GHz p5 CPUs, and KPID used 174 of the same.



28

Fig. 20 Physical CPUs consumed end-to-end – 1.9 GHz p5 Processors

SAP Component Balance: AIX 5.2 vs AIX 5.3 AIX 5.2 For AIX 5.2, where no hardware multi-threading is available, the best balance was determined as follows:

Parallel Dia-Process to SAP Dialog Process : 1.1:1

Dialog Process to Physical CPU: 1:1 AIX 5.3 Using hardware multi-threading, there are two logical processors for each physical CPU. The best parallel throughput takes full advantage of this configuration as follows:

Parallel Dia-Process to SAP Dialog Process : 1.1:1

Dialog Process to Physical CPU: 2:1 (one process per SMT thread) The increase in the number of parallel dialog tasks will be reflected in the throughput, but also in the memory utilization. The more parallel dialog tasks, the more user contexts active simultaneously.



29

Fig. 25 AIX 5.3 Throughput Benefits

AIX 5.2 vs AIX 5.3

32277692

43277500

05000000

100000001500000020000000250000003000000035000000400000004500000050000000

aix52 aix53

Tota

l Thr

ough

put

050000100000150000200000250000300000350000400000450000500000

Thro

ugpu

t per

CP

U

total-throughput

throughput/CPU

The chart above shows the load throughput achieved using the same load configuration and hardware for AIX 5.2 and AIX 5.3. For each the optimal balance of components was used. For AIX 5.3, the number of dialog processes and the parallelism could be increased. Attempting the same parallelism on AIX 5.2 proved counter productive as the optimal balance could not be achieved. This comparison shows that not only way the total throughput significantly increased, the ratio of throughput per CPU also shows a major improvement. On AIX 5.2, 108.5 physical CPUs were used, on AIX 5.3 only 99.8.

Memory Requirements For the purpose of sizing, the maximum memory utilization must be taken into account as a memory over commitment would result in paging and change the response time and throughput behaviour considerably Fig. 21 Memory Utilization in KPI-A Component Total

7TB Average(MB)

20TB Average(MB)

7TB Max(MB)

20TB Max(MB)

DB 186849 203262.90 192870 206546.50 APPs 123257.4 82098.42 141241.2 99826.20 Total 310106.4 285361.33 334111.2 306372.70 For KPIA, the system utilized between 306 and 334 Gigabytes of application working storage.



30

Using NMON, the average and maximum memory utilization is captured. In this case we are looking only at the working storage and not including client or file-system cache. The reason for this is that the working storage is the application footprint, and non-computational is volatile and will often expand to fill any remaining capacity. The working storage does include the OS computational requirements. The application server memory requirement is primarily driven by the data-load process which is running in massive parallel mode and therefore has many parallel user contexts. The size of the data-block being processed by each of the parallel processes has a large effect on the size of the individual user contexts and therefore on the total memory requirement. In the 7TB test, a block size of 160,000 rows was selected. For the 20TB tests, a block size of 80,000 rows was used. This is reflected in the increased memory requirements for the application servers in the baseline (7TB) statistics in fig. 21 above. Fig. 22 Throughput Achievement for KPID at 20TB: 11 November HighLoad Phase : 05.11.2006 12:30:00 05.11.2006 16:50:00 Ave Load MilRec/Hr : 76,21 Aggregation MilRec/Hr : 13,31 Average Query Txn/Sec : 1,55 For the KPID achievements documented for 11 November, for example, the following graph shows the amount of real memory configured, and that utilized by working storage over all LPARs. The database (green) is using just short of 500 GB, the application servers (blue) 380GB. Fig. 23 Memory Utilization Summary over all LPARs for KPID: 5 November



31

The graph below shows the working storage memory requirement across 16 various KPID runs. Fig. 24 Memory Utilization in KPID

Working Storage Memory

0

200000

400000

600000

800000

1000000

1200000

KPID

Gig

aByt

e

APPS-WS

DB-WS

In the last 10 runs depicted on this chart, the memory foot print for both the DB and the application servers has become stabilized at just under 800GB. The throughput is affected by the balance of the components (parallelization), the speed of the processors, and other factors in the load design. The memory on the DB is a result of buffer pool settings and number of active connections. The application server memory is influenced by the number of instances on the LPAR, and the level of parallelization. It was discovered, for example, that each parallel dialog process, using a block-size of 80,000 records, can consume nearly 1GB of memory for its SAP user context. Maintaining the same constant setting the DB, the memory requirement for both will increase with the further parallelization expected for KPI-G. This will be primarily in the application servers. To further parallelize the load, additional application server resources will be necessary resulting in more client connections to DB2. To utilize the additional resources, more parallel dialog-tasks will be started, increasing the memory requirement for user context storage.



32

RUN DB APP Upload PHYC Rec/CPU Comments

10 AIX5.2 AIX5.3 38209492 107.48 355503 SMT-ON forApp-Servers

12 AIS5.3 AIX5.3 36167445 101.4 356680 SM-ON all

13 AIX5.3 AIX5.3 43277500 99.8 433642 SMT-ON all

14 AIX5.3 AIX5.3 32277629 108.5 297489 SMT-OFF allSimulating 5.2

Summary AIX 5.2 vs AIX 5.3 in full KPIA Scenario

Fig. 25 CPU utilization efficiency improvement with AIX 5.3 The chart above shows the load throughput achieved using the same load configuration and hardware for AIX 5.2 and AIX 5.3. For each the optimal balance of components was used. For AIX 5.3, the number of dialog processes and the parallelism could be increased. Attempting the same parallelism on AIX 5.2 proved counter productive as the optimal balance could not be achieved. The implementation of AIX 5.3 was done in two stages: first the DBservers, and then the APPservers. The major improvement comes with the simultaneous multi-threading functionality. The chart above shows the improvements gained first by the upgrade of the DBservers alone, and then show the difference over the whole configuration “with and without “ SMT. This comparison shows that not only way the total throughput significantly increased, the ratio of throughput per CPU also shows a major improvement. On AIX 5.2 (no SMT), 108.5 physical CPUs were used, on AIX 5.3 only 99.8.



33

Tuning of the Application Combined Load

Characteristics and Tuning option of Cube Load Fig. 26 Components of Cube Load

Number of extractors The extractor processes are batch jobs. These are also called “infopackages” as they define the relationship between the source ODS, the target cube(s), the selection criteria, and processing method. This batch job reads from the ODS a number of records related to the <package-size>, and uses the RFC method to launch an asynchronous dialog task to process this data package, while it returns to reading for the next package. Number of Target Cubes A single package read from the ODS can have multiple destination cubes. The dialog task handling this package will process it though the applicable translation rules for each of the target info cubes. Size of Info packages The size of the infopackages affects the speed at which the extractor can “spin off” the dialog tasks and the speed at which the dialog tasks finish their processing tasks (turn-around time). The size of the packages also effects the memory requirements on the application server handling the dialog tasks. The larger the package, the larger the user-context is for each of the tasks, and thereby the combined app-server memory footprint. If the packages are too small, there is an overhead in house-keeping and monitoring which can lead to serialization of the tasks.

ODS Extractor

IC

:

IC

Dia

Dia

:

DB-Focus APPS-Focus DB-Focus

Factors which effect cube load behaviour Number of extractors Number of target cubes Size of the infopackages Number of dialog processes spawned per extractor



34

Number of Dialog Task Spawned The target servers to perform the dialog tasks are defined in a logon group, along with the control quotas for the number of existing dialog processes in each server which is available for asynchronous work. These quotas are intended to protect a server from being over-run by this batch related work, and reserve available capacity for other online activities. In our case, these application servers were dedicated to this workload and the objective was to achieve, and sustain, scalability over the entire available CPU resource. Fig. 27 Multiple Instances Per LPAR Data Loading - load balancing

In order to avoid any slowdown due to gateway or dispatcher conjestion, and smooth out the dialog task distribution "bursts", 2 SAP instances were installed on both application servers 1G

bit -

sys

3as0

3

alia

s- s

ys3a

s05

sys3as03data loadsapgw03


1Gbi

t - s

ys3a

s04

alia

s- s

ys3a

s06



sys3ci_02batch instance

extractor batch jobs

sapgw02initiating gateway

LogonGroup

membersand

quotas

Data Loading - load balancing

In order to avoid any slowdown due to gateway or dispatcher conjestion, and smooth out the dialog task distribution "bursts", 2 SAP instances were installed on both application servers

The number of the dialog tasks spawned depends on the number of dialog processes which are available for RFC logins as defined in the logon group, and the limitation for each extractor as determined by the MAXPROC settings for the source ODS. In the target logon group, each application server is entered and a quota for the available dialog processes configured. Each ODS object has a configured limit for the number of parallel processes allowed. If was expected that the extractor would spawn dialog tasks either to the limit of the available dialog processes or the limit of allowed parallel processes for the ODS. A restriction based on a percentage of available resources being more flexible and more secure than a fixed value which only would take effect at the beginning of the load. Lesson: This assumption, in fact, was in error. During these tests it was discovered that the dialog tasks are limited only be the ODS relevant settings, and if there are too few dialog processes available at the participating login group servers, the requests are queued at the target server until a dialog process becomes available. The “user contexts” related to these RFC requests consume significant memory on the target



35

servers, and each initiated RFC requires a gateway session, from the initiating GW to the target GW. The result of this, experienced in the tests, was a memory over-commitment causing paging on the application server, an overrun of the dispatcher queue, dispatcher errors at the target server, and an eventual collapse of the initiating GW in reaction to the general instability and timeouts. The work around is to carefully balance the ODS parallelization with the available resources on the application servers. This has implications for a real-life scenario in that a loss of an application server can have unexpected results coming from an unplanned imbalance of the parallelization level to available resource. In addition, there can be no dynamic use of new resources made available after the load has started. Total Number of Packages in a Request The number of packages in an upload request is calculated by the number of packages * number of target infocubes. The smaller the package size, and the more target infocubes there are, the greater the number of packages in a load request. In this version of SAP NetWeaver Business Intelligence, the status monitoring tools which keep tack of the “health” of a load request must manage each of these packages. Lesson: For very large requests, the throughput begins to degrade after a given number of packages. The overhead of managing the increasing request size cannot keep up with the potential throughput and each package takes longer and longer to complete. The recommendation is < 1500 packages per request. Fig. 28 Dual Extractor

ODS

selectto n

selectfm n

Extractor

Extractor

IC

:IC

1

3

2

4

DiaDia

Dia

Dia:

Dia

In order to achieve maximum throughput to a limited number of target cubes, a dual extractor was used. This is simply two “infopackages” with the same source and destination(s), but with different data selection criteria. In this manner it is possible to reduce the request size as well, as each extractor is tracked as a separate request. Number of Target Info Cubes In the combined load scenario, any means of reducing load against the database is beneficial as the database must ultimately be the contention point. In regard to the design of the upload scenario, it is possible to avoid reading the same data to process for different target cubes, by using a “1 to many” configuration. In this design, a single extractor, or infopackage, is defined with multiple targets. Each block that is read, and given to a dialog-task for processing, is processed and written for each of the target cubes. This has two benefits in our scenario, and possibly in a production environment, it reduces the read requirement against the database, and it reduces the



36

scheduling overhead for initiating a dialog task per block per target info-cube. In our case it also worked positively on behalf of the load balancing. Fig. 29 Read Once, Write Many

Reads Once (per extractor)

Writes Many (per target)

The diagram above shows the read/write behaviour of the “1 to 4” configuration used in KPI-A. In this case the input is read once (depicted by the Read/hour) and written 4 times to 4 separate infocubes (depicted by #Update & # Written). The following chart summarizes the calibration tests around the design for the data-load. This was the first step in the job-chain design for this requirement. There were a number of recommendations on ratios of ODS to target cubes, and dialog-tasks per target (parallelization), but these had never been verified in an environment of this size and capacity. We therefore did a trend analysis with the objective of quickly establishing the direction to take and verifying the previous recommendations. Starting Point and Trend Directions

• The customer is using block sizes of 100K records. Is this optimal? Would larger or smaller block sizes have a benefit on the overall efficiency of the system or throughput?

• The recommendation for degree of parallelism suggested a limitation of 8

dialog tasks. This represents a challenge to scalability and there was no clear statement of justification for this limit. What would limit additional parallelism?

• Being that there is a benefit in reusing a data block from the ODS to load

multiple target cubes, what is the optimal ODS to target cube relationship?



37

Fig. 30 Trends toward Optimal Parallelism

This is a graph on two axes. The columns and the left hand axis look at the throughput achieved per dialog-task. The number of active dialog tasks in parallel (depicted by the right hand axis) was taken from the job logs of the batch extractor job. This represents the number of dialog requests the extractor actually spawned, but does not really show how many were actually able to get dialog process resources simultaneously. The graph also spans two different block sizes, starting at 160K and moving to 80K. Trend for ODS to Target Cube Relationship: From the point of view of the application server load, ignoring the less efficient DB utilization, the best “price/performance” is a “one to two” model: 1 ODS, 2 target cubes. This can be seen in the 2nd point in the graph above: a high throughput per dialog task. This configuration, however, would increase the DB load and reduce parallelization, as we are not able to go beyond 24 to 26 parallel dialog-tasks when the extractor has to read a block for each two written (see right axis for maximum parallelization achieved). The next best trend point for throughput was a “one to four” as we are able to achieve a parallelism of over 60 dialog-processes with a single extractor, with good throughput per dialog-task. An improvement on this is the “one to four” using 80K block sizes. Here we see a drop in parallelism but an increase in throughput per dialog process. The “one to seven infocubes” has an even better degree of parallelization, achieving around 64 parallel dialog-tasks, and with an even high throughput per dialog-task. (Attempting parallelism beyond this point is counter productive as there are only 64 SAP dialog processes). The trend shows a benefit in smaller block sizes and a high number of target infocubes.

Throughput per DialogProc

220000 240000 260000 280000 300000

1 2 4 5 7 4 7 7 Number of Target Cubes

Inserts/Dia

0 20 40 60 80

Dia-tasks in Parallel

Recs/DiaProc DialogProcActive

Blocksize 80k

Blocksize 160k



38

Conclusion KPIA selected the 1:4 as the best possible design for the data layout in phase1. In this case we were limited to the number of target cubes actually available for data load. The recommended limitation of 8 parallel dialog-processes per target cube did not appear to be a limit of the target cube, but of the number of SAP dialog processes available. In the case of 4 target cubes, the ratio is 16 per cube, in the case of 7 target cubes; the ratio is 9 per cube. Fig. 31 Price/Performance Verification

This graph is similar to the previous graph, with the difference that is shows the total data-load over all the dialog tasks. If we can consider that the blue line representing the number of active dialog processes, therefore it is an indication of CPU resource utilization on the app-server, and we can do a somewhat imprecise Cost/Performance trend. In this case the blue line is cost, and the column is performance. The trend shows the 4way at 80K blocks to be a good performer, the 7way is better. Lesson: The parallelization ratio for AIX 5.2 is 1/1/1. One dialog-task per SAP dialog-process and one SAP dialog-process per CPU. The trends indicate that the more target cubes there are per extractor, the better the throughput and price/performance. At least this trend held true for the tests of up to 7 target infocubes.

Total Throughput vs Level of Parallel

0 5000000

10000000 15000000 20000000 25000000

1 2 4 5 7 4 7 7 Target Cubes

Inserts/Hr

0 20 40 60 80

# Active dialog tasks

Total Throughput Dia-Procs-Active

Blocksize 80k

Blocksize 160k



39

Verification of the Scenario: Single or Dual Extractor Having determined a trend for the design of the job-chains for data loading, it was necessary to verify the “cost/effectiveness” of this design end-to-end by including the database server utilization as well. The two scenarios were tested using the same parallelization capability: the total “maxproc” setting for the Operational Data Store (ODS) sources. This setting, which controls the number of parallel dialog processes which can be spawned, was set such that in both cases the maximum could not exceed 116 dialog-tasks. The number of SAP dialog-processes in both cases were 160 to ensure there would not be any contention for SAP dialog-process capacity. The Cost/Performance was determined by the number records loaded per physical CPU consumed in total on the system end-to-end. Fig. 32 Various Designs for Load Job Chains

Fig. 33 Comparative Throughput Configuration Total Throughput PHYC Recs per Hr/CPU Single Extractor 29.8 Mil/Hr 65.6 454K Dual Extractor 41 Mil/HR 74.8 548K

Conclusions: Dual extractor is the best performer overall and most efficient end to end.

ODS Extractor

Maxproc29

4 InfoCubes

ODS Extractor

Maxproc29

4 InfoCubes

ODS Extractor

Maxproc29

4 InfoCubes

ODS Extractor

Maxproc29

4 InfoCubes

Single Extractor per ODS1 Source Batch2

ODS

ODS

Extractor

Extractor

Extractor

Extractor

Maxproc 58

4 InfoCubes

Maxproc 58

4 InfoCubes

One ODS with two extractors



40

Characteristics and Tuning of Query The online reporting load is generated via load simulation tool, Mercury LoadRunner®. This tool operates using sophisticated scripting to simulate on-line users. These scripts were designed by SAP Solutions Support to meet the specifications of the customer. The scripts simulate HTML queries.

Target Query Cubes As the customer uses multi-providers for the queries, this method was also employed in the PoC. Of the 100 infocubes which formed the basis of the test scenarios, 50 were used for reporting. The cubes were accessed via 8 multi-providers. In this manner the queries could move across the full range of the cubes by simply selecting a different multi-provider and entry. This was the means used by the testing tools which simulated query load. Fig. 34 Classed equate to Multi-provider: each multi-provider is associated to several infocubes (ZGTF..) CubeClass1 CubeClass2 CubeClass3 CubeClass4 ZGTFC013 ZGTFC014 ZGTFC040 ZGTFC039 ZGTFC015 ZGTFC016 ZGTFC042 ZGTFC041 ZGTFC017 ZGTFC018 ZGTFC044 ZGTFC043 ZGTFC019 ZGTFC020 ZGTFC046 ZGTFC045 ZGTFC021 ZGTFC022 ZGTFC048 ZGTFC047 ZGTFC023 ZGTFC024 ZGTFC050 ZGTFC049 ZGTFC025 CubeClass5 CubeClass6 CubeClass7 CubeClass8 ZGTFC063 ZGTFC064 ZGTFC090 ZGTFC089 ZGTFC065 ZGTFC066 ZGTFC092 ZGTFC091 ZGTFC067 ZGTFC068 ZGTFC094 ZGTFC093 ZGTFC069 ZGTFC070 ZGTFC096 ZGTFC095 ZGTFC071 ZGTFC072 ZGTFC098 ZGTFC097 ZGTFC073 ZGTFC074 ZGTFC100 ZGTFC099 ZGTFC075 According to an analysis of the query statistics on the production system, done by the customer, it was observed that 80% of all queries return less than 1000 rows, and another 17% return less than 10000 rows. This distribution was reflected in the query design for the PoC. It was agreed that for the KPI tests, that up to 60% of the queries could be satisfied either directly from the OLAP cache in the application server, or from the aggregates maintained in the database aggregate buffer pool. These queries represent typical reports for which aggregates or bookmarks have been prepared. Highly selective (ad hoc) reporting will run less efficiently and directly against the database fact tables. 40% of the queries were required to simulate this behaviour. One of the objectives of the customer in selecting this criterion was to duplicate the behaviour of long running queries in production so that a solution which might be found in the PoC could help alleviate these problems also in production.



41

Fig. 35 Reporting Criteria

ApplicationServer OLAP Cache

sys<n>onl

Aggregates

Reporting

DB F-fact tables

The figure above depicts the PoC reporting criteria, with access to the SAP OLAP cache being the shortest path, the DB aggregate cache in second place, and access to the DB fact tables being longest. The online workload profile was:

10 Queries • 80% on aggregates, 20% selectively on fact tables • 50% using OLAP cache, 50% not using OLAP cache

or, in a matrix: Query MultiProvider type OLAP Cache Aggregate 1 Sales Statistics Yes Yes 2 Sales Statistics No Yes 3 Profitability Analysis Contr. Yes Yes 4 Profitability Analysis Contr. No Yes 5 Sales Statistics Yes Yes 6 Profitability Analysis Contr. No Yes 7 Profitability Analysis Contr. Yes Yes 8 Profitability Analysis Contr. No No 9 Sales Statistics Yes Yes 10 Profitability Analysis Contr. No No Thus, query 8 and 10 are the expensive fact-table queries, queries 2, 4 and 6 access Aggregates on the database and the others are served by the OLAP cache.



42

Fig. 36 Comparison between “Cached and not Cached” Queries Using SAP OLAP Cache

No Cache - direct to fact tables

No DB timefor OLAP cache

Compared to those using OLAP cache, "ab hocs" have erratic rsp times.

These two snapshots are taken after considerable tuning for the “ad hoc” queries. Nevertheless, the effect of the caching is very evident. The non-cached queries have much more variance in their response times (6 to 14 seconds) and spend up to 70% of the response time in the database.



43

DB2 Statviews The only tuning allowed for the queries was the use of DB2 “statviews”. This is described in depth in the DB2 documentation for this PoC. A statview is basically a DB2 V8/V9 mechanism for optimizing the access path and data join for cross table selects. The chart below, taken directly from LR, shows the instant drop in response time at the point of time when the statviews were activated for the long running queries. Fig. 37 Effect of DB2 Statviews on Long Running Queries

Statviews activated, query response time drop in queries to fact-tables

Characteristics Aggregation and Job-Chain Design Aggregates are defined on Infocubes to improve performance for critical query navigations. The requirements of these navigations are known or pre-defined to allow an effective aggregate to be built. As new data is added to the Info-Cube, the aggregates must be modified to reflex this new data. In general, the aggregates should be much smaller then the data target, and normally contained a compressed version of the Info-cube data in that similar data is consolidated. A typical customer scenario would take the following steps:

1. New data is loaded into the target InfoCube. 2. A “rollup” of the new data would be done into the existing aggregates. 3. Data in the aggregates would be compressed to reduce the number of records

of similar kind in the aggregate. The smaller the aggregate the more chance it has of being maintained in memory in the DB aggregate buffer pool, thereby improving the response time of queries to this aggregate. From an SQL perspective this results in both INSERT and UPDATE statements - depending on the amount of records with the same characteristics combination – compared to only INSERT statement when not compressing the



44

aggregates. Not to compress aggregates would only make sense if single requests are frequently deleted out of the cube.

For the purpose of the PoC this scenario takes makes a slight deviation. This is necessary to prepare a repeating scenario with (as near a possible) identical load. The time to reset to the starting point for a rerun must be within the feasible. In the case of the PoC, it is not possible to simulate the entire data cycle, nor was it a prerequisite of the KPI definitions. PoC Deviation from the Typical Cycle

1. Data loading is separated from aggregate rollup. It is not intended to simulate a complete data cycle.

2. Goal is to have a reproducible initial state of all aggregates. So before rolling in a complete data deletion), then reactivated. An initial fill of the aggregates is used instead of rolling up into existing data.

3. A single request is then loaded into 10 different aggregates per InfoCube. As all used aggregates are empty, there isn’t technically any difference between this rollup and reconstruction. As no compression of the aggregates is done, and no merging with existing aggregate data, no updates to the F-fact table data is done, only inserts.

Monitoring and Analysis of Combined Load To monitor the resource consumption and load distribution over multiple LPARs, and to be able to compare this metrics to the same in future configurations, a special tool was developed. This tool takes NMON statistics from multiple LPARs, combines them and summarizes selected metrics. As NMON is “workload manager aware” components sharing an LPAR where separated in WLM classes. In this manner it was possible to monitor the memory and CPU resources for each class. Within each DB2 LPAR there are 8 DB2 instances for example (with the exception of DB0). The load distribution and resource utilization is monitored for each. Fig. 38 Consolidated NMON Analysis Tool Tagging and tools

Load distribution over individual DB2 nodes in an LPAR (%cpu util)

PhyC for each LPAR in the system.

Summary of Resource Utilization



45

To be able to compare CPU utilization for KPI runs over a changing server landscape, it was necessary to have a common denominator. The CPU utilization is collected in units of physical CPUs consumed, rather than % of CPU utilization. This allows comparisons of different server sizes and also the future comparison to a shared processor pool implementation. The combined information is also summarized for a number of defined metrics with the objective of showing a change in pattern. The example above of SAN adapter utilization, depicting the activity on the SAN during a KPI run helps to recognize a change in pattern. This type of summary data is collected for the SAN interfaces and the backbone network, as well as CPU utilization and memory. Real-Time Monitoring The AIX XMPERF graphical monitor is used to monitor the system behavior during the runs. This allows a deviation in any component of the landscape to be recognized quickly so that analysis of the situation can begin while it’s happening. This tool also allows record and playback of runs which take place during the night. Fig. 39 XMPERF Real-time Graphical Monitor of KPID Load

1

2

3

45

This is an example of an XMPERF monitor used during KPID. There are 5 individual monitors tracking a number of different components. The monitors are numbered and described below:

1. This shows the ramp-up of the cube-load over the 4 dedicated application servers. This shows how evenly the load is distributed and the level of the loading. Here the 4 application servers are quite evenly loaded and just entering the high-load phase.

2. This instrument tracks the traffic on the backbone network between the DB servers and DB0.



46

3. These speedos show any paging activity on the DB servers. 4. This show the CPU utilization and distribution on the DB servers. Green is

user, red is IOwait, and yellow is kernel overhead. Blue background is idle. 5. These “pies” or “butterflies” track the load distribution over the 8 DB nodes in

each of the 4 LPARs. The first instrument is DB0 with only one instrument. Throughput Analysis KPIA The figure below is a tool used to analyse and depicts the overall throughput over the three load categories for KPI-A. The left y-axis is the records per hour throughput of data-load and aggregates on a per cube basis. The right y-axis is the average response time of the query load. Fig. 40 Throughput Analysis Tool for KPIA

Throughput

0

5.000.000

10.000.000

15.000.000

20.000.000

25.000.000

30.000.000

35.000.000

14:2

5

14:3

4

14:4

2

14:5

1

14:5

9

15:0

8

15:1

6

15:2

5

15:3

3

15:4

2

15:5

0

15:5

9

16:0

8

16:2

5

16:3

3

16:4

2

16:5

0

16:5

9

17:0

7

17:1

6

17:2

4

17:3

3

17:4

1

17:5

0

17:5

8

18:0

7

Rol

lup

& U

p-L

oad

(Rec

ord

s/h)

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

18.00

20.00

Qu

ery

Res

pon

se T

ime

Rollup 1 Rollup 2 Upload 1 Upload 2 Upload 3 Upload 4 Upload 5 Upload 6 Upload 7 Upload 8 Response Time

2 Cube Aggregations: Throughput from st03

Query RspTimes: Load Runner8 Load Requests: Throughput from ST03

This tool was very successful for the baseline calibration and KPIA, but became difficult to manage for KPID and beyond as the number of load and aggregation requests increased by factor 4. For each, the SAP statistics had to be extracted and updated separately. This tool also uses a combination of SAP and LoadRunner® statistics for the graph. The LoadRunner® statistics include front-end network time which 1) wasn’t part of the KPI and 2) wasn’t ideally configured. The difference in the KPI response times and the graph led was confusing.



47

Throughput Analysis KPID to KPIG Throughput data is extracted from SAP statistics tables using extraction tools designed by SAP during the PoC. The reports are able to extract query and load data on matching timelines. The aggregation is still taken from the summary data generated at the end of the aggregation run. There is no more detailed information on aggregation related to time intervals available. Fig. 41 Throughput Analysis for KPID and KPIG

This results analysis tool is based on monitoring statistics taken directly from SAP tables. A new report, written by SAP, allows the extraction of data-load throughput, and the query throughput and response time, on an identical time axis. Using this output, the excel analysis tool can construct the graph above and calculate the results of the high-load phase. The yellow boundaries delimit the calculated high-load phase. The high-load phase for the data upload starts when the last extractor has finished its ramp-up (the full number of dialog process have been spawned) and ends when the first enters ramp-down. Most tests were run with a two phase upload, and therefore contain a full ramp-down and ramp-up phase in the middle of the high-load phase. The actual peak throughput for upload was much higher than the calculation. The reason for this method is explained later in the upload details.



48

SAP Load Monitoring Statistics

The load data statistics are taken directly from the SAP monitoring table RSMONMESS, based on the start and end timestamp. This table tracks each record inserted into a target cube. Using this information, the throughput is calculated per selected time period.

Furthermore the end of the ramp-up and the starting point of the ramp-down phase are calculated. The peak load phase is defined as that phase when the extraction process has initiated all DIA processes on the one hand and when the extraction job is finished reading on the other hand. Query Monitoring

The standard SAP BI reporting for query is based on the given OLAP statistic values that are stored in table RSDDSTAT. ST03 reads the original statistics data, merges corresponding records and aggregates the statistical records based on the selected timeframe. For the analysis tool, the query statistical records are aggregated within time intervals and the results integrated into the ST03 in an extra-developed profile called "time profile". Important KPI values are:

• Number of navigation steps / second

• Avg. Query runtime [seconds]



49

P5 Virtualization During phase 1, load profiling was done for all three of the different loads to determine their CPU utilization and load distribution over DB and APPs. The following diagrams taken from the NMON reports show the combined utilization over all the LPARs in the system. Fig. 42 Data Load CPU Utilization Profile

Application Servers

Full capacity 32Processors

DB Servers - 10 to 12 physical CPUs

The load profile in this example is the data-load. It is very application server heavy and has a very constant behavior. This load type can be designed to use all available application server capacity. Its requirements are easy to predict, and are controlled by job-design and settings.



50

Fig. 43 Heavy Query Load CPU Utilization Profile

Application Servers for Online

DB servers (not stacked)

The figure above shows the profile of a very heavy query load, done during the calibration tests. There are a number of attempts to profile the query load; this is an example of one: they are all quite different! In general the load represents a 2:1 DB utilization to application server load but it is highly erratic. Fig. 44 Aggregation Load CPU Utilization Profile

Application Server for batch

DB0

DB Load (not stacked)

The aggregation load is also very erratic. The load on DB0 in the beginning is thought to be the effect of the heavy aggregation rules for the first 3 aggregates. The load tapers off for the smaller lighter aggregates. The batch load on the application server



51

represents about 1/3 of the overall load. The aggregation method has the effect of function shipping; the jobs trigger complex SQL requests toward the database and then wait for the DB to return the results. This makes this load somewhat difficult to control as the major activity is taking place in the database where it is anonymous to any kind of work load control tools. However, the profile of the job overall represents peaks and troughs in CPU activity. Fig. 45 KPID Combined Load CPU Utilization Profile

4 Loading Application Servers

Combined Application servers for Online and Batch

DB Servers(not stacked)

The Combined load looks like the example above. In this picture, the individual dedicated LPARs are placed side by side in the picture. No sharing of physical resources is done. However, from this picture one can get an idea of what sharing could be possible in a micro-partitioning environment. These load peaks are show here in relatively large measurement intervals. Processor sharing takes place at a 10ms interval. There would appear to be a large potential to improve Cost/Performance by resource sharing in this combined profile. For phase2 of the PoC, micro-partitioning will be introduced to investigate this theory. Starting Point in the Phase2 Landscape For the Phase II, we will have 5 P595 with 64 CPU Power5+ for the database and the application servers. Additional LPARs will be used in order to better separate the various load profiles, and the CPU virtualization capability of the Power5 systems will be used for dynamic resource sharing The CPU power will be distribute to the different LPAR by means of a defined resource policy.

For the first p595 there will be:

- One LPAR for DB2 Node 0 and the SAP central instance (CI) - One LPAR for Online Activities “Queries” - One LPAR for SAP batch : Dedicated to Aggregate - One LPAR for SAP batch : Data-Load Extractors



52

- One LPAR for the Storage Agent use by TDP for Advanced copy Services

The intention is to give DB2 Node0 and the CI resource priority. Currently the intention is to place these on the same p595. Both the Node0 and the CI provide global functionality which has an effect on their “sub components” system-wide. Therefore the reaction time of these critical resources has a general effect even though these two components themselves are not normally the high-load focus. Initially, the data-load extractors are being placed in an LPAR on the same system at the aggregate batch. This is the only way to prioritize load vs. aggregation. The remaining 4 p595s will have the following configuration:

- One LPAR for DB2 with 8 DB2 Nodes - One LPAR for Online Activities “Queries” - One LPAR for SAP batch : Dedicated to Data load - One LPAR for the Storage Agent use by TDP for Advanced copy Services

On each P595 a similar Policy for managing the CPU capacity will be implemented. The initial idea is to give the priority to DB2 and the online activities (query) to guarantee good and constants response time. The Aggregates drive the database, which has unlimited priority, but are restricted by the number of actual batch jobs running aggregation. The data-load has two load profiles, on the extractor side, it is batch orientated, on the load side, massive parallel. The initial policy here will be to give priority to the aggregates with a limitation. The data-load will have no limitation, but have the lowest priority. The expectation is that the massive parallel part of the data-load will consume any CPU capacity unused by the priority workloads.

Fig. 46 LPARs in the Shared Processor Pool per p595

Upload Upload

DB0

DB2DB1 DB3 DB4

Online OnlineOnline Online Online

AggregatesUploadUpload

PRIORITY1

PRIORITY2

PRIORITY3

Controlled by ent, vp, and priority

Storage AgtStorage Agt Storage Agt Storage AgtStorage Agt

PRIORITY4

Extractors

CI

Each of the columns in the diagram above represents one of the 5 p595s in the Phase 2 hardware configuration.



53

Fig. 47 Micro-Partition Configuration: Starting Point

LPAR Number of CPU guarantee

(Capacity Entitlement)

Maximum number of CPU (Number of Virtual

Processor) Priority

(Weight) DB2 8 42 128 On-line 16 22 64 Upload 16 58 16 Aggregate 16 42 32 Storage Agt 5 8 0

The chart above shows the micro-partitioning configurations which will be used at the start of phase 2. The objective is to share as much as possible but maintain strict control of the resources. This will prove and interesting challenge!

Summary The starting point of this project was a clone of the current customer landscape at AIX 5.2, DB2 V8, and using a single p595 server, and a single DS8300 storage server for the database. During phase 1 of the PoC, documented here, the business processes of the day to day activities were studied and a best “price/performance” approach developed for the critical tasks. During this phase, a number of beneficial changes were introduced into the landscape following a migration path which the customer either has in plan, or is considering. AIX 5.3 was introduced and showed the benefits of the hardware multi-threading on system-wide throughput. DB2 was upgraded to V9 and an immediate improvement in the “DB-bound” work load was demonstrated. The new POWER5 technology, at 2.3 GHz, was introduced into the DB and Storage servers (DS8300 Turbo) with additional throughput and response time improvement. Phase1 of the project, at 7-20 terabytes, has brought an in-depth understanding of the load requirements and shown that the architecture, currently in place at the customer, is capable of scaling to manage the immediate next phase of their business requirements. Phase 1 also lays the ground work for the next phase of scalability. The building blocks for phase 1 hardware, 5 LPARs for DB2 with separate data paths and external backbone communication, prove scalability of the application stack. This design is the basis for the distribution of the DB over 5 separate physical servers to extend the hardware scalability into phase 2: 60TBs and the customer’s high end performance expectations. The study of the combined load behaviour over both database and application servers is the basis for the introduction of the POWER5 virtualization design for phase2, using the shared processor pool to increase the scalability of the entire solution.



54

IInntteennttiioonnaallllyy LLeefftt BBllaannkk



55

Storage Design for a High-End Parallel Database

Document Scope This whitepaper covers the portion of the PoC around the storage design and layout in support of a shared nothing implementation and the requirement for high performance parallel backup/restore functionality. The storage design in the 20TB phase was restricted by customer requirements to a single DS8300. This reflected the infrastructure planned by the customer in his capacity planning for 20TBs. The tests documented for the 20TB part of the PoC were to provide the basis for follow-on design for the high-end landscape of 60TB. In the storage design phase, several current “best-practice” approaches for large system implementations were considered. Each having credible references, each having benefits and limitations. As the storage implementation could be done only once, a significant effort went into evaluating the options and selecting the approach which would be taken. This whitepaper describes these options, the path chosen, and the results.

Contents

Design for a High-End Parallel Database ....................................................................55 Document Scope ..........................................................................................................55

Contents ...................................................................................................................55 The Storage Environment ........................................................................................55 Storage design description .......................................................................................56 The storage and AIX file systems layout.................................................................61 The SAN design.......................................................................................................66 The backup and FlashCopy design and implementation .........................................67 Backup .....................................................................................................................67 FlashCopy ................................................................................................................69 DS8300 internal addressing and total capacity consideration .................................72 Disk Storage Design Summary................................................................................74

The Storage Environment

This chapter describes the storage implementation used in the Proof of Concept and describes the project from a storage expert perspective. In this chapter is subdivided into the following topics:

• “Storage design description”: describes the storage architecture from the test environment.

• “The storage and AIX file systems layout” describes how the disks have been configured.

• “The SAN design” describes the Storage Area Network components.



56

• “The backup and FlashCopy design and implementation” provides the details for the two main features used for the tests.

• “DS8300 internal addressing and total capacity consideration” provides information on the storage capacity and the options chosen.

Storage design description

The objective of this project was to implement a DB2 multi-partition shared nothing architecture which means, ideally, that each DB2 partition gets dedicated CPUs, RAM and disks. This configuration:

• Offers a nearly linear scalable architecture with the number of partitions.

• Increases DB2 parallelism (and so performance).

• Gives a flexible architecture: it is possible to add and/or move DB2 partitions according to the needs.

•

For the 20 TB configuration 33 DB2 partitions were used, with a usable capacity of 27.6 TB for the production and a usable capacity of 24 TB for the backup. The total capacity was 51.6 TB.

The DS8000 disk space is divided in data, indexes, DB2 logger, and temporary tablespaces.

600 GB have been allocated for the base tables, master tables and dimension tables.

The ratio between the data-indexes, the DB2 logger, and the temporary tablespaces is the following:

• DB2 logger: 3.25 TB • DB2 temporary tablespace: 1.75 TB • DB2 data and index: 20 TB

For each DB2 partition, the following amounts have been allocated:

• Partition 0:

− DB2 logger: 400 GB − DB2 temporary tablespace: 400 GB − DB2 data and index: 1200 GB

• Partition 6-37 (each)

− DB2 logger: 100 GB − DB2 temporary tablespace: 100 GB − DB2 data and index: 600 GB

27.6 TB for production data plus 24 TB for backup data use 512 disk drives.



57

All the DB2 assumptions taken for sizing the storage are summarized on

.

− The first lines show the minimum capacities as defined in the initial requirement for each component.

− The second line shows the real capacity formatted in the DS8000 after the complete layout study including an optimum size and number of LUNs has been made (for example it includes some extra space needed for the LUNs definition and assignment).

–Storage sizing (in TB)

Data and Index TS

TS Logger Total

Minimum required

0.60 0.36 0.25 1.21 DB2 partition 0

Allocated 1.20 0.40 0.40 2

Minimum required

0.606 0.043 0.093 0.742 DB2 partition 6-37 (each)

Allocated 0.6 0.1 0.1 0.8

Minimum required

20 1.75 3.25 25 DB2 total

Allocated 20,4 3.6 3.6 27.6

One DS8300 model from the DS8000 series was used in the PoC environment1; it is a frame storage subsystem with 80 arrays for a maximum of 640 disks which provides a 68560 G effective capacity (68.5 TB). Each disk drive module has a capacity of 146 GB and a rotation speed of 15 krpm. Actually only 512 disks are used in our environment and provide a full usable capacity of 54 TB (54,016 GB). In this total capacity 27.6 TB are reserved for the production data

1To know more about the DS8000 architecture and model, you can read IBM System Storage Solutions Handbook,

SG24-5250.

The main options and choices that determine the storage achitecture



58

The different options for the storage layout for the SAP NetWeaver Business Intelligence (SAP BI) database can be:

• For the DB2 production LUNs:

– Either spread all the DB2 partitions on all the arrays by using small LUNs and having one LUN for each DB2 partition in each array.

– Or dedicate a group of arrays for each DB2 partition by using large LUNs in the group of array.

Our choice was the second option: with a dedicated group of arrays for each DB2 partition and a small number of LUNs on an array, the potential I/O contention is reduced on this array and the administrative tasks may be improved.

• For the Flashcopy LUNs:

– Either dedicate the arrays,

– or share the same arrays for the productions and for the Flashcopy LUNs.

Our choice was the second option: with the sharing of same arrays, we optimize the production workload by providing more physical drives.

• For data/index, temporary files and log production LUNs:

– Either keep then separate on different physical arrays.

– Or dedicate an array for each type: data, temp, log.

Our choice was the second option: when sharing the same group of arrays for the data/index, temp and log LUNs the number of disks is minimized and the share nothing architecture between the DB2 partitions is kept.

The high level implementation of the storage layout is shown in Figure 4-2 “Storage mapping between the DB2 servers and the DS8000”. Each AIX LPAR have its own set of source LUNs for the production database and target LUNs for the backup.

A link between the source LUNs and target LUNs, called the Flashcopy relationship, is created. The mode used to synchronize the source LUNs (production database) and the target LUNs (backup database) is called incremental mode.

A picture of a DS8000 unit is provided in the figure 4-1 “Error! Reference source not found.”.



59

Figure 4-1 DS8000 front view

Figure 4-2 Storage mapping between the DB2 servers and the DS8000

As shown in the figure 4-3 “The DS8000 main components”, the DS8300 is mainly composed of:

• Two controllers (POWER5 technology)

• Fibre Channel Host Adapters (HA), to connect to the servers. 24 FC ports are used in our environment.

• Eight devices adapters (DA), to connect to the switched back end.



60

The storage is designed to balance the resources between the controllers: half of the logical volumes (LUNs) are managed by the controller/server 0 and the other half is managed by the controller/server 1.

Figure 4-3 The DS8000 main components

Figure 4-4 “Arrays implementation “shows the configuration with a total of 64 arrays and 8 device adapters (DA):

• 32 arrays with 6 hard disk drives (HDD) for data, 1 HDD for parity and 1 HDD for spare.

• 32 arrays with 7 hard disk drives (HDD) for data and 1 HDD for parity.

Every DA is connected to a group of 8 arrays (4 of each type). These arrays are more detailed in the figure “Summary of the implementation in our environment”.



61

Figure 4-4 Arrays implementation

The next section will define the distribution of the data on those different arrays in order to optimize the I/O parallelism.

The storage and AIX file systems layout

In summary, as depicted in Figure 4-5 “Summary of the implementation in our environment”, we have four arrays/ranks for each DB2 partition with the following rules:

• No arrays shared between DB2 partitions 1 to 16

• No arrays shared between DB2 partitions 17 to 32

• The same array is shared between DB2 partitions 1 and 17, 2 and 18, n and n+16, until 16 and 32

• The same array is shared between DB2 partition 0 and partitions 1, 17, 5, 21, 3, 19, 7, 23.



62

Figure 4-5 Summary of the implementation in our environment

The following concepts have been followed in the storage layout:

• The data/index, temporary and log LUNs share the same group of array.

• Each DB2 partition is mapped with four arrays with six HDD for data and one for parity and four arrays with seven HDD for data and one for parity: this configuration making the best use of disks and internal servers. Minimal number of LUNs: better management in production and better performance for backup/restore, disaster recovery and Flashcopy.

• Only 2 sizes for the LUNs: 25 GB, 150 GB have been used.

These concepts provide an easy way to monitor and to predict and understand the performance issues and also it was an easy way for migrating DB2 partitions from AIX LPAR to another and from a DS8300 to another DS83000.

Each DB2 partition (6 to 37) has four file systems for the data/index tablespaces, four file systems for the temporary tablespaces and one file system for the logs. In each file system a DB2 container is defined. DB2 is spread over those four containers: so over the 4 file systems the content of each table using a block size of 32 KB. Each group of four file systems will be hosted on four LUNS in the DS8000. Figure 4.6 “Details of the arrays implementation for the DB2 partition 6 “ provides an example of this implementation for the DB2 partition 6.



63

Figure 4-6 Details of the arrays implementation for the DB2 partition 6

Two options are possible:

1. Either to use one LUN for one file system, as shown in figure 4-7 “Fig. 4.7 Array implementation option: one LUN per one file system”.

Fig. 4.7 Array implementation option: one LUN per one file system



64

2. Or to use the four LUNs grouped together using AIX LVM spreading (max policy on) for the 4 file systems, as shown in figure 4-8 “Array implementation option: LUNs grouped together.

The second option was implemented in our environment for reasons of manageability and scalability. The first option may still improve the performance but has not been tested in this PoC. We expect that because of the large number of tables (several hundreds) and the random distribution of those tables in the file systems the difference in the two options would not be significant in this environment.

Figure 4-8 Array implementation option: LUNs grouped together

The following file systems have been defined for the DB2 partitions 6 to 37:

• Four file systems per DB2 partition for data and index tablespace (sapdata1 to sapdata4) mapped to 8 LUNs.

– DB2/EB8/sapdata1/NODE000X – DB2/EB8/sapdata2/NODE000X – DB2/EB8/sapdata3/NODE000X – DB2/EB8/sapdata4/NODE000X

• Four file systems per DB2 partition for the temporary tablespace (sapdatat1 to sapdatat4) mapped to 8 LUNs.

– DB2/EB8/sapdatat1/NODE000X – DB2/EB8/sapdatat2/NODE000X – DB2/EB8/sapdatat3/NODE000X – DB2/EB8/sapdatat4/NODE000X

• One file system per DB2 partition for logger (log_dir) mapped to 8 LUNs

– DB2/EB8/log_dir/NODE000X



65

In summary, as shown in figure 4-9 “Arrays and file system distribution for the DB2 partitions 6 to 37 (LPAR1 as an example)” for the LPAR 1, for each DB2 partition (6 to 37) 9 files systems with a total of 24 LUNs, with a total of 313 file systems (25 + 32 X 9 = 313) and a maximum of 72 file system per System p (8 X 9 = 72) have been defined.

Figure 4-9 Arrays and file system distribution for the DB2 partitions 6 to 37 (LPAR1 as an example)

The following files have been defined for the DB2 partition 0:

• 12 file systems for DB2 partition 0 for data and index tablespace (sapdata1 to sapdata12) mapped to 12 LUNS.

– DB2/EB8/sapdata1/NODE0000 – DB2/EB8/sapdata2/NODE0000 – ……. – DB2/EB8/sapdata11/NODE0000 – DB2/EB8/sapdata12/NODE0000

• 12 file systems for DB2 partition 0 for Temporary tablespace (sapdatat1 to sapdatat12) mapped to 24 LUNS.

– DB2/EB8/sapdatat1/NODE0000 – DB2/EB8/sapdatat2/NODE0000 – ……. – DB2/EB8/sapdatat11/NODE0000 – DB2/EB8/sapdatat12/NODE0000

• One file system for DB2 partition 0 for logger (log_dir) mapped to 24 LUNs

– DB2/EB8/log_dir/NODE0000

In summary, as shown in figure 4-10 “Arrays and file systems distribution for the DB2 partition 0”, for the DB2 partition 0, 25 files systems have been defined, with a total of 60 LUNs.



66

Figure 4-10 Arrays and file systems distribution for the DB2 partition 0

The difference between the DB2 partition 0 and the other partitions for the number of file systems is purely the result of our successive tests: the first system had 6 DB2 partitions with 12 file systems each; when we extended the DB2 architecture from 6 to 33 DB2 partitions the possible number of file systems was a number between 1 and 12; 4 was chosen has a compromise between the manageability, the performance and the number of arrays reserved for each DB2 partition.

The SAN design

The zoning at the SAN level and the LUN masking at the DS8300 level is defined in order to have a maximum of 4 paths for every LUN.

• A zone will include all the ports of a DS8300 and the POWER5-595 LPAR connected to it.

• Each group of LUNs in a DS8300 will belong to a DS8000 Volume Group in relation with 4 hostconnect with a maximum of one I/O port each, in order to have 4 paths maximum.

• The LPAR 0 (DB2 partition 0) will share one FC port in each group of 4 FC used for LPAR1 to LPAR4.

Hostconnect, I/O port, Volume Group are specific DS8000 definitions and do not have to be interpreted as UNIX terminology. To know more about DS8000, please refer to IBM System Storage DS8000 Series: Architecture and Implementation, SG24-6786.

The figure 4-11 “The SAN fabric” describes the SAN components set up in our environment.



67

Figure 4-11 The SAN fabric

The backup and FlashCopy design and implementation

In this section we describe the backup and FlashCopy processes and options. To know the details about these functions, you may read IBM TotalStorage DS8000 Series: Copy Services in Open Environments, SG24-6788. This section discusses only the functions used for our tests.

Backup

Three types of backup are available, as described in figure 4-12 “Types of backup with Tivoli Storage Manager”: backup with the LAN, LAN-free backup and server-free backup.

• Backup via the LAN

In a traditional LAN environment the Tivoli Storage Manager backup and archive client or application read data from locally attached disks and then send it over the LAN to the Tivoli Storage Manager backup server. The server receives the data then writes it out to its storage pool based on predefined policies and server configuration. Data is read and written by both the Tivoli Storage Manager client and Tivoli Storage Manager server machines. In addition, control information is also sent over the LAN to the Tivoli Storage Manager server.

• LAN-free backup

SAN technology provides an alternative path for data movement between the Tivoli Storage Manager client and the server. Shared storage resources (disk, tape) are accessible to both the client and the server through the SAN. Data movement is off-loaded from the LAN and from the server processor. LAN-free backups



68

decrease the load on the LAN by introducing a Storage Agent. The Storage Agent handles the communication with the Tivoli Storage Manager server over the LAN but sends the data directly to SAN attached tape devices, relieving the Tivoli Storage Manager server from the actual I/O transfer.

• Server-free backup

Server-free backup/restore capability is introduced into Tivoli Storage Manager Version 5. In a server-free backup environment data is copied directly from the SAN attached Tivoli Storage Manager client disk to the SAN attached tape drive via the SAN Data Gateway data mover. The Storage Agent used in LAN-free backups is not used. The data movement is actually done by a SAN Data Gateway (SDG) or similar device on the SAN. Therefore, both Tivoli Storage Manager client and server machines do not have to read and write the data at all. The Tivoli Storage Manager server sends commands to the SDG device to tell it which blocks to move from which SAN attached disk to which SAN attached tape device. The data is actually copied rather then moved from one location to another. This provides a way to back up and restore large volumes of data between client-owned disks and storage devices using a method that considerably reduces overhead on the Tivoli Storage Manager server and the client.

Only volume images and not individual files can be moved by server-free data movement. The data is transferred block by block rather than by doing file I/O. Both raw and Windows NT® file system (NTFS) volumes can be backed up using the server-free backup capability. Data that has been backed up using this technique can be restored over a server-free path, over a LAN-free path, or over the LAN itself. The impact on application servers is now minimized with this type of backup. It reduces both Tivoli Storage Manager client and server CPU utilization.

The data mover device can be anywhere in the SAN, but it has to be able to address the LUNs for both the disk and tape devices it is moving data between.

The server-free backup was used for the PoC test



69

Figure 4-12 Types of backup with Tivoli Storage Manager

The server-free backup has been chosen in our tests: it has the benefits of not using the LAN and the production DB2 server resources.

FlashCopy

The FlashCopy creates a copy of a logical volume at a specific point-in-time, which we also refer to as a Point-in-Time Copy, instantaneous copy, or t0 copy (time-zero copy).

By doing a FlashCopy, a relationship is established between a source and a target. Both are considered to form a FlashCopy pair. As a result of the FlashCopy, all physical blocks (full copy) are copied or only those parts, which are changing in the production data after the FlashCopy has been established (using the nocopy option).

The three main steps of a FlashCopy operation are the creation of the FlashCopy relationship, the reading from the source and the writing to the target.

• Establish FlashCopy relationship.

When the FlashCopy is started, the relationship between the source and the target are established within seconds. This is done by creating a pointer table including a bitmap for the target.

Let us assume all bits for the bitmap of the target are set to their initial values. This represents the fact that no data block has been copied so far. The data in the target will not be touched during the setup of the bitmaps. Once the relationship has been established it is possible to perform read and write I/Os on both the source and the target.

• Reading from the source.

The data could be read immediately after the creation of the FlashCopy relationship.

• Writing to the source.

Whenever data is written to the source volume while the FlashCopy relationship exists, the storage subsystem makes sure that the time-zero-data is copied to the target volume prior to overwriting it in the source volume.

Figure 4-13 “The FlashCopy process” shows the FlashCopy process.

With a normal FlashCopy, a background process is started that copies all data from the source to the target. Incremental FlashCopy provides the capability to refresh a FlashCopy relationship. With incremental FlashCopy, the initial relationship between a source and a target volume is maintained.

In our tests we used the incremental FlashCopy with the copy option (full volume copy). In that case during a refresh the updates that took place on the source volume

Incremental FlashCopy and full volume copy were used in the PoC tests



70

since the last FlashCopy will be copied to the target volume. Also, the updates done on the target volume will be overwritten with the contents of the source volume.

Figure 4-13 The FlashCopy process

The target LUNs are used during the server free backup with the FlashCopy:

• Only the data/index and temporary tablespace LUNs will be flashcopied

• The target FlashCopy LUNs are spread on all the arrays of the DS8000 (no dedicated arrays assigned to the target FlashCopy LUNs)

• The target FlashCopy LUNs are on different arrays from the source FlashCopy LUNs, as shown in figure 4-14 “Target LUNs mapping with DB2 partitions”. In particular, when comparing figure 4-5 “Summary of the implementation in our environment” and figure 4-14 “Target LUNs mapping with DB2 partitions”partitions, you can notice the shift of the device adapter (DA): for example DB2 partition 6 has moved form the DA2 to the DA0



71

Figure 4-14 Target LUNs mapping with DB2 partitions

• The FlashCopy LUNs pairs are across device adapters and on the same internal server (server/controller 0 or 1).

• All the FlashCopy relationships are started almost at the same time, so the completion of the FlashCopy is very fast and the source and target LUNs are available quickly.

• A maximum of 16 background copies occur in parallel. It is due to the fact that each device adapter of the DS8300 allows 4 background copies, 2 outbounds and 2 inbounds. In our implementation we have 8 device adapters with all the relationships across the device adapters we have a total of eight time 2 background copies.

The LUNs defined and copied or refreshed by FlashCopy are used by Tivoli Storage Manager Data Protection for FlashCopy, DB2 and SAP.



72

DS8300 internal addressing and total capacity consideration

The internal addressing is used to identify and attribute a number to all the LUNs within the DS8300. This address is used on the server side in the different LPARs to map the hdisk2 to the correct file system. They are 3 levels of hierarchies.

• The first level (noted X in figure 4-15 “LUN identification”) also called address group identifies the LPAR and the production versus the backup LUNs:

− For the LPAR 0 (DB2 partition 0) all the production LUN identification numbers start with "0" and the backup LUN identification numbers with an "A"

− For the LPAR 1 (DB2 partition 6 to13) all the production LUN identification numbers start with "1" and the backup LUN identification numbers with a "B"

− For the LPAR 2 (DB2 partition 14 to 21) all the production LUN identification numbers start with "2" and the backup LUN identification numbers with a "C"

− For the LPAR 3 (DB2 partition 22 to 30) all the production LUN identification numbers start with "3" and the backup LUN identification numbers with a "D"

− For the LPAR 4 (DB2 partition 31 to 37) all the production LUN identification numbers start with "4" and the backup LUN identification numbers with a "E"

• The second level (noted Y in LUN identification) also called Logical Subsystem (LSS) identifies the DB2 partition within a LPAR and also to assign a LUN to the controller/server 0 (if the number is even) or to the controller/server 1 if the number is odd). For instance DB2 partition 16 in the LPAR 2 has the addresses 24 and 25.

• The third level (noted A and B in LUN identification) identifies (for all DB2 partition except 0):

– Data / index 00 , 01 – Temp 02 , 03 – Log 04 , 05

2An Hdisk is an AIX term representing a logical unit number (LUN) on an array as seen from the OS level.

Summary of number and type of components in the PoC tests



73

Figure 4-15 LUN identification

The figure 4-16 “Total LUNs layout is a summary of the total capacity and the distribution between the LPARs and the type of LUNs (production, Fiber Channel, logger, temporary, data and index)



74

Figure 4-16 Total LUNs layout

Disk Storage Design Summary The challenge of the storage design was to solve and optimize two conflicting requirements: On one side obtain the best performance and on the other side use the smallest capacity for the storage subsystem. How to follow the guidelines for the 32 db2 multi-partitions database? : Separate the types of data on different arrays, Loggers, temporary and data/index tablespaces, follow the share nothing architecture and provide enough write and read I/Os for every type and every workload. The solution could have been to use more than 2000 physical disks for a 20 TB SAP NetWeaver Business Intelligence database but was not compatible with the customer requirement of only one DS83000 storage subsystem. The storage design resulted in the implementation of a 512 physical disks DS8300 (146 GB each, 15 krpm) with 24 FC ports. All the 512 physical disks are available for the production LUNs and shared with the backup LUNs. The SAN is completely independent between the production and backup activities from the servers, via the FC directors up to the DS8300 FC ports (16 for production and 8 for backup). Each db2 partition is mapped on 4 disk arrays. The most important data layout characteristic is driven by a complete balancing and even distribution: The 8 internal paths (loops) are evenly used with the same number of disk (64) and the same number of arrays with 6 or 7 data which guarantee the same level of performance for all the db2 partitions. The results demonstrated an excellent overall performance of the design (up to 80.000 IOs with a ratio read/write 90/10) and have confirmed the choices of the design options: An even distribution of the workload on all the db2 partitions which is consistent with 4 dedicated arrays per db2 partition (share nothing). The main workload is on the data/index tablespace LUNs and is not concurrent with the workload on the loggers and temporary tablespace LUNs. This does not conflict with the mix of the loggers, data/index and temporary tablespace LUNs for each db2 partition on the same group of 4 physical arrays. A little impact of the application server free backup (using the target flashcopy disk) on high load queries on-line activities (less than 20 %) even if the same arrays are shared by the application and backup LUNs. Some lessons learned: 1/ It is important to separate the physical FC paths for production and backup traffic. 2/ Easy to monitor because of dedicated arrays for each db2 partition and dedicated LUNs of each data type. 3/ Keep some extra capacity (20 % at least) in order to provide flexibility and manageability of the SAP NetWeaver Business Intelligence multi partitions database.



75

Backup/ Restore/ Recovery of a 20TB SAP NetWeaver Business Intelligence System in a DB2 UDB Multi-partitioned

Environment

Scope This document covers the portion of the PoC around the backup, restore and recovery using the IBM Tivoli functionality. It covers Backup/ Restore/ Recovery for a 20 TB SAP Netweaver Business Intelligence (SAP BI) in a Multipartition DB2 UDB ESE DPF installation using IBM Tivoli Storage Manager for Hardware Data Protection for Disk Storage and SAN VC for SAP with DB2 UDB together with IBM DS8000 storage servers and IBM System POWER5 servers (IBM AIX operating system).

Contents Backup/ Restore/ Recovery of a 20 TB SAP NetWeaver Business Intelligence System in a DB2 UDB Multi-partitioned Environment ...........................................................75 Scope............................................................................................................................75

Contents ...................................................................................................................75 The Environment .........................................................................................................76 Introduction and motivation for the tests .....................................................................77 Building blocks of the solution....................................................................................78

SAP Business Warehouse ....................................................................................79 DB2 UDB Database partitioning feature .............................................................80 DB2 UDB backup/ restore ...................................................................................80 DB2 UDB database split-mirror ..........................................................................82 IBM Tivoli Storage Manager for Hardware Data Protection for Disk Storage and SAN VC for SAP with DB2 UDB (DP for FlashCopy) ......................................83 Tivoli Data Protection for Enterprise Resource Planning SAP for DB2 UDP (DP for SAP) ...............................................................................................................83 IBM System p5 595 .............................................................................................84 IBM TS3500 Tape Library and TS1030 LTO Ultrium 3 Tape Drive .................85 IBM System Storage DS8300..............................................................................85 FlashCopy ............................................................................................................85 CIM agent / DS 8000 Open API ..........................................................................86

IT infrastructure requirements and planning................................................................86 Software stack..........................................................................................................86 Database Partition Layout........................................................................................87 Log file management in a DB2 UDB DPF environment.........................................89

DB2 log manager .................................................................................................89 DB2 log files states ..............................................................................................90 Log Retrieve and collocation ...............................................................................90

Storage Layout .........................................................................................................93 Volume Group and Filesystem Layout ................................................................93

Hints and Tips for the setup .........................................................................................97



76

TCP Ports .................................................................................................................97 ssh configuration......................................................................................................98 socket servers ...........................................................................................................98 CIM agent / DS Open API.......................................................................................99 Additional DB2 configuration parameters on the backup server.............................99

Disable Health monitor ........................................................................................99 Overwrite bufferpool settings ............................................................................100

Infrastructure tests......................................................................................................101 Influencing factors for backup throughput and runtime ........................................101 Online Tape Backup ..............................................................................................102 FlashCopy backup..................................................................................................105

Phase 1: Pre Checks ...........................................................................................106 Phase 2: Invoke FlashCopy................................................................................106 Phase 3: Access FlashCopied storage on backup server(s) ...............................106 Phase 4: Run backup to tape on backup server(s)..............................................107 Phase 5: Cleanup................................................................................................110 Observations for the background copy in the DS8k ..........................................110

FlashCopy Restore .................................................................................................111 Database Restore from Tape..................................................................................113 Roll-forward Recovery ..........................................................................................116 Index Creation........................................................................................................120

Summary....................................................................................................................124

The Environment The tasks of managing backup, restore and recovery of IBM DB2 UDB databases to accommodate the requirement of rebuilding business essential data in integrity within a very short period of time after an event of failure (logical error or physical error) has always been a major IT challenge. The IBM DB2 UDB databases support the backbone of the enterprise business applications and the stable and reliable operation of large and fast growing databases is a key to business success. One popular business application in this context is SAP NetWeaver® Business Intelligence (SAP NetWeaver BI), which paints a complete picture of the business to satisfy the diverse needs of end users, IT professionals, and senior management. SAP NetWeaver BI benefits from an underlying IBM DB2 UDB database managing all the data items and brings together a powerful business intelligence infrastructure, a comprehensive set of tools, planning and simulation capabilities, and data-warehousing functionality. Supporting the global business of a large company, the database size of an SAP NetWeaver BI installation may reach a significant size: To demonstrate feasibility, scalability and show the reliable operation of a large SAP NetWeaver BI installation based on IBM DB2 UDB ESE, a PoC environment targeting a database size of 20 Terabytes was set up. The purpose of this White paper is to provide a comprehensive set of best practices and procedures for backup/ restore/ recovery when deploying SAP NetWeaver BI together with IBM DB2 UDB databases on IBM System Storage DS8000 servers and IBM System p5 servers using the AIX operating system.



77

Introduction and motivation for the tests The PoC basically consists of two parts that are tightly connected and dependent on each other. The performance and scalability part investigates in the throughput of running queries, loading data into the infocubes and the aggregate build. The infrastructure part addresses the manageability of a 20 TB database from the point of view of a recovery from a disaster. This part of the test addresses the feasibility of operations like Flashcopy, Backup, Restore and Recovery. In a very few words, the PoC verify the following customer requirements:

• Flashcopy in less than 8 hours, in parallel with defined workload. • Tape backup of a Flashcopy in less than 8 hours. • On line tape backup, in parallel with simulated Query and data load

activities • Flashcopy restore and simultaneous roll-forward of 500 GB of logs in

less than 8 hours. • Database Restore from tape using TSM and roll forward 2 TB of logs

in less than 18 hours • Rebuild indexes (total amount 3 TB data) in less than 2 hours.

These test scenarios reflects the business requirements like being able to recover from a local issue within one business day that can be achieved using a Flashcopy restore. Another scenario that needs to be addressed is recovering from a disaster that requires a full database restore. This recovery scenario has to be completed within a timeframe of 24 hours: During the tests, the tape restore, immediately followed by the roll-forward operation for 2 TB of Log files and the (re-) creation of Indexes in a combined run takes about 20 hours runtime. The remaining 4 hours may be needed to set up hardware and application infrastructure. The amount of Log Files reflects the average generation of them during normal production workload: 500 GB of log files may be generated as a peek amount during one day with high workload, and typically 2 TB of log files are generated during one week. The backups are not only the foundation for successful restores and recovery, but needs to meet further requirements: Backups need to be taken on a daily base in a certain timeframe to be able to recover from a local issue within 8 hours. In case of the PoC this timeframe is defined to 16 hours. This timeframe can be met by using the FlashCopy functionality of the storage server and a subsequent backup to tape with TSM from another server/ LPAR: Using this approach, the backup is offloaded from the production database and so doesn’t impact production performance. Another different approach to limit the impact on the production environment is to test is the capability of DB2 to throttle a database backup. In Summary, all the infrastructure test are addressing the requirement of backing-up the 20 Terabyte database daily without significant impact on production and the capability to recover from a system failure within a business day or to completely build a new database environment within 24 hours.



78

Building blocks of the solution The pictures below describe component model and the specific hardware and software setup used in the PoC installation. All these individual components and the relationships amongst them will be discussed in the next section. From an IT infrastructure view, the following main setup was available for the tests:

AIX operating system

AIX operating system

DB2 UDB ESE

Tivoli Data Protection for SAP (DB2)

Tivoli Storage Manager API Client

Tivoli Storage Manager for Hardware

Tivoli Storage Agent LANfree backup/restore

Online backup tape restore Log archiving

Management and Workflow for FlashCopy Backup

DB2 UDB ESE

Tivoli Data Protection for SAP (DB2) Tivoli Storage Manager API Client

Tivoli Storage Manager Server

Backup of FlashCopy image

Tivoli Storage Manager for Hardware

Management and Workflow for FlashCopy Backup

LTO3 tape delegation for LANfree

Log archiving (LAN)

Source FC Target

FlashCopy

DS8300

CIM Agent Interface to FlashCopy CopyServices

http: Port 5988 production server backup server

SHMEM

SHMEM TCP/IP (loopback)

TCP/IP

LPAR 0 sys1db0p DB2 Partition 0 LPAR 1 sys1db1p DB2 Partition 6 … 13

IBM System p p595 64 cores 1,9 GHz 256 GB RAM

LPAR5 sys1ci

CI AS1 AS2

LPAR 2 sys1db2p DB2 Partition 14 … 21 LPAR 3 sys1db3p DB2 Partition 22 … 29 LPAR 4 sys1db4p DB2 Partition 30 … 37

Windows 2k CIM Agent

DS8300

Source FC Target

LPAR 6 deats005 TSM Server

IBM System p p595 64 cores 1,9 GHz 256 GB RAM

http port 5988

1 16

Tape library , 16 LTO3 tape drives 16 Fibre

Channel Adapter

16 Fibre Channel adapter for FC disks



79

Business Warehouse Business information warehousing is a means for companies to take advantage of the data they gather to better understand their business and how to increase efficiency, retain customers, and grow revenues. SAP Netweaver® Business Intelligence (SAP NetWeaver BI) grew within the SAP product offering from the need for more flexible and powerful reporting capabilities to process the data stored in the SAP® ERP R/3 systems. SAP NetWeaver BI is fully integrated within the SAP NetWeaver 2004 portfolio; you can use SAP NetWeaver BI to integrate data from across the enterprise and beyond, and then you can transform it into practical, timely business information to drive sound decision making, targeted action, and solid business results. SAP NetWeaver BI provides tools for Extraction, Transformation and Loading (ETL), Data Warehouse Management and Business Modeling comprised in the Administrator workbench, and tools for Query Design, Managed Reporting and Analysis and Web Application Design. Furthermore it supports (among others) functionalities like Data Mining, Alert reporting, Business Planning and Simulation. Pre-configured Business Content can be adjusted to the company’s business needs as an easy starting point. Technically, the SAP NetWeaver BI information model uses InfoObjects as its main building block. InfoObjects are used to represent business entities and information items (business evaluation objects) and can be reused in other key elements of the SAP information model such as InfoSources, Persistent Staging Area (PSA), Operational Data Store (ODS) objects, and InfoCubes. They are divided into characteristics (e.g. customer, material), key figures (e.g. amount, revenue), and units. InfoSources are groups of InfoObjects that logically belong together from a business point of view. They are used to model transactional data structures from Online Transactional Processes (OLTP) and master data, which contain data such as customer addresses that do not change over time. ODS objects are data sets resulting from the merging of one or more InfoSources and are mainly used for preparing reports, and quality assurance purposes as intermediate data containers. They (together with PSA tables which contain similar information) typically are the largest objects in the database of a SAP NetWeaver BI system. The most important InfoProviders are InfoCubes are defined and created using an extended star schema database design to organize data for reporting and analysis. A basic star schema contains independent dimension tables that are linked together through a fact table. The extended star schema adds to the star schema capability by also storing master data about attributes, hierarchies, and text, in separate tables, which are shared between InfoCubes. This means that data redundancy is reduced because master data only needs to be stored once but can be used by various InfoCubes. Each InfoCube is made up of two fact tables and a set of dimension tables. Fact tables contain data that describes specific events within a business and are the largest tables in a star schema; they can contain billions of entries. Dimension tables contain further information about attributes in a fact table but are independent of each other and are linked only through fact tables. Since the information stored in the InfoCubes is intended to be used for reporting, the data granularity is in general lower than in ODS objects, which means that InfoCubes normally don’t grow as large as ODS objects or PSA tables.



80

Aggregates are performance-enhancing features of InfoCubes. They are smaller versions of the InfoCubes, containing aggregated data with respect to certain conditions that suit the data selections of a set of business queries.

DB2 UDB Database partitioning feature The database-partitioning feature (DPF) of IBM’s DB2 UDB Enterprise Server Edition (DB2 ESE) allows scaling of a single database across more than one server or LPAR. DPF enables the distribution of a large database over multiple partitions using a shared-nothing architecture. For SAP system environments, the DB2 UDB database-partitioning feature is supported in case of “OLAP”-type systems ( SAP NetWeaver BI, SAP Supply Chain Management application ) only. If there is a one-to-one relationship between database partitions and OS images installed either on servers or LPARs, the database partitions are called physical database partitions. If each OS image holds more than one database partition, the database partitions are called logical database partitions. DB2 UDB ESE DPF can be setup on standalone SMP servers installing several logical database partitions within one OS image: Server utilization maybe increased by “scaling up” with multiple logical database partitions. “Scaling out” across a set of servers may be setup either using physical database partitions only, or by distributing many logical database partitions across all available servers/ LPARs. If sufficient hardware resources (tape drives, CPU power) are available, backup/restore/recovery procedures will benefit from partitioning also as tasks will be split across multiple partitions in parallel. The DB2 Version used in this proof of concept was DB2 V8.2 Fix Pack 12, 64 Bit running on 5 LPARS on a System p5 595 machine. One LPAR was dedicated to DB2 partition 0 containing the SAP basis tables, the dimension and master tables. This LPAR, additionally had DB2 partitions 1 to 5, which are placeholders, and do not have any tablespaces or other data assigned. The database layout was changed during this Proof of Concept from 6 partitions to 33 partitions and thus, partitions 1 to 5 were partitions used in the old layout but were not in the final PoC configuration. The other 4 LPARS containing the remaining 32 DB2 partitions held the large ODS, PSA and InfoCube Objects. Each of these LPARS was configured with 8 database partitions. The CPU and memory configuration of the partitions changed during the various phases of the test. The illustration below shows the principle of the implementation.

DB2 UDB backup/ restore Backup and Restore commands are integrated directly in DB2 UDB. While performing a backup/ restore DB2 UDB will fork additional processes:

• Buffer manipulator processes read data blocks from the tablespaces and store them into intermediate buffers. The number of buffer manipulator processes forked during the backup is controlled via the “PARALLELISM” clause in the DB2 BACKUP/ DB2 RESTORE command.

• Media controller processes read the content of the intermediate buffers and

transmit them to the target device. For target TSM or VENDOR library, the



81

number of media controller processes is controlled via the “OPEN SESSIONS” clause. On the TSM server, sufficient resources (tape drives) have to be available to allow opening all sessions in parallel.

• To be able to allocate the required memory for using the specified buffers, the

DB2 database configuration parameter “UTIL_HEAP_SZ” has to correspond with the memory usage. In this PoC, the value was set to 30720 4K Pages.

Number and size of the intermediate buffers is controlled via the “WITH BUFFERS” clause.

The database manager configuration parameter, util_impact_lim, makes it possible to the performance impact of the backup activities on the database workload during the runtime of the backup. As default setting, no throttling for backup or another utility is enabled (util_impact_lim = 100). During the runtime of the backup the database workload typically will be affected and suffer degradation. If the impact on the application during backup runtime is too high, the backup may be throttled. Backup throttling is enabled by setting util_impact_lim to a value less than 100 (in addition, the backup has to be invoked with a non-zero priority). Specifying an util_impact_lim value of e.g. 10 should restrict the maximum impact of the backup on the database workload to 10 percent or less. A throttled backup will usually take longer to complete then a not throttled one, but cause less degradation. In a DPF setup, DB2 backup/ restore commands have to be invoked on all database partitions: The database containing the database catalog is always backed up/ restored first and sequentially before the others. All further partitions can be backed up/ restored sequentially or in parallel.

database partition

db2bm

Buffer manipulatorread from tablespaces

and write data into buffers

„parallelism __“

DISK

TSM

VENDOR

DB2 Tablespace buffer

buffer

buffer

...

db2bm...DB2 Tablespace

DB2 TablespaceDB2 Tablespace

DB2 Tablespace

db2med

db2med

...

buffer

Media controllerread from buffers and

write data to target device

„open __ sessions“

db2> backup database <name>OPEN [i] sessions WITH [j] BUFFERS BUFFER [size] PARALLELISM [k]

TO /dir

USE TSM

LOAD lib



82

DB2 UDB database split-mirror A split mirror is an independent, and instantaneous copy of a DB2 UDB database on a 2nd set of disk volumes. The 2nd set of disk volumes can then be attached to a different set of server(s) / LPARs and an identical image of the DB2 UDB database can be started on them for performing database backups, initial setup of a standby database or for database system cloning purposes. To have an independent and consistent image of the DB2 UDB database, all control and data files have to be present in the on the 2nd disk set. This includes the entire contents of the database directory and all directories hosting Tablespace containers3. To have a consistent state of the 2nd disk set, it is important to ensure that there are no partial page writes occurring on the source database: DB2 UDB provides the ability to suspend I/O on the source database, allowing splitting the mirror while the database is still online, without requiring any further downtime. After a “set write suspend for database <SID>” is issued on all database partitions of the source database, all write activities to the tablespaces are suspended4. The Flashcopy operations are then started. To continue with normal operation once the Flashcopy is initiated,, a “set write resume for database <SID>” is issued on all database partitions. � To allow the split mirror database to be opened by an application or a DBA for maintenance, the database must be initialized. To do so, the db2 command “db2inidb” is used. The command can be used to initialize the database for certain purposes. Using “db2inidb <SID> as standby” allow the setup of a shadow database by log shipping. This option also allows taking a DB2 backup using the “backup database” command of DB2. The other options for “db2inidb” are “mirror” and “snapshot”. The snapshot parameter initializes the database and will roll back transactions that are in flight when the split occurs, and start a new log chain sequence so that any logs from the primary database cannot be replayed on the cloned database. This option also requires the active log directory to be moved to the target system. The option “mirror” is intend to be used as a valid backup image for fast restore of a database. If the split mirror is intended for this backup purposes, and a possible scenario is to revert-back the “split” process and so copy back the image as “fast restore” to the 1st disk set, the active log directory must not be part of the 2nd disk set. If the active logs were present on the 2nd disk set, the active log directory of the 1st disk set would be overwritten during the copy-back process and logs required for recovery to the actual point in time would be lost. The instantaneous copy from the 1st to the 2nd is dependent on the storage vendor's implementation: Using an IBM DS8000 storage subsystem, the split-mirror from the 1st volume set to the 2nd volume set can be done using FlashCopy.

3All tablespaces must be created as DMS tablespace, TEMPORARY tablespaces may be setup as SMS or DMS 4If each database partition is suspended independently, the I/O write on the catalog partition (partition 0) must be suspended last. In a multi-partitioned database, an attempt to suspend I/O on any of the non-catalog partitions will require a connection to the catalog node for authorization. If the catalog partition is suspended, then the connection attempt may hang.



83

IBM Tivoli Storage Manager for Hardware Data Protection for Disk Storage and SAN VC for SAP with DB2 UDB (DP for FlashCopy) DP for FlashCopy will manage the whole workflow required for such a procedure: DP for FlashCopy supports FlashCopy backups for DB2 UDB ESE environments for both logical and physical database partitions. Two main possibilities are available for performing the backup: FlashCopy backup to disk only The production database is set to write suspend mode, and the FlashCopy operations are invoked for all source-target volume pairs. Then the database is set to “write resume” and continues with normal operation. All blocks off the source volumes are copied to the target volumes during the background copy in one or multiple storage subsystem(s). Typically, the FlashCopy is scheduled in incremental mode: Only the blocks changed since the last FlashCopy backup are copied. The FlashCopy image of the database (target LUNs) is accessed on one or more backup server(s) to check integrity of volume groups and file systems. After the background copy is complete, the image is available for a potential FlashCopy restore operation. FlashCopy backup followed by a database backup to TSM on the backup server(s) The production database is set to write suspend mode, and the FlashCopy operations are invoked for all source-target volume pairs. Then the database is set to “write resume” and continues with normal operation. Dependent on the scenario, the FlashCopy is started either with “NOCOPY” (transfer of changed blocks to the target volumes only) or with the “COPY” option (start background copy): To be able to use the FlashCopy image for a FlashCopy restore operation, the “COPY” option is mandatory. The FlashCopy is typically scheduled in incremental mode also. The FlashCopy image of the database is accessed on one or more backup server(s). On the backup servers, DB2 UDB is started and a database backup of the FlashCopy image is started. DB2 UDB will use IBM Tivoli Storage Manager for ERP Data Protection for SAP (DP for SAP) as load library to send the data to the TSM API Client during the DB2 backup. DP for FlashCopy uses the DS Open Application interface to interact with the FlashCopy copy services of the DS8000 storage server. The DS Open API is based on the CIM model.

Tivoli Data Protection for Enterprise Resource Planning SAP for DB2 UDP (DP for SAP) Focus of DP for SAP is on the backup/restore of database objects: It will be used for backup and restore of database contents, control files, and offline DB2 log files. A backup or restore operation is started in DB2 via the DB2 Command Line Processor (CLP): The CLP interprets the backup/ restore commands for the DB2 database and passes control to a DB2 server process. This process triggers the backup or restore, loads the DP for SAP shared library dynamically and communicates with it through the vendor API.



84

During a “BACKUP DATABASE” command, the DB2 server process creates a unique timestamp for the backup, reads the DB2 configuration files and loads DP for SAP dynamically as a shared library. DB2 buffer manipulator processes read data from the database containers and write the data into the buffers. DB2 media controller processes then move the data blocks from the buffers to the data mover part of DP for SAP. The DP for SAP shared library passes the data blocks to the TSM server (via the TSM API client). The TSM server then writes the data blocks to the storage media (tape or disk devices). At the end of the backup process, the DB2 server process logs the backup in the Recovery History File. Backup images actually stored on the TSM server can be queried using the Backup Object Manager query commands. During a “RESTORE DATABASE” command, the DB2 server process loads DP for SAP dynamically as a shared library and requests the backup data (based on the backup timestamp information) from it. The DP for SAP shared library checks with the TSM server if the backup image is available. If available, it retrieves the data blocks from TSM and passes them to the DB2 Server Process. The DB2 server process restores the DB2 data to the database containers and logs the restore in the Recovery History File.

The DB2 Log Manager loads DP for SAP as a shared library also: When a log file is ready to be archived, the DB2 Log Manager transfers all the data blocks to DP for SAP. DP for SAP then passes the data to TSM. In case of a roll-forward recovery operation, the DB2 Log Manager first checks if the log files are already available in the log path or the overflow log path. If the log files cannot be found at one of these locations, the DB2 Log Manager checks with DP for SAP if the log images can be found on the TSM server and requests the data from there: DP for SAP retrieves the data from TSM and passes them to the DB2 Log Manager. The DB2 Log Manager writes the log files to the log directory in the file system. The log files are applied to the database during the real part of the roll-forward recovery.

IBM System p5 595 The database and TSM servers were running within LPARS of IBM System p595 Servers. During the PoC, the servers were running with POWER5 1,9 GHz CPUs and 256 GB Main Memory. One requirement of the customer was the implementation of AIX 5.2 in the first step: Due to that, the P5 servers were limited to dLPAR configurations and to distribute system resources in a static manner. During the life cycle of the project, it was of course possible to reconfigure the server environment for better load balance, however it was not feasible to attempt any dynamic reconfiguration during a test run. In the project, the load balancing was done without the use of hardware multi-threading (SMT) as this was not yet implemented at the customer.



85

IBM TS3500 Tape Library and TS1030 LTO Ultrium 3 Tape Drive In the PoC, the IBM TS3500 Tape Library was used. The TS3500 Tape Library includes an enhanced power architecture and frame control assembly. The TS3500 Tape Library is designed to provide an excellent network data backup/archive solution. As the machine was equipped with in total 16 LTO tape devices, one expansion frame was added to the base frame to tailor the library to match the system capacity and performance needs. The IBM TS3500 Tape Library was equipped with 16 IBM Total Storage TS1030 LTO Ultrium 3 Tape Drive that combines IBM tape reliability and performance at open systems prices. These tape devices supports the infrastructure with maximum data transfer rate, to up to 80 MB/sec and up to 400 GB native physical capacity per cartridge (800 GB with 2:1 compression)

IBM System Storage DS8300 An IBM System Storage DS8300 is implemented as storage subsystem. The models of the DS8000 series offer high-performance, high-capacity storage systems that are designed to deliver scalability, resiliency and total value for medium and large enterprises. The DS8300 was equipped with 512 disk drives (146 GB / 15.000 rpm) located in one base frame and two expansion frames. Formatting the arrays as RAID-5 leads to a total capacity of 54016 GByte. About half of the capacity is used for the database of the production system; the remaining amount is intended as FlashCopy target volumes.

FlashCopy The FlashCopy function of the IBM DS8000 series allows making a point-in-time, full volume copy of data from a source volume to a target volume. The copy is immediately available for read access or writes. When a FlashCopy operation is initiated, a “mapping” between a source volume and a target volume is created (FlashCopy relationship). The FlashCopy relationship between this volume pair exists from the time the FlashCopy operation is initiated until the storage unit has copied all data from the source to the target volume. In case of the “NOCOPY” option, no background copy takes place and only “original” blocks that are changed on the source volume are copied to the target volume in advance of the update. If the “COPY option” is used all data of the source volume is physically copied to the target volume by a background process. If the FlashCopy relationship is “persistent”, the relationship still stays valid after the end of the background copy until the relationship is explicitly withdrawn. The activation of “change recording” for the volume pairs allows performing “incremental” FlashCopies. A “Reverse FlashCopy relationship” allows changing the direction between source and target volume. The volume that was previously defined as the target becomes the source for the volume that was previously defined as the source (and is now the target). The data that has changed is copied to the volume previously defined as the source: This feature can be used for fast restores.



86

CIM agent / DS 8000 Open API The Common Information Model (CIM) agent is a set of standards that is developed by the Distributed Management Task Force (DMTF). The CIM provides an open approach to the design and implementation of storage systems, applications, databases, networks, and devices. The CIM defines common object classes, associations, and methods. CIM agent server A Common Information Model (CIM) agent runs on a dedicated server and contains the CIM agent software and its components. CIM agent client

A Common Information Model (CIM) agent client is an application programming interface (API) that manages storage and initiates a request to a device or a data storage server such as the IBM System Storage DS8000.

CIM agent object classes Object classes are the building blocks of the Common Information Model (CIM) agent and provide functionality such as Copy Services, storage configuration, or logical unit number (LUN) masking.

IT infrastructure requirements and planning In this section different implementation possibilities and requirements for them are discussed. The first paragraph outlines the software stack used in the PoC environment and in the second paragraph the different possibilities for the layout of the database partitions and there impact on the backup infrastructure using DP is discussed.

Software stack

All production and backup servers need to have the same software stack installed regarding DB2 UDB ESE, DP for FlashCopy and DP for SAP. As an example, the following components were installed on all production and backup servers in a PoC environment:

• AIX 5.2/5.3 • IBM Tivoli Storage Manager for Hardware 5.3.1.2

• Data Protection for Disk Storage and

SAN VC for SAP with DB2 UDB (DP for Flashcopy) 5.3.1.2

• IBM Tivoli Storage Agent 5.3.2.0 • Base Providers for AIX OS

(sysmgt.pegasus.osbaseproviders) 1.2.5.0



87

Database Partition Layout

Some boundary conditions have to be met to enable DP for FlashCopy to take a backup of a FlashCopy image of the production database.

• The number n of all logical/ physical database partitions on the production database servers/ LPARs is equal to the number n of database partitions distributed across the backup servers/ LPARs.

• The DB2 instance for the backup servers (hosting the FlashCopy of the production database later on) has to have the same partition numbers for all the database partitions as the DB2 instance of the production database. (The partition numbers are defined in the configuration file).

• One backup server/LPAR may handle the FlashCopy backup for one or more production servers: The number of production database servers/ LPARs has to be greater than or equal the number of backup servers/ LPARs.

• The production server will inherit the structure for all database partitions hosted on it to the backup server: The DB backup for all database partitions of one production server has to be handled by one backup server/ LPAR.

The partition layout for the DB2 UDB ESE database is defined in the configuration file db2nodes.cfg located in the instance directory tree. For each database partition, one line item contains information about partition number, the hostname of server/ LPAR where the database partition is operated and the port number used for intra partition communication. Physical partitions (one-to-one mapping between database partitions and OS images) always have port number 0. In case of logical partitions, database partitions located in one OS image have consecutive port numbers. The production servers and the backup servers will have their own db2nodes.cfg file reflecting the database distribution for production database and FlashCopy image.

The following pictures will illustrate some examples for a possible DP for FlashCopy backup setup with FlashCopy. In the scenario, the production database is distributed across 5 production server LPARs and is partitioned into 33 database partitions. Partition 0 (acting as catalog partition and coordinating partition also) is installed as physical database partition on one LPAR. Each of the other four LPARs hosts eight logical database partitions. In the first example, one single backup server LPAR hosts all 33 database partitions for the FlashCopy image.

LPAR0sys1db0p

DB2 partition 0

LPAR1sys1db1p

DB2 partition 1 … 8

LPAR2sys1db2p


LPAR3sys1db3p


LPAR4sys1db4p


DB2 Partition 0 DB2 Partition 1 DB2 Partition 9 DB2 Partition 17 DB2 Partition 25

DB2 Partition 8 DB2 Partition 16 DB2 Partition 24 DB2 Partition 32

... ... ... ...

physical database partition logical database partition

production servers

backup server

LPAR5deats005


DB2 Partition 0

DB2 Partition 32

...



88

In the second example, the database layout between production database and the FlashCopy image on the backup servers is symmetric: The database on the backup servers follows the same layout: Partition 0 (acting as catalog partition and coordinating partition also) is installed as physical database partition on one LPAR. Each of the other four LPARs hosts eight logical database partitions.

The third example is in-between the two scenarios mentioned above: The FlashCopy image of the database on the backup servers is distributed across two backup server LPARs. The first LPAR hosts 17 database partitions including partition 0. The second backup server LPAR hosts the remaining 16 database partitions.

LPAR0sys1db0p

DB2 partition 0

LPAR1sys1db1p


LPAR2sys1db2p


LPAR3sys1db3p


LPAR4sys1db4p




... ... ... ...

production servers

backup servers

LPAR5sys1db5p

DB2 partition 0

LPAR6sys1db6p


LPAR7sys1db7p


LPAR8sys1db8p


LPAR9sys1db9p




... ... ... ...

productionservers

LPAR0sys1db0p

DB2 partition 0

LPAR1sys1db1p


LPAR2sys1db2p


LPAR3sys1db3p


LPAR4sys1db4p




... ... ... ...

backup servers

LPAR5sys1db5p


LPAR6sys1db6p


DB2 Partition 0 DB2 Partition 17

DB2 Partition 32DB2 Partition 16

...

productionservers

...



89

An optimal number of DB partitions and their distribution across production servers and backup servers has to be defined individual in each case and will depend on the individual requirements for

• Total size of the database • Performance requirements for database operation • Backup/ Restore strategy • Performance requirements for backup/ restore • Available hardware

Log file management in a DB2 UDB DPF environment In previous DB2 versions a Userexit was called for archiving/ retrieving inactive database logs either to an archive filesystem or directly to a TSM server. In SAP implementations, the inactive log files typically were moved to an archive filesystem first. Another process (“brarchive”) then was scheduled regularly to transfer them to the TSM server; an “Admin database” (created within the same DB2 instance) took care about the history for an eventual required retrieve action. Since DB2 UDB V8, the design had changed and now the DB2 log manager (integrated within DB2 UDB) is available for managing the inactive log files. In the PoC environment, the DB2 log manager is activated to manage the database log files for all database partitions independently.

DB2 log manager The DB2 log manager is controlled by different parameters: The database configuration parameter LOGARCHMETH1 specifies the primary destination to archive the inactive logs: A possible destination could be DISK (archiving to a filesystem), TSM (archiving using the standard TSM API), or VENDOR (specifying an additional load library). Options may be specified using LOGARCHOPTS1. If the method for archiving the log file was unsuccessful for NUMARCHRETRY times (having a delay of ARCHRETRYDELAY seconds between two attempts), the log files are archived to an alternate (disk) destination FAILARCHPATH. It is also possible to archive a 2nd copy of the log file by specifying an optional redundant 2nd method LOGARCHMETH2. During a roll-forward recovery, the log files that are required for recovery but are not present are retrieved to the OVERFLOWLOGPATH from the DB2 log manager independently for each database partition.



90

DB2 log files states One DB2 log file may have one of the following four states during its life cycle:

• online active DB2 UDB currently logs transactions in the log file.

• online retained The log file is no longer used for logging transactions, but it contains transactions corresponding to data pages changed in the buffer pool: Currently these buffer pool blocks have not been written to disk. DB2 UDB will need the log entries in case of a crash recovery or rollback. The DB2 log manager nevertheless copies the filled online log file to the archive location.

• offline retained The log file is no longer used by DB2 and doesn’t contain transactions with any unwritten data page. The log is not crucial for a crash recovery or rollback. The DB2 log manager copies the log to the archive location. DB2 UDB will reclaim the space in the filesystem by deleting or overwriting the offline retained log files in the database log directory.

• archived A filled or closed log file that was successfully archived (to TSM).

During a roll-forward recovery following a restore operation, all archived log files required for a roll-forward or rollback activity have to be retrieved.

Log Retrieve and collocation This section discusses some aspects regarding archive/retrieve of log files for multiple database partitions to/ from a storage manager (TSM) having its own storage pool hierarchy. In the PoC environment, DP for SAP is used to sent/retrieve the archived logs to/from TSM. Having the database distributed across several database partitions, each database partition has an own DB2 log manager. However, the log files for different database partitions may be related to each other. The following scenario has to be avoided while running a roll-forward operation on multiple database partitions in parallel:

DB2 Engine

log_dir

NEWLOGPATH

mirrorlog_dir

MIRRORLOGPATH

OVERFLOWLOGPATH

DB2 log

manager

retrieve

archive

FAILARCHPATH

NUMARCHRETRYARCHRETRYDELAY

LOGARCHMETH1LOGARCHOPTS1

LOGARCHMETH2LOGARCHOPTS2

DISK

TSM

VENDOR



91

• The logs were archived independently to the TSM server for all database partitions, eventually to a disk storage pool first. Finally, the logs got migrated to tape cartridges in a tape storage pool.

• One tape cartridge contain log files for multiple and different database partitions

During the roll-forward operation running in parallel on the database partitions, several log managers request log files required for the recovery from the TSM server independently. It may happen that two or more database partitions request log files located on one and the same tape cartridge:

LPAR4 sys1db1p DB2 Partition 30 … 37




DB2 Partition 6

DB2 Partition 7

DB2 Partition 8

…

DB2 Partition 14

0006/ log_dir

0008/ log_dir

0014/ log_dir

0007/ log_dir LPAR 6

deats005 TSM Server

LPAR0 sys1db0p

DB2 Partition 0 0000/ log_dir

PS0098L3 db2eb8.EB8.NODE0000.20060814125704.C0000010_S0095554.LOG db2eb8.EB8.NODE0000.20060814125808.C0000010_S0095555.LOG db2eb8.EB8.NODE0000.20060814190219.C0000010_S0095556.LOG db2eb8.EB8.NODE0000.20060815034758.C0000010_S0095557.LOG db2eb8.EB8.NODE0007.20060815004126.C0000009_S0002866.LOG db2eb8.EB8.NODE0008.20060815004126.C0000009_S0002381.LOG db2eb8.EB8.NODE0010.20060815004125.C0000009_S0003272.LOG db2eb8.EB8.NODE0011.20060815004126.C0000009_S0002857.LOG db2eb8.EB8.NODE0012.20060815004126.C0000009_S0002399.LOG db2eb8.EB8.NODE0014.20060815004126.C0000009_S0003491.LOG db2eb8.EB8.NODE0015.20060815004126.C0000009_S0002656.LOG db2eb8.EB8.NODE0016.20060815004126.C0000009_S0002682.LOG

Retrieving logs S0095554.LOG S0095555.LOG S0095556.LOG from tape PS0098L3

NODE0007: Retrieve of S0002866.LOG waits for tape PS0098L3 NODE0008 Retrieve of S0002381.LOG waits for tape PS0098L3 NODE0014 Retrieve of S0003491.LOG waits for tape PS0098L3



92

� The log manager/ roll-forward processes for some database partitions will need to wait until the tape cartridge is released from the first session. So, the roll-forward activity on one database partition may serialize the roll-forward activities on further database partitions. Timeouts and termination of roll-forward processes may occur on the database partitions, which need to wait for the resources. To avoid that risk, the logs of different databases have to be retrieved either from

• A common DISK storage pool A disk storage pool allows to retrieve logs for different database partitions in parallel.

or • the logs for different partitions need to be stored on different tape cartridges.

This can be achieved by using a unique TSM node names for each different database partition. In addition, “collocation” has to be activated for the tape storage pool of the archived logs. These settings assure that logs of different database partitions are stored on different tape cartridges. Of course sufficient tape drives have to be available to allow each DB2 log manager to occupy an own tape drive for the log retrieve during the roll-forward operation.

LPAR 6 deats005TSM Server

LPAR4sys1db1pDB2 Partition 30 … 37




DB2 Partition 6

DB2 Partition 7

DB2 Partition 8

…

DB2 Partition 14

0006/log_dir

0008/log_dir

0014/log_dir

0007/log_dir

LPAR0sys1db0p

DB2 Partition 0 0000/log_dir

Nodename EB806

Nodename EB807

Nodename EB808

Nodename EB814

Nodename EB800

LOGPOOL

Archived logs collocated per DB partition



93

Storage Layout One important design aspect of the implementation is the overall storage layout: In this section different aspects regarding volume group and filesystem layout, both for “locally” mounted and NFS mounted filesystems, will be presented: The chapter is focused on requirements for the implementation of the FlashCopy solution with DP for FlashCopy. From the view of database performance or backup/restore performance, further considerations have to be taken into account, e.g. the distribution of the data across the disk arrays in the storage box, the number of Fibre Channel adapters etc. These items are not in scope of this section, but will be discussed within another Whitepaper of these series.

Volume Group and Filesystem Layout

An SAP system with an underlying DB2 database follows some general conventions: Assuming that the SAP system has the SAP system name <SID> (and is not installed as MCOD system), the DB2 UDB database is named <SID> also. The database instance is owned by user db2<sid>. User db2<sid> has the same UID on all the servers/ LPARs. Home directory of user db2<sid> is /db2/db2<sid>, which is shared across all servers/ LPARs via NFS and contains the database instance directory also. The environment variable $INSTHOME points to the directory /db2/db2<sid>. The directory tree /db2/<SID> and below contains

• the database directory and the local database directory the database directory and the local database directory are located below /db2/<SID>/db2<sid>/NODEnnnn, where nnnn reflects the database partition number.

• the database tablespaces (data tablespaces, temporary tablespaces) • the active log directory

the active log directory is located in /db2/<SID>/log_dir/NODEnnnn, where nnnn reflects the database partition number

• directories for database logs etc.

During planning for volume group and file system layout for the individual database partitions, several preconditions have to be met to be able to perform the backup of the FlashCopy image of the database:

• the content of the database directory, all Tablespace containers and eventually existing file systems for SMS TEMP tablespaces have to be part of the FlashCopy image

• the file system for online logs must not be part of the FlashCopy image

• the file system for database instance and archived logs are not part of the FlashCopy image



94

Different volume groups have to be created for the database. Out of the view of one single database partition, at least three different kinds of volume groups have to be created:

• Category 1: Database directory and database data One (or more) volume group(s) containing the LUNs dedicated for database directory, database data and TEMP tablespaces. For each LUN being a member of one of these volume groups5 a FlashCopy target LUN has to be defined in the storage system also. The FlashCopy target LUN is assigned to the corresponding backup server.

• Category 2: Active logs One (or more) volume groups containing the file systems for active database logs

• Category 3: further DB related file systems (shared) The corresponding JFS/ JFS2 file system is available at partition 0 only. One volume group containing the file systems for remaining database file systems (and eventually SAP file systems). These file systems are NFS-exported to all the servers/ LPARs hosting database partitions. Within a DB2 UDB ESE DPF installation in an SAP environment, the volume group includes the file systems:

o /db2/db2<sid> Home directory of user db2<sid> and instance directory

o /db2/<SID>/db2dump Central file system for the log (db2diag.log) and trace files of the database

o /db2/<SID>/log_archive Central file system for archived logs (in /db2/<SID>/log_retrieve case of indirect archiving only

o /db2/<SID>/dbs Shared file system for DP for FlashCopy

Names for the filesystems, their corresponding logical volumes, and also for the assigned should be unique within the whole database environment: This avoids eventually problems due to renaming while the volume groups are imported on the backup servers during a flashcopy backup/ restore.

5 If all logical volumes of the volume group are mirrored using LVM mirroring with the ‘super strict’ option (One LUN hosting one copy set of a logical volume is not allowed to host the 2nd copy set of the same or another logical volume. Hosting the same copy set of another logical volume is possible), then exactly one half of the LUNs (“one complete copy set”) is sufficient.



95

The next picture illustrates the different kind of volume groups and the file systems located within them.

If there are multiple logical database partitions hosted within one OS image: Filesystems of different database partitions but belonging to the same category may share one and the same volume group. See the following considerations for sharing the same volume groups for several database partitions:

Advantages of sharing the volume groups for several database partitions

• Less total number of volume groups to handle • Free space within one volume group can be utilized by multiple database

partitions: If the data distribution is not absolutely equal across all database partitions, the disk space will be used more efficient.

Disadvantages of sharing the volume groups for several database partitions

• Data of several database partitions will share the same LUNs (no pure “shared-nothing” architecture)

• Relocation of a database partition to another OS image has to be done via backup/restore instead of moving the volume group to the other OS image.

Partition 0 only

For each DB partitionn = {0,1,2,...,32}

/db2/db2eb8/db2/EB8/log_archive/db2/EB8/db2dump/sapmnt/EB8/usr/sap/trans

DB2Pn_LOG_VG

vg101

DB2Pn_DATA_VG

/db2/EB8/log_dir/NODE00nn

/db2/EB8/db2eb8/NODE00nn/db2/EB8/sapdata1/NODE00nn/db2/EB8/sapdata2/NODE00nn/db2/EB8/sapdata3/NODE00nn/db2/EB8/sapdata4/NODE00nn

.../db2/EB8/sapdatat1/NODE00nn/db2/EB8/sapdatat2/NODE00nn/db2/EB8/sapdatat3/NODE00nn/db2/EB8/sapdatat4/NODE00nn

.../db2/EB8/db2eb8/NODE00nn

No FlashCopy

No FlashCopy

FlashCopy

all filesystems containing tablespace containers

Local database directory



96

NFS Layout

If the database partitions of the production database are distributed across multiple LPARs/ servers, some directories containing “global” content are NFS exported from the LPAR/ server hosting database partition 0 and are mounted on all further production servers.

Typically, this includes

• the instance home directory (/db2/db2<sid>) • the global directory for diagnostic files (/db2/<SID>/db2dump) • directories for logfile management

(/db2/<SID>/log_archive, db2/<SID>/log_retrieve)

In addition, the directory /db2/<SID>/dbs is NFS exported from the production server hosting database partition 0 and is mounted on all further production servers and all backup servers. This directory contains configuration files, customization files, temporary files and log files for DP for FlashCopy.

If the database partitions of the FlashCopy image are distributed across multiple LPARs/ servers, there are also some “global” directories NFS exported from the backup server LPAR/ server hosting database partition 0 and are mounted on all further backup servers. Typically, this includes

• the instance home directory (/db2/db2<sid>) • the global directory for diagnostic files (/db2/<SID>/db2dump)



97

The picture below summarizes the global storage layout:

Database Partition 0

...

/db2/<SID>/sapdata1/NODE0000/db2/<SID>/sapdata1/NODE0000 Base table spaces(ABAP+Java)BI Tablespaces

DB directory

Temporary table spaces

NFS

/db2/<SID>/sapdataN/NODE0000/db2/<SID>/sapdataN/NODE0000

/db2/<SID>/saptemp1/NODE0000/db2/<SID>/saptemp1/NODE0000

/db2/<SID>/db2<sid>/NODE0000/db2/<SID>/db2<sid>/NODE0000

/db2/<SID>/log_dir/NODE0000/db2/<SID>/log_dir/NODE0000

/db2/<SID>/db2dump/db2/<SID>/db2dump

/db2/db2<sid>/db2/db2<sid>

/db2/<SID>/log_archive/db2/<SID>/log_archive

/db2/<SID>/log_retrieve/db2/<SID>/log_retrieve

Active Log Files

Diagnostic files

Instance directory

Log file mgmt

vg00

vg01

vg02

/db2/<SID>/dbs/db2/<SID>/dbs Shared dir for TDP

Database Partition n...

/db2/<SID>/sapdata1/NODE000n/db2/<SID>/sapdata1/NODE000n

/db2/<SID>/sapdataN/NODE000n/db2/<SID>/sapdataN/NODE000n

/db2/<SID>/saptemp1/NODE000n/db2/<SID>/saptemp1/NODE000n

/db2/<SID>/db2<sid>/NODE000n/db2/<SID>/db2<sid>/NODE000n

/db2/<SID>/log_dir/NODE000n/db2/<SID>/log_dir/NODE000n



/db2/<SID>/log_archive/db2/<SID>/log_archive

/db2/<SID>/log_retrieve/db2/<SID>/log_retrieve

vgn0

vgn1

/db2/<SID>/dbs/db2/<SID>/dbs

Database Partition 0

...

/db2/<SID>/sapdata1/NODE0000/db2/<SID>/sapdata1/NODE0000 Base table spaces(ABAP+Java)BI Tablespaces

DB directory

Temporary table spaces

/db2/<SID>/sapdataN/NODE0000/db2/<SID>/sapdataN/NODE0000

/db2/<SID>/saptemp1/NODE0000/db2/<SID>/saptemp1/NODE0000

/db2/<SID>/db2<sid>/NODE0000/db2/<SID>/db2<sid>/NODE0000



vg00

vg02/db2/<SID>/dbs/db2/<SID>/dbs Shared dir for TDP

Database Partition n

...

/db2/<SID>/sapdata1/NODE000n/db2/<SID>/sapdata1/NODE000n

/db2/<SID>/sapdataN/NODE000n/db2/<SID>/sapdataN/NODE000n

/db2/<SID>/saptemp1/NODE000n/db2/<SID>/saptemp1/NODE000n

/db2/<SID>/db2<sid>/NODE000n/db2/<SID>/db2<sid>/NODE000n



/db2/<SID>/dbs/db2/<SID>/dbs

NFS

FlashCopy

NFSInstance directory

Diagnostic files

(or rootvg)

production servers

backup servers

vgn0

Hints and Tips for the setup It is not the intention in this chapter to provide full and detailed information about the setup: Please refer to the product information for the complete description of the setup of DP for FlashCopy and DP for SAP. This chapter will just provide some further comments and hints that may be helpful.

Users & Groups The groups dba, db2asgrp, db<sid>adm and db<sid>ctl have to be available on all production and backup servers using the same GID. The user db2<sid> has to be available on all production and backup servers using the same UID.

TCP Ports Several services ports for DB2 UDB, DP for FlashCopy and DP for SAP have to be defined. If multiple logical database instances are operated within one OS image, sufficient ports have to be available for them: # DB2 port range DB2_db2<sid> 5934/tcp DB2_db2<sid>_1 5935/tcp ... DB2_db2<sid>_n 59xx/tcp A similar port range has to be available for the socket servers on the production servers:



98

# idscntl port range idscntl<SID> 57410/tcp idscntl<SID>_0 57411/tcp ... idscntl<SID>_n 574yy/tcp

The prole agent, which is involved in sending the data within DP for SAP, requires an additional port: # Port for prole Agent tdpr3db264 57324/tcp

ssh configuration In the PoC setup, OpenSSH is configured between all production and backup servers for the instance owning user db2<sid>. User db2<sid> is allowed to login to each other LPAR without being prompted for any additional verification. The environment variable DB2RSHCMD was set to secure remote execution also: So DB2’s “db2_all” and “rah” remote commands will use “ssh” instead of “rsh”. See document “Configure DB2 Universal Database for UNIX to use OpenSSH” for details about the setup: http://www-128.ibm.com/developerworks/db2/library/techarticle/dm-0506finnie/index.html

socket servers The socket servers act for communication and synchronization purposes for the TDP processes running on the backup server. On each production server

• one “base” socket server process (without having the “-s” option specified) • for each database partition located on the production server:

one additional socket server process (having the “-s” option specified, the value corresponds to the port entry for the partition specified in the db2nodes.cfg)

has to be active. The socket servers are started (and will respawn in case of termination) automatically out of the inittab.

The inittab will include entries like

sock<SID>:2:respawn:su - db2<sid> -c 'cd /db2/<SID>/dbs ; /db2/<SID>/dbs/tdphdwdb2 -f initsocket -p /db2/<SID>/dbs/$(hostname)/init<SID>.fcs'>/dev/null 2>&1 sock<SID>_0:2:respawn:su - db2<sid> -c 'cd /db2/<SID>/dbs ; /db2/<SID>/dbs/tdphdwdb2 -f initsocket -p /db2/<SID>/dbs/$(hostname)/init<SID>.fcs –s 0'>/dev/null 2>&1 … sock<SID>_n:2:respawn:su - db2<sid> -c 'cd /db2/<SID>/dbs ; /db2/<SID>/dbs/tdphdwdb2 -f initsocket -p /db2/<SID>/dbs/$(hostname)/init<SID>.fcs –s n'>/dev/null 2>&1



99

The entries in /etc/inittab are generated during the execution of the setup procedure, however it should be checked that all required entries are available in the inittab and all processes are active.

CIM agent / DS Open API

Install the CIM agent on one LPAR/ external server as outlined in the documentation “DS Open Application Programming Reference”.

Perform the CIM agent configuration steps:

• Change CIM agent to communicate in “unsecured” mode Update the configuration file cimom.properties with the properties:

ServerCommunication=HTTP Port=5988 DigestAuthentication=False

• Add Storage Units o Add all Storage Units to the CIM configuration

setdevice >>> addess <IP address of Storage Unit> <username> <password>

o Add all Copy Services Server to the CIM configuration setdevice >>> addessserver <IP address of Copy Services Server> <username> <password> [<alternate address for Copy Services Server> Restart the CIM server afterwards

• Add CIM User setuser >>> adduser <cimuser> <username> <cimpass>

This user has then to be entered to the TDP configuration profile, and its password has then to be stored (using tdphdwdb2 -p <profile> -f configure) in the encrypted configuration file.

Additional DB2 configuration parameters on the backup server In the following paragraph, some additional parameterization settings for the database on the backup server(s) are described. These settings have to be set manually after the DB2 instance is created on the backup server(s), but before the first flashcopy operation is performed: Then the parameters get active during the “db2start” of the split-mirror image after the FlashCopy operation.

Disable Health monitor Having the health monitor enabled for the split mirror image on the backup server(s), during one FlashCopy execution test the backup tasks were stopped by it and so the



100

backup failed. It is recommended to disable the health monitor for the split mirror image: Set database manager parameter HEALTH_MON to OFF. db2 “UPDATE DATABASE MANAGER CONFIGURATION USING HEALTH_MON OFF”

Overwrite bufferpool settings The flashcopied database on the backup servers inherits database configuration parameters from the “original” one which settings are tuned for productive database operation on the production servers: Using the same size for the buffer pools for the split-mirror image will require a huge amount of memory on the backup server. Even worse, a backup run will not benefit from this allocated memory. By using the registry variable DB2_OVERRIDE_BPF, the “standard” settings contained in the split-mirror image are overwritten and minimized buffer pool sizes are obtained: db2set DB2_OVERRIDE_BPF=1000



101

Infrastructure tests The Proof of Concept basically consists of two parts that are tightly integrated but with different focus. One part is addressing the performance and scalability of an SAP NetWeaver® Business Intelligence system, whereas this part describes the infrastructure tests. This section concentrates on finding the optimal solution for backing up large databases, to understand the required infrastructure and to prepare for the optimal recovery process.

Influencing factors for backup throughput and runtime Some common facts are discussed in this section which will affect both throughput during backup and restore to/ from tape: Focus is on database size, tape throughput and FC adapter throughput on the AIX LPAR. The list is far from being complete, e.g. the storage subsystem and its SAN connection to the server, the LUN layout and their distribution across the disk arrays, number of fibre channel adapters on the storage box etc. may affect backup/restore runtime also.

DB2 database size and distribution distribution of data

across all 33 database partitions(total amount of data backed up: 15770 GByte)

1 1

7

10

6

7

1

0

2

4

6

8

10

12

<250 250 275 300 325 350 375 400 425 450 475 500 525 550 575 > 575

size of backup image per partition [GB]

num

ber

The total database size is one key factor for backup duration. If there is a non-uniform distribution of data across the database partitions, one big partition may influence the overall runtime (especially if all backups are run in parallel). In the tests in the PoC the data transferred from DB2 to TSM during the backup was about 15,5 TB in total. Expect for partition 0, during the backup for each individual database partition in average 480 GB of data were transferred. However, there was no perfect equal distribution among the data for the different database partitions: Standard deviation for the distribution was about 30 GB or 6,25%.



102

Throughput LTO2 vs. LTO3 To compare throughput capability of LTO2 and LTO3, a unit test was performed with a backup of two single partitions in parallel: Using LTO2 drives in the first run and LTO3 drives in the second run.

The measured throughput in the POC environment for the LTO3 drives is about 141% compared to LTO2 drives. Compared to the nominal “uncompressed” data rate of 80 MB/s for LTO3 a typical (“compressed”) throughput of 140 MB/s is observed in the PoC environment.

Throughput of a single FC adapter A unit test was performed using multiple database backups in parallel, but having the tapes connected via one FC adapter only. Using one single adapter only lead to saturation at

Combining the results of tape throughput and FC adapter throughput, it is best to dedicate an own FC adapter (2 GBit) per connected tape drive. If e.g. two tapes drives are connected via one adapter only, throughput will be limited to about 62% (LTO-3) or 88% (LTO-2) of the maximum value.

Online Tape Backup Beside the option to backup a database from a FlashCopy image, it is the natural way to backup a database directly from the production system. DB2 delivers two different flavors of the backup command – running offline or running online. While offline backups are feasible for small databases or environments with less disaster recovery requirements, the default for large productions databases is to perform the backup on-line. DB2 fully supports the backup of databases while running user workload. However, as the backup creates read access to the data in the database, this will generate additional workload that needs to be minimized. This especially becomes true, if the database is running 24x7 with equally distributed workload. To allow this, DB2 delivers various options to optimize and adopt the backup to the companies’ environment. The database can be backed up using the incremental backup functionality of DB2; using compression for backups might also help to adopt



103

the backup processes and finally the concept of throttling the backup so it does limit the impact on the production. The test will limit the impact of DB2 online backup by using the configuration parameters UTIL_IMPACT_LIM and UTIL_IMPACT_PRIORITY either with the DB2 backup command or manually while the backup is running. Other inhibitors for overall performance impact might be internal B locks while running the backup in online mode. Internal B locks are related to the lifeLSN (Log Sequence Number) of a database object. Each database object has a lifeLSN which may be is changed by operations such as create index, import replace, load and some others. DB2 UDB does not want changes to the lifeLSNs to happen. So the backup acquires an internal B lock and thus other operations, which change the lifeLSN, will wait for the on-line backup to release the internal B lock. As this in an internal lock, the lock timeout database configuration parameter does not control this and you cannot use that parameter to affect the persistence of this lock. To minimize the impact of this, the DB2 Registry Variable DB2_OBJECT_TABLE_ENTRIES has to be set at the Tablespace creation time. This will reserve contiguous storage for object metadata during table space creation. Reserving contiguous storage reduces the chance that an on-line backup will block operations that update entries in the metadata (for example, CREATE INDEX, IMPORT REPLACE). The online workload is made of online queries (25 navigation steps per minutes with a target average response time below 20 sec), cube load with a target of 25 millions records per hour and rollup aggregate with a target of 5 millions records per hours.

Throughput

0

5 000 000

10 000 000

15 000 000

20 000 000

25 000 000

30 000 000

35 000 000

40 000 000

20:0

2

20:1

1

20:1

9

20:2

8

20:3

6

20:4

5

20:5

3

21:0

2

21:1

0

21:1

9

21:2

7

21:3

6

21:4

4

21:5

3

22:0

1

22:1

0

22:1

9

22:2

7

22:3

6

22:4

4

22:5

3

23:0

1

23:1

0

23:1

8

23:2

7

23:4

4

23:5

2

00:0

1

00:0

9

00:1

8

00:2

7

00:3

5

00:4

4

Rol

lup

& U

p-Lo

ad (R

ecor

ds/h

)

0

5

10

15

20

25

30

Que

ry R

espo

nse

Tim

e

Rollup 2 Rollup 1 Upload 1 Upload 2 Upload 3 Upload 4 Upload 5 Upload 6 Upload 7 Upload 8 Response Time

Workload Graph The graph shows the combination of the workload during a period of time. The query response time is measured at given checkpoints. Due to the duration of the upload (depicted as 8 equal length jobs in the high-load phase) and aggregate build (2 cubes, showed here as the basis) as building blocks, the graphical representation is different. This is showing the total number of records performed during the whole period of measurement. This graph shows the average for the cube load of 27.6 Millions records/hour and aggregate rollup of 6.9 Millions records/hour (left scale), and the blue curve show the response time for query (right scale). In average the response time for queries is 11.55 seconds.



104

The backup with the workload was done on DB2 partition 0 first with 4 sessions and then one DB2 partition sequentially per LPAR on 2 sessions. With the 4 LPAR for DB2, we used 8 drives in parallel. (Partition 6 to 37). The backup was done with <UTIL_IMPACT_LIMIT> set to <50%> in db2 to limit the impact of the backup. This graph shows the result of the same workload during the tape backup. The InfoCube upload is not impacted by the backup and still stays at 27,8 million records per hour. This is because this activity is mostly CPU intensive on the application servers and does not generate much workload on the database. The aggregate build is a little bit more impacted by the backup and is decreasing by 14,6 percent to 6,0 million records per hour. The most impacted activity is the queries. The queries activity heavy load the database and do generate read I/O on the disk storage subsystem. With using UTIL_IMPACT_LIMT of 50, the performance of the queries degrades by 35% to an average response time of 17,7 seconds. Decreasing the UTIL_IMPACT_LIMT to a lower value would decrease this impact some more. But as the results with the backup running in parallel are within the target of 20 seconds, no additional tests were performed. The CPU usage comparison on the next graph shows that the “user time” is some higher during the run with backup compared to the run without backup in parallel. The system time is more affected by the backup and double in relation to the workload only test run. The average system time increase from 16% to 29%. This is due to the high number of IO done for the backup and also the usage of the “loop back” communication between DB2 and the TSM Storage Agent.

CPU Total sys3db1p 8/8/2006

0

10

20

30

40

50

60

70

80

90

100

21:3

5

21:3

9

21:4

3

21:4

7

21:5

1

21:5

5

21:5

9

22:0

3

22:0

7

22:1

1

22:1

5

22:1

9

22:2

3

22:2

7

22:3

1

22:3

5

22:3

9

22:4

3

22:4

7

22:5

1

22:5

5

22:5

9

23:0

3

23:0

7

23:1

1

23:1

5

23:1

9

23:2

3

23:2

7

23:3

1

23:3

5

23:3

9

23:4

3

23:4

7

23:5

1

23:5

5

23:5

9

User% Sys% Wait%

CPU usage during the calibration run without backup



105

CPU Total sys3db1p 8/11/2006

0

10

20

30

40

50

60

70

80

90

10016

:44

16:4

8

16:5

2

16:5

6

17:0

0

17:0

4

17:0

8

17:1

2

17:1

6

17:2

0

17:2

4

17:2

8

17:3

2

17:3

6

17:4

0

17:4

4

17:4

8

17:5

2

17:5

6

18:0

0

18:0

4

18:0

8

18:1

2

18:1

6

18:2

0

18:2

4

18:2

8

18:3

2

18:3

6

User% Sys% Wait%

CPU usage during test run with online backup.

A different view to the behavior of the UTIL_IMPACT_LIMIT parameter is to monitor the tape throughput compared to a full speed backup run without any workload on the system. The test shows that the impact of the tape backup on the production activity is mainly on the user’s activities. This impact is moderate or very small on the batch jobs like InfoCube upload because it is Application Server CPU intensive. Besides limiting the impact on the production, the throttled backup will increase the duration of the backup run from 5,5 hours to about 7,1 hours when using UTIL_IMPACT_LIMIT of 50%.

FlashCopy backup For each DB server LPAR hosting one or more database partitions, one DP for FlashCopy process (tdphdwdb2) has to be started, all these processes have to be started in parallel for a FlashCopy backup run. As the production database is distributed across five LPARs, five tdphdwdb2 are started in parallel. After the processes are started, one individual FlashCopy backup run using DP for FlashCopy can be structured in several phases: phase

duration Actions

1 some ten minutes Check DB2 database state, Check storage layout, Match FlashCopy source-target pairs and check FlashCopy source-target volumes relationships

2 some seconds Invoke FlashCopy operations via the CIM agent 3 some ten minutes Discover target volumes and mount the filesystems on the backup

server(s) (4) some hours IF “backup” (to tape) option is chosen:

Start DB2 and perform the backup to tape (5) some minutes IF “unmount” option is chosen:

Unmount filesystems, Remove VGs and disk devices The duration of each individual step will influence the duration of the overall process.



106

Phase 1: Pre Checks In the first phase, DP for FlashCopy checks the configuration on the DB servers, the state of the DB2 database state and the state for all FlashCopy source-target volumes relationships. The (initial) state of the FlashCopy target volumes influences the duration of phase 1: In most cases, the FlashCopy backup is run in incremental mode. Running it the first time without having any existing FlashCopy source-target relationships, there is no need in the check phase of “tdphdwdb2” to check the status of all pairs. Repeating the FlashCopy then for a 2nd time (and having all incremental FlashCopy relationships available) will check the state of all source-target pairs and will last some minutes more. Checking all the 280 source-target relationships for the LUNs in the POC setup takes about 30 minutes. In the backup run mentioned here, the relationship had been withdrawn before, no incremental Flash Copy source-target pairs had been available and so no checks for that had to be done: After about 10 minutes, all checks are finished and DP for FlashCopy enters the second phase.

Phase 2: Invoke FlashCopy In the second phase, DP for FlashCopy invokes the FlashCopy for all relevant DB volumes. All DB partitions are set to „write suspend“ at first. The FlashCopy operations are then started for all volumes: TDP sends the FlashCopy commands to the CIM Agent that forwards them to the storage server. After the FlashCopy is invoked for all LUNs, the DB partitions are set to „write resume“ and the database continues with normal operations. The total duration of this „offline“ step is some 10 seconds and depends on the actual workload on the database: The db2 “write suspend” commands will have to compete with the current workload on the database server. Having a high workload on the database server, it may take more time to get the commands executed on all partitions (up to a minute). Compared to the duration of the backup to tape this is a negligible amount of time. However, it will impact the “offline” time (where the database is in “write suspend” mode). The production database is fully operational again and DP for FlashCopy enters the third phase.

Phase 3: Access FlashCopied storage on backup server(s) In the third phase, the flashcopied storage image is accessed on the backup server(s). To avoid locking issues on devices, the five tdphdwdb2 processes synchronize themselves and will run sequentially through this phase: Each tdpdb2hdw process runs AIX configuration manager processes (“cfgmgr”). When all FlashCopy LUNs are available for the relevant storage set, the volume groups are redefined and varied online. Afterwards, all the filesystems will be mounted. The duration of the total step strongly depends on the number of storage elements (LUNs/ disk devices, number of volume groups and number of filesystems). Within the POC setup, 280 LUNs spread across 66 volume groups have to be discovered. Depending on the number of paths in the SDD configuration, one vpath (LUN) will map to several hdisks: Some tests were done with different numbers of paths and so a different number of disk devices to discover:



107

The time required for the configuration of all LUNs on the backup server scales nearly linear with the number of SDD paths: The number of SDD paths should be kept to a minimum value. On the other hand, having to few paths to the LUNs may impact the throughput during the DB2 backup (being restricted in reading from the disks): In fact, having two paths only seemed to be a limit during the tests. To optimize the overall process, a compromise between the time required accessing the disks and the throughput during the backup has to be found. During the test run discussed here in detail, two paths are defined to the LUNs: About 55 min after invoking the tdphdwdb2 processes, the backup server has successfully acquired all storage elements. All FC disk devices are discovered, the volume groups are varied on, and all filesystems are mounted. If the FlashCopy is performed in “COPY” mode and Phase 3 is finished successfully, the backup image is available for a FlashCopy restore

Phase 4: Run backup to tape on backup server(s) If it is intended to run the backup of the flashcopy to tape, this part is done in Phase 4. DP for FlashCopy runs the “db2start” and then issues the “db2inidb as standby“ command for all database partitions: The split-mirror image is then prepared for taking a backup. From a functional perspective, the database backup scheduled afterwards is similar as running an online backup of the production database to tape. However, instead of running the backup on the production system, the backup is executed on the split-mirror image on the backup server instead. There are several backup scheduling and configuration capabilities available, which have to be aligned to the available hardware (e.g. tape drives available for backup purposes) and so may influence the duration. The backup of the coordinating partition (partition 0) always has to be completed before the start of a backup for another partition. When backup of partition 0 is done, all remaining database partitions can be backed up either sequential or in parallel. The overall scheduling behavior for the remaining backups (except partition 0) is controlled by the TDP profile parameter DB2_EEE_PARALLEL_BACKUP:

• If DB2_EEE_PARALLEL_BACKUP is set to NO (default setting) all database backups aligned to one tdphdwdb2 process are done sequentially. For the PoC setup five tdphdwdb2 processes are running in parallel: Backups scheduled out of these different processes will run in parallel.

• If DB2_EEE_PARALLEL_BACKUP is set to YES (and the backup of the coordinating partition is finished), all remaining DB2 backups are started in parallel. To be able to run all backups in parallel, sufficient tape drives need to be available. In the POC environment, 16 LTO3 tape drives were available only, which is less than the number of 33 partitions – 1 (Partition 0) = 32 partitions. However it is possible to achieve the start of 32 backups in parallel



108

with having 16 tape drives available: DB2_EEE_PARALLEL_BACKUP has to be set to to YES, and the TSM server options RESOURCETIMEOUT and MAXSESSIONS have to be set to a sufficient high value. All backups will then be started in parallel. However, the 2nd half of the backups has to wait for the first half to be complete to get access to a tape drive. For a production environment however it is not recommended to “overbook” tape drive resources in such a manor.

If the backups of one set (“out of one tdphdwdb2 process) will be done with same parameterization, it is sufficient to specify that value in the TDP configuration profile. If individual “sessions” settings for the backups of different database partitions will be used, all “sessions” values for all database partitions need to be set in the each TDP configuration profile. The test run discussed here in detail is run using DB2_EEE_PARALLEL_BACKUP set to NO, and running the backup for all partitions using 2 sessions: Backup of partition 0 is started first (using two sessions in the test run), the backup is finished about 20 minutes later. The five tdphdwdb2 processes synchronize and all backups for the remaining database partitions are started. Each tdphdwdb2 process handles its backups sequentially. So, having one „stream“ for each DB server, 4 backups processes (2 sessions each) are running in parallel. The picture below illustrates the backup schedule: After FlashCopy and mount of the filesystems had been completed; backup of partition 0 is started (sequentially). Then four streams of backups are run in parallel. Total backup is finished after the end of the backup of the last partition.

Backup scheduling and runtime

partition0flashcopy and mount phase

partition 30

partition 22

partition 14

partition 6

partition 15

partition 7

partition 23

partition 31

partition 8

partition 16

partition 24

partition 32

partition 9

partition 17

partition 25

partition 33

partition 10

partition 18

partition 26

partition 34

partition 11

partition 19

partition 27

partition 35

partition 12

partition 20

partition 28

partition 36

partition 13

partition 21

partition 29

partition 37

00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 06:00:00

sys1db0p

sys1db1p

sys1db2p

sys1db3p

sys1db4p

time



109

The graph below shows the corresponding throughput (read from the disks) during the backup time: Using 4 backup streams in parallel with 2 sessions each, the average throughput is about 913 MB/s

total throughput (fibre channel disk adapter) vs. time4 backups running in parallel using 2 sessions each

0

200

400

600

800

1000

1200

0:00:00 2:00:00 4:00:00 6:00:00 8:00:00

time

thro

ugh

pu

t [M

B/s

]

total read [MB/s] average (265 MB/s) average (913 MB/s)

backuppartition 0(2 sessions)

4 parallel backup streams (2 sessions each)

Flashcopyand mountfilesystems

The backup server is configured having 8 CPUs (POWER5, 1,9 GHz). During the backup of four database partitions in parallel, they were utilized about 70%. An equivalent of about 5.5 CPUs will be required to satisfy that rate at least.

CPU utilization (%sys + %usr) * numCPU

0

1

2

3

4

5

6

7

8

0:00:00 2:00:00 4:00:00 6:00:00 8:00:00

time

num

CP

U

# CPUs utilized Total # CPU The total throughput correlates quite linearly to the amount of CPU computational power used. During the FlashCopy run, the total amount of eight CPU couldn’t be fully exploited. So, another bottleneck, e.g. limits in reading from disks, writing to



110

tapes, other boundary conditions like number of FC adapters, limited the throughput. In the case of the POC, it seems that having two paths to each of the LUNs only caused the bottleneck.

total throughput vs. CPU util8 CPUs installed, 4 DB2 backups using 2 sessions each

y = 168,28x + 75,771R2 = 0,9763

0

500

1000

1500

0,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0 8,0

CPU consumption (%sys + %usr) * numCPU

tota

l thr

ough

put [

MB

/s]

About 6h 15 min after the invocation of the tdphdwdb2 processes, the backup for all partitions is completed.

Phase 5: Cleanup TDP then starts cleanup actions on the backup server (unmount filesystems, export volume groups, remove disk devices). Total backup including the cleanup actions are completed about 6 h 35 min after the start.

Observations for the background copy in the DS8k This chapter discusses some observations for the behaviour of the background copy in the DS8300 storage box. The flashcopies during the PoC were typically executed as “incremental” ones: During this special test approx. 65 percent of the tracks of the full storage amount (24 TB) of the database had been changed since the last FlashCopy run. The background copy had to process about 15 TB. DP for FlashCopy polls the CIM server about the status of the background copy in regular intervals, and so provides data for the progress of the background copy: This data is shown in the graph below:



111

DS background copy rate in GB/h

1835 GB/h

1490 GB/h

0,00

400,00

800,00

1200,00

1600,00

2000,00

15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 0:00 1:00 2:00 3:00

time

rate

of b

ackg

roun

d co

py [G

B/h

]

Sum all volumes (smoothed) Start generation 500 GB logs End of generation of 500 GB logsStart of db2 backups End of db2 backups Average IIIAverage II

The FlashCopy backup procedure is started at 15:00. As there are already existing source-target relationships available (“incremental flashcopy”), DP for FlashCopy needs about 45 minutes to identify all source and target LUNs and to check their relationship. About 15:45 the FlashCopy commands are executed via the CIM server on the DS8k storage box, and the database is set back into normal operation. At 16:31 all the flashcopy volumes and filesystems are mounted on at the backup server and the backup of partition 0 is started. In this example, all backups are finished at 22:11. All backups are scheduled using three tape sessions each. At about 17:00, a high workload on the production database is started for about 2 hours: During this high volume phase, about 500 GB of database logs are generated within two hours. During this high workload on the production database the data rate for the background copy decreases from about 1600 GB/h to roughly 1000 GB/h: Large read/write I/O activities are done on the disks of the source system. In parallel, due to the backup of the FlashCopy database to tape, large read I/O operations are done on the flashcopy target disks. The backup is still running after the high workload on the production database is finished: The transfer rate for the background copy increased to nearly 1500 GB/h again. [large read I/O operations on the flashcopy target disks in parallel to the background copy]. Four backup streams are running in parallel for the backup and some streams finish a bit earlier then others. During the Ramp-down of the backup (21:50 – 22:10) the background copy rate increases. After all backups are finished, the background copy rate increases to about 1835 GB/h. The total background copy for the 15 TB is completed at about 02:00

FlashCopy Restore On each of the five LPARs hosting database partitions, one tdphdwdb2 process is started. The five processes have to be started in parallel and synchronize themselves



112

among each other. All of them will require some user input, so they cannot be started in background mode. The execution of the FlashCopy restore process is comparable to phase 1 to phase 3 of the FlashCopy backup procedure, then followed by the activation of the DB2 database. At first, DP for FlashCopy checks the backup log for all available backups (either last FlashCopy or tape backup) and the configuration on the DB servers. One of the valid and available backup images has to be selected then. DP for FlashCopy checks all FlashCopy source-target volumes relationships, the filesystems etc. to validate the status of the selected backup image. Total runtime of this phase is about 15 min in the PoC environment. TDP queries the user in advance of starting the “real” FlashCopy restore procedure: DB2 is stopped; all filesystems are unmounted and removed. All volume groups are exported. Then the “inverse FlashCopy” is started for all volumes. All disk devices are (re-) acquired; the volume groups are imported and varied on. All filesystems are mounted. This second step takes about 25 min for the PoC environment, about 40 min after start of the procedure, all filesystems are available again. Afterwards, DB2 is started again and a “db2inidb as mirror” is run for all database partitions: This last step runs very quickly, so about 55 min after the start of the procedure, TDP processing is finished and roll-forward recovery can be started.



113

Database Restore from Tape The restore was not done on the TSM Server but back to the production database to test the recovery procedure. The same scheduling was used to restore the database as it was used for the backup procedure. For details about the scheduling, please refer to the above section covering this topic. Prior to the restore, the source database was dropped to simulate a full recreation of the database. To speed up the creation of tablespaces, the DB2 Registry variable „DB2_USE_FAST_PREALLOCATION“ was set to on. After dropping, the database restore is started at 6:35 p.m. and running until 02:03 a.m. the following day. So the duration of the restore was 7 hours and 38 minutes.

Restore scheduling and runtime4 parallel restore streams having 2 sessions each

partition0

partition 30

partition 22

partition 14

partition 6

partition 15

partition 7

partition 23

partition 31

partition 8

partition 16

partition 24

partition 32

partition 9

partition 17

partition 25

partition 33

partition 10

partition 18

partition 26

partition 34

partition 11

partition 19

partition 27

partition 35

partition 12

partition 20

partition 28

partition 36

partition 13

partition 21

partition 29

partition 37

00:00:00 02:00:00 04:00:00 06:00:00 08:00:00

sys1db0p

sys1db1p

sys1db2p

sys1db3p

sys1db4p

time



114

The following graphs give a good indication of the workload scheduling and distribution during the restore. The CPU configuration for the restore and roll-forward tests is 8 CPU for the LPAR with the DB2 partition 0 and 4 CPU for all the other LPAR. The LPARS are equipped with 40GB memory for the DB2 catalog partition and 24 GB for the other LPARs.



115

CPU Usage during the Restore process.

The graph below shows the corresponding throughput (write to the disks) during the restore time: Using 4 restore streams in parallel with 2 sessions each, the average throughput is about 688 MB/s.

Total throughput (fibre channel disk adapter) vs. Time 4 restore sessions running in parallel using 2 sessions each

Data Transfer Rate during the Restore process.



116

Roll-forward Recovery After restoring the database, it is necessary to roll-forward to a specific point in time or to the end of the available log files. DB2 with the database-partitioning feature supports this critical part of the recovery process with its parallel architecture and so all partitions are rolled forward in parallel. In this test to roll-forward 2 terabytes of logs, the goal was to complete the restore and roll-forward process in 18 hours. The log files used during this test where created using the workload test scenarios and some artificial scripts to speed up the process. The two graph below shows this behavior of the roll-forward process and shows the Gigabytes rolled forward per DB2 partition and the evenly spread CPU usage for all DB2 Partitions on the selected LPAR.

Parallel roll-forward of all DB2 partitions. The graph above display the amount of data rolled forward during this test. You can see a parallel recovery of about the same amount of data on each DB2 partition. However, DB2 partition 0 is running longer and throughput is increasing, as the other Partitions are finishing work. The throughput during the Roll-forward is not only bound to the CPU usage on the system as shown in a graph below, but is bound to the network capacity for retrieving the DB2 Log Files. So after network traffic is decreasing when the majority of the DB2 Partitions are finished, more bandwidth is available for DB2 Partition 0. The fact that DB2 partition 0 applies more data is not due to an architectural design, but based on the fact that simply more logs are generated due to the workload running in the PoC. This could change, depending on the workload of other SAP NetWeaver BI installations.



117

CPU usage by DB2 Partition (WLM classes) on selected LPAR The roll-forward starts at 09:30 a.m. and finished at 01:54 p.m. the same day. The forward recovery phase was completed at 01:24 p.m., which leaves 30 minutes for the rollback phase of the recovery process. The CPU usage on the DB2 partitions during this Roll-forward is shown below. As the database partitions 6 – 37 and with this the 4 LPARs shows the same behavior, only the LPAR with DB2 partition 0, which is the system catalog and the LPAR having DB2 partitions 6 – 13 is shown.

CPU usage on LPAR with DB2 System Catalog

CPU usage on LPAR with DB2 partitions 6 – 13 The CPU usage on the LPARS holding the database partitions 6 – 37 is very high and shows the limiting factor during this test. The CPU usage on the DB2 system catalog



118

is within the limits, but the other LPARS are running on a very high level of system and user CPU consumption. The CPU usage is instantly growing to about 100% after starting the roll-forward and is decreasing at the end. The slower decrease at the end can be explained by the fact, that different DB2 Partitions have finished the recovery process and so less CPU is used for the remaining roll-forward processes. The way the usage is decreasing also indicates that the CPU usage during the roll-forward process is about 100% but is not heavily stressed over this limit. Comparing CPU usage on DB2 partition 0 and the other partitions shows that partitions 0 runs longer as the other partitions. This may vary in other situations as the relationship of log files produced on DB2 partitions 0 and the other partitions depends on the system usage. However, the Roll-forward process on the DB2 partitions 6 – 37 should be the same within some limits as this also indicates the equal usage of all partitions. The roll-forward process generates I/O requests on the disks during the process. These requests are both read and write request. DB2 is reading from the log files and is writing to the database the following graphs shows the I/O profile during the roll-forward process.

Data Read from Disk during Roll-forward

Data write to disk during Roll-forward



119

The I/O profile reflects the CPU usage during the process with immediate increase to the maximum values and decreasing rates at the end. While the average read during the roll-forward is 201 MB/s, the write rate in average is 346 MB/s. So the combined average workload on the I/O subsystem is 550 MB/s and DB2 is writing almost twice of data than reading from disk. The Log files for the roll-forward are retrieved from TSM Server via Network and so the workload on the Network must be taken into account in such scenario also. The System Landscape is equipped with two Gigabit Ethernet networks. While one network is dedicated to the DB2 communication within the partitions (going via en1 adapter), the second network is used for external traffic to TSM or SAP. This network is configured to use the en0 adapter in the machine. The graphs below show the network traffic of the LPAR with DB2 partition 0 and the LPAR with DB2 partitions 6 – 13. As the other partitions show a similar behavior, the graphs are not included in the document. Like CPU usage, the network traffic on LPAR for the DB2 System Catalog is increasing at the end as the DB2 partition 0 is still working on the roll-forward and now can exclusively use all available resources. The main workload on the network is read operations on adapter en0, which reflects the retrieval of the Log files from TSM.



120

Network usage During Roll-forward Process.

Index Creation Index Creation may show up at the end of the roll-forward operation. If indexes are created and this operation is recovered during the roll-forward operation, the index creation will not be performed during the roll-forward by default. The index creation will be performed on database restart or first access of the table with invalid indexes and can take several hours depending of the size and number of indexes to be recreated. As SAP NetWeaver BI is dropping and creating indexes during normal operation e.g. when loading data or rebuilding aggregates, this can influence the total recovery time significantly. The DB2 Configuration Parameter “INDEXREC”, which can be defined on both database manager and database level, can influence the start condition of the index recreation. This parameter indicates when the database manager will attempt to rebuild invalid indexes, and whether or not any index build will be redone during DB2 roll-forward.



121

Use SYSTEM setting specified to decide when invalid indexes will be rebuilt, and whether any index build log records are to be redone during DB2 roll-forward. Setting it to ACCESS, invalid indexes are rebuilt when the index is first accessed. Any fully logged index builds are redone during DB2 roll-forward. When using ACCESS_NO_REDO, invalid indexes will be rebuilt when the underlying table is first accessed. Any fully logged index build will not be redone during DB2 roll-forward and those indexes will be left invalid. The default value for indexrec is RESTART. Invalid indexes will be rebuilt when a RESTART DATABASE command is either explicitly or implicitly issued. Any fully logged index build will be redone during DB2 roll-forward. The database configuration parameter „LOGINDEXBUILD“ can be changed from the default value „off“ to „on“ and DB2 will create the indexes during the roll-forward operation and so the index recreation will not show up at first table access or restart of the database. However, this setting will increase the amount data created in the log files and may have an overall negative impact on the log file management processes or the roll-forward. Especially in SAP NetWeaver BI implementations with many indexes dropped and recreated during data load or aggregate rebuild, this may affect the overall performance. Marking the indexes of the larges tables invalid using the db2dart utility simulates the index recreation scenario. The tables are in total about 3 terabyte in size and consist of fact tables, ODS tables and PSA tables. The test starts with the restart of the database and the INDEXREC parameter set to RESTART. The process starts at 2:47 p.m. and ends at 3:55 p.m. that leads to an overall runtime of 1 hour and 8 minutes. As this is a very special part of a systems life, the CPU and memory configuration is changed to meet the expected high CPU workload. The CPU configuration for the index creation is 12 CPU for the LPAR with the DB2 partition 0 and 8 CPU for all the other LPAR. The LPARS are equipped with 30 GB memory for the DB2 catalog partition and 40 GB for the other LPARs.



122

CPU Usage during Index Creation

As all of the large tables are defined on the DB2 partitions 6 – 37, the LPAR with DB2 system catalog is not used very heavily. The other LPARs shows similar profile and so only LPAR for DB2 partitions 6 – 13 is displayed. The CPU profile is showing a very large portion of I/O wait, which was not expected at the beginning but shows the I/O throughput-demanding situation. The I/O profile is mainly showing data read operations from the disks whereas the write operation is only a fraction of the overall workload of the disk subsystem.

Total Data Rate during Index Creation.



123

Write Data Rate during Index Creation

The overall workload on the disks is within the parameters of the used Storage Subsystem. Seeing these results, the workload on the fibre channel adapters might give some more details about optimizations for this scenario. The following graph is showing the read throughput per fibre channel adapters. As the practical throughput for the used adapters is about 200 MB/s, the graphs show that the some adapters are heavily used during this test. So increasing the number of used adapters from 3 to 4 or more would probably have a positive impact on the throughput. However, as the overall results of this specific test are excellent, further tests and hardware changes were not performed during the PoC.

Data Rate measured on each fibre channel adapter.



124

In addition, some more optimizations of DB2 are possible but were not performed during this test. These DB2 optimizations might be the usage of intra_parallel for more parallelism for index creation or increasing sort memory and decreasing Bufferpools to allow in memory sorts and index creation.

Summary Although the Backup or Restore process with DB2 scales in a single partition environment, the tests performed here show the flexibility and parallelism of DB2 in a multi-partitioned implementation. With the option to restore DB2 Nodes in parallel after restoring the Catalog Partition, DB2 is well equipped for a disaster recovery scenario. The flexibility to restore single partitions and to schedule the restore in different ways gives the customer also the possibility to handle various outages accordingly. The tests and experiences also show the excellent integration of DB2 with TSM and theTivoli Data Protection products. This large Proof of Concept shows the feasibility pf managing a 20 Terabyte SAP NetWeaver Business Intelligence system, running with DB2 on IBM Infrastructure. The performed test reflects a real customer scenario with given business needs and availability requirements which will be typical in other large installation. Establishing the FlashCopy and backing up the image to tape could be done in about 6h 35 min while using 8 LTO3 tapes drives in parallel. In the scenario, about 1 h had to be spent in invoking the FlashCopy itself and mounting the many filesystems on the backup server. Most of that time was spent in “discovering” the devices and mounting the filesystems. The time required for this step can be optimized by reducing the total number of LUNs (and SDD paths), volume groups and filesystems. Due to constraints with changing the setup, the number of LUNs, VGs and filesystems couldn’t be changed in the POC environment. Only the number of SDD paths was varied. However, having too few SDD paths per LUN may then have a negative impact on the backup time, so a compromise will need to be found. Having the database distributed across 5 LPARs (partition 0 on sys1db0p, eight partitions on each other LPARs), a “natural” approach will be to use up to four, eight, twelve, … tapes in parallel. The speed of the backup is dependent on the number of parallel target tape drives. Assuming 32 tapes were available for the backup, the highest throughput would be achieved by running all the database backups in parallel with one session each. As a rough sizing estimation for backup server/ data mover, about 6 CPUs will be required to backup 4 streams in parallel (2 sessions each� 8 LTO3 tapes in total) and achieve a rate of about 1GByte/s.



125

The test, investigating the roll-forward process, clearly shows a parallel recovery of about the same amount of data on each DB2 partition. However, DB2 partition 0 is running longer and throughput is increasing, as the other Partitions are finishing work. The database was open again after 4 hours. Looking at the various resource utilization, such as CPU, Fibre channel Adapters and Network, the results show a well-balanced System. Increasing one kind of resource would be expected to shift the limiting factor to other parts. These tests clearly demonstrate the benefits on the DB2 shared nothing concept, not only in day-to-day performance but also in critical recovery scenarios. The creation of indexes, for 3TB of data, finished after a runtime of 1 hour and 8 minutes. With the given CPU configuration of 12 CPU for the LPAR with the DB2 partition 0 and 8 CPU for all the other LPAR, the Fibrechannel Adapter Configuration of 3 Adapters and the given Gigabit Ethernet connections. The system performance for this specific test might improve further when using more Fibrechannel Adapters to meet the required read requests during this operation. Another interesting finding is also the heavy read activity compared to the write activity that might change with different index definitions on the individual system. In the end, all requirements were fulfilled and overachieved and thus giving large installations the confidence to run such enormous SAP NetWeaver Business Intelligence systems on IBM Infrastructure, as well as verifying the parallel database design for the infrastructure configuration of the second phase of the Proof of Concept. Phase 2 of the PoC will perform similar tests running a 60 Terabyte SAP NetWeaver Business Intelligence system on IBM Infrastructure – so stay tuned.



126

Appendix

SAP Software Stack SAP Kernel 640_Rel Patch 129 SAP BI 350 SP 12 SAP_ABA 640 SP12 SAP_BASIS 640 SP16 PI_BASIS 2004_1_640 SP9 ST_PI 2005_1_640 SP1 BI_CONT 352 SP7 ST-A/PI O1F_BCO640

Aix Software Stack AIX 5L Version 5.2 ML 6

AIX 5L Version 5.3 ML 5 SP4

Tivoli and DB2 Software Stack DB2 UDB ESE 8.2 FP12 DB2 9 for LINUX UNIX and Windows 9.0 FP0 IBM Tivoli Storage Manager for Hardware 5.3.1.2 Data Protection for Disk Storage and SAN VC for SAP with DB2 UDB (DP for FlashCopy) IBM Tivoli Storage manager for ERP 5.3.2.2 Data Protection for SAP (DP for SAP) IBM Tivoli Storage Agent 5.3.2.0 TSM Client – Application Programming Interface 5.3.0.12 TSM Client – Backup/Archive 5.3.0.12 Pegasus CIM Server Runtime Environment 2.5.0.0 (sysmgt.pegasus.cimserver.rte) Base Providers for AIX OS (sysmgt.pegasus.osbaseproviders) 1.2.5.0



127

Copyrights and Trademarks © IBM Corporation 1994-2005. All rights reserved. References in this document to IBM products or services do not imply that IBM intends to make them available in every country. The following terms are registered trademarks of International Business Machines Corporation in the United States and/or other countries: AIX, AIX/L, AIX/L(logo), DB2, e(logo)server, IBM, IBM(logo), System P5, System/390, z/OS, zSeries. The following terms are trademarks of International Business Machines Corporation in the United States and/or other countries: Advanced Micro-Partitioning, AIX/L(logo), AIX 5L, DB2 Universal Database, eServer, i5/OS, IBM Virtualization Engine, Micro-Partitioning, iSeries, POWER, POWER4, POWER4+, POWER5, POWER5+, POWER6. A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. SAP, the SAP Logo, SAP, R/3 is trademarks or registered trademarks of SAP AG in Germany and many other countries Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other company, product or service names may be trademarks or service marks of others. Information is provided "AS IS" without warranty of any kind. Information concerning non-IBM products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not constitute an endorsement of such products by IBM. Sources for non-IBM list prices and performance numbers are taken from publicly available information, including vendor announcements and vendor worldwide homepages. IBM has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related to non-IBM products. Questions on the capability of non-IBM products should be addressed to the supplier of those products. More about SAP Trademarks at. http://www.sap.com/company/legal/copyright/trademark.asp



128

ISICC-Press CTB-2007-1.2 IBM SAP International Competence Center, Walldorf

Documents

SAP BI-20TB PROOF OF CONCEPT