9
Informatica Data Explorer Performance Tuning © 2011 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation.

De 90 0363 IDE Performance Tuning

Embed Size (px)

Citation preview

  • Informatica Data Explorer Performance Tuning

    2011 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means(electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation.

  • AbstractThe system resource guidelines for Informatica Data Explorer include resource recommendations for the Profiling ServiceModule, the Data Integration Service, profile warehouse, and hardware settings for different profile types. You can follow theguidelines for mapping memory and disk size configuration for profiles with Data Quality transformations in them. This articledescribes the system performance guidelines for Informatica Data Explorer.

    Supported Versions Informatica Data Explorer 9.1.0

    Table of ContentsSystem Performance Guidelines Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Resource Guidelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    Profiling Service Module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Data Integration Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Hardware Considerations for Flat File and Mainframe Sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Hardware Considerations for Relational Sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Profile Warehouse Guidelines for Column Profiling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Profile Warehouse Guidelines for Key and Functional Dependency Discovery. . . . . . . . . . . . . . . . . . . . . . . 6Profile Warehouse Guidelines for Foreign Key and Overlap Discovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    Resource Guidelines for Profiles with Data Quality Transformations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Mapping Memory and Disk Size Guidelines for Standard Transformations. . . . . . . . . . . . . . . . . . . . . . . . . 8Mapping Memory and Disk Size Guidelines for Reference Data Transformations. . . . . . . . . . . . . . . . . . . . . 8

    System Performance Guidelines OverviewEffective performance tuning of Informatica Data Explorer depends on how well you balance system resources for the DataIntegration Service, the Profiling Service Module, and profile warehouse. It is important to organize mapping memory anddisk size for profiles with Data Quality transformations.

    Resource GuidelinesResource guidelines include resource recommendations such as number of CPUs, amount of memory, disk space, and diskspeed. The optimal use of these resources can lead to improved performance of the Profiling Service Module, the DataIntegration Service, and profile warehouse.The system resource guidelines depend on profile types. Column profiling guidelines depend on the data source type andhardware capacity. Other types of profiling such as key discovery, functional dependency discovery, foreign key discovery,and overlap discovery have specific hardware resource guidelines.

    2

  • Profiling Service ModuleThe Profiling Service Module interacts with profile warehouse and data sources such as relational databases andnonrelational databases. Modern relational databases are optimized to process the data stored in them.The Profiling Service Module requires additional resources to read a nonrelational database source. Nonrelational sourcescan be SAP resources or mainframe sources, such as IMS or VSAM. For mainframe sources, the Profiling Service Moduleperforms most of the data processing tasks to minimize the data access costs.The following table describes the system resource requirements for the Profiling Service Module:

    System Resource Requirement

    CPU Informatica Data Explorer uses less than 1 CPU. Each profiletype has different CPU requirements:- Relational systems require less than 1 CPU for each Data

    Transformation Manager thread.- Flat files use approximately 2.3 CPUs for each Data

    Transformation Manager thread.- Key and functional dependency discovery require 1 CPU for

    each Data Transformation Manager thread.- Join, foreign key, and overlap discovery require 2 CPUs for

    each Data Transformation Manager thread.

    Memory Minimum memory required to run the profile.

    Disk No disk space is required.

    Operating System Use a 64-bit operating system if memory requirements aregreater than 3 GB.

    Data Integration ServiceThe Data Integration Service runs the Profiling Service Module. The Data Integration Service has fixed memory and variablememory requirements. The CPU requirements are not significant.The following table describes the memory requirements:

    Memory Type Description

    Fixed The amount of memory required to run the Java Virtual Machinethat the Data Integration Service uses. The requirement isapproximately 500 MB.

    Variable The amount of memory required to run each DataTransformation Manager thread. One Data TransformationManager thread is required to run each mapping that computes apart of a profile job. This overhead is dependent on theMaximum Execution Pool Size property in the serviceproperties. The default value of this property is 10 and theoverhead is approximately 1000 MB.Note: A profile that reads the output of an address validation rulemay incur an additional 1 GB in memory to read and cache theaddress validation reference data.

    3

  • Hardware Considerations for Flat File and Mainframe SourcesWhen you run a profile job on a flat file, the Profiling Service Module generates mappings that infer the metadata for thecolumns and virtual columns. Each mapping can run serially or in parallel.The Profiling Service Module may generate a second type of mapping to cache the source data. This mapping always runsin parallel with the column profiling mappings because it takes longer than a column profile mapping.The following section describes the hardware requirements for running different profiles on flat file and mainframe sources:Column Profile for a Column Profile MappingA coulmn profile mapping has the following requirements:CPU

    2.3Memory

    The minimum resource required is 10 MB, representing 2 MB 5 columns. The maximum resource required is 72MB, representing a 64 MB buffer for one high-cardinality column and 8 MB for the remaining four low cardinalitycolumns.

    Disk Space2 Number of columns per mapping Maximum number of rows ((2 bytes per character Maximum string size incharacters) + frequency bytes)

    Disk Speed7200 RPM is the minimum required disk speed.

    Column Profile for a Profile Cache MappingA profile cache mapping has the following requirements:CPU

    1.5Memory

    Memory required for Data Transformation ManagerDisk Space

    No disk space is required.Disk Speed

    Not applicable for a flat file source and 7200 PRM is the minimum required disk speed for a mainframe source.Key and Functional Dependency DiscoveryKey and functional dependency discovery have the following requirements:CPU

    1Memory

    256 MB, in addition to the mapping memoryDisk Space

    A minimum of 128 GB

    4

  • Disk Speed7200 RPM is the minimum required disk speed.

    Foreign Key and Overlap DiscoveryForeign key and overlap discovery have the following requirements:CPU

    2Memory

    64 MBDisk Space

    No disk space is required.Disk Speed

    Not applicable

    Hardware Considerations for Relational SourcesThe Profiling Service Module transfers as much processing as it can to the machine hosting the relational database. Thedivision of work between the Profiling Service Module and the database can be challenging when you estimate resources foreach machine.The following section describes resource considerations based on a single mapping that pushes the profiling logic down tothe relational database for each column:CPU

    Based on the relational database, at least one CPU processes each query. If the relational database provides amechanism to increase this, such as the parallel hint in Oracle, the number of CPUs utilized increases accordingly.

    MemoryThe relational database requires memory in the form of a buffer cache. The greater the buffer cache, the faster therelational database runs the query. Use at least 512 MB of buffer cache.

    DiskRelational systems use temporary table space. The formula for the maximum amount of temporary table spacerequired is as follows:

    2 maximum number of rows in any table (maximum column size + frequency bytes)

    where 2 = two passes (some analyses need two passes). Maximum column size = the number of bytes in any column in a table that is not one of the very large

    datatypes, for example CLOB, that you cannot run a profile on. The column size must take into account thecharacter encoding, such as Unicode or ASCII.

    Frequency bytes = 4 or 8 bytes to store the frequency during the analysis. This is the default size that thedatabase uses for COUNT(*).

    Operating SystemUse a 64-bit operating system if memory requirements are greater than 3 GB.

    5

  • Profile Warehouse Guidelines for Column ProfilingThe profile warehouse stores profiling results. The main resource for the profile warehouse is disk space. The disk sizecalculations depend on the expected storage sizes of integers. Some databases, such as Oracle, use a compressed numberformat and they require less disk size.Column profiling stores statistical and bookkeeping data, value frequencies, and staged data in the profile warehouse.Following are the profile warehouse guidelines for column profiling:Statistical and Bookkeeping Data Guidelines

    Each column contains a set of statistics, such as the minimum and maximum values. The profile warehousecontains a set of tables that store bookkeeping data, such as profile ID. These tables take up very little space andyou can exclude them from disk space calculations.

    Value Frequency Calculation GuidelinesValue frequencies are a key element in profile results. They list the unique values in a column along with a countof the occurrences of each value. Low cardinality columns have very few values, but large cardinality columns canhave millions of values. The Profiling Service Module limits the number of unique values it identifies to 16,000 bydefault. You can change this value.Use the following formula to calculate disk size requirements:

    Number of columns number of unique values (average value size + 64)

    where Number of columns = the sum of columns and virtual columns in the profile run. Average value size includes Unicode encoding of characters. 64 bytes for each value = 8 bytes for the frequency and 56 bytes for the key.

    Cached Data GuidelinesCached data is also known as staged data. It is a copy of the source data that is used for drilldown operations.Depending on the data source, this can use a very large amount of disk space.Use the following formula to calculate disk size requirements for cached data:

    Number of rows number of columns (average value size + 24)

    where 24 is the cache key size. Sum the results of this calculation for all cached tables.Other Resource Needs

    The profile warehouse has the following memory and CPU requirements:Memory

    The queries run by the Profiling Service Module do not use significant amounts of memory. Use themanufacturer's recommendations based on the table sizes.

    CPUUse 1 CPU for each concurrent profile job. This applies to each relational database or flat file profile job, notto each profile mapping. If the data is cached, use 2 CPUs for each concurrent profile job.

    Profile Warehouse Guidelines for Key and Functional Dependency DiscoveryThe disk space for key and functional dependency discovery depends on the number of inferred keys, functionaldependencies, and their dependency violations. These items take up large space in the profile warehouse if you set a largenumber for key and functional dependency discovery.You can use the following formulas to compute the disk space. If you set the confidence parameter to 100%, the profilewarehouse does not store violating rows and you can omit its computation.

    6

  • KeysUse the following formula to compute the disk space for key discovery:

    Number of Inferred Keys Average Number of Columns in the Key 32 + Number of Keys ( 32 + (2 Average Column Size ) Average Number of Key Columns Average Number of Rows that Violate the Key)

    where 32 is the number of bytes used to store one column in the key. 2 is the typical number of bytes used for a single Unicode character.

    Functional DependencyUse the following formula to compute the disk space for functional dependency:

    Number of Inferred Functional Dependencies (Average Number of LHS Columns + 1) 32 + Number of Inferred Functional Dependencies (32 + (2 Average Number of Characters in Columns) (Average Number of LHS Columns ) Average Number of Rows that Violate the Functional Dependency

    where Average Number of LHS Columns is the average number of columns in the determinant of the functional

    dependency. One is added for the dependent column. 32 is the number of bytes used to store one column in the functional dependency. 2 is the typical number of bytes used for a single Unicode character.

    Profile Warehouse Guidelines for Foreign Key and Overlap DiscoveryThe disk space for foreign key and overlap discovery is dependent on the number inferred foreign keys and overlappingcolumn pairs. These items take up large space in the profile warehouse if you set a large number for foreign key and overlapdiscovery.The Profiling Service Module computes column signatures once for foreign key and overlap discovery. You can use thefollowing formula for computing the disk space for column signatures:Signatures

    Number of Columns in Schema * 3600

    where Number of Columns in Schema is the total number of columns in the profile model. After the Profiling Service

    Module generates the column signature for a profile task, subsequent profile tasks reuse the signature. 3600 is the amount of space required to store the signatures for one column.

    Foreign KeysUse the following formula to compute the disk space for foreign keys:

    Number of Inferred Foreign Keys * 2 * (Average Number Of Columns in the Primary or Foreign Key) * 32 + Number Of Foreign Keys *( 32 + (2 Bytes per Character * Average Number of Characters in the Columns) * Average Number Of Key Columns * Average Number of Rows that Violate the Foreign Key Either in the Parent Table or Child Table

    where 2 is the multiplier to get the total number of columns for the foreign key. 32 is the number of bytes to store one column in the key. 2 Bytes per Character is the typical number of bytes for a single Unicode character.

    Overlap DiscoveryUse the following formula to compute the disk space for overlap discovery:

    Number Of Inferred Overlap Pairs * 2 * 32

    7

  • where 2 is the number of columns in the pair. 32 is the number of bytes required to store one column in the overlap pair.

    Resource Guidelines for Profiles with Data QualityTransformationsThe memory and disk overhead are critical when you run profiles with Data Quality transformations. When you determineyour resource needs, consider the number of concurrent mappings submitted to the server, the types of transformation usedin each mapping, and the size of the source data sets.

    Mapping Memory and Disk Size Guidelines for Standard TransformationsThe standard transformations, in the performance context, are Comparison, Decision, Weighted Average, and Merge. Thememory or disk usage of these transformations does not vary with the size of the data processed.These components process data rows in small batches and send them to the next component in the mapping immediately.The standard transformations do not incur additional costs in memory or disk usage beyond the standard running size.

    Mapping Memory and Disk Size Guidelines for Reference Data TransformationsReference data transformations such as Case Converter, Labeler, Parser, and Standardizer process data immediately, butthey have initialization costs that increase memory use according to their configuration.The reference table data is managed in the database. At run time, the data is held in memory for performance reasons. Tooptimize data throughput, this in-memory storage is designed for speed rather than space efficiency. Each transformationhas its own copy of the in-memory reference data.To estimate the in-memory storage, multiply the number of bytes in each column of the reference table by the number ofrows in the reference table. Then multiply the total by 1.3. For example, following is the in-memory requirement for areference table with 10000 rows, 6 columns, and an average byte count of 25:

    10000 6 25 1.3

    The total value equals approximately 2 MB.Data Quality uses reference tables to enable operations such as standardization, labeling, and parsing. Each reference dataset is carried in a table and has a size in the database equivalent to its disk size. Use the following formulas to calculatereference data table size:

    number of data rows number of columns number of characters per column

    Note: This formula applies if all columns have the same average data size.number of data rows (characters in column 1 + characters in column 2 + characters in column n)

    Note: This formula applies when table columns have different sizes.

    AuthorRajesh SivanarayananLead Technical Writer

    8

  • AcknowledgementsThe author would like to acknowledge Jeff Millman and Venkatakrishnan Swaminathan for their contributions tothis article.

    9

    AbstractSupported VersionsTable of ContentsSystem Performance Guidelines OverviewResource GuidelinesProfiling Service ModuleData Integration ServiceHardware Considerations for Flat File and Mainframe SourcesHardware Considerations for Relational SourcesProfile Warehouse Guidelines for Column ProfilingProfile Warehouse Guidelines for Key and Functional Dependency DiscoveryProfile Warehouse Guidelines for Foreign Key and Overlap Discovery

    Resource Guidelines for Profiles with Data Quality TransformationsMapping Memory and Disk Size Guidelines for Standard TransformationsMapping Memory and Disk Size Guidelines for Reference Data Transformations

    AuthorAcknowledgements