Upload
kodanda
View
2.015
Download
13
Embed Size (px)
Citation preview
Confidential & Proprietary
A Practical Introduction toAb Initio Software
Confidential & Proprietary
Course Structure
Part 1: Basic Concepts and DML
Part 2: Building Applications
Part 3: Partitioning, Layouts, Checkpoints
Database Connectivity
IntermediateExercises
&
Day 1
Day 2Part 4: Lookups, Partitioners, Variables
Testing/Validation
Finger Exercises
Confidential & Proprietary
Confidential and Proprietary!
All software and training material is:
Copyright © 1994-2001
Software Corporation
Presentations, on-line files, and printed matter are covered by nondisclosure agreement(s).
Course material and documentation are not to be circulated to organizations or individuals not under nondisclosure.
Confidential & Proprietary
Confidential & Proprietary
A Practical Introduction toAb Initio Software
Part 1: Basic Concepts and DML
V14
Confidential & Proprietary
Outline
Ab Initio OverviewFirst PrinciplesParallel Computer ArchitectureSample Applications using Ab Initio SoftwareAb Initio Product Architecture
The Graph ModelDML to Describe Data (Data Formats)DML to Transform Data
Confidential & Proprietary
What Does “Ab Initio” Mean?
• Ab Initio is Latin for “From the Beginning.”
• From the beginning our software was designed to support the largest, most complex business applications. Crucial capabilities like parallelism and checkpointing can’t be added after the fact.
• The Graphical Development Environment and a powerful set of components allow our customers to get valuable results from the beginning.
Confidential & Proprietary
Ab Initio’s focus
“Big Data” problems high volumehigh complexity
High performance, scalable solutions
High productivity development
Confidential & Proprietary
Ab Initio Software
•Ab Initio software is a general-purpose data processing platform for enterprise class, mission-critical applications such as:
Data warehousingBatch processingClick-stream analysisData movementData transformation
Confidential & Proprietary
Parallel Computer Architecture
• Computers come in many “shapes and sizes”:• Single-CPU
• Multi-CPU
• Network of single-CPU nodes
• Network of multi-CPU nodes
• Multi-CPU machines are often called SMP’s (for Symmetric Multi Processors).
• Specially-built networks of machines are often called MPP’s (for Massively Parallel Processors).
Confidential & Proprietary
A Single-CPU Computer
Processor
Disk
Memory
Bus
Confidential & Proprietary
A Multi-CPU Computer (SMP)
Confidential & Proprietary
A Network of Single-CPU Nodes
Network
If all of these comprise one computer, it may be an MPP
Confidential & Proprietary
A Network of Multi-CPU Nodes
Confidential & Proprietary
A Network of Networks
Confidential & Proprietary
Ab Initio Provides For:
• Distribution - a platform for applications to run on collections of cpu’s
• Complexity - the ability for applications to run in parallel on any combination of single-CPU computers, multi-CPU computers, and networks of computers.
Confidential & Proprietary
Applications of Ab Initio Software
• “Big Data” processing.
• Parallel execution of existing applications.
• Parallel sort/merge processing.
• Data transformation.
• Rehosting of corporate data.
Confidential & Proprietary
Applications of Ab Initio Software
• Front end of Data Warehouse:• Transformation of disparate sources
• Aggregation and other preprocessing
• Referential integrity checking
• Database loading
• Back end of Data Warehouse:• Extraction for external processing
• Aggregation and loading of Data Marts
Confidential & Proprietary
Ab Initio Product Architecture
Native Operating Systems (Unix, Windows, OS/390)Native Operating Systems (Unix, Windows, OS/390)
The Co>Operating SystemThe Co>Operating System
Component SuitePartitioners, Transforms, ...
Component SuitePartitioners, Transforms, ...
Development Environments
GDE Shell C++
Development Environments
GDE Shell C++
3rd Party Components
3rd Party Components
UserComponents
UserComponents
User ApplicationsUser Applications
Confidential & Proprietary
Co>Operating System Runs on:• Sun Solaris 2.6, 7, and 8 (SPARC)
• IBM AIX 4.2, and 4.3
• Hewlett-Packard HP-UX 10.20, 11.00, and 11.11
• Siemens Pyramid Reliant UNIX Release 5.43
• IBM DYNIX/ptx 4.4.6, 4.4.8, 4.5.1, and 4.5.2
• Silicon Graphics IRIX 6.5
• Red Hat Linux 6.2 and 7.0 (x86)
• Windows NT 4.0 (x86) with SP 4, 5 or 6
• Windows NT 2000 (x86) with no service pack or SP1
• Digital UNIX V4.0D (Rev. 878) and 4.0E (Rev. 1091)
• Compaq Tru64 UNIX Versions 4.0F (Rev 1229) and 5.1 (Rev 732)
• IBM OS/390 Version 2.8, 2.9, and 2.10
• NCR MP-RAS 3.02
Confidential & Proprietary
Connectivity to Other Software
• Common, high performance database interface:• IBM DB2, DB2/PE, UDB•Oracle • Informix XPS•Sybase•Teradata•MS SQL Server 7
• Other software packages:•SAS•Trillium•Postalsoft• ...
Confidential & Proprietary
Co>Operating System Services
• Parallel and distributed application execution• Control• Data Transport
• Transactional semantics on the application level.
• Checkpointing.• Monitoring and debugging.• Parallel file management.• Metadata-driven components.
Confidential & Proprietary
The Graph Model
Confidential & Proprietary
The Graph Model: Naming the Pieces
Dataset DatasetsComponents
Flows
Confidential & Proprietary
The Graph Model: Some Details
Ports
Record formatmetadata
Expressionmetadata
Confidential & Proprietary
Components
• A component is a program.
• Components may run on any computer running the Co>Operating System.
• Different components do different jobs.
• The particular work a component accomplishes depends on its parameter settings.
• Some parameters are computational metadata.
Confidential & Proprietary
Datasets
• A dataset is a source or destination of data. It can be a file, a database table, a SAS dataset, ...
• Datasets may reside on any machine running the Co>Operating System.
• Datasets may reside on other machines if connected by FTP or database middleware
• Data is described by record format metadata.
Confidential & Proprietary
Dataset:Records and Fields
0345John Smith0212Sam Spade0322Elvis Jones0492Sue West0121Mary Forth0221Bill Black
0345John Smith0212Sam Spade0322Elvis Jones0492Sue West0121Mary Forth0221Bill Black
Dataset
Records
Fields
A dataset is made up of records; a record consists of fields.
Analogous database terms are rows and columns; analogousSAS terms are observations and variables.
Confidential & Proprietary
Sources of Record Format Metadata
•Record formats can be generated manually (hand coding / typing) or automatically from:•Database catalogs
•COBOL copybooks
•SAS datasets
•Other third-party products
Confidential & Proprietary
Record Format Metadata in GDE
0345John Smith0212Sam Spade0322Elvis Jones0492Sue West0121Mary Forth0221Bill Black
0345John Smith0212Sam Spade0322Elvis Jones0492Sue West0121Mary Forth0221Bill Black
Confidential & Proprietary
Editing Types in GDE
Field name Field type Field length
Confidential & Proprietary
The Record Format in Text
record decimal(4) id; string(6) first_name; string(6) last_name; string(5) newfield;end
Confidential & Proprietary
Field Names
Names consist of letters, digits, and underscores:a … z, A … Z, 0 … 9, _
Note: No spaces, hyphens, $’s, #’s, %’s, or other symbols that may be acceptable to RDBMS
Case matters! ABC and abc are different!
Some words are reserved (record, end, date, …)
Confidential & Proprietary
Field Type and Field Length
There are several built-in types available via the drop-down menu. This course uses three types: string, decimal (for all numbers), and date.
A date requires a format specifier that is an exact representation of the date (e.g., “MM-DD-YY”).
A field length is either a number for fixed-length fields, or the delimiter that terminates the field for variable-length fields.
Confidential & Proprietary
What Data Can Be Described?
• There are both fixed-size and variable-length types.
• ASCII, EBCDIC, UNICODE character sets are supported.
• Supported types can represent strings, numbers, binary numbers, packed decimals, dates …
• Complex data formats can consist of vectors, nested records, ...
Confidential & Proprietary
Access to Field Characteristics
•Some aspects of field descriptions (e.g., date formats) must be accessed via the attribute pane.
•To see additional attributes, use the ‘Attributes’ item on the Record Format Editor’s View Menu or use the Attributes button.
Confidential & Proprietary
More Record Format Editing
View… Attributes.
Field Type drop-down
Length can be delimiter string
Date format goes here
Confidential & Proprietary
Text Record Format for Date Field
record decimal(4) id; string(6) first_name; string(6) last_name; date("YYYY-DD-MM") newfield;end;
Confidential & Proprietary
Viewing Data(figure-01)
1. Right click on dataset.
2. Select “View Data...”
Confidential & Proprietary
Expressions in DML
• Computations are expressed in the algebraic syntax of C, Pascal, etc.
• Field names act as variables.
• Arithmetic operators: +, -, *, ...• Comparison operators: >, <, ==, !=, ...
• Many built-in functions: string_concat, string_trim, today, date_day_of_week, …
• (See Chapter 4 of the Data Manipulation Language Reference for more information on expressions and built-in functions.)
Confidential & Proprietary
Evaluating Expressions fromView Data
Type in an expression...
…or use the expression editor
Confidential & Proprietary
Expression Editor
Expression text
Fields Functions Operators
Confidential & Proprietary
Exercise 1: Writing DML
• Open examples\intro-course\ex1.
• The data file ex1.dat contains these lines:Smith,John,1992.02.23,2400Jones,Jane,1993.10.29,320Warren,Jake,1994.11.02,9045
• Use the Record Format Editor (New) to create a description of this data. Lastname, firstname, pur_date, and amt. Then use View Data to verify the description is correct.
• (Hint: Newline delimiters are written: ”\n”.)
Confidential & Proprietary
Simple Components
• In these components the record format metadata does not change from input to output
Confidential & Proprietary
The Filter by Expression Component
• Reads records from input port and evaluates the select_expr parameter for each. If expression is true (non-zero), record is written to out port.
• Optionally, if expression is false (zero), record is written to deselect port.• One port must be connected downstream• Can use both flows
Confidential & Proprietary
Filter Data (Selection) (figure-02)
1. Push “Run” button.
2. View monitoring information. 3. View output data.
Confidential & Proprietary
Expression Parameter
Confidential & Proprietary
Exercise 2: Data Filtering (Selection)
•Using example graph figure-02.mp, change the select expression parameter of the Filter by Expression component to select records with id greater than 215.
•Run the application and examine the resulting data.
Confidential & Proprietary
Keys
• A key identifies a field or set of fields used to organize a dataset in some way.
• Single field:id• Multiple field: { last_name; first_name }
• Modifiers: { id descending }
• Used for sorting, grouping, partitioning.
• (See Chapter 8 of the Data Manipulation Language Reference for more information on keys. Note: keys are also called collators.)
Confidential & Proprietary
The Sort Component
• Reads records from input port, sorts them by key, and writes result on output port.
Confidential & Proprietary
Sorting (figure-03)
Confidential & Proprietary
Sorting(figure-03)
Confidential & Proprietary
Using example graph figure-03.mp, change the key parameter of the Sort component to sort the data by first_name.
Run the application and examine the resulting data.
Exercise 3: Sorting
Confidential & Proprietary
More Complex Components
• In these components the record format metadata typically changes (goes through a transfor-mation) from input to output
Confidential & Proprietary
Data Transformation
0345,090263John,Smith;0345,090263John,Smith;
1000345Smith 1963.09.021000345Smith 1963.09.02
Drop
id+1000000
Reformat
Reformat Reorder
Input record format: record decimal(”,”) id; date(”MMDDYY”) bday; string(”,”)
first_name; string(”;”) last_name; end
Output record format: record decimal(7) id; string(8) last_name; date(”YYYY.MM.DD”) bday; end
Confidential & Proprietary
The Reformat Component
• Reads records from input port, reformats them according to a transform function, and writes the result records to output (out0) port.
• Additional output ports (out1, ...) can be created by adjusting the count parameter.
Confidential & Proprietary
•A transform function specifies the rules used to create the output record.
•Each field of the output record must be assigned a value. Partial output records are not allowed!
•The transform editor is used to create a transform function in a graphical manner.
Transform Function
Confidential & Proprietary
Transform Editor
Confidential & Proprietary
Text DML: Transform Function Syntax
• Functions look like: output-variables :: name ( input-variables ) = begin
assignments end;
• Assignments look like: output-variable.field :: expression ;
(See Chapter 6 of the Data Manipulation Language Reference for more information on transform functions.)
Confidential & Proprietary
A Look Inside the ReformatComponent
b ca
x zy
Confidential & Proprietary
45 QF9
out :: trans(in) =begin out.x :: in.b - 1; out.y :: in.a; out.z :: fn(in.c);end;
1. Record arrives at input port
Confidential & Proprietary
45 QF9
out :: trans(in) =begin out.x :: in.b - 1; out.y :: in.a; out.z :: fn(in.c);end;
2. Record is read into component
Confidential & Proprietary
45 QF9
out :: trans(in) =begin out.x :: in.b - 1; out.y :: in.a; out.z :: fn(in.c);end;
3. Transform function is evaluated
Confidential & Proprietary
44 RG9
out :: trans(in) =begin out.x :: in.b - 1; out.y :: in.a; out.z :: fn(in.c);end;
4. Transform function yields a result record
Confidential & Proprietary
out :: trans(in) =begin out.x :: in.b - 1; out.y :: in.a; out.z :: fn(in.c);end;
44 RG9
5. Result record is written to output port
Confidential & Proprietary
Exercise 4: Reformat Data
• Using graph figure-04.mp, write a record format with an id from the simple dataset and a single name field of 20 characters.
• Write a transform function to produce a dataset in this format passing through the id and concatenating first_name and last_name using string_concat.
• Run the graph and examine the results.
• Modify the transform to trim the spaces from the first name before concatenating with last name to get “John Smith ” rather than “John Smith ”
Confidential & Proprietary
Data Aggregation
0345Smith Bristol 560212Spade London 80322Jones Compton 120492West London 230121Forth Bristol 70221Black New York 42
0345Smith Bristol 560212Spade London 80322Jones Compton 120492West London 230121Forth Bristol 70221Black New York 42
Bristol 63Compton 12London 31New York 42
Bristol 63Compton 12London 31New York 42
Confidential & Proprietary
Data Aggregation of Sorted/Grouped Input
0345Smith Bristol 560121Forth Bristol 70322Jones Compton 120212Spade London 80492West London 230221Black New York 42
0345Smith Bristol 560121Forth Bristol 70322Jones Compton 120212Spade London 80492West London 230221Black New York 42
Bristol 63Compton 12
London 31New York 42
Bristol 63Compton 12
London 31New York 42
Confidential & Proprietary
The Rollup Component
• By default, Rollup reads sorted records from the input port, aggregates them as indicated by key and transform parameters, and writes the resulting aggregated records on the out port.
Confidential & Proprietary
Built-in Functions for Rollup
•The following aggregation functions are predefined and are only available in the rollup component:
avg maxcount minfirst productlast sum
Confidential & Proprietary
Note the use of an aggregation function in the expression
Rollup Wizard
Confidential & Proprietary
Exercise 6: Rollup Data
•Using example graph figure-05.mp, modify the transform function to count the number of records for the same city.
•Run the application and examine the results.
Confidential & Proprietary
Joining Data
0345Smith Bristol 560212Spade London 80322Jones Compton 120492West London 230121Forth Bristol 70221Black New York 42
0345Smith Bristol 560212Spade London 80322Jones Compton 120492West London 230121Forth Bristol 70221Black New York 42
0322970402 1242.500345970924 923.750121961211 12392.000492971123 234.120666950616 2312.10
0322970402 1242.500345970924 923.750121961211 12392.000492971123 234.120666950616 2312.10
0345Bristol 561997/09/240212London 81900/01/010322Compton 121997/04/020492London 231997/11/230121Bristol 71996/12/110221New York 421900/01/01
0345Bristol 561997/09/240212London 81900/01/010322Compton 121997/04/020492London 231997/11/230121Bristol 71996/12/110221New York 421900/01/01
Confidential & Proprietary
Joining Sorted Data
0121Forth Bristol 70212Spade London 80221Black New York 420322Jones Compton 120345Smith Bristol 560492West London 23
0121Forth Bristol 70212Spade London 80221Black New York 420322Jones Compton 120345Smith Bristol 560492West London 23
0121961211 12392.00
0322970402 1242.500345970924 923.750492971123 234.120666950616 2312.10
0121961211 12392.00
0322970402 1242.500345970924 923.750492971123 234.120666950616 2312.10
0121Bristol 71996/12/110212London 81900/01/01...
0121Bristol 71996/12/110212London 81900/01/01...
Confidential & Proprietary
Building the Output Record
• out:
• record
• decimal(4) id;• string(8) city;• decimal(3) amount;• date(”YYYY/MM/DD”) dt;• end
in0:record decimal(4) id; string(6) name; string(8) city; decimal(3) amount;end
in1:record decimal(4) id; date(”YYMMDD”) dt; decimal(9.2) cost;end
Confidential & Proprietary
What if in1 record is missing?
• out:• record
• decimal(4) id;• string(8) city;• decimal(3) amount;• date(”YYYY/MM/DD”) dt;• end
in0:record decimal(4) id; string(6) name; string(8) city; decimal(3) amount;end
in1:record decimal(4) id; date(”YYMMDD”) dt; ??? decimal(9.2) cost;end
Confidential & Proprietary
Prioritized Assignment
• In DML, a missing value (say, if there is no in1 record) causes an assignment to fail.
• If an assignment for a left hand side fails, the next priority assignment is tried. There must be one successful assignment to each output field.
out.dt :1: in1.dt;out.dt :2: “1900/01/01”;
PriorityDestination Source
Confidential & Proprietary
Assigning Priority to Business Rules
Confidential & Proprietary
Resulting Display
Confidential & Proprietary
The Join Component
• Join performs a join of inputs. By default, the inputs to join must be sorted and an inner join is computed.
•Note: The following slides and the on-line example assume the join-type parameter is set to ‘Outer’, and thus compute an outer join.
Confidential & Proprietary
Joining (figure-06)
Confidential & Proprietary
A Look Inside the Join Component*
out :: fname(in0, in1) =begin ... ... ... ... ...end;
q rab ca
Align inputs by key
xa q
q rab ca
*join-type = Full Outer join
Confidential & Proprietary
out :: join(in0, in1) =begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: ”XX”;end;
Align inputs by a
NY 4G234 42G
1.Records arrive at inputs
Confidential & Proprietary
out :: join(in0, in1) =begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: ”XX”;end;
Align inputs by a
NY 4G234 42G
2.Records read into component
Confidential & Proprietary
out :: join(in0, in1) =begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: ”XX”;end;
Align inputs by a
NY 4G234 42G
3.Keys compared
Confidential & Proprietary
out :: join(in0, in1) =begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: ”XX”;end;
Align inputs by a
NY 4G234 42G
4.Aligned records passed to function
Confidential & Proprietary
out :: join(in0, in1) =begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: ”XX”;end;
Align inputs by a
NY 4G234 42G
5.Transform evaluated
Confidential & Proprietary
out :: join(in0, in1) =begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: ”XX”;end;
Align inputs by a
24G NY
6.Result record generated
Confidential & Proprietary
out :: join(in0, in1) =begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: ”XX”;end;
Align inputs by a
24G NY
7.Result record written
Confidential & Proprietary
out :: join(in0, in1) =begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: ”XX”;end;
Align inputs by a
IL 8K 79 23H
8.Records arrive at input
Confidential & Proprietary
out :: join(in0, in1) =begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: ”XX”;end;
Align inputs by a
IL 8K 79 23H
9.Records read into component
Confidential & Proprietary
out :: join(in0, in1) =begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: ”XX”;end;
Align inputs by a
IL 8K 79 23H
10.Keys compared
Confidential & Proprietary
out :: join(in0, in1) =begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: ”XX”;end;
Align inputs by a
IL 8K
79 23H
11.Aligned records passed to function
Confidential & Proprietary
out :: join(in0, in1) =begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: ”XX”;end;
Align inputs by a
IL 8K
79 23H
12.Transform evaluated
Confidential & Proprietary
out :: join(in0, in1) =begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: ”XX”;end;
Align inputs by a
89H XX
IL 8K
13.Result record generated
Confidential & Proprietary
out :: join(in0, in1) =begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: ”XX”;end;
Align inputs by a
89H XX
IL 8K
14.Result record written
Confidential & Proprietary
Exercise 7: Join Data
•Using example graph figure-06.mp, modify the transform function to join visits.dat and last-visits.dat so that no records are rejected.
•Run the application, and examine the results. The Unmatched Last Visits dataset should be empty.
Confidential & Proprietary
Exercise 8 (if time): Join Retaining All Fields
• Building upon the graph you created in Exercise 7, create a new output record format and transform function to join visits.dat and last-visits.dat according to the following rules:• Retain all fields from each dataset.• Supply defaults where necessary.
• Change the necessary parameters, run the application, and examine the results.
Confidential & Proprietary
Mouse & Key ShortcutsAction On What? Does This
Shift-<doubleclick>
Components Open Editor
<double click> A parameter inParameter Tab
Open Editor
<double click> Port Open RecordFormat Editor
Drag input fieldto blank spacein output fieldpane
Transformeditor
Adds field tooutput recordformat
Mouse and Key Shortcuts
Confidential & Proprietary
Confidential & Proprietary
Part 2: Building Applications
A Practical Introduction toAb Initio Software
Confidential & Proprietary
Outline
•Constructing Applications•Parallelism
•Data Partitioning•Multifiles
Confidential & Proprietary
Steps in Building an Application
•Add datasets.•Add components.•Add flows.•Modify as needed.
• Configure datasets and components along the way; let the yellow “To Do” cues guide you.
• Generally, you should configure your input and output metatdata (record formats) before adding flows.
Confidential & Proprietary
Adding an Input Dataset
2. Open Datasets Category
3. Choose InputFile
1. Click on Component Button
Confidential & Proprietary
Configuring the Input Dataset
1. Browse to find simple.dat 2. Browse to find simple.dml
3. Change label to something descriptive
Confidential & Proprietary
Adding a Filter by Expression Component
2. Choose Filter by Expression
1. Open Transform Category
Confidential & Proprietary
Adding an Output Dataset
Choose OutputFile
Confidential & Proprietary
Configuring the Output Dataset
1. Browse to see directory 2. Enter name of output file
Confidential & Proprietary
Adding Flows
1. Click on source (hold)
2. Drag to destination (release)
Confidential & Proprietary
Configuring Filter by Expression
Enter expression
Confidential & Proprietary
Flows Can Propagate Configuration
• One way to “Get rid of yellow” is to configure datasets or components.
• Hooking up flows allows the GDE to automatically propagate many kinds of information, like record format metadata; sometimes, connecting things is all you need to do to “Get rid of yellow.”
Confidential & Proprietary
Tip: Let Propagation Do the Work!
• Define record formats for input datasets.
• Define record formats for output datasets only when they differ from input datasets; let propagation do as much as possible.
• If record formats change, this minimizes the impact on the graph.
• Sometimes you will need to set record formats on components. In such cases, usually you should set the format on the output port.
Confidential & Proprietary
Tip: Look Before Deleting Components!
• Before deleting a component in a graph, look to see whether the component defines record formats for any of its ports. If you delete a component with record format definitions, you may lose the definitions.
• To safely delete such a component: For each port with a record format definition, go to the other end of the flow for that port (which will be some other component or dataset) and uncheck the ‘propogate from neighbor’ box for the associated port.
Confidential & Proprietary
Running the Application
1. Push “Run” button.
2. View monitoring information.
3. View output data.
Confidential & Proprietary
Diagnostic Ports:Reject, Error
•Reject: Input records that caused errors.
•Error: Error messages.
Confidential & Proprietary
Instrumentation Parameters:Reject-threshold
• A drop-down menu specifying the number of errors to tolerate. The choice “Use limit/ramp” allows for other possibilities.
Confidential & Proprietary
Diagnostic Port:Log
•Log: Logging records.
Confidential & Proprietary
Instrumentation Parameters:Log
• Syntax: event OR event/n (a power of 10)
• Logs records of type event. If n is specified, only 1 of every n records are logged. Valid events are:
• input, output, reject, intermediate
Confidential & Proprietary
Logging Record Format
• Logging flows have predefined metadata.
• The record format is:
• record• string("|") node;• string("|") timestamp;• string("|") component;• string("|") subcomponent;• string("|") event_type;• string("|\n") event_text;• end
Confidential & Proprietary
Component: Gather Logs
• Reads logging records from multiple flows connected to the input port and writes them to the specified file outside of the application’s transactional context. The start-text and end-text parameter values are written to the log at the beginning and end.
Confidential & Proprietary
Component: Replicate
• Copies records from input port to multiple flows connected to output port.
Confidential & Proprietary
Sample Graph
Confidential & Proprietary
Exercise 9: Creating a Reformatting Application
• Create a new graph that:
Reads data from simple.dat with record format simple.dml.
Reformats that data with simple-out.xfr.
Writes the results to simple-out.dat with record format simple-out.dml.
• Run it and verify the results.
Confidential & Proprietary
Exercise 10:Obtaining Log Information
• Add a Gather Logs component to the application.
• Configure the component. Don’t forget to provide a log file name.
• Connect it to the Reformat’s log port.
• Run the application.
• View the log file on the server.
Confidential & Proprietary
Exercise 11: Creating an Aggregation Application
•Create an application that:
Reads data from visits.dat with record format visits.dml.
Sorts it by city.
Aggregates it (using Rollup component) by city with visits-to-city-rollup.xfr.
Writes the results to visits-to-city.dat with record format visits-to-city.dml.
Logs input,output,intermediate events.
Confidential & Proprietary
Computing without Sort
Some components do not require pre-sorted inputs.
These components work by keeping some or all of the inputs in memory.
These components usually have a sorted-input parameter, or have the word hash in their name.
There are rules of thumb about when to use “in-memory” sorting or grouping vs sorting before the component.
Confidential & Proprietary
Exercise 12: Rollup without Sort
• Open figure-05.
• Save As... to figure-05-nosort.
• Delete the Sort component.
• Change the sorted-input parameter of the Rollup component to “in-memory…”
• Run the application and examine the results.
Confidential & Proprietary
Exercise 13:Join without Sort
• Open figure-06.
• Save As... to figure-06-nosort.
• Delete both Sort components.
• Change the sorted-input parameter of the Join component to “in-memory…”
• Run the application and examine the results.
Confidential & Proprietary
Forms of Parallelism
• Component parallelism
• Pipeline parallelism
• Data parallelism
Confidential & Proprietary
Component Parallelism
Sorting Customers
Sorting Transactions
Confidential & Proprietary
Component Parallelism
•Comes “for free” with graph programming.
•Limitation:•Scales to number of “branches” a graph.
Confidential & Proprietary
Pipeline Parallelism
Processing Record: 100
Processing Record: 99
Confidential & Proprietary
Pipeline Parallelism
•Comes “for free” with graph programming.
•Limitations:•Scales to length of “branches” in a graph.
•Some operations, like sorting, do not pipeline.
Confidential & Proprietary
Data Parallelism
Partiti
ons
Confidential & Proprietary
Global View:
Expanded View:
Two Ways of Looking atData Parallelism
Confidential & Proprietary
Data Parallelism
•Scales with data.
•Requires data partitioning.
•Different partitioning methods for different operations.
Confidential & Proprietary
Data Partitioning
Expanded View:
Global View:
Confidential & Proprietary
Data Partitioning: The Global View
Fan-out Flow
Degree of Parallelism
Confidential & Proprietary
Component: Partition by Round-robin
• Reads records from its input port and writes them to the flow partitions connected to its output port. Records are written to partitions in “roundrobin” fashion, with block-size records going to a partition before moving on to the next.
Confidential & Proprietary
Roundrobin Partitioning
BCD
FCDBGB
DF
D
BC
D
FC
DB
GB
DF
E
E
E
E
A
AA
A
A
AA
AD
Partition 0 Partition 1 Partition 2
Confidential & Proprietary
Roundrobin Partitioning
B CD FC D BG B
D FD
Partition 0 Partition 1 Partition 2
BCD
FCDBGB
DF
E
EE
E
A
AA
A
A
AA
AD
Confidential & Proprietary
A Data Parallel Application:The Expanded View
Confidential & Proprietary
Exercise 14: Data Parallel Reformatting (Expanded)
• Open figure-04.
• Save As... to figure-04-expanded.
• Create a copy of the Reformat and the Simple-Out dataset (use Edit...Copy and Edit…Paste).
• Change the path for the copy of Simple-Out.
• Add a Partition by Round-robin component before the Reformat components; hook them up with flows.
• Run the application and examine the results.
Confidential & Proprietary
A Data Parallel Application:The Global View
Fan-out Flow Multifile
Degree of Parallelism(Abstract)
Confidential & Proprietary
What is a Multifile?
• A multifile is essentially the “global view” of a set of ordinary files, each of which may be located anywhere.
• Each partition of a multifile is an ordinary file.
• By using the global view and multifiles, you can avoid having to draw data parallelism explicitly.
• Ab Initio utilities let you manipulate (copy, rename, delete, etc.) multifiles as easily as ordinary files.
• Note that the icon for a multifile has 3 platters instead of 2.
Confidential & Proprietary
Multifiles
• Multifiles reside in multidirectories.
• Multidirectories and multifiles are identified using URL syntax with “mfile” as the protocol part:
•mfile:/users/training-07/test-mfs/
•mfile:mfs2/transactions/
•mfile://mktg-mpp/vol3/big-mfs/january/sales.dat
• These URL’s are simply abbreviations for the many pieces making up a multidirectory or multifile.
(See Chapter 2 of the Co>Operating System Administrator’s Guide for more information on multifiles.)
Confidential & Proprietary
A Multidirectory
mfile://host1/u/jo/mfs
//host3/vol7/pC///host2/vol3/pB///host1/vol4/pA///host1/u/jo/mfs/
ControlPartition
DataPartition
DataPartition
DataPartition
A single name for three directories
Confidential & Proprietary
A Multifile
//host3/vol7/pC///host2/vol3/pB///host1/vol4/pA///host1/u/jo/mfs/
ControlPartition
DataPartition
DataPartition
DataPartition
A single name for three filesmfile://host1/u/jo/mfs/a.dat
a.dat a.dat a.dat a.dat
Confidential & Proprietary
Additional Multidirectories
//host3/vol7/pC///host2/vol3/pB///host1/vol4/pA///host1/u/jo/mfs/
ControlPartition
DataPartition
DataPartition
DataPartition
a.dat a.dat a.dat a.dat
dir1/ dir1/ dir1/ dir1/
mfile://host1/u/jo/mfs/dir1
Confidential & Proprietary
Additional Multidirectories
//host3/vol7/pC///host2/vol3/pB///host1/vol4/pA///host1/u/jo/mfs/
ControlPartition
DataPartition
DataPartition
DataPartition
a.dat a.dat a.dat a.dat
dir1/ dir1/ dir1/ dir1/
mfile://host1/u/jo/mfs/dir1
Confidential & Proprietary
Additional Multidirectories
//host3/vol7/pC///host2/vol3/pB///host1/vol4/pA///host1/u/jo/mfs/
ControlPartition
DataPartition
DataPartition
DataPartition
a.dat a.dat a.dat a.dat
dir2/ dir2/ dir2/ dir2/dir1/ dir1/ dir1/ dir1/
mfile://host1/u/jo/mfs/dir2
Confidential & Proprietary
A Multidirectory Hierarchy
//host3/vol7/pC///host2/vol3/pB///host1/vol4/pA///host1/u/jo/mfs/
ControlPartition
DataPartition
DataPartition
DataPartition
a.dat a.dat a.dat a.dat
dir2/ dir2/ dir2/ dir2/dir1/ dir1/ dir1/ dir1/
mfile://host1/u/jo/mfs/dir2/b.dat
b.dat b.dat b.dat b.datx.dat x.dat x.dat x.dat
Confidential & Proprietary
Adding a Multifile Dataset
1. Drill into multidirectory
2. Type in filename
Confidential & Proprietary
Exercise 15: Data Parallel Reformatting (Global)
• Open figure-04.
• Save As... to figure-04-global.
• Add a Partition by Round-robin component.
• Change the Simple-Out dataset to a multifile.
• Run the application and examine the results (use the “Partition” option in View Data).
Confidential & Proprietary
Data Aggregation in Parallel
0345Smith Bristol 560322Jones Compton 120121Forth Bristol 7
0345Smith Bristol 560322Jones Compton 120121Forth Bristol 7
Bristol 63Compton 12
Bristol 63Compton 12
0212Spade London 80492West London 230221Black New York 42
0212Spade London 80492West London 230221Black New York 42
London 31New York 42
London 31New York 42
Confidential & Proprietary
Data Aggregation of Grouped Input in Parallel
0345Smith Bristol 560121Forth Bristol 70322Jones Compton 12
0345Smith Bristol 560121Forth Bristol 70322Jones Compton 12
Bristol 63Compton 12
Bristol 63Compton 12
0212Spade London 80492West London 230221Black New York 42
0212Spade London 80492West London 230221Black New York 42
London 31New York 42
London 31New York 42
Confidential & Proprietary
• Aggregation processes records in groups defined by key values.
• Parallel aggregation requires partitioning based on key value.
• Parallel aggregation takes three steps:• Partition by key.• Sort by key. Same key in each step• Aggregate by key.
Key-Dependent Data Parallelism
Confidential & Proprietary
Component: Partition by Key
• Reads records from its input port and writes them to the flow partitions connected to its output port. A hash code computed using the key determines which partition a record will be written on, meaning that records with the same key value will go to the same partition.
Confidential & Proprietary
Partitioning by Key
BC
D
FC
DBGB
DF
D
Partition 0 Partition 1 Partition 2
BCD
FCDBGB
DF
E
E
A
AA
A
E
E
A
AA
AD
Confidential & Proprietary
Partitioning by Key
Partition 0 Partition 1 Partition 2B
C
DF
C D
BGB
DF
D
BCD
FCDBGB
DF
E
E
A
AAA
E
E
A
AA
AD
Confidential & Proprietary
Partition by Key + Sort = Parallel Grouping
B
C
D
FC
D
B
G
BD
F
D
Partition 0 Partition 1 Partition 2
DF
D
DF
D
BCD
FCDBGB
DF
D
BCC
BGB
EE
AAAA
E
E
A
AA
A
E
E
A
AAA
Confidential & Proprietary
Common Mistakes
•Incorrect Results if:Keys for partition, sort, or aggregate differ.Data is partitioned, but is never sorted.
•Computationally Expensive if:Data is sorted before it is partitioned.
Confidential & Proprietary
Exercise 16:Data Parallel Aggregation•Start with figure-05.
•Save As... to figure-05-parallel.
•Add a Partition by Key component.
•Change the output file to a multifile.
•Run the application and examine the results.
Confidential & Proprietary
Confidential & Proprietary
Part 3: Intermediate Topics
A Practical Introduction toAb Initio Software
Confidential & Proprietary
Outline
•Departitioning
•Deadlock
•Repartitioning
•Layouts
•Phases and Checkpoints
•Anatomy of a Running Job
•Sample Applications
Confidential & Proprietary
Departitioning
Departitioning combines many flows of data toproduce one flow. It is the opposite of partitioning.
Each departition component combines flows in adifferent manner.
Confidential & Proprietary
Expanded View:
Global View:
Departitioning
Output File
Score 1
DepartitionScore 2
Score 3
Confidential & Proprietary
Departitioning
• For the various departitioning components:•Key-based?•Result ordering?•Effect on parallelism?•Uses?
Fan-in Flow
Confidential & Proprietary
Departitioning: Performance
Input buffer Output buffer
Free space
Used space
Confidential & Proprietary
Concatenation
49Jane 02241 2
44Bob 02116 8
43Mark 02114 9
49Jane 02241 2
44Bob 02116 8
43Mark 02114 9
47Bill 02114 14
46Rick 02116 23
47Bill 02114 14
46Rick 02116 2342John 02116 30
48Mary 02116 38
45Sue 02241 92
42John 02116 30
48Mary 02116 38
45Sue 02241 92
Globally ordered, partitioned data:
49Jane 02241 2
44Bob 02116 8
43Mark 02114 9
47Bill 02114 14
46Rick 02116 23
42John 02116 30
48Mary 02116 38
45Sue 02241 92
49Jane 02241 2
44Bob 02116 8
43Mark 02114 9
47Bill 02114 14
46Rick 02116 23
42John 02116 30
48Mary 02116 38
45Sue 02241 92
Sorted data:
Confidential & Proprietary
Concatenation: Performance
Blocked components
Running components Reading single flowin its entirety
Confidential & Proprietary
Concatenation
• Not key-based.• Result ordering is by partition.• Serializes pipelined computation.• Useful for:
• creating serial flow from partitioned data• appending headers and trailers• writing DML
• Used infrequently
Confidential & Proprietary
Merge
42John 02116 30
48Mary 02116 38
45Sue 02241 92
42John 02116 30
48Mary 02116 38
45Sue 02241 92
49Jane 02241 2
43Mark 02114 9
46Rick 02116 23
49Jane 02241 2
43Mark 02114 9
46Rick 02116 23
44Bob 02116 8
47Bill 02114 14
44Bob 02116 8
47Bill 02114 14
Round-robin partitioned and sorted by amount:
49Jane 02241 2
44Bob 02116 8
43Mark 02114 9
47Bill 02114 14
46Rick 02116 23
42John 02116 30
48Mary 02116 38
45Sue 02241 92
49Jane 02241 2
44Bob 02116 8
43Mark 02114 9
47Bill 02114 14
46Rick 02116 23
42John 02116 30
48Mary 02116 38
45Sue 02241 92
Sorted data, following merge on amount:
Confidential & Proprietary
Merge: Performance
Components running roughly in lock-step
If keys evenly distributed: Reading flowsroughly evenly
Confidential & Proprietary
Merge: Performance
If keys globally sorted or near globally sorted:
Blocked components
Reading single flowin its entirety
Confidential & Proprietary
Merge
• Key-based.• Result ordering is sorted if each input is sorted.• Possibly synchronizes pipelined computation; may
even serialize.• Useful for creating ordered data flows.• Used more than concatenate, but still infrequently
Confidential & Proprietary
Interleave
42John 02116 30A
45Sue 02241 92A
48Mary 02116 38A
42John 02116 30A
45Sue 02241 92A
48Mary 02116 38A
43Mark 02114 9C
46Rick 02116 23B
49Jane 02241 2C
43Mark 02114 9C
46Rick 02116 23B
49Jane 02241 2C
44Bob 02116 8C
47Bill 02114 14B
44Bob 02116 8C
47Bill 02114 14B
Round-robin partitioned and scored:
42John 02116 30A
43Mark 02114 9C
44Bob 02116 8C
45Sue 02241 92A
46Rick 02116 23B
47Bill 02114 14B
48Mary 02116 38A
49Jane 02241 2C
42John 02116 30A
43Mark 02114 9C
44Bob 02116 8C
45Sue 02241 92A
46Rick 02116 23B
47Bill 02114 14B
48Mary 02116 38A
49Jane 02241 2C
Scored dataset in original order, following interleave:
Confidential & Proprietary
Interleave: Performance
Components running in lock-step
Reading flows inround-robin sequence
Confidential & Proprietary
Interleave
• Not key-based.
• Result ordering is inverse of round-robin.
• Synchronizes pipelined computation.
• Useful for restoring original order following a record-independent parallel computation partitioned by round-robin.
• Used in rare circumstances
Confidential & Proprietary
Gather
42John 02116 30A
45Sue 02241 92A
48Mary 02116 38A
42John 02116 30A
45Sue 02241 92A
48Mary 02116 38A
43Mark 02114 9C
46Rick 02116 23B
49Jane 02241 2C
43Mark 02114 9C
46Rick 02116 23B
49Jane 02241 2C
44Bob 02116 8C
47Bill 02114 14B
44Bob 02116 8C
47Bill 02114 14B
Round-robin partitioned and scored:
43Mark 02114 9C
46Rick 02116 23B
42John 02116 30A
45Sue 02241 92A
48Mary 02116 38A
44Bob 02116 8C
47Bill 02114 14B
49Jane 02241 2C
43Mark 02114 9C
46Rick 02116 23B
42John 02116 30A
45Sue 02241 92A
48Mary 02116 38A
44Bob 02116 8C
47Bill 02114 14B
49Jane 02241 2C
Scored dataset in random order, following gather:
Confidential & Proprietary
Gather: Performance
Reading flows asdata is available
Confidential & Proprietary
Gather
• Not key-based.• Result ordering is unpredictable.• Neither serializes nor synchronizes pipelined
computation.• Useful for efficient collection of data from multiple
partitions and for repartitioning.• Used most frequently
Confidential & Proprietary
Summary of Departitioning Methods
Method Key-based? Ordering? Uses Merge Yes Sorted Creating ordered serial flow Concatenate No Global Creating serial flow from
partitioned data Interleave No Inverse of
round-robin “Undoing” round-robin partitioning
Gather No Unpredictable Unordered departitioning, repartitioning
Confidential & Proprietary
Deadlock
Blocking on read
Blocking on write
Confidential & Proprietary
Avoiding Deadlock
•Use Concatenate, Interleave and Merge with care
•Use flow buffering.
• Insert phase break before departition.
•Don’t serialize data unnecessarily; repartition instead of departition.
Confidential & Proprietary
Repartitioning
Use to redistribute records across partitions.
Records are almost always redistributed in akey-based manner, but don’t have to be.
Records can be redistributed to fewer partitions,the same number of partitions, or more partitions.
Confidential & Proprietary
The “Wrong” Way
This serializes the computation.
Confidential & Proprietary
Expanded View:
Global View:
Repartitioning -- The Right Way
Confidential & Proprietary
Repartitioning
Note: The departition component is almost always a Gather.
All-to-All Flow
Confidential & Proprietary
Key Repartition + Sort = Regroup
B
C
D
FC
D
B
G
BD
F
D
Partition 0 Partition 1 Partition 21
6
6 6
6
2
42
2
4
77
G 7F 7
C
C
BD
12
2F 2
B
AAAA
55
55
5
5
55
AAA
A 4
D 4
B 6
EE
3 3
E
3
3
E 6
D 6
D 6
Partition by Key:
Gather:
Confidential & Proprietary
Partition 0 Partition 1 Partition 2
G 7F 7
CCBD
122
F 2
B
4D 4B 6
6D 6D 6
Sort:
G 7F 7
C
C
BD
12
2F 2
B
5555
AA
A
A
5
5
55
AAA
A 4
D 4
B 6
E
33
E
E
3
3
E 6
D 6
D 6
Key Repartition + Sort = Regroup
Confidential & Proprietary
Key Repartition + Sort = Regroup
Confidential & Proprietary
Sort Does “Gathering”
Confidential & Proprietary
Which Components will Gather?
Many built-in components will gather. To find out ifa specific component will gather:• Select the component in the component organizer• Either:
– Look at the adjacent help– Look for “fan” next to Input Ports: in
OR– Press the help button– Look for “fan-in” in the Ports section beside in
Confidential & Proprietary
Layout
•Layout determines the location of a resource.
•A layout is either serial or parallel.
•A serial layout specifies one node and one directory.
•A parallel layout specifies multiple nodes and multiple directories. It is permissible for the same node to be repeated.
Confidential & Proprietary
Layout
•The location of a Dataset is one or more places on one or more disks.
•The location of a computing component is one or more directories on one or more nodes. By default, the node and directory is unknown.
•Computing components propagate their layouts from neighbors, unless specifically given a layout by the user.
Confidential & Proprietary
Layout
(notice that all layouts are serial in this graph)
files on Node X
file on Node X
Q: On which node do the processing components run?A: On Node X.
Confidential & Proprietary
Layout Determines What Runs Where
Node XNode W Node Y Node Z
Q: On which Node do the processing components run?
Confidential & Proprietary
Layout Determines What Runs Where
Node XNode W Node Y Node Z
Confidential & Proprietary
Layout Determines What Runs Where
Serial
Parallel
3-way multifile onNode X,Y,Zfile on Node W
Confidential & Proprietary
Layout Determines What Runs Where
Node XNode W Node Y Node Z
Confidential & Proprietary
Layout Determines What Runs Where
Serial Serial
Q: Where do the Reformat(s) run?
file on Node Wfile on Node W
Q: Serial or Parallel?
Confidential & Proprietary
Controlling Layout
Propagate (default)Bind layout to thatof another component
Use layout of URL
Construct layoutmanually
Run on thesehosts
Confidential & Proprietary
Multidirectory URL as a Layout
//host3/vol7/pC///host2/vol3/pB///host1/vol4/pA/
mfile://host1/u/jo/mfs
Layout specifies the locations of the partitions.
Each partition of a layout has:A host part (node to run on)A data part (directory for working storage)
Confidential & Proprietary
Reining in the Parallel Beast
• Applications built with Ab Initio Software can combine all forms of parallelism.
• Layouts control the number of partitions of a parallel computation; that is, the degree of data parallelism.
• Phases control the number of components running at any one time; that is, the degree of component and pipeline parallelism.
Confidential & Proprietary
Phases
Phase 0 Phase 1
Confidential & Proprietary
Phases
•Breaking an application into phases limits the contention for:•Main memory.
•Processor(s).
•Breaking an application into phases costs:•Disk space.
Confidential & Proprietary
Checkpoints
•Since data is staged to disk between phases, one can arrange to use that data to “start from the middle” should something go wrong.
•Any phase break can be a checkpoint.
Confidential & Proprietary
The Phase Toolbar
Select Phase Number
View Phase Set Phase
A Toggle between:Phase (P), and Checkpoint After Phase (C)
Confidential & Proprietary
Anatomy of a Running Job
What happens when you push the “Run” button?
• Your graph is translated into a script that can be executed in the Shell Development Environment.
• This script and any metadata files stored on the GDE client machine are shipped (via FTP) to the server.
• The script is invoked (via REXEC or TELNET) on the server.
• The script creates and runs a job that may run across many nodes.
• Monitoring information is sent back to the GDE client.
Confidential & Proprietary
Anatomy of a Running Job
•Host Process Creation•Pushing “Run” button generates script.•Script is transmitted to Host node.•Script is invoked, creating Host process.
Client Host Processing nodes
GDE
Host
Confidential & Proprietary
Anatomy of a Running Job
•Agent Process Creation•Host process spawns Agent processes.
Client Host Processing nodes
GDE
Host
Agent Agent
Confidential & Proprietary
Anatomy of a Running Job
•Component Process Creation•Agent processes create Component
processes on each processing node.
Client Host Processing nodes
GDE
Host
Agent Agent
Confidential & Proprietary
Anatomy of a Running Job
• Component Execution• Component processes do their jobs.• Component processes communicate directly with
datasets and each other to move data around.
Client Host Processing nodes
GDE
Host
Agent Agent
Confidential & Proprietary
Anatomy of a Running Job
•Successful Component Termination•As each Component process finishes with its
data, it exits with success status.
Client Host Processing nodes
GDE
Host
Agent Agent
Confidential & Proprietary
Anatomy of a Running Job
• Agent Termination• When all of an Agent’s Component processes exit,
the Agent informs the Host process that those components are finished.
• The Agent process then exits.
Client Host Processing nodes
GDE
Host
Confidential & Proprietary
Anatomy of a Running Job
• Host Termination• When all Agents have exited, the Host process
informs the GDE that the job is complete.
• The Host process then exits.
Client Host Processing nodes
GDE
Host
Confidential & Proprietary
Anatomy of a Running Job
• Abnormal Component Termination•When an error occurs in a Component
process, it exits with error status.•The Agent then informs the Host.
Client Host Processing nodes
GDE
Host
Agent Agent
Confidential & Proprietary
Anatomy of a Running Job
• Abnormal Component Termination
•The Host tells each Agent to kill its Component processes.
Client Host Processing nodes
GDE
Host
Agent Agent
Confidential & Proprietary
Anatomy of a Running Job
• Agent Termination• When every Component process of an Agent have
been killed, the Agent informs the Host process that those components are finished.
• The Agent process then exits.
Client Host Processing nodes
GDE
Host
Confidential & Proprietary
Anatomy of a Running Job
• Host Termination•When all Agents have exited, the Host
process informs the GDE that the job failed.•The Host process then exits.
Client Host Processing nodes
GDE
Host
Confidential & Proprietary
To View or Edit the Script
“Edit Script” button
Lines beginning with“mp” are ShellDevelopment Environmentdirectives
Confidential & Proprietary
Connecting the GDE to the Server
Hostname of server
User ID
Password
Confidential & Proprietary
Sample Applications
•Loading a Data Warehouse
•Extracting a Data Mart
•Data Cleansing
•Rehosting a Database
Confidential & Proprietary
Loading a Data Warehouse
Confidential & Proprietary
Extracting a Data Mart
Confidential & Proprietary
Data Cleansing
Confidential & Proprietary
Rehosting a Database
Confidential & Proprietary
Rehosting a Database - example run
Confidential & Proprietary
Texas Massachusetts
Confidential & Proprietary
Confidential & Proprietary
Topics for Developers:Partitioners,
Multistage Components, Lookup Files,
and More
A Practical Introduction toAb Initio Software
Confidential & Proprietary
Outline
Digging Deeper•Partitioners Revisited• Introduction to Multi-stage Transforms•Online Examples•Controlling Rejects via Limit/Ramp•Lookup Tables•Multifile Creation
Confidential & Proprietary
Partitioning Review
• For the various partitioning components:• Is it Key-based? Does the problem require a
key-based partition?•Performance: Are the partitions balanced or
skewed?
Fan-out Flow
Confidential & Proprietary
Partitioning: Performance
Balanced:Processors get neithertoo much nor too little.
Skewed:Some processors get
too much, others too little.
Partition 0
Partition 1
Partition 2
Partition 3
Partition 0
Partition 1
Partition 2
Partition 3
Confidential & Proprietary
Sample Data to be Partitioned
• Customers• 42John 02116 30
• 43Mark 02114 9
• 44Bob 02116 8
• 45Sue 02241 92
• 46Rick 02116 23
• 47Bill 02114 14
• 48Mary 02116 38
• 49Jane 02241 2
• Customers• 42John 02116 30
• 43Mark 02114 9
• 44Bob 02116 8
• 45Sue 02241 92
• 46Rick 02116 23
• 47Bill 02114 14
• 48Mary 02116 38
• 49Jane 02241 2
record decimal(2) id; string(5) name; decimal(5) zipcode; decimal(3) amount; string(1) newline;end
Confidential & Proprietary
Partition by Round-robin
Customers
42John 02116 30
45Sue 02241 92
48Mary 02116 38
Customers
42John 02116 30
45Sue 02241 92
48Mary 02116 38
Customers
43Mark 02114 9
46Rick 02116 23
49Jane 02241 2
Customers
43Mark 02114 9
46Rick 02116 23
49Jane 02241 2
Customers
44Bob 02116 8
47Bill 02114 14
Customers
44Bob 02116 8
47Bill 02114 14
Partition 0 Partition 1 Partition 2
Confidential & Proprietary
Partition by Round-robin
• Not key based.• Results in very well balanced data,
especially with block-size of 1.• Useful for record-independent parallelism.
Confidential & Proprietary
Partition by Key
Customers
43Mark 02114 9
45Sue 02241 92
47Bill 02114 14
49Jane 02241 2
Customers
43Mark 02114 9
45Sue 02241 92
47Bill 02114 14
49Jane 02241 2
Customers
42John 02116 30
44Bob 02116 8
46Rick 02116 23
48Mary 02116 38
Customers
42John 02116 30
44Bob 02116 8
46Rick 02116 23
48Mary 02116 38
partition on zipcode:
Confidential & Proprietary
Partition by Key often followed by a Sort
Customers
43Mark 02114 9
47Bill 02114 14
45Sue 02241 92
49Jane 02241 2
Customers
43Mark 02114 9
47Bill 02114 14
45Sue 02241 92
49Jane 02241 2
Customers
42John 02116 30
44Bob 02116 8
46Rick 02116 23
48Mary 02116 38
Customers
42John 02116 30
44Bob 02116 8
46Rick 02116 23
48Mary 02116 38
Sort on zipcode:
Totals by Zipcode
02114 23
02241 94
Totals by Zipcode
02114 23
02241 94
Totals by Zipcode
02116 99
Totals by Zipcode
02116 99
Rollup by zipcode:
Confidential & Proprietary
Partition by Key
•Key-based.
•Usually results in well balanced data.
•Useful for key-dependent parallelism.
Confidential & Proprietary
Partition by Expression
Customers
42John 02116 30
43Mark 02114 9
44Bob 02116 8
46Rick 02116 23
47Bill 02114 14
49Jane 02241 2
Customers
42John 02116 30
43Mark 02114 9
44Bob 02116 8
46Rick 02116 23
47Bill 02114 14
49Jane 02241 2
Customers
48Mary 02116 38
Customers
48Mary 02116 38Customers
45Sue 02241 92
Customers
45Sue 02241 92
Expression: amount/33
Confidential & Proprietary
Partition by Expression
• Key-based, depending on the expression.
• Resulting balance very dependent on expression and on data.
• Various application-dependent uses.
Confidential & Proprietary
Partition by Range
Customers
43Mark 02114 9
44Bob 02116 8
49Jane 02241 2
Customers
43Mark 02114 9
44Bob 02116 8
49Jane 02241 2
Customers
46Rick 02116 23
47Bill 02114 14
Customers
46Rick 02116 23
47Bill 02114 14
Customers
42John 02116 30
45Sue 02241 92
48Mary 02116 38
Customers
42John 02116 30
45Sue 02241 92
48Mary 02116 38
With splitter values of 9 and 23:
Confidential & Proprietary
Range+Sort: Global Ordering
Customers
49Jane 02241 2
44Bob 02116 8
43Mark 02114 9
Customers
49Jane 02241 2
44Bob 02116 8
43Mark 02114 9
Customers
47Bill 02114 14
46Rick 02116 23
Customers
47Bill 02114 14
46Rick 02116 23
Customers
42John 02116 30
48Mary 02116 38
45Sue 02241 92
Customers
42John 02116 30
48Mary 02116 38
45Sue 02241 92
Sort following a partition by range:
Confidential & Proprietary
Partition by Range
•Key-based.
•Resulting balance dependent on set of splitters chosen.
•Useful for “binning” and global sorting.
Confidential & Proprietary
Partition with Load Balance
Customers
42John 02116 30
43Mark 02114 9
44Bob 02116 8
49Jane 02241 2
Customers
42John 02116 30
43Mark 02114 9
44Bob 02116 8
49Jane 02241 2
Customers
45Sue 02241 92
Customers
45Sue 02241 92Customers
46Rick 02116 23
47Bill 02114 14
48Mary 02116 38
Customers
46Rick 02116 23
47Bill 02114 14
48Mary 02116 38
if middle node highly loaded:
Confidential & Proprietary
Partition by Load Balance
• Not key-based.
• Results in skewed data distribution to complement skewed load.
• Useful for record-independent parallelism.
Confidential & Proprietary
Partition with Percentage
Customers
42John 02116 30
43Mark 02114 9
44Bob 02116 8
45Sue 02241 92
Customers
42John 02116 30
43Mark 02114 9
44Bob 02116 8
45Sue 02241 92
Customers
...
Customers
...
Customers
46Rick 02116 23
47Bill 02114 14
48Mary 02116 38
49Jane 02241 2
Customers
46Rick 02116 23
47Bill 02114 14
48Mary 02116 38
49Jane 02241 2
With percentages: 4, 20
The next 16 recordswould go here, and the next 76 records would go here
Confidential & Proprietary
Partition by Percentage
• Not key-based
• Results in usually skewed data distribution conforming to the provided percentages.
• Useful for record-independent parallelism.
Confidential & Proprietary
Broadcast (as a Partitioner)
Unlike all other partitioners which write a record to ONE outputflow, Broadcast writes each record to EVERY output flow.
Customers42John 02116 3043Mark 02114 944Bob 02116 845Sue 02241 9246Rick 02116 2347Bill 02114 1448Mary 02116 3849Jane 02241 2
Customers42John 02116 3043Mark 02114 944Bob 02116 845Sue 02241 9246Rick 02116 2347Bill 02114 1448Mary 02116 3849Jane 02241 2
Customers42John 02116 3043Mark 02114 944Bob 02116 845Sue 02241 9246Rick 02116 2347Bill 02114 1448Mary 02116 3849Jane 02241 2
Customers42John 02116 3043Mark 02114 944Bob 02116 845Sue 02241 9246Rick 02116 2347Bill 02114 1448Mary 02116 3849Jane 02241 2
Customers42John 02116 3043Mark 02114 944Bob 02116 845Sue 02241 9246Rick 02116 2347Bill 02114 1448Mary 02116 3849Jane 02241 2
Customers42John 02116 3043Mark 02114 944Bob 02116 845Sue 02241 9246Rick 02116 2347Bill 02114 1448Mary 02116 3849Jane 02241 2
Confidential & Proprietary
Broadcast
•Not key-based
•Results in perfectly balanced partitions
•Useful for record-independent parallelism.
Confidential & Proprietary
Summary of Partitioning Methods
Method Key-based? Balancing? Uses Key Yes Good Key-dependent parallelism Expression Yes Depends on data & expression Application specific Range Yes Depends on splitters Key-dependent parallelism,
global sorting Round-robin No Good Record-independent
parallelism Load Balance
No Depends on load Record-independent parallelism
Percentage No Depends on percentages given Record-independent parallelism
Confidential & Proprietary
Multistage Transform Components
• These components take several sets of rules to tell them how data is to be transformed in several stages.
• Each set of rules (in the form of one transform function) determines how each stage of the transformation will proceed.
• Stages include: input selection, initialization, iteration, finalization, output selection, and more.
Confidential & Proprietary
Packages Hold Types and Functions
•Multistage transform components are driven by packages:
Temporary type
Initialization stage
Iteration stage
Finalization stage
“Helper” function
Confidential & Proprietary
Rollup is a Multistage Component
• By default, Rollup comes up in “Wizard” mode.
• To access the full power of this component, switch to Package Mode once you are in the transform editor. (do View -> Package)
Confidential & Proprietary
Rollup
• Rollup performs a general aggregation of data.
• Rollup has these stages:• Key Change key_change
• Input Selection input_select
• Initialization initialize (required)
• Rollup rollup (required)
• Finalization finalize (required)
• Output Selection output_select
Confidential & Proprietary
Data Aggregation
0345Smith Bristol 560212Spade London 80322Jones Compton 120492West London 230121Forth Bristol 70221Black New York 42
0345Smith Bristol 560212Spade London 80322Jones Compton 120492West London 230121Forth Bristol 70221Black New York 42
Bristol 63Compton 12London 31New York 42
Bristol 63Compton 12London 31New York 42
Confidential & Proprietary
Data Aggregation of Sorted/Grouped Input
0345Smith Bristol 560121Forth Bristol 70322Jones Compton 120212Spade London 80492West London 230221Black New York 42
0345Smith Bristol 560121Forth Bristol 70322Jones Compton 120212Spade London 80492West London 230221Black New York 42
Bristol 63Compton 12
London 31New York 42
Bristol 63Compton 12
London 31New York 42
Confidential & Proprietary
Input record format: record decimal(4) id; string(6) name; string(8) city; decimal(3) amount; end
Output record format: record string(8) city; decimal(4) sum; end
Initialization: tmp = 0;
Calculation (loop): tmp = tmp + amount;
Result: sum = tmp;
Aggregation Calculation for each City
Confidential & Proprietary
Data Aggregation in the Rollup Transform
• temp:• record
• decimal(4) sum;
• end
in: record decimal(4) id; string(6) name; string(8) city; decimal(3) amount; end
out: record string(8) city; decimal(4) sum; end
record name
use a descriptive name
Initialization: temp.sum = 0;
Calculation (loop): temp.sum = temp.sum + in.amount;
Result: out.sum = temp.sum;
Confidential & Proprietary
The Temporary Variable
•Multistage transform components provide a temporary variable which may be used to carry information between stages.
•Multiple pieces of information may be conveyed from stage to stage by having multiple fields in the temporary type.
Confidential & Proprietary
Package Editor: creating temporary type...
Confidential & Proprietary
…and rollup code
Confidential & Proprietary
type temporary_type =record decimal(4) sum;end
temp::initialize(in) =begin temp.sum :: 0;end;
temp::rollup(temp, in) =begin temp.sum :: temp.sum + in.amount;end;
out::finalize(temp, in) =begin out.city :: in.city; out.sum :: temp.sum;end;
Text Representation of Rollup Aggregation
Confidential & Proprietary
Rollup: ...
Initialize: ...
Finalize: ...
Do for first recordin each group
temp:
in:
out:
Do for every recordin each group
Do for last recordin each group
A Look Inside the Rollup Component
Confidential & Proprietary
Normalization
H002 Smith 3 1994.03.23 Jane Bill ThomH003 Jones 2 1993.02.12 Andy ElleH004 Lee 1 1994.08.15 LoriH008 Ruben 2 1993.10.22 Eric Anne
Jane SmithBill SmithThom SmithAndy JonesElle JonesLori LeeEric RubenAnne Ruben
Confidential & Proprietary
Inside Normalize
length
len
normalize
len = length(input);
for index = 0 to len-1 output = normalize(input, index);
for index = 0 to len-1
index
Confidential & Proprietary
Online Example of Normalize
Open this example graph:Examples… DML… Transforms … Normalize
View input data, run the graph, and view the output data.
Examine the Normalize parameters.
Confidential & Proprietary
Denormalize
•Denormalize generates one output record for a group of input records.
•Denormalize has these stages:• Input Selection input_select
• Initialization initialize
• Initialization initial_denormalization (required)
•Rollup rollup
•Denormalize denormalize (required)•Finalization finalize
•Output Selection output_select
Confidential & Proprietary
Denormalization
Smith 3 Jane Bill ThomJones 2 Andy ElleLee 1 LoriRuben 2 Eric Anne
Jane SmithBill SmithThom SmithAndy JonesElle JonesLori LeeEric RubenAnne Ruben
Confidential & Proprietary
Online Examples of Denormalize
Open either of these example graphs:Examples… DML… Transforms … DenormalizeExamples… DML… Transforms … Denorm-rollup
View input data, run the graph, and view the output data.
Examine the Normalize parameters.
Confidential & Proprietary
Online Examples of Transform Components
See: Help…Examples…DML…Transformsfor a number of graphs that demonstrate transformcomponent usage.
Confidential & Proprietary
Join
• Join performs a join of inputs. By default, the inputs to join must be sorted and an inner join is computed.
• Options:
• join-type: Inner, Outer or Explicit (other).
• dedupn: Call the transform function only once for any matching record on input n. Defaults to false.
• record-requiredn: Call transform function for all keys, even if there is not a matching record for input n. Defaults to true. Only used if join-type is Explicit.
Confidential & Proprietary
An inner join produces an output record only when a given key is present on ALL inputs. If the key is duplicated on any input, each (duplicate) key is matched with the other inputs.
in0 in1 resulta,me b,7 b,we,7b,we b,8 b,we,8b,she c,9 b,she,7c,he b,she,8d,us c,he,9
Inner Join:
Confidential & Proprietary
A full outer join produces an output record whether there is a match for a given key on an input or not. If the key is duplicated on any input, each (duplicate) key is matched with the other inputs. The user should provide default values. in0 in1 resulta,hi b,7 a,hi,999b,lo b,8 b,lo,7c,bye c,9 b,lo,8
d,1 c,bye,9d,XXX,1
Full Outer Join:
Confidential & Proprietary
Joins can be arbitrarily complex in Ab Initio
The Join component is capable of combining its input in many ways. It is also capable of combining more than two inputs.
See the Component Reference or the Online Help for complete information about Join.
Confidential & Proprietary
Controlling Rejects: When First/Never Are Not Enough
•Sometimes it is desirable to exercise more control over when to abort a graph than is possible with “Never Abort” or “Abort on first reject”. The choice “Use limit/ramp” allows for other possibilities...
Confidential & Proprietary
Instrumentation Parameters:Limit, Ramp
• Limit: Number of errors to tolerate.
• Ramp: Scale of errors to tolerate per input. Similar to percentage in fractional form.
Confidential & Proprietary
Typical Limit and Ramp Settings
• Limit = 0 Ramp = 0.0Abort on any error.
• Limit = 50 Ramp = 0.0Abort after 50 errors.
• Limit = 100 Ramp = 0.01Abort if more than 1 record in 100 causes error, but only after processing 100 records.
Confidential & Proprietary
Lookup Files
• DML provides a facility for looking up records in a dataset based on a key:
lookup(”file-name”, key-expression)
• The data is read from a file into memory.
• The GDE provides a Lookup File component as a special dataset with no ports.
Confidential & Proprietary
Using lookup instead of Join
Using Last-Visitsas a lookup file
Confidential & Proprietary
Configuring a Lookup File1. Label used as name in lookup expression
3. Set record format
2. Browse for pathname 4. Set key
Confidential & Proprietary
Using lookup in a Transform Function
• Transform function:• out :: lookup_info(in) =
• begin
• out.id : : in.id;
• out.city : : in.city;
• out.amount : : in.amount;
• out.dt :1 : lookup(”Last-Visits”, in.id).dt;
• out.dt :2 : ”1900/01/01”;
• end;
Input 0 record format:record decimal(4) id; string(6) name; string(8) city; decimal(3) amount;end
Output record format:record decimal(4) id; string(8) city; decimal(3) amount; date(”YYYY/MM/DD”) dt;end
Confidential & Proprietary
Multifile Commands
Roles of people in an Ab Initio Project
• Normally the SA for the project manages the multifile systems (with input from the team).
Suggested Directory Structures in a Project
• Ab Initio has a white paper and course modules on the significance of environment in a project.
Utilities for multifile structures
• The Co>Operating System reference guide describes the “m_” commands, some of which, follow.
Confidential & Proprietary
The m_mkfs Command
m_mkfs mfs-url dir-url1 dir-url2 ...
• Creates a multifile system rooted at mfs-url and having as partitions the new directories dir-url1, dir-url2, ...
$ m_mkfs //host1/u/jo/mfs3 //host1/vol4/dat \ //host2/vol3/dat //host3/vol7/dat
$ m_mkfs my-mfs dir1 dir2 dir3
Confidential & Proprietary
The m_mkdir Command
m_mkdir url
• Creates the named multidirectory. The url must refer to a pathname within an existing multifile system.
$ m_mkdir mfile:my-mfs/subdir
$ m_mkdir mfile://host2/tmp/temp-mfs/dir1
Confidential & Proprietary
The m_ls command
m_ls [option...] url [url...]
• Lists information on the file or directories specified by the urls. The information presented is controlled by the options, which follow the form of ls.
$ m_ls -ld mfile:my-mfs/subdir
$ m_ls mfile://host2/tmp/temp-mfs
Confidential & Proprietary
Exercise C: Multifile Commands
1.Create a three-partition multifile system named mfs-3way.
2.Create two directories within mfs-3way named dir1 and dir2.
3.Use m_ls to list the contents of mfs-3way.
Confidential & Proprietary
Exercise D: Using Multifiles
1.Use m_ls to examine other multidirectories and multifiles (in particular, mfs-2way).
2.Use ls to examine the control and data partitions of mfs-2way.
Confidential & Proprietary
Confidential & Proprietary
IDB Database
Confidential & Proprietary
Setting Up
•Ensure the Accounts, Sites, and Transactions datasets from Figure-07 of the intro training class are available
•Create the dbc config file for your database
•Load the data from each of the datasets into the database – remember to set necessary radio buttons on access tab
Confidential & Proprietary
Database Config File
•Click Config File and New to create a config file for your DBMS and instance.
Confidential & Proprietary
Database Config File
•A window with the available DBMS types will pop-up.
•Choose the one you want.
Confidential & Proprietary
Database Config File
•The Database Configuration file will be brought into an editor for you to fill in the necessary information
Confidential & Proprietary
Database Config File
• dbms: oracle
• ## REQUIRED. Do not change the value of this tag from oracle.
• db_version: 8.0.5 ## REQUIRED. Enter the Oracle version number.
• db_home: c:/orant ## REQUIRED. Enter the Oracle home directory.
• db_name: NTORCL ## often just the SID
• db_nodes: laptop-12 ## can be multiple
• user: ${MY_USERNAME} ## use environment variables
• password: ${MY_PASSWORD} ## so as not to hardcode
• case: lower ## dml from dbms in lowercase
• ##column_delimiter:
• generate_dml_with_nulls: false
• fixed_size_dml: false
• treat_blanks_as_null: true
• ##local_db_version:
• ## environment:
• direct_parallel: false
Confidential & Proprietary
setup-idb-training.mp
These reformats do nothing but are required for the NT Oracle version to carry the fields to the database tables.
Confidential & Proprietary
Load Account Table Parameters: Access
• If truncating, append or replace are irrelevant.
Confidential & Proprietary
modify_accounts.mp
Confidential & Proprietary
update_account.mp
Confidential & Proprietary
Update Table Parameters
Confidential & Proprietary
updateSqlFile & insertSqlFile
update accounts set address = :address where acct_id = :acct_id
insertSqlFile
insert into accounts values (:acct_id,:acct_name,:address)
updateSqlFile
Confidential & Proprietary
Log file
laptop-12.abinitio.com|Thu May 03 10:21:45 2001|Gather_Logs.000||start|Start|
laptop-12.abinitio.com|Thu May 03 10:21:45 2001|Update_Table_Accounts.000|update|start||
laptop-12.abinitio.com|Thu May 03 10:21:46 2001|Update_Table_Accounts.000|update|sql|
Primary SQL supplied: update accounts set address =
:address where acct_id = :acct_id|
laptop-12.abinitio.com|Thu May 03 10:21:46 2001|Update_Table_Accounts.000|update|sql|
Secondary SQL supplied: insert into accounts values
(:acct_id,:acct_name,:address)|
laptop-12.abinitio.com|Thu May 03 10:21:48 2001|Update_Table_Accounts.000|update|finish|
10 records read
10 rows updated by SQL1
0 records sent to SQL2
0 rows updated by SQL2
0 records rejected|
laptop-12.abinitio.com|Thu May 03 10:21:48 2001|Gather_Logs.000||finish|End|
Confidential & Proprietary
check_updates.mp
Confidential & Proprietary
Browsing the database
Confidential & Proprietary
Selecting from available
Confidential & Proprietary
Input Table Properties: Select Statement
Confidential & Proprietary
Run SQL
Confidential & Proprietary
delete_rows.sql
Confidential & Proprietary
Log file (shortened for clarity)
SQL File to run: c:\data\training\data-for-training\delete_rows.sql|
SQL*Plus: Release 8.0.5.0.0 - Production on Thu May 3 12:12:8 2001|
(c) Copyright 1998 Oracle Corporation. All rights reserved.|
Connected to:|Oracle8 Enterprise Edition Release 8.0.5.0.0 - Production|
PL/SQL Release 8.0.5.0.0 - Production|
COUNT(*)|---------| 1000|
3 rows deleted.|
COUNT(*)|---------| 997|
Commit complete.|
Disconnected from Oracle8 Enterprise Edition Release 8.0.5.0.0 - Production|
PL/SQL Release 8.0.5.0.0 - Production|
Confidential & Proprietary
Input Table (direct to output file = unload)
Confidential & Proprietary
Input Table Parameters: Source
Confidential & Proprietary
Serial unload: Select Statement
Confidential & Proprietary
Parallel Unload: ABLOCAL(tablename)
Confidential & Proprietary
Parallel Unload: ABLOCAL()
Confidential & Proprietary
ablocal_expr
ablocal_expr: if (this_partition() == 0) “acct_id <= 112347000” else if (this_partition() == 1) “acct_id > 112347000” else “1 = 2”
Confidential & Proprietary
Testing and Validation:Techniques and Strategies
Confidential & Proprietary
Testing and Validation
•Components•DML Features•Data Cleansing•Generating Test Data•Testing Strategies
Confidential & Proprietary
Components for Testing and Validation
• In Component Organizer: Validate Category• Check Order
• Compare Checksums
• Compare Records
• Compute Checksum
• Generate Random Bytes
• Generate Records
• Validate Records
• In Other Categories:• Intermediate File (Datasets)
• Trash (Miscellaneous)
• Dedup Sorted (Transform)
Confidential & Proprietary
Compare Records
Confidential & Proprietary
Generate Records and Validate Records
Confidential & Proprietary
DML: Validation Function Fields
Function fields with names that begin with “is_valid” are called to check validity of data.
record string(20) x; decimal(5) y; int is_valid_y() = (y > 0); end
Confidential & Proprietary
Data Cleansing with Reformat
Confidential & Proprietary
Data Cleansing with Reformat
out :: cleanse(in) = begin out.x : : in.x; out.y :1: if (in.y < 0) -in.y; out.y :2: 99999; end;
Confidential & Proprietary
Generating Test Data
• Generate Records component has many options: sequential values, values of expressions, etc.
• Data can be produced “by hand” and reformatted in any way desired.
• Data that is to be joined can be produced by generating “wide” records and reformatting into different data sets.
• Real data can be sampled/selected to produce test data.
• Test data can be combined from multiple sources.
Confidential & Proprietary
Generating Test Data
Confidential & Proprietary
Using Intermediate Files to “Capture” Data
Confidential & Proprietary
Testing Strategies
•Mix generated data with select special cases.
•Test serial form of application first.
•Test parallel form only after serial version passes.
•Use production data to flesh out test data.
•Use Intermediate Files to capture intermediate results
Confidential & Proprietary
Appendix A: Additional Exercises
A Practical Introduction toAb Initio Software
Confidential & Proprietary
The Problem
A phone company has accounts (customers) who have switches at a number of sites (locations). Each site can both send and receive (from-site and to-site). The switches at the sites record transactions that include the from-site, the to-site, the date, the time of day, and the duration of the call. The phone company wants to do some analysis on this data.
Confidential & Proprietary
The Data (see Figure 07)These exercises will make use of:
•Accounts dataset: acct.dat, acct.dml.Records for each account.
•Sites dataset: site.dat, site.dml.Records for each account’s site.
•Transactions dataset: trans.dat, trans.dml.Records for transactions between sites.
Open figure-07.mp where these datasets are defined.
Confidential & Proprietary
The Scenario - High Level ER Diagram
Accountsacct_idacct_nameaddress
Sitessite_idacct_idaddress
Transactionsfrom_siteto_sitedthhmmssduration
1M 1
N
1 M
Confidential & Proprietary
Data format and sample: Accounts
• record
• decimal(9) acct_id;
• string(15) acct_name;
• string(20) address;
• string(1) newline = "\n";
• end
112346893Johnson Paints 2303 Appian Way112342374Jackson Stone 419 Rockville Pkwy112346225Kendall Drug One Main Street112391676Stanfill Flower7286A Befug Ave100246677Sihebosev Toys 59520 Lico St
112346893Johnson Paints 2303 Appian Way112342374Jackson Stone 419 Rockville Pkwy112346225Kendall Drug One Main Street112391676Stanfill Flower7286A Befug Ave100246677Sihebosev Toys 59520 Lico St
Confidential & Proprietary
record decimal(8) site_id; decimal(9) acct_id; string(20) address; string(1) newline = "\n";end
33213432112347574Fourteen Helime Blvd3321365411234237442 Babcock Way332132341123468938288 Main St332139981123423741232 Center St332102121123462255061 Pollard Rd
33213432112347574Fourteen Helime Blvd3321365411234237442 Babcock Way332132341123468938288 Main St332139981123423741232 Center St332102121123462255061 Pollard Rd
Data format and sample: Sites
Confidential & Proprietary
record decimal(8) from_site; decimal(8) to_site; date("YYYYMMDD") dt; decimal(6) hhmmss; decimal(5) duration; string(1) newline = "\n";end
332136543321323419940428072929 236332134323321399819940429082345 102332102123321343219940430125202 2310332136543321323419940403142811 39
332136543321323419940428072929 236332134323321399819940429082345 102332102123321343219940430125202 2310332136543321323419940403142811 39
Data format and sample: Transactions
Confidential & Proprietary
Exercise 1:Early Analysis
Build a non-partitioned application that processes the sites dataset to produce a dataset with the number of sites for each account.
Confidential & Proprietary
Build a non-partitioned application that processes the sites dataset to produce a dataset with the number of sites for each account
• Things to Consider• What fields do you need on the output?• (Hint: fewer is better.)
• What datasets do you need to process?• (Hint: fewer is much better.)
• What are the steps you need to take?• (Hint: fewer is better.)
Things to Consider
Confidential & Proprietary
Exercise 2:
• Include the Account Name in the information from the last exercise
• Modify the previous application to produce a dataset that includes the account name of every account in the accounts dataset, and the number of sites per account you just computed.
Confidential & Proprietary
Exercise 3:
•Practice Going Parallel with Data
•Make your solution to Exercise 2 run with parallel data streams.
Confidential & Proprietary
Exercise 4:Further Analysis
•Build a data parallel application that processes the transactions dataset to produce a dataset that contains, for each site:•The number of transactions made from the
site.•The sum of the durations of all transactions
made from the site.
Confidential & Proprietary
Exercise 5:Yet More Analysis
•Modify the previous application to produce two serial datasets: one sorted by number of transactions (descending) and another sorted by duration of all transactions (descending).
Confidential & Proprietary
Exercise 6:Marketing’s Real Question
•Which named accounts are our “best” accounts based on frequency and duration?
•Build a data parallel application that finds accounts that are both frequent (20 or more transactions) and long total duration (greater than 10000). Include the account name in your answer.
Confidential & Proprietary
Exercise 7:Review/Revise for Efficiency
•Use the output select stage in the Package Editor of Rollup to do the filter specified in Exercise 6.
•Don’t forget to document that the selector is there.
Confidential & Proprietary
Exercise 8:Further Revise for Efficiency
•Replace one of the JOIN components with a lookup table.
Hint: Make ACCOUNTS.dat a lookup table.
Confidential & Proprietary
Exercise 9: How many within account transactions?
•How many From_Site / To_Site combinations are within one account?
Confidential & Proprietary
Confidential & Proprietary
Appendix B: Setting up the Graphical Development
Environment
A Practical Introduction toAb Initio Software
Confidential & Proprietary
Setting Up the GDE
From menu bar: Run... Settings...
Edit
Confidential & Proprietary
Editing the Host Profile
Hostname of server
User ID
Password
Some environmentsmay require other settings
Confidential & Proprietary
Host Profile Settings:
Connection:Host: hostLogin: loginPassword: password
Confidential & Proprietary
Setting Up:Copying On-Line Materials
•From menu bar: Run… Execute Command…
• $AB_HOME/examples/intro-course/set-up-
training