CIS

CHAPTER 8

Data Structures and CAATTs for Data Extraction

Fajilan, Jastine Mae F.Jugan, Ma. Patricia Samantha G.Masangkay, Princess Donalyn L.Nicolas, Lorraine Nicole P.Sapinoso, Chrisiel Faye A.BSA 4-1

January 25, 2013

This chapter examines data structures and the use of CAATTs for data extraction and analysis.Data Structures2 fundamental concepts:

1. Organization-refers to the way records are physically arranged on the secondary storage device; either sequential or random.

2. Access Method- is the technique used to locate records and to navigate through the database or file.

File-Flat StructureEnd users in this environment own their data files rather than share them with other

users. Data processing is thus performed by stand alone applications rather than integrated systems. The flat-file approach is a single view model that characterizes legacy systems.Sequential Structure

It is easy and simple to process. The application starts at the beginning of the file and processes each record in a sequence. It does not permit accessing a record directly. Files that require direct access operations need a different data structure. The following data structures address this need.

1. Index structure – contains the numeric value of the physical disk storage location for each record in the associated data file.

Records in an indexed random file are dispersed throughout a disk without regard for their proximity to other related records.

2. Virtual Storage access method structure- used for very large files that require routine batch processing and a moderate degree of individual record processing. To find a specific record location, it uses a number of indexes that describe in summarized form the contents of each cylinder. The greatest disadvantage is that it does not perform record insertion operations efficiently.

3 physical components:i. indexesii. Prime data storage areaiii. Overflow area

3. Hashing structure- employs an algorithm that converts the primary key of a record directly into a storage address. The principal advantage is speed while significant disadvantages are: technique does not use storage space efficiently; and different record keys may generate the same residual, which translates into the same address.

4. Pointer structures- records in this type of file are spread over the entire disk without concern for their physical proximity with other related records.

Types:1. Physical address pointer- contains the actual disk storage location needed by

the disk controller.2. Relative address pointer- contains relative position of a record in the file.3. Logical key pointer- contains the primary key of the related record.

Hierarchical and Network Database structuresEarly hierarchical and network database models employed many of the preceding flat-file techniques as well as new proprietary database structures. A major difference between the 2 approaches is the degree of process integration and data sharing that can be achieved. 2 dimensional flat-files exist as independent data structures that are not linked logically or physically to other files.

Relational Database Structure, Concepts and Terminology

Indexed Sequential File Structure- Facilitates both direct access to individual records and batch processing of the entire

file

Inverted list – allows even more flexible access to data

Relational Database Theory

E. F. Codd originally proposed the principles of the relational model in the late 1960s. The formal model has its foundations in relational algebra and set theory, which provide the theoretical basis for most of the data manipulation operations used. Accordingly, a system is relational if it:

1. Represents data in the form of two-dimensional tables2. Supports the relational algebra functions of restrict, project, and join

(a) Restrict – extracts specified rows from a specified table .This operation creates a virtual table (one that does not physically exist) that is a subset of the original table(b) Project – extracts specified attributes (columns) from a table to create a virtual table. (c)Join – builds a new physical table from two tables consisting of all concatenated pairs of rows, from each table.

Although restrict, project, and join is not the complete set of relational functions, it is a useful subset that satisfies most business information needs

Figure 1Indexed Sequential File

Relational Database ConceptsIn this section, we review basic concepts,

terminology, and techniques common to relational database systems. Entity, Occurrence, and Attributes

Entity – anything about which the organization wishes to capture data

- May be physical (eg. Inventories, customers, or employees), conceptual (eg. Sales (to a customer) , Accounts Receivable ,or Accounts Payable)

Data model – the blueprint for ultimately creating the physical databaseEntity Relationship (ER) Diagram – graphical representation used to depict the modelOccurrence – used to describe the number of instances or records that pertain to a specific entityAttributes – the data elements that define an entity

- The logical and relevant characteristics of an entity, they are unique to it

Association and Cardinality

Association – labeled line connecting two entities in a data model -represented with a verb, such as ships, requests, or receives

Cardinality – degree of association between two entities -describes the number of possible occurrences in one table that are associated with a single occurrence in a related table

Example 1 (1:1) shows that for every occurrence (record) in the Employee entity, there is a possibility of zero or one occurrence in the Company Car entity .When determining the cardinality values in an entity association, select a single occurrence (record) of one entity and answer the following question: What are the minimum and maximum number of records that may be associated with the single record that has been selected?

Figure 2Data Model Using an Entity Relationship Diagram

Example 2 (1:1) illustrates a situation in which each record in one entity is always associated with one (and only one) record in the associated entity. Two short lines intersecting the connecting line at both ends depict this cardinality.Example 3 (1: M) presents the relationship between customer and Sales Order entities. The crow’s foot symbol (which gives this form of notation its name) depicts the many cardinalities at the Sales order end of the association line

Example 4 (1:M) represents a situation in which each specific item of inventory is supplied by one and only one vendor , and each vendor supplies one or many different inventory items to the company

Example 5 (M: M) illustrates the many-to-many association. Examples 4 and 5 demonstrate

how cardinality reflects the business rules in place within an organization. The database designer must obtain a thorough understanding of how the client-company and specified users conduct business to properly design the data model. If the data model is wrong, the resulting database tables will also be wrong.

Figure 3 Examples of Entity Associations

Alternative Cardinality Notations – an alternative method is to write the cardinality values on each end of the association line connecting the two entities. Some database designers explicitly show both the upper and lower cardinality values. Some choose a shorthand version that notes only the upper cardinality.

The Physical Database TablesPhysical database tables are constructed from the data model with each entity in the

model being transformed into a separate physical table. Across the top of each table are attributes forming columns. Intersecting columns to form the rows of the table are tuples. Tuple – corresponds approximately to a record in a flat-file system

Properly designed tables possess the following four characteristics:1. The value of at least one attribute in each occurrence (row) must be unique. This

attribute is the primary table. The values of the other (nonkey) attributes in the row need not be unique

2. All attributes values in any column must be of the same class3. Each column in a given table must be uniquely named. However , different tables may

contain columns with the same name4. Tables must conform to the rules of normalization. This means they must be free from

structural dependencies including repeating groups , partial dependencies , and transitive dependencies

Linkages between Relational TablesLogically related tables need to be physically connected to achieve the associations

described in the data model .Using foreign keys accomplishes this.The nature of the association between the related tables determines the method used

for assigning foreign keys.

Figure 1 Linkages between Relational Tables

User views

User view – the set of data that a particular user sees.

Anomalies, Structural Dependencies, and Data Normalization-deals with why databases tables need to be normalized. In other words, why is it

necessary for the organization’s database to form an elaborate network of normalized tables linked together. Why, instead, can we not simply consolidate the views of one user (or several) into a single common table from which all data needs may be met?

Database AnomaliesThe answer to the above questions is that improperly normalized tables can cause

DBMS processing problems that restrict, or even deny, users access to the information they need. Such tables exhibit negative operational symptoms called Anomalies. One or more of these anomalies will exist in tables that are not normalized or are normalized at a low level, such as first normal form (1NF) or second normal form (2NF). To be free of anomalies, tables must be normalized to the third normal form (3NF) level.

Update Anomaly – results from data redundancy in an unnormalized table.

Insertion Anomaly – to demonstrate the effects of the insertion anomaly, assume that a new vendor has entered the marketplace. The organization does not yet purchase from the vendor, but may wish to do so in the future. In the meantime, the organization wants to add the vendor to the database. This is not possible, however, because the primary key for the Inventory table is PART NUM. Because the vendor does not supply the organization with any inventory items, the supplier data cannot be added to the table.Deletion Anomaly – involves the unintentional deletion of data from a table.

Designing Relational DatabaseSix phases of database design (view modeling):

1. Identify entities2. Construct a data model showing entity associations3. Add primary keys and attributes to the model4. Normalize the data model and add foreign keys5. Construct the physical database6. Prepare the user views

Identify Entities-View modeling begins by identifying the primary entities of the business function.

Entities are things about which the organization wishes to capture data.

-Entities are represented as nouns in a system description. To pass as valid entities, two conditions need to be met:Condition 1. An entity must consist of two or more occurrences.Condition 2. An entity must contribute at least one attribute that is not provided through other entities

Construct a data model showing entity associationsThe next step in view modeling is to determine the associations between entities and

document them with an ER diagram. Sometimes the association between entities is not apparent because different rules may apply in different organizations.

If the database is to function properly, its designers need to understand the organization’s business rules as well as the specific needs of individual users.

Add Primary Keys and Attributes to the ModelThe analyst should select a primary key that logically defines the nonkey attributes and

uniquely identifies each occurrence in the entity.Every attribute in an entity should appear directly or indirectly (a calculated value) in one

or more user views. Entity attributes are, therefore, originally derived and modeled from user views. In other words, if stored data are not used in a document, report, or a calculation that is reported in some way, then it serves no purpose and should not be part of the database.

Normalize the data model and add foreign keyThe normalization issues that needed resolution are outlined as follows:

Repeating group data in purchase order Repeating group data in receiving report Transitive Dependencies

Construct the physical databaseThis is an involved step that must be carefully planned and executed and may take

many months in a large installation. Programs will need to be written to transfer organization data currently stored in flat files or legacy databases to the new relational tables.

Data currently stored on paper documents may need to be entered into the database tables manually. Once this is done, the physical user view can be produced.Prepare the user views

A report program is used to make the view visually attractive and easy to use. Column headings can be added, fields summed, and averages calculated to produce a hard-copy or computer screen report that resembles the original user report.

The report program can suppress unnecessary data from the view, such as duplicated fields and the key values in the Inventory/vendor link table. These keys are necessary to build the view, but are not needed in the actual report.

Global View IntegrationA modern company however would need hundreds or thousands of views and

associated tables. To facilitate the task, modern Enterprise Resource Planning (ERP) systems come equipped with a core schema, normalized tables, and view templates.

View Integration- is combining the data needs of all users into a single schema or enterprise wide view. This is a daunting undertaking when creating the entire database from scratch.

The database designer must produce a set of integrated tables that are free of the update, insert, and deletion anomalies and sufficiently rich to serve the needs of all users.

Embedded Audit ModuleThe objective of the embedded audit module (EAM), also known as continuous

auditing, is to identify important transactions while they are being processed and extract copies of them in real time.

An EAM is a specially programmed module embedded in a host application to capture predetermined transaction types for subsequent analysis.

Disadvantages of EAMs Operational Efficiency- From the user’s point of view, EAMs decrease operational

performance. The presence of an audit module within the host application may create significant overhead, especially when the amount of testing is extensive. One approach for relieving this burden from the system is to design modules that may be turned on and off by the auditor. Doing so will, of course, reduce the effectiveness of the EAM as an ongoing audit tool.

Verifying EAM Integrity- the EAM approach may not be viable audit technique in environment with a high level of program maintenance. When host applications undergo frequent changes, the EAMs embedded within the hosts will also require frequent modifications. The integrity of the EAM directly affects the quality of the audit process, auditors must therefore evaluate the EAM integrity.

Generalized Audit SoftwareGeneralized Audit Software is the most widely used CAAT for IS auditing. GAS allows auditors to access electronically coded data files and perform various operations on their contents. Some of the more common uses for GAS include:

Footing and balancing entire files or selected data items Selecting and reporting detailed data contained in files Selecting stratified statistical samples for data files Formatting results of tests into reports Printing confirmations in either standardized or special working Screening data and selectively including or excluding items Comparing multiple files and identifying any differences Recalculating data fields

Using GAS to Access Simple StructuresThe GAS selects the sample records and prepares report containing the needed information.

Using GAS to Access Complex StructuresGaining access to complex structures, such as a hashed file or other form of random file, may pose a problem for the auditor. Not all GAS products on the market may be capable of accessing every type of file structure. Most DBMSs have utility features that can be used to reformat complex structures into flat files suitable for this purpose.

Production inventory file

GAS

Transaction LIST

DBMS utility program Database

Flat file

GAS

Transaction file

Audit Issues Pertaining to the Creation of Flat FilesThe auditor must rely on computer services personnel to produce a flat file from the complex file structure. There is a risk that data integrity will be compromised by the procedure used to create the flat file.

ACL (audit command language) Software

Data DefinitionACL uses the data definition feature. To create a date definition, the auditor needs to know both where the source file physically resides and its field structure layout. Small files can be imported via text files or spreadsheets. Very large files may need to be accessed directly from the mainframe computer. The data definition also allows the auditor to define important characteristics of the source file, including overall record length, the name given to each field, the type of data contained in each field and the starting point and length of each field in the file.

Customizing a View A view is simply a way of looking at data in a file, auditors seldom need to use all the data contained in a file. ACL allows the auditor to customize the original view created during data definition to one that better meets his or her audit needs. The auditor can create and reformat new views without changing or deleting the data in the underlying file. Only the presentation of data is affected.

Filtering DataACL provides powerful options for filtering data that support various audit tests. Filters are expressions that search for records that meet the filter criteria. ACL’s expression builder allows the auditor to use logical operations such as AND, OR,,, NOT others to define and test conditions of any complexity and to process only those records that match specific conditions.

Stratifying DataACL’s stratifying feature allows the auditor to view the distribution of records that fall into specified strata. Data can be stratified on any numeric field such as sales price, unit cost, quantity sold, and so on. The data are summarized and classified by strata, which can be equal in size (called intervals) or vary in size (called free).

Statistical AnalysisACL offers many sampling methods for statistical analysis. Two of the most frequently used are record sampling and monetary unit sampling (MUS). Each method allows random and interval sampling. The choice of methods will depend on the auditor’s strategy and the composition of the file being audited. On one hand, when records a file are fairly evenly distributed across strata, the auditor may want an unbiased sample and will thus choose the record sample approach.

Normalizing Tables in a Relational DatabaseThe database anomalies are symptoms of structural problems within tables called dependencies. Specifically these are known as repeating groups, partial dependencies and transitive dependencies.

The normalization process involves systematically identifying and removing these dependencies from the tables under review. The figure below graphically illustrates the unnormalized table’s progression toward 3NF as each type of dependency is resolved. Tables in 3NF will be free of anomalies and will meet two conditions:1. All nonkey attributes will be wholly and uniquely dependent on (defined by) the primary key.2. None of the nonkey attributes will be dependent on (defined by) other nonkey attributes.

Steps in the Normalization Process:

Represent View with a Single Table

Remove Repeating Groups

Remove Partial Dependencies

Remove Transitive Redundancies

As illustrated in the figure, the process begins with a user view such as an output report, a source document or an input screen. At this point, the view is merely a pictorial representation of a set of data the user will eventually have when the project is completed. To demonstrate the normalization process, we will use the customer sales invoice.

User View

Unnormalized Table

Table in First Normal Form (1NF)

Table in Second Normal Form (2NF)

Table in Third Normal Form (3NF)

User ViewSALES INVOICE

Invoice Number: 16459Order Date: 09/22/2009

Shipped Date: 09/27/2009Shipped Via: UPS

Customer Number: 1765Customer Name: ABC AssociatesStreet Address: 132 Elm St.City: BethlehemState: PATelephone Number: 610-555-6721

Prod Num Description Quantity Unit Price Ex. Price

R234 Bolt cutter 2 $42.50 $85.00

M456 Gear puller 1 $16.50 $16.50

W62 Electric welder

1 $485.00 $485.00

Total Due $586.50

The next step is to represent the view as a single table that contains all of the attributes. Now that we have a base table to work from, the next few steps in the normalization process involve identifying and if necessary, elimination structural dependencies that exist. If dependencies exist, correcting them will involve splitting the original single-table structure into two or more smaller and independent 3NF tables. Unnormalized Table Supporting User View

InvoiceNo.

OrderDate

ShipdDate

ShipdVia

TotalDue

Cust,No.

CustName

StreetAddress

City State TelNo.

ProdNo.

Description Qty. UnitPrice

ExPrice

16459 09/22/2009 09/27/2009 UPS $586.50 1765 ABC Assoc

132 Elm St.

Bethlehem PA 610 R234 Bolt cutter 2 $42.50 $85.00

16459 09/22/2009 09/27/2009 UPS $586.50 1765 ABC Assoc

132 Elm St.

Bethlehem PA 610 M456 Gear puller 1 $16.50 $16.50

16459 09/22/2009 09/27/2009 UPS $586.50 1765 ABC Assoc

132 Elm St.

Bethlehem PA 610 W62 Electric welder

1 $485.00 $485.00

Remove Repeating Group DataThe first step in correcting structural dependencies is to determine if the table under review contains repeating groups. Repeating group data is the existence of multiple values for a particular attribute in a specific record. To avoid such data redundancy, the repeating group data need to be removed from the table and placed in a separate table.

Resulting Tables after Removing Repeating Group DataSingle- Table Structure for Sales Invoice

InvoiceNo.

OrderDate

ShipdDate

ShipdVia

Cust,No.

CustName

StreetAddress

City State TelNo.

ProdNo.


Sales Invoice TableInvoiceNo.

OrderDate

ShipdDate

ShipdVia

Cust,No.

CustName

StreetAddress

City State TelNo.

Line Item TableInvoiceNo.

ProdNo.


Remove Partial DependenciesA partial dependency occurs when one or more nonkey attributes are dependent on (defined by) only part of the primary key, rather than the whole key. This can occur only in tables that have a composite (two or more attribute) primary key.

Resulting Tables after Removing Partial DependencyLine Table Item

InvoiceNo.

ProdNo.


InvoiceNo.

ProdNo.

Qty.

Line Item TableInventory Table

ProdNo.

Description UnitPrice

Remove Transitive DependenciesThe final step in resolving structural dependencies is to remove transitive dependency. A transitive dependency occurs in a table where nonkey attributes are dependent on another nonkey attribute and independent of the table’s primary key. We resolve this transitive dependency by splitting out the customer data and placing them in a new table called Customer.

Resulting Tables after Removing Transitive DependencySales Invoice Table

InvoiceNo.

Order

Date

ShipdDate

ShipdVia

Cust,No.

CustName

StreetAddress

City State TelNo.

InvoiceNo.

OrderDate

ShipdDate

ShipdVia

Sales Invoice TableCustomer Table

Cust,No.

CustName

StreetAddress

City State TelNo.

Linking the Normalized Tables At this point, the original single-table structure has been reduced to the four normalized but independent tables. To work together, these tables need to be linked via foreign keys. This requires first determining the cardinality (degree of association) between the tables and then assigning the foreign keys.

Linkages Between Normalized TablesCustomer Table

Cust,No.

CustName

StreetAddress

City State TelNo.

1765 ABC Assoc

132 Elm St. Bethlehem PA 610-555-6721

InvoiceNo.

OrderDate

ShipdDate

ShipdVia

Cust,No.

16459 09/22/2009 09/27/2009 UPS 1765

Sales Invoice Table Inventory Table

InvoiceNo.

ProdNo.

Qty.

16459 R234 M456

16459 W62 2

16459 1 1

Line Item Table

ProdNo.

Description UnitPrice

R234 Bolt cutter M456

Gear puller

W62 Electric welder

$42.50 $16.50 $485.00

Documents

CIS