EIM Intro - Information Architectures -Doc

8/9/2019 EIM Intro - Information Architectures -Doc

1/19

CRF-RDTE-TR-20091102-01

11/6/2009

Public Distribution| Michael Corsello

CORSELLO

RESEARCH

FOUNDATION

INFORMATION ARCHITECTURE BASICSINTRODUCTION TO ENTERPRISE INFORMATION MANAGEMENT


2/19

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20091102-01

AbstractInformation Architectures are a key to effective information management within an enterprise.

Management of information is based upon the concepts of architecting the structures and practices for

managing the lifecycle of information and all stages of information handling and processing.


3/19



Table of ContentsAbstract ......................................................................................................................................................... 2

Introduction .................................................................................................................................................. 5

Concepts and Definitions .......................................................................................................................... 5

Information Perspectives .............................................................................................................................. 5

Management Strategies ................................................................................................................................ 7

Project Orientation ................................................................................................................................... 7

Benefits ................................................................................................................................................. 7

Costs ...................................................................................................................................................... 7

Topic Orientation ...................................................................................................................................... 8

Benefits ................................................................................................................................................. 8

Costs ...................................................................................................................................................... 8

Entity Orientation ..................................................................................................................................... 8

Benefits ................................................................................................................................................. 9

Costs ...................................................................................................................................................... 9

Combining strategies ................................................................................................................................ 9

Implementation Strategies ......................................................................................................................... 10

Client Application .................................................................................................................................... 10

Client Server ......................................................................................................................................... 10

Three-Tier ................................................................................................................................................ 11

N-Tier ...................................................................................................................................................... 11

Capability Partitioning ................................................................................................................................. 12

Repositories ............................................................................................................................................ 13

Enterprise Centric ............................................................................................................................... 14

Application Centric .............................................................................................................................. 15

Domain Centric ................................................................................................................................... 15

Entity Centric ....................................................................................................................................... 16

Processing Engines .................................................................................................................................. 17

User Presentations .................................................................................................................................. 18

Conclusions ................................................................................................................................................. 19

Appendices .................................................................................................................................................. 19

References .............................................................................................................................................. 19


4/19




5/19



IntroductionInformation architecture is the set of practices and processes used to define the appropriate models and

mechanisms for persisting and representing data within an enterprise. This simply means how does an

organization structure its data for the most effective use and reuse throughout the information lifecycle.

Concepts and DefinitionsEnterprise - In our context, an enterprise is the collection of organizations within a given business

domain that operate together. This corresponds to business specific groups within multiple

organizations that share common information. More specifically, it is about how information is

structured and managed to facilitate the sharing of data between organizations and divisions within an

organization.

Repository - In the context of information architecture, data is stored in repositories, which may be

physically implemented as a database (such as Oracle or Sql Server), or using any other persistent

mechanism. Modern relational database management systems (RDBMSes) are the most commonly

used persistence mechanism, but they are far from optimal and should be evaluated as an option ratherthan being expected as the norm. A repository includes the persistent store itself, and the core software

required to interact with that store. In the case of an RDBMS, the repository may be equivalent to the

database as it implies the software as well (Oracle, for instance is a software application).

Application - An application is a piece of software that runs on a computer. An application may or may

not have a user interface (UI), which may or may not be a graphical user interface (GUI). An RDBMS is

an application that does not have a user interface, however there are applications (such as Toad) that

provide a user interface to an RDBMS.

Information PerspectivesThe architecture for an enterprise will consist of multiple repositories, each of which contains a subset

of the data for that enterprise. This data must be structured based upon some consistent means to

facilitate discovery and use.

There are two primary considerations for evaluating the best strategies for enterprise data:

Management strategy Implementation strategy

Prior to discussing each of these strategies, a high-level perspective on the nature of data is critical toselecting optimal strategies.

Data, when placed in context, becomes information. As information is evaluated and absorbed by a

person, it becomes knowledge. The more effectively the transition is from data to knowledge, the

greater the value of the data. The more contexts a single data element is used for, the greater the value

of that data. The key to achieving both is an effective model for the data.


6/19



A model is a logical representation of a phenomenon in the real world. A data model is a model for a

data entity that is the representation of a real world phenomenon. A model comes in two parts:

Conceptual model Instance model (realization)

The conceptual model is a model that is global to all instances of a class of phenomenon. In simple

terms, a conceptual model is how we

describe the notion of a thing. An

instance model is a model that is the

representation of an instance of and entity

based upon the conceptual model of what

that is an instance of. In simple terms, a

conceptual model might be the model of

Car, where as the instance model would

be the model of My Car, which isdescribed using the vocabulary of the Car

model as a template.

Modeling is a key practice in the structuring,

storing and management of information to

ensure effective and efficient usage. If a

poor model is used, then poor usability and

return on investment in data will result. The

act of modeling itself, may either result in higher or lower costs for information technology solutions,

but will result in greater value being obtained from the data managed.


7/19



Management StrategiesData management and organization will be based upon some strategy for organizing the data being

managed. The selection of an appropriate strategy is not trivial and is based upon the use of the data

once persisted. If there are many uses for a data element that spans business practices, it may be most

efficient for human productivity to adopt multiple strategies and manage the transition of data between

the repositories supported by each strategy.

Project Orientation

In a project oriented management strategy, all data is grouped, collected and managed by the project it

is associated with. In this strategy, each organizational project (such as a specific water quality study)

gets a distinct repository or partition (directory in a file system sense). The data generated and used

within that project is stored within the project repository based upon data models used for that project.

Projects may adopt centralized models, or define project specific models for any given data domain.

Benefits

This strategy is good for data collection and general field activity projects. There may be minimal

structure and little need for data manipulation. In this strategy people are responsible for maintaining

the flow of information within their project and must push data out to make it available to others.

People can be highly efficient as they have minimal constraints to slow down human efforts. Only data

that is used by a project needs to be transformed for use.

Costs

Data can become unusable or undiscoverable across projects. There is no central source for a given type

of data. Once data is acquired from another project, significant processing may be required to make it

usable. Data may be in multiple formats which may become unusable over time; each distinct format

would need to be located and transformed to maintain usability over time. Each data set a project

needs will have to be transformed to the local model, this cost is repeated for every project reusing the

data. Modeling must be performed for each project to fill any gaps from existing models.


8/19



Topic Orientation

A topic oriented management strategy divides repositories by business topic (domain) area. In a topic

oriented strategy, each topic has a unified model that is used within that topic area. If multiple topics

use a common data entity (real world thing), each topic area may have a distinct model for that entity.

For example, a roads topic and a waterways navigation topic may both have a bridge model, but the

models may be entirely different and incompatible.

Benefits

A topic oriented strategy is good for businesses with very few domains that do not interact with external

organizations. Within this strategy, all projects can interchange data freely as they are based upon

common models with shared repositories by topic. All data for an enterprise within a topic is in a usable

format. Each topic may create models that accepted tools can use directly. Integration data sets may be

created to ensure compatibility between domains by sharing models at overlap points.

Costs

Translation of data across topics may be costly. Integration of data across topics may be impossible if

correspondence between the domains is not planned beforehand. Each topic may be incompatible for

another topics software tools. Data integration costs may be high, and tend to be continual. Modeling

must focus on an entire business topic rather than on commonly used aspects of the topic.

Entity Orientation

Managing data by entities is the most beneficial and most complex form of management. All real-world

entities used within the enterprise must be identified and modeled separately. Each entity then

becomes an atomic data repository that can be shared across all projects and topics due to the common

model. In an entity oriented strategy, the primary goal is to adequately identify the entities and model


9/19


10/19



Implementation StrategiesOnce a management strategy or set of strategies are chosen, an implementation strategy must be

selected for each technology implementation within the enterprise.

Client Application

The most straightforward implementation strategy for a software solution is the client application or

thick client. In a client application, the data and processing are local to the software and all operations

occur on the user computer. A common example of this architecture is the traditional word-processing

application such as Microsoft Word or Corel WordPerfect.

The client application approach does not rely on interactions between computers for the software to

function and generally does not have to consider network communications in detail. The client

application works quite well and is the most battle tested form of implementation for software

applications. However, the client application does not provide mechanisms for information sharing and

exchange, nor does it provide for concurrent execution across machines. In general, the client

application excels for single user applications with independent data sets.

Whenever data is global in nature, meaning that multiple users or user communities need to share

and exchange data directly, the client application approach breaks down. Additionally, the client

application approach requires the greatest level of effort to ensure platform compatibility between user

machines (such as Microsoft Windows and Apple Mac OS versions).

Client Server

Moving to multi-user concurrent usage capabilities starts with the addition of a server-based component

to the software solution. The most basic form of this architecture is the client-server strategy. In a

client-server application, there are exactly two deployment components to the overall application, one

runs on the user computer (the client) and the other runs on a remote server. A common example of

this is the database enabled application. In the database enabled application, there is a database server

hosting the data repository. The client then runs an application that interacts with the database over a

network. Probably the most prevalent form of the client-server architecture is a basic web site. In a

client-server web site, the web server hosts all data and performs basic processing to serve user

requests. The web browser running on the client machine requests, retrieves, processes and displays

the content from the server. Through the use of client-side technologies such as JavaScript, Flash,

Silverlight, etc. the client browser can perform final processing and user interaction handling.


11/19



The client-server architecture is simple, efficient and effective for basic content but is incomplete for

handling multiple sources of changing data such as in a relational database. Further, client-server

architectures do not scale well for intense processing or large user bases.

Three-Tier

To add the capability for data processing, a three-tier architecture may be used. The three-tier

architecture consists of a client application, a business processing server and a data storage tier. The

three-tier architecture is a common implementation strategy for basic business web applications. The

data tier is implemented as a relational database management system (RDBMS) such as Oracle,

Microsoft Sql Server, IBM DB2, etc. The data tier is generally responsible for all data storage and

retrieval, with some data processing logic to ensure data integrity. The business processing tier is the

web-server from the client-server model with the most data intensive or computationally intensive

processing occurring at this tier. The client tier is again the web-browser based content from the client-

server tier where all user interaction processing and possibly some business logic processing occurs.

The three-tier architecture has been the workhorse of web applications since the mid-nineties. As the

concurrent user base increases, more web servers may be added to scale out the application. This

architecture also allows for portions of the logic to be moved between tiers (by developers) to increase

performance or reduce wait times. These tier changes allow for adjusting the support and maintenance

costs of the applications by moving the processing workload between machines, and potentially to the

users browser at no computational cost to the owner of the site.

N-Tier

An evolution of the three-tier architecture is the N-tier, where any N, or number of tiers, exists to

support the application. Technically the client-server and three-tier architectural strategies are specific


12/19



subsets of the general N-tier approach. Modern web based applications frequently use an N-tier

approach, especially where service oriented architectures (SOA) are applied.

In an N-tier implementation, each tier provides some portion of the overall application capability. There

may be multiple business processing servers that each provides a unique capability such as statistical

processing of data. Additionally, there may be multiple database servers, each of which contains aspecific portion of the total data used in the application. This form of distributing the workload by

partitioning specific capability areas is the basis for distributed computing, SOA and cloud computing.

The N-tier approach is overkill for simple applications, but is generally the best approach for all forms of

multi-user applications. As businesses attempt to consolidate, the use of N-tier computing allows for

high-levels of reuse and horizontal integration of capabilities across applications.

Capability PartitioningWhen implementing a software system, trade-offs are made to ensure the best performance, scalability

and maintainability possible given reasonable cost constraints and functional capabilities required. To

provide any capability, there is a minimum cost and timeline to produce any workable solution. An

effective solution will always be in excess of these minimums. Strategies for reuse, integration and


13/19



partitioning are effective at minimizing realized costs by distributing the costs across implemented

applications. The partitioning allows for resource sharing in any of several areas:

Conceptual reuse, where ideas, designs and algorithms are applied to multiple projects Source reuse, where software source code is reused on multiple projects Library reuse, where compiled libraries of code are reused as-is on multiple projects Hardware reuse, where multiple applications are hosted on a single physical server Service reuse, where a software service is reused by multiple applications (such as in

SOA)

Data reuse, where a single authoritative data repository is reused by multipleapplications

Each of the forms of reuse can reduce costs in some scenarios, but which reuse strategy is most cost-

effective is project and business domain specific. To ensure that any of these forms of reuse are

realistic, they have to be planned for in the initial stages of the projects producing the reusable product,

and in the projects reusing the product. Accidental reuse can happen, but often includes additional

hidden costs for integration when not planned for. Planned reuse will increase initial development

costs, but are likely to reduce total lifecycle costs for all parts reused. The greater number of instances

of reuse, the more effective the cost savings for the initial planning investment.

Trade-offs to provide capabilities at reduced costs generally involves partitioning strategies. Each of the

primary computing areas for a software application may be subject to partitioning. These primary

computing areas are:

Repositories of data, where the full corpus of data an application will use may bepartitioned into domain or entity specific repository models (see Entity Orientation

section above).

Processing engines or capabilities, where the computational portions of an applicationmay be separated into reusable analytical components for reuse.

User presentations or graphical user interfaces (GUIs) can be partitioned away from anapplication to ensure there is no business logic associated with the display of

information.

Effective choices of what should be partitioned and how to partition those items will greatly influence

performance, reusability and ultimately software lifecycle costs. It is important to realize that most of

the cost for a software application is incurred not during development, but during sustainment.

Repositories

All software is designed to process data in some form. The repository is a conceptual store from which

software will access and process data. There are several strategies for partitioning repositories across

an enterprise:


14/19



Enterprise centric, where the entire enterprise centralizes all data into a single masterrepository.

Application centric, where each application has a dedicated repository. Domain centric, where each business domain has a dedicated repository that all

applications using that domain data must connect to. Entity centric, where each entity is modeled and a repository exists for that entity

model. All applications using an entity are connected to that entity repository.

Enterprise Centric

The enterprise centric approach of centralizing all data into a master integrated repository is only

effective for small repositories with limited growth. The definition of small in this context is fluid as a

function of cost for providing a hardware infrastructure to support such a repository. As database

vendors become better at supporting single repositories of increasing size, this option becomes more

viable for more organizations. In general, this is not a recommended approach in most circumstances.

With a single repository, performance is improved due to the integrated and common location of all

data. However, performance is also decreased due to the overall size of the repository and increased

intrinsic complexity. Also, failures have a higher likelihood of significant impacts with a single repository(the proverbial all eggs in a single basket). Scalability is an issue in the single repository due to current

limitations in technology, with Oracle Real Application Clusters (RAC) being an example of current

advances, which may reduce the significance of these limitations. Currently, scalability still makes total

centralization generally a sub-optimal choice.


15/19



Application Centric

In an application centric partitioning of repositories, each application gets a dedicated repository (often

a relational database). If multiple applications within the enterprise require access to the same data,

that data is maintained in both repositories via a synchronization mechanism. This solution provides the

greatest level of performance for a single application, but comes at the additional cost of data

duplication and issues of consistency for rapidly changing data (synchronization latency).

For small organizations with static data sets, the application centric approach may be quite effective. In

many cases, the application centric approach will be sub-optimal in all areas due to the effort involved in

establishing the data synchronization mechanisms and the cost of data duplication.

Unfortunately, application centric partitioning is the most natural form of partitioning due to the

isolation of repositories for each application. In this form, developers of a solution are able to focus on

the local problem alone. This may lead to lower development costs for an application at the cost of

poorer fit to the business capability required, as developers do not need to fully understand the

problem domain.

Domain Centric

A domain centric partitioning of repositories ensures that each defined business domain within theenterprise has a single master repository for all of that domains data. This tends to have similar pros

and cons to the application centric model with the exception that all applications developed are

required to conform to the existing domain repositories. In this manner, application developers are

required to have a reasonable understanding of the business domains affected by the application.

In a domain centric model, each application may be required to communicate with more than one

repository. This data integration across repositories may result in the need for additional join


16/19



elements to be added to repositories to support this integration. Once completed however, all future

applications will be able to reuse the integrated data model.

The key limitation of a domain centric approach is again the potential for data duplication between

domains and the reconciliation of sameness of entities between domains. For example, if two

business domains have data on people, and if both domains have data on a John Smith, are they theSAME person or different people with the same name? Further, if the domains each model person for

their repository (not using an Entity Oriented modeling approach from prior sections), the person

models may be different and thusly incompatible.

Entity Centric

Using an entity centric repository approach is the most data efficient, and design costly approach. In the

entity centric approach, each data entity has an isolated repository with identified linkages between

repositories. For example, there may be a single people repository that contains records of all human

beings known throughout the enterprise. This repository will contain people information for

employees, customers, suppliers, contractors, etc. Notice however that the people repository only

contains the information that describes the people notion of those individuals. For an employee for

example, the human resources (HR) portion of their data including the very fact that they are an

employee is stored in a differentrepository, not the people repository. This is simultaneously anamazing benefit and a limitation. The modeling of entities drives the repository boundaries, and the

integration of data across repositories happens within one of the repositories participating in the

integration. For our HR example, there may be a staff repository that contains all HR relevant aspects

of employees and their reporting chains, but it is the staff repository that contains the references to

the person repository.


17/19



The separation of data entities into isolated repositories improves security, as each type of data is

stored in isolation of all other data, and permits the use of entity specific services to interact with the

data elements. In general, entity based partitioning is the most effective form of partitioning when

done properly. Proper partitioning is characterized by identifying the proper granularity of what

distinguishes one entity type from another (which is a topic beyond the scope of this document).

Processing Engines

A large part of computationally intensive applications involves generic processing functions. The

separation of processing capabilities into reusable structures can yield great cost savings in multiple

areas including long-term supportability. The primary cost associated with these reusable structures is

designing them up front for reuse. The isolation of these computational units can take on different

forms:

Reusable libraries such as a statistical analysis library containing generic routines forstatistical functions are probably the most common form of reuse. These functions are

generically reusable to any applications referencing the libraries and are often available

from commercial vendors.

Reusable frameworks provide a collection of common capabilities that may be reusedacross applications. Many of these frameworks are available commercially for specific

purposes such as aspect oriented computing.

Reusable services such as data processing web services are increasing in availability andform the basis of most SOA implementations. These services often use co-located data

to increase performance.


18/19



In developing software for an enterprise, all three of the above partitioning strategies may be used

together for maximal effectiveness and operational longevity.

User Presentations

The user interface of an application is responsible for the presentation of data and controls to the

application user. This portion of an application is often called the presentation layer and ideally

contains no functional logic for the application. When properly designed and partitioned, the

application graphical user interface (GUI) is completely independent of the functional portion of the

application. Further, the GUI itself can be partitioned into components that can be reused across

applications.

There are two different aspects of the GUI that can be partitioned for an application:

Partitioning of the GUI from the capability logic Partitioning of the GUI itself into GUI components

The most significant area of reuse comes from the first area of separating the GUI from all business

logic. This should be the default development pattern for application development, and has been

taught commonly in computer science courses for years. Unfortunately, this is often not followed, in an

effort to reduce development time and the need for planning the separation of the GUI from the logic.

When the GUI is separated from all other logic, both the processing logic and data access code (which

interacts with data repositories), can be reused across applications. Moreover, by separating the GUI


19/19


Public Distribution CRF RDTE TR 20091102 01

from application logic, multiple distinct GUIs can be built for any given capability. At this level, an

application can be constructed by placing a new GUI on top of existing logic and data access libraries.

This drastically reduces the development time and standardizes the capabilities of new applications.

Creating GUI components has the additional capability of permitting all applications that use the

component to have a common set of controls and thereby, common workflows. Within an organization,if all applications are developed using common controls, users of those applications will need minimal

training moving between applications. Identifying the boundary for developing a GUI component is the

difficult part, which requires planning throughout the design of the application. A properly defined GUI

component will operate upon a well-defined entity in a standard, well-defined manner. This component

will provide access to data, proper controls (such as buttons, lists and drop-downs) and consistent

behavior. In addition, if the data repositories are well- defined for an entity (Entity Centric partitioning),

the GUI control should match the repository boundary for better stability over time.

ConclusionsArchitecting information solutions for an organization is a complex set of practices and trade-offs tomaximize capabilities while minimizing cost. Given that information solutions take a great deal of time

and care to construct, proper planning is required well in advance of need to ensure solutions are

available by the time the need arises without wasted efforts.

Various strategies exist for planning information repositories, software implementations and user facing

applications. Planning for reuse of repositories and software back-end components and services is of

great importance. Stakeholders involved with information strategies need to understand the difference

between the data repositories containing data, back-end software processing data and the user

interfaces that present data and processing. The separation of these concepts in the minds of those

involved in planning can yield great results in long-term cost savings and capabilities realized.

Appendices

References

Documents

EIM Intro - Information Architectures -Doc