Upload
michael-corsello
View
215
Download
0
Embed Size (px)
Citation preview
8/9/2019 EIM Intro - Information Architectures -Doc
1/19
CRF-RDTE-TR-20091102-01
11/6/2009
Public Distribution| Michael Corsello
CORSELLO
RESEARCH
FOUNDATION
INFORMATION ARCHITECTURE BASICSINTRODUCTION TO ENTERPRISE INFORMATION MANAGEMENT
8/9/2019 EIM Intro - Information Architectures -Doc
2/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-01
AbstractInformation Architectures are a key to effective information management within an enterprise.
Management of information is based upon the concepts of architecting the structures and practices for
managing the lifecycle of information and all stages of information handling and processing.
8/9/2019 EIM Intro - Information Architectures -Doc
3/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-01
Table of ContentsAbstract ......................................................................................................................................................... 2
Introduction .................................................................................................................................................. 5
Concepts and Definitions .......................................................................................................................... 5
Information Perspectives .............................................................................................................................. 5
Management Strategies ................................................................................................................................ 7
Project Orientation ................................................................................................................................... 7
Benefits ................................................................................................................................................. 7
Costs ...................................................................................................................................................... 7
Topic Orientation ...................................................................................................................................... 8
Benefits ................................................................................................................................................. 8
Costs ...................................................................................................................................................... 8
Entity Orientation ..................................................................................................................................... 8
Benefits ................................................................................................................................................. 9
Costs ...................................................................................................................................................... 9
Combining strategies ................................................................................................................................ 9
Implementation Strategies ......................................................................................................................... 10
Client Application .................................................................................................................................... 10
Client Server ......................................................................................................................................... 10
Three-Tier ................................................................................................................................................ 11
N-Tier ...................................................................................................................................................... 11
Capability Partitioning ................................................................................................................................. 12
Repositories ............................................................................................................................................ 13
Enterprise Centric ............................................................................................................................... 14
Application Centric .............................................................................................................................. 15
Domain Centric ................................................................................................................................... 15
Entity Centric ....................................................................................................................................... 16
Processing Engines .................................................................................................................................. 17
User Presentations .................................................................................................................................. 18
Conclusions ................................................................................................................................................. 19
Appendices .................................................................................................................................................. 19
References .............................................................................................................................................. 19
8/9/2019 EIM Intro - Information Architectures -Doc
4/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-01
8/9/2019 EIM Intro - Information Architectures -Doc
5/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-01
IntroductionInformation architecture is the set of practices and processes used to define the appropriate models and
mechanisms for persisting and representing data within an enterprise. This simply means how does an
organization structure its data for the most effective use and reuse throughout the information lifecycle.
Concepts and DefinitionsEnterprise - In our context, an enterprise is the collection of organizations within a given business
domain that operate together. This corresponds to business specific groups within multiple
organizations that share common information. More specifically, it is about how information is
structured and managed to facilitate the sharing of data between organizations and divisions within an
organization.
Repository - In the context of information architecture, data is stored in repositories, which may be
physically implemented as a database (such as Oracle or Sql Server), or using any other persistent
mechanism. Modern relational database management systems (RDBMSes) are the most commonly
used persistence mechanism, but they are far from optimal and should be evaluated as an option ratherthan being expected as the norm. A repository includes the persistent store itself, and the core software
required to interact with that store. In the case of an RDBMS, the repository may be equivalent to the
database as it implies the software as well (Oracle, for instance is a software application).
Application - An application is a piece of software that runs on a computer. An application may or may
not have a user interface (UI), which may or may not be a graphical user interface (GUI). An RDBMS is
an application that does not have a user interface, however there are applications (such as Toad) that
provide a user interface to an RDBMS.
Information PerspectivesThe architecture for an enterprise will consist of multiple repositories, each of which contains a subset
of the data for that enterprise. This data must be structured based upon some consistent means to
facilitate discovery and use.
There are two primary considerations for evaluating the best strategies for enterprise data:
Management strategy Implementation strategy
Prior to discussing each of these strategies, a high-level perspective on the nature of data is critical toselecting optimal strategies.
Data, when placed in context, becomes information. As information is evaluated and absorbed by a
person, it becomes knowledge. The more effectively the transition is from data to knowledge, the
greater the value of the data. The more contexts a single data element is used for, the greater the value
of that data. The key to achieving both is an effective model for the data.
8/9/2019 EIM Intro - Information Architectures -Doc
6/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-01
A model is a logical representation of a phenomenon in the real world. A data model is a model for a
data entity that is the representation of a real world phenomenon. A model comes in two parts:
Conceptual model Instance model (realization)
The conceptual model is a model that is global to all instances of a class of phenomenon. In simple
terms, a conceptual model is how we
describe the notion of a thing. An
instance model is a model that is the
representation of an instance of and entity
based upon the conceptual model of what
that is an instance of. In simple terms, a
conceptual model might be the model of
Car, where as the instance model would
be the model of My Car, which isdescribed using the vocabulary of the Car
model as a template.
Modeling is a key practice in the structuring,
storing and management of information to
ensure effective and efficient usage. If a
poor model is used, then poor usability and
return on investment in data will result. The
act of modeling itself, may either result in higher or lower costs for information technology solutions,
but will result in greater value being obtained from the data managed.
8/9/2019 EIM Intro - Information Architectures -Doc
7/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-01
Management StrategiesData management and organization will be based upon some strategy for organizing the data being
managed. The selection of an appropriate strategy is not trivial and is based upon the use of the data
once persisted. If there are many uses for a data element that spans business practices, it may be most
efficient for human productivity to adopt multiple strategies and manage the transition of data between
the repositories supported by each strategy.
Project Orientation
In a project oriented management strategy, all data is grouped, collected and managed by the project it
is associated with. In this strategy, each organizational project (such as a specific water quality study)
gets a distinct repository or partition (directory in a file system sense). The data generated and used
within that project is stored within the project repository based upon data models used for that project.
Projects may adopt centralized models, or define project specific models for any given data domain.
Benefits
This strategy is good for data collection and general field activity projects. There may be minimal
structure and little need for data manipulation. In this strategy people are responsible for maintaining
the flow of information within their project and must push data out to make it available to others.
People can be highly efficient as they have minimal constraints to slow down human efforts. Only data
that is used by a project needs to be transformed for use.
Costs
Data can become unusable or undiscoverable across projects. There is no central source for a given type
of data. Once data is acquired from another project, significant processing may be required to make it
usable. Data may be in multiple formats which may become unusable over time; each distinct format
would need to be located and transformed to maintain usability over time. Each data set a project
needs will have to be transformed to the local model, this cost is repeated for every project reusing the
data. Modeling must be performed for each project to fill any gaps from existing models.
8/9/2019 EIM Intro - Information Architectures -Doc
8/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-01
Topic Orientation
A topic oriented management strategy divides repositories by business topic (domain) area. In a topic
oriented strategy, each topic has a unified model that is used within that topic area. If multiple topics
use a common data entity (real world thing), each topic area may have a distinct model for that entity.
For example, a roads topic and a waterways navigation topic may both have a bridge model, but the
models may be entirely different and incompatible.
Benefits
A topic oriented strategy is good for businesses with very few domains that do not interact with external
organizations. Within this strategy, all projects can interchange data freely as they are based upon
common models with shared repositories by topic. All data for an enterprise within a topic is in a usable
format. Each topic may create models that accepted tools can use directly. Integration data sets may be
created to ensure compatibility between domains by sharing models at overlap points.
Costs
Translation of data across topics may be costly. Integration of data across topics may be impossible if
correspondence between the domains is not planned beforehand. Each topic may be incompatible for
another topics software tools. Data integration costs may be high, and tend to be continual. Modeling
must focus on an entire business topic rather than on commonly used aspects of the topic.
Entity Orientation
Managing data by entities is the most beneficial and most complex form of management. All real-world
entities used within the enterprise must be identified and modeled separately. Each entity then
becomes an atomic data repository that can be shared across all projects and topics due to the common
model. In an entity oriented strategy, the primary goal is to adequately identify the entities and model
8/9/2019 EIM Intro - Information Architectures -Doc
9/19
8/9/2019 EIM Intro - Information Architectures -Doc
10/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-01
Implementation StrategiesOnce a management strategy or set of strategies are chosen, an implementation strategy must be
selected for each technology implementation within the enterprise.
Client Application
The most straightforward implementation strategy for a software solution is the client application or
thick client. In a client application, the data and processing are local to the software and all operations
occur on the user computer. A common example of this architecture is the traditional word-processing
application such as Microsoft Word or Corel WordPerfect.
The client application approach does not rely on interactions between computers for the software to
function and generally does not have to consider network communications in detail. The client
application works quite well and is the most battle tested form of implementation for software
applications. However, the client application does not provide mechanisms for information sharing and
exchange, nor does it provide for concurrent execution across machines. In general, the client
application excels for single user applications with independent data sets.
Whenever data is global in nature, meaning that multiple users or user communities need to share
and exchange data directly, the client application approach breaks down. Additionally, the client
application approach requires the greatest level of effort to ensure platform compatibility between user
machines (such as Microsoft Windows and Apple Mac OS versions).
Client Server
Moving to multi-user concurrent usage capabilities starts with the addition of a server-based component
to the software solution. The most basic form of this architecture is the client-server strategy. In a
client-server application, there are exactly two deployment components to the overall application, one
runs on the user computer (the client) and the other runs on a remote server. A common example of
this is the database enabled application. In the database enabled application, there is a database server
hosting the data repository. The client then runs an application that interacts with the database over a
network. Probably the most prevalent form of the client-server architecture is a basic web site. In a
client-server web site, the web server hosts all data and performs basic processing to serve user
requests. The web browser running on the client machine requests, retrieves, processes and displays
the content from the server. Through the use of client-side technologies such as JavaScript, Flash,
Silverlight, etc. the client browser can perform final processing and user interaction handling.
8/9/2019 EIM Intro - Information Architectures -Doc
11/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-01
The client-server architecture is simple, efficient and effective for basic content but is incomplete for
handling multiple sources of changing data such as in a relational database. Further, client-server
architectures do not scale well for intense processing or large user bases.
Three-Tier
To add the capability for data processing, a three-tier architecture may be used. The three-tier
architecture consists of a client application, a business processing server and a data storage tier. The
three-tier architecture is a common implementation strategy for basic business web applications. The
data tier is implemented as a relational database management system (RDBMS) such as Oracle,
Microsoft Sql Server, IBM DB2, etc. The data tier is generally responsible for all data storage and
retrieval, with some data processing logic to ensure data integrity. The business processing tier is the
web-server from the client-server model with the most data intensive or computationally intensive
processing occurring at this tier. The client tier is again the web-browser based content from the client-
server tier where all user interaction processing and possibly some business logic processing occurs.
The three-tier architecture has been the workhorse of web applications since the mid-nineties. As the
concurrent user base increases, more web servers may be added to scale out the application. This
architecture also allows for portions of the logic to be moved between tiers (by developers) to increase
performance or reduce wait times. These tier changes allow for adjusting the support and maintenance
costs of the applications by moving the processing workload between machines, and potentially to the
users browser at no computational cost to the owner of the site.
N-Tier
An evolution of the three-tier architecture is the N-tier, where any N, or number of tiers, exists to
support the application. Technically the client-server and three-tier architectural strategies are specific
8/9/2019 EIM Intro - Information Architectures -Doc
12/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-01
subsets of the general N-tier approach. Modern web based applications frequently use an N-tier
approach, especially where service oriented architectures (SOA) are applied.
In an N-tier implementation, each tier provides some portion of the overall application capability. There
may be multiple business processing servers that each provides a unique capability such as statistical
processing of data. Additionally, there may be multiple database servers, each of which contains aspecific portion of the total data used in the application. This form of distributing the workload by
partitioning specific capability areas is the basis for distributed computing, SOA and cloud computing.
The N-tier approach is overkill for simple applications, but is generally the best approach for all forms of
multi-user applications. As businesses attempt to consolidate, the use of N-tier computing allows for
high-levels of reuse and horizontal integration of capabilities across applications.
Capability PartitioningWhen implementing a software system, trade-offs are made to ensure the best performance, scalability
and maintainability possible given reasonable cost constraints and functional capabilities required. To
provide any capability, there is a minimum cost and timeline to produce any workable solution. An
effective solution will always be in excess of these minimums. Strategies for reuse, integration and
8/9/2019 EIM Intro - Information Architectures -Doc
13/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-01
partitioning are effective at minimizing realized costs by distributing the costs across implemented
applications. The partitioning allows for resource sharing in any of several areas:
Conceptual reuse, where ideas, designs and algorithms are applied to multiple projects Source reuse, where software source code is reused on multiple projects Library reuse, where compiled libraries of code are reused as-is on multiple projects Hardware reuse, where multiple applications are hosted on a single physical server Service reuse, where a software service is reused by multiple applications (such as in
SOA)
Data reuse, where a single authoritative data repository is reused by multipleapplications
Each of the forms of reuse can reduce costs in some scenarios, but which reuse strategy is most cost-
effective is project and business domain specific. To ensure that any of these forms of reuse are
realistic, they have to be planned for in the initial stages of the projects producing the reusable product,
and in the projects reusing the product. Accidental reuse can happen, but often includes additional
hidden costs for integration when not planned for. Planned reuse will increase initial development
costs, but are likely to reduce total lifecycle costs for all parts reused. The greater number of instances
of reuse, the more effective the cost savings for the initial planning investment.
Trade-offs to provide capabilities at reduced costs generally involves partitioning strategies. Each of the
primary computing areas for a software application may be subject to partitioning. These primary
computing areas are:
Repositories of data, where the full corpus of data an application will use may bepartitioned into domain or entity specific repository models (see Entity Orientation
section above).
Processing engines or capabilities, where the computational portions of an applicationmay be separated into reusable analytical components for reuse.
User presentations or graphical user interfaces (GUIs) can be partitioned away from anapplication to ensure there is no business logic associated with the display of
information.
Effective choices of what should be partitioned and how to partition those items will greatly influence
performance, reusability and ultimately software lifecycle costs. It is important to realize that most of
the cost for a software application is incurred not during development, but during sustainment.
Repositories
All software is designed to process data in some form. The repository is a conceptual store from which
software will access and process data. There are several strategies for partitioning repositories across
an enterprise:
8/9/2019 EIM Intro - Information Architectures -Doc
14/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-01
Enterprise centric, where the entire enterprise centralizes all data into a single masterrepository.
Application centric, where each application has a dedicated repository. Domain centric, where each business domain has a dedicated repository that all
applications using that domain data must connect to. Entity centric, where each entity is modeled and a repository exists for that entity
model. All applications using an entity are connected to that entity repository.
Enterprise Centric
The enterprise centric approach of centralizing all data into a master integrated repository is only
effective for small repositories with limited growth. The definition of small in this context is fluid as a
function of cost for providing a hardware infrastructure to support such a repository. As database
vendors become better at supporting single repositories of increasing size, this option becomes more
viable for more organizations. In general, this is not a recommended approach in most circumstances.
With a single repository, performance is improved due to the integrated and common location of all
data. However, performance is also decreased due to the overall size of the repository and increased
intrinsic complexity. Also, failures have a higher likelihood of significant impacts with a single repository(the proverbial all eggs in a single basket). Scalability is an issue in the single repository due to current
limitations in technology, with Oracle Real Application Clusters (RAC) being an example of current
advances, which may reduce the significance of these limitations. Currently, scalability still makes total
centralization generally a sub-optimal choice.
8/9/2019 EIM Intro - Information Architectures -Doc
15/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-01
Application Centric
In an application centric partitioning of repositories, each application gets a dedicated repository (often
a relational database). If multiple applications within the enterprise require access to the same data,
that data is maintained in both repositories via a synchronization mechanism. This solution provides the
greatest level of performance for a single application, but comes at the additional cost of data
duplication and issues of consistency for rapidly changing data (synchronization latency).
For small organizations with static data sets, the application centric approach may be quite effective. In
many cases, the application centric approach will be sub-optimal in all areas due to the effort involved in
establishing the data synchronization mechanisms and the cost of data duplication.
Unfortunately, application centric partitioning is the most natural form of partitioning due to the
isolation of repositories for each application. In this form, developers of a solution are able to focus on
the local problem alone. This may lead to lower development costs for an application at the cost of
poorer fit to the business capability required, as developers do not need to fully understand the
problem domain.
Domain Centric
A domain centric partitioning of repositories ensures that each defined business domain within theenterprise has a single master repository for all of that domains data. This tends to have similar pros
and cons to the application centric model with the exception that all applications developed are
required to conform to the existing domain repositories. In this manner, application developers are
required to have a reasonable understanding of the business domains affected by the application.
In a domain centric model, each application may be required to communicate with more than one
repository. This data integration across repositories may result in the need for additional join
8/9/2019 EIM Intro - Information Architectures -Doc
16/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-01
elements to be added to repositories to support this integration. Once completed however, all future
applications will be able to reuse the integrated data model.
The key limitation of a domain centric approach is again the potential for data duplication between
domains and the reconciliation of sameness of entities between domains. For example, if two
business domains have data on people, and if both domains have data on a John Smith, are they theSAME person or different people with the same name? Further, if the domains each model person for
their repository (not using an Entity Oriented modeling approach from prior sections), the person
models may be different and thusly incompatible.
Entity Centric
Using an entity centric repository approach is the most data efficient, and design costly approach. In the
entity centric approach, each data entity has an isolated repository with identified linkages between
repositories. For example, there may be a single people repository that contains records of all human
beings known throughout the enterprise. This repository will contain people information for
employees, customers, suppliers, contractors, etc. Notice however that the people repository only
contains the information that describes the people notion of those individuals. For an employee for
example, the human resources (HR) portion of their data including the very fact that they are an
employee is stored in a differentrepository, not the people repository. This is simultaneously anamazing benefit and a limitation. The modeling of entities drives the repository boundaries, and the
integration of data across repositories happens within one of the repositories participating in the
integration. For our HR example, there may be a staff repository that contains all HR relevant aspects
of employees and their reporting chains, but it is the staff repository that contains the references to
the person repository.
8/9/2019 EIM Intro - Information Architectures -Doc
17/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-01
The separation of data entities into isolated repositories improves security, as each type of data is
stored in isolation of all other data, and permits the use of entity specific services to interact with the
data elements. In general, entity based partitioning is the most effective form of partitioning when
done properly. Proper partitioning is characterized by identifying the proper granularity of what
distinguishes one entity type from another (which is a topic beyond the scope of this document).
Processing Engines
A large part of computationally intensive applications involves generic processing functions. The
separation of processing capabilities into reusable structures can yield great cost savings in multiple
areas including long-term supportability. The primary cost associated with these reusable structures is
designing them up front for reuse. The isolation of these computational units can take on different
forms:
Reusable libraries such as a statistical analysis library containing generic routines forstatistical functions are probably the most common form of reuse. These functions are
generically reusable to any applications referencing the libraries and are often available
from commercial vendors.
Reusable frameworks provide a collection of common capabilities that may be reusedacross applications. Many of these frameworks are available commercially for specific
purposes such as aspect oriented computing.
Reusable services such as data processing web services are increasing in availability andform the basis of most SOA implementations. These services often use co-located data
to increase performance.
8/9/2019 EIM Intro - Information Architectures -Doc
18/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-01
In developing software for an enterprise, all three of the above partitioning strategies may be used
together for maximal effectiveness and operational longevity.
User Presentations
The user interface of an application is responsible for the presentation of data and controls to the
application user. This portion of an application is often called the presentation layer and ideally
contains no functional logic for the application. When properly designed and partitioned, the
application graphical user interface (GUI) is completely independent of the functional portion of the
application. Further, the GUI itself can be partitioned into components that can be reused across
applications.
There are two different aspects of the GUI that can be partitioned for an application:
Partitioning of the GUI from the capability logic Partitioning of the GUI itself into GUI components
The most significant area of reuse comes from the first area of separating the GUI from all business
logic. This should be the default development pattern for application development, and has been
taught commonly in computer science courses for years. Unfortunately, this is often not followed, in an
effort to reduce development time and the need for planning the separation of the GUI from the logic.
When the GUI is separated from all other logic, both the processing logic and data access code (which
interacts with data repositories), can be reused across applications. Moreover, by separating the GUI
8/9/2019 EIM Intro - Information Architectures -Doc
19/19
Corsello Research Foundation
Public Distribution CRF RDTE TR 20091102 01
from application logic, multiple distinct GUIs can be built for any given capability. At this level, an
application can be constructed by placing a new GUI on top of existing logic and data access libraries.
This drastically reduces the development time and standardizes the capabilities of new applications.
Creating GUI components has the additional capability of permitting all applications that use the
component to have a common set of controls and thereby, common workflows. Within an organization,if all applications are developed using common controls, users of those applications will need minimal
training moving between applications. Identifying the boundary for developing a GUI component is the
difficult part, which requires planning throughout the design of the application. A properly defined GUI
component will operate upon a well-defined entity in a standard, well-defined manner. This component
will provide access to data, proper controls (such as buttons, lists and drop-downs) and consistent
behavior. In addition, if the data repositories are well- defined for an entity (Entity Centric partitioning),
the GUI control should match the repository boundary for better stability over time.
ConclusionsArchitecting information solutions for an organization is a complex set of practices and trade-offs tomaximize capabilities while minimizing cost. Given that information solutions take a great deal of time
and care to construct, proper planning is required well in advance of need to ensure solutions are
available by the time the need arises without wasted efforts.
Various strategies exist for planning information repositories, software implementations and user facing
applications. Planning for reuse of repositories and software back-end components and services is of
great importance. Stakeholders involved with information strategies need to understand the difference
between the data repositories containing data, back-end software processing data and the user
interfaces that present data and processing. The separation of these concepts in the minds of those
involved in planning can yield great results in long-term cost savings and capabilities realized.
Appendices
References