EIM Intro - Information Architectures -Doc

Embed Size (px)

Citation preview

  • 8/9/2019 EIM Intro - Information Architectures -Doc

    1/19

    CRF-RDTE-TR-20091102-01

    11/6/2009

    Public Distribution| Michael Corsello

    CORSELLO

    RESEARCH

    FOUNDATION

    INFORMATION ARCHITECTURE BASICSINTRODUCTION TO ENTERPRISE INFORMATION MANAGEMENT

  • 8/9/2019 EIM Intro - Information Architectures -Doc

    2/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-01

    AbstractInformation Architectures are a key to effective information management within an enterprise.

    Management of information is based upon the concepts of architecting the structures and practices for

    managing the lifecycle of information and all stages of information handling and processing.

  • 8/9/2019 EIM Intro - Information Architectures -Doc

    3/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-01

    Table of ContentsAbstract ......................................................................................................................................................... 2

    Introduction .................................................................................................................................................. 5

    Concepts and Definitions .......................................................................................................................... 5

    Information Perspectives .............................................................................................................................. 5

    Management Strategies ................................................................................................................................ 7

    Project Orientation ................................................................................................................................... 7

    Benefits ................................................................................................................................................. 7

    Costs ...................................................................................................................................................... 7

    Topic Orientation ...................................................................................................................................... 8

    Benefits ................................................................................................................................................. 8

    Costs ...................................................................................................................................................... 8

    Entity Orientation ..................................................................................................................................... 8

    Benefits ................................................................................................................................................. 9

    Costs ...................................................................................................................................................... 9

    Combining strategies ................................................................................................................................ 9

    Implementation Strategies ......................................................................................................................... 10

    Client Application .................................................................................................................................... 10

    Client Server ......................................................................................................................................... 10

    Three-Tier ................................................................................................................................................ 11

    N-Tier ...................................................................................................................................................... 11

    Capability Partitioning ................................................................................................................................. 12

    Repositories ............................................................................................................................................ 13

    Enterprise Centric ............................................................................................................................... 14

    Application Centric .............................................................................................................................. 15

    Domain Centric ................................................................................................................................... 15

    Entity Centric ....................................................................................................................................... 16

    Processing Engines .................................................................................................................................. 17

    User Presentations .................................................................................................................................. 18

    Conclusions ................................................................................................................................................. 19

    Appendices .................................................................................................................................................. 19

    References .............................................................................................................................................. 19

  • 8/9/2019 EIM Intro - Information Architectures -Doc

    4/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-01

  • 8/9/2019 EIM Intro - Information Architectures -Doc

    5/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-01

    IntroductionInformation architecture is the set of practices and processes used to define the appropriate models and

    mechanisms for persisting and representing data within an enterprise. This simply means how does an

    organization structure its data for the most effective use and reuse throughout the information lifecycle.

    Concepts and DefinitionsEnterprise - In our context, an enterprise is the collection of organizations within a given business

    domain that operate together. This corresponds to business specific groups within multiple

    organizations that share common information. More specifically, it is about how information is

    structured and managed to facilitate the sharing of data between organizations and divisions within an

    organization.

    Repository - In the context of information architecture, data is stored in repositories, which may be

    physically implemented as a database (such as Oracle or Sql Server), or using any other persistent

    mechanism. Modern relational database management systems (RDBMSes) are the most commonly

    used persistence mechanism, but they are far from optimal and should be evaluated as an option ratherthan being expected as the norm. A repository includes the persistent store itself, and the core software

    required to interact with that store. In the case of an RDBMS, the repository may be equivalent to the

    database as it implies the software as well (Oracle, for instance is a software application).

    Application - An application is a piece of software that runs on a computer. An application may or may

    not have a user interface (UI), which may or may not be a graphical user interface (GUI). An RDBMS is

    an application that does not have a user interface, however there are applications (such as Toad) that

    provide a user interface to an RDBMS.

    Information PerspectivesThe architecture for an enterprise will consist of multiple repositories, each of which contains a subset

    of the data for that enterprise. This data must be structured based upon some consistent means to

    facilitate discovery and use.

    There are two primary considerations for evaluating the best strategies for enterprise data:

    Management strategy Implementation strategy

    Prior to discussing each of these strategies, a high-level perspective on the nature of data is critical toselecting optimal strategies.

    Data, when placed in context, becomes information. As information is evaluated and absorbed by a

    person, it becomes knowledge. The more effectively the transition is from data to knowledge, the

    greater the value of the data. The more contexts a single data element is used for, the greater the value

    of that data. The key to achieving both is an effective model for the data.

  • 8/9/2019 EIM Intro - Information Architectures -Doc

    6/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-01

    A model is a logical representation of a phenomenon in the real world. A data model is a model for a

    data entity that is the representation of a real world phenomenon. A model comes in two parts:

    Conceptual model Instance model (realization)

    The conceptual model is a model that is global to all instances of a class of phenomenon. In simple

    terms, a conceptual model is how we

    describe the notion of a thing. An

    instance model is a model that is the

    representation of an instance of and entity

    based upon the conceptual model of what

    that is an instance of. In simple terms, a

    conceptual model might be the model of

    Car, where as the instance model would

    be the model of My Car, which isdescribed using the vocabulary of the Car

    model as a template.

    Modeling is a key practice in the structuring,

    storing and management of information to

    ensure effective and efficient usage. If a

    poor model is used, then poor usability and

    return on investment in data will result. The

    act of modeling itself, may either result in higher or lower costs for information technology solutions,

    but will result in greater value being obtained from the data managed.

  • 8/9/2019 EIM Intro - Information Architectures -Doc

    7/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-01

    Management StrategiesData management and organization will be based upon some strategy for organizing the data being

    managed. The selection of an appropriate strategy is not trivial and is based upon the use of the data

    once persisted. If there are many uses for a data element that spans business practices, it may be most

    efficient for human productivity to adopt multiple strategies and manage the transition of data between

    the repositories supported by each strategy.

    Project Orientation

    In a project oriented management strategy, all data is grouped, collected and managed by the project it

    is associated with. In this strategy, each organizational project (such as a specific water quality study)

    gets a distinct repository or partition (directory in a file system sense). The data generated and used

    within that project is stored within the project repository based upon data models used for that project.

    Projects may adopt centralized models, or define project specific models for any given data domain.

    Benefits

    This strategy is good for data collection and general field activity projects. There may be minimal

    structure and little need for data manipulation. In this strategy people are responsible for maintaining

    the flow of information within their project and must push data out to make it available to others.

    People can be highly efficient as they have minimal constraints to slow down human efforts. Only data

    that is used by a project needs to be transformed for use.

    Costs

    Data can become unusable or undiscoverable across projects. There is no central source for a given type

    of data. Once data is acquired from another project, significant processing may be required to make it

    usable. Data may be in multiple formats which may become unusable over time; each distinct format

    would need to be located and transformed to maintain usability over time. Each data set a project

    needs will have to be transformed to the local model, this cost is repeated for every project reusing the

    data. Modeling must be performed for each project to fill any gaps from existing models.

  • 8/9/2019 EIM Intro - Information Architectures -Doc

    8/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-01

    Topic Orientation

    A topic oriented management strategy divides repositories by business topic (domain) area. In a topic

    oriented strategy, each topic has a unified model that is used within that topic area. If multiple topics

    use a common data entity (real world thing), each topic area may have a distinct model for that entity.

    For example, a roads topic and a waterways navigation topic may both have a bridge model, but the

    models may be entirely different and incompatible.

    Benefits

    A topic oriented strategy is good for businesses with very few domains that do not interact with external

    organizations. Within this strategy, all projects can interchange data freely as they are based upon

    common models with shared repositories by topic. All data for an enterprise within a topic is in a usable

    format. Each topic may create models that accepted tools can use directly. Integration data sets may be

    created to ensure compatibility between domains by sharing models at overlap points.

    Costs

    Translation of data across topics may be costly. Integration of data across topics may be impossible if

    correspondence between the domains is not planned beforehand. Each topic may be incompatible for

    another topics software tools. Data integration costs may be high, and tend to be continual. Modeling

    must focus on an entire business topic rather than on commonly used aspects of the topic.

    Entity Orientation

    Managing data by entities is the most beneficial and most complex form of management. All real-world

    entities used within the enterprise must be identified and modeled separately. Each entity then

    becomes an atomic data repository that can be shared across all projects and topics due to the common

    model. In an entity oriented strategy, the primary goal is to adequately identify the entities and model

  • 8/9/2019 EIM Intro - Information Architectures -Doc

    9/19

  • 8/9/2019 EIM Intro - Information Architectures -Doc

    10/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-01

    Implementation StrategiesOnce a management strategy or set of strategies are chosen, an implementation strategy must be

    selected for each technology implementation within the enterprise.

    Client Application

    The most straightforward implementation strategy for a software solution is the client application or

    thick client. In a client application, the data and processing are local to the software and all operations

    occur on the user computer. A common example of this architecture is the traditional word-processing

    application such as Microsoft Word or Corel WordPerfect.

    The client application approach does not rely on interactions between computers for the software to

    function and generally does not have to consider network communications in detail. The client

    application works quite well and is the most battle tested form of implementation for software

    applications. However, the client application does not provide mechanisms for information sharing and

    exchange, nor does it provide for concurrent execution across machines. In general, the client

    application excels for single user applications with independent data sets.

    Whenever data is global in nature, meaning that multiple users or user communities need to share

    and exchange data directly, the client application approach breaks down. Additionally, the client

    application approach requires the greatest level of effort to ensure platform compatibility between user

    machines (such as Microsoft Windows and Apple Mac OS versions).

    Client Server

    Moving to multi-user concurrent usage capabilities starts with the addition of a server-based component

    to the software solution. The most basic form of this architecture is the client-server strategy. In a

    client-server application, there are exactly two deployment components to the overall application, one

    runs on the user computer (the client) and the other runs on a remote server. A common example of

    this is the database enabled application. In the database enabled application, there is a database server

    hosting the data repository. The client then runs an application that interacts with the database over a

    network. Probably the most prevalent form of the client-server architecture is a basic web site. In a

    client-server web site, the web server hosts all data and performs basic processing to serve user

    requests. The web browser running on the client machine requests, retrieves, processes and displays

    the content from the server. Through the use of client-side technologies such as JavaScript, Flash,

    Silverlight, etc. the client browser can perform final processing and user interaction handling.

  • 8/9/2019 EIM Intro - Information Architectures -Doc

    11/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-01

    The client-server architecture is simple, efficient and effective for basic content but is incomplete for

    handling multiple sources of changing data such as in a relational database. Further, client-server

    architectures do not scale well for intense processing or large user bases.

    Three-Tier

    To add the capability for data processing, a three-tier architecture may be used. The three-tier

    architecture consists of a client application, a business processing server and a data storage tier. The

    three-tier architecture is a common implementation strategy for basic business web applications. The

    data tier is implemented as a relational database management system (RDBMS) such as Oracle,

    Microsoft Sql Server, IBM DB2, etc. The data tier is generally responsible for all data storage and

    retrieval, with some data processing logic to ensure data integrity. The business processing tier is the

    web-server from the client-server model with the most data intensive or computationally intensive

    processing occurring at this tier. The client tier is again the web-browser based content from the client-

    server tier where all user interaction processing and possibly some business logic processing occurs.

    The three-tier architecture has been the workhorse of web applications since the mid-nineties. As the

    concurrent user base increases, more web servers may be added to scale out the application. This

    architecture also allows for portions of the logic to be moved between tiers (by developers) to increase

    performance or reduce wait times. These tier changes allow for adjusting the support and maintenance

    costs of the applications by moving the processing workload between machines, and potentially to the

    users browser at no computational cost to the owner of the site.

    N-Tier

    An evolution of the three-tier architecture is the N-tier, where any N, or number of tiers, exists to

    support the application. Technically the client-server and three-tier architectural strategies are specific

  • 8/9/2019 EIM Intro - Information Architectures -Doc

    12/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-01

    subsets of the general N-tier approach. Modern web based applications frequently use an N-tier

    approach, especially where service oriented architectures (SOA) are applied.

    In an N-tier implementation, each tier provides some portion of the overall application capability. There

    may be multiple business processing servers that each provides a unique capability such as statistical

    processing of data. Additionally, there may be multiple database servers, each of which contains aspecific portion of the total data used in the application. This form of distributing the workload by

    partitioning specific capability areas is the basis for distributed computing, SOA and cloud computing.

    The N-tier approach is overkill for simple applications, but is generally the best approach for all forms of

    multi-user applications. As businesses attempt to consolidate, the use of N-tier computing allows for

    high-levels of reuse and horizontal integration of capabilities across applications.

    Capability PartitioningWhen implementing a software system, trade-offs are made to ensure the best performance, scalability

    and maintainability possible given reasonable cost constraints and functional capabilities required. To

    provide any capability, there is a minimum cost and timeline to produce any workable solution. An

    effective solution will always be in excess of these minimums. Strategies for reuse, integration and

  • 8/9/2019 EIM Intro - Information Architectures -Doc

    13/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-01

    partitioning are effective at minimizing realized costs by distributing the costs across implemented

    applications. The partitioning allows for resource sharing in any of several areas:

    Conceptual reuse, where ideas, designs and algorithms are applied to multiple projects Source reuse, where software source code is reused on multiple projects Library reuse, where compiled libraries of code are reused as-is on multiple projects Hardware reuse, where multiple applications are hosted on a single physical server Service reuse, where a software service is reused by multiple applications (such as in

    SOA)

    Data reuse, where a single authoritative data repository is reused by multipleapplications

    Each of the forms of reuse can reduce costs in some scenarios, but which reuse strategy is most cost-

    effective is project and business domain specific. To ensure that any of these forms of reuse are

    realistic, they have to be planned for in the initial stages of the projects producing the reusable product,

    and in the projects reusing the product. Accidental reuse can happen, but often includes additional

    hidden costs for integration when not planned for. Planned reuse will increase initial development

    costs, but are likely to reduce total lifecycle costs for all parts reused. The greater number of instances

    of reuse, the more effective the cost savings for the initial planning investment.

    Trade-offs to provide capabilities at reduced costs generally involves partitioning strategies. Each of the

    primary computing areas for a software application may be subject to partitioning. These primary

    computing areas are:

    Repositories of data, where the full corpus of data an application will use may bepartitioned into domain or entity specific repository models (see Entity Orientation

    section above).

    Processing engines or capabilities, where the computational portions of an applicationmay be separated into reusable analytical components for reuse.

    User presentations or graphical user interfaces (GUIs) can be partitioned away from anapplication to ensure there is no business logic associated with the display of

    information.

    Effective choices of what should be partitioned and how to partition those items will greatly influence

    performance, reusability and ultimately software lifecycle costs. It is important to realize that most of

    the cost for a software application is incurred not during development, but during sustainment.

    Repositories

    All software is designed to process data in some form. The repository is a conceptual store from which

    software will access and process data. There are several strategies for partitioning repositories across

    an enterprise:

  • 8/9/2019 EIM Intro - Information Architectures -Doc

    14/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-01

    Enterprise centric, where the entire enterprise centralizes all data into a single masterrepository.

    Application centric, where each application has a dedicated repository. Domain centric, where each business domain has a dedicated repository that all

    applications using that domain data must connect to. Entity centric, where each entity is modeled and a repository exists for that entity

    model. All applications using an entity are connected to that entity repository.

    Enterprise Centric

    The enterprise centric approach of centralizing all data into a master integrated repository is only

    effective for small repositories with limited growth. The definition of small in this context is fluid as a

    function of cost for providing a hardware infrastructure to support such a repository. As database

    vendors become better at supporting single repositories of increasing size, this option becomes more

    viable for more organizations. In general, this is not a recommended approach in most circumstances.

    With a single repository, performance is improved due to the integrated and common location of all

    data. However, performance is also decreased due to the overall size of the repository and increased

    intrinsic complexity. Also, failures have a higher likelihood of significant impacts with a single repository(the proverbial all eggs in a single basket). Scalability is an issue in the single repository due to current

    limitations in technology, with Oracle Real Application Clusters (RAC) being an example of current

    advances, which may reduce the significance of these limitations. Currently, scalability still makes total

    centralization generally a sub-optimal choice.

  • 8/9/2019 EIM Intro - Information Architectures -Doc

    15/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-01

    Application Centric

    In an application centric partitioning of repositories, each application gets a dedicated repository (often

    a relational database). If multiple applications within the enterprise require access to the same data,

    that data is maintained in both repositories via a synchronization mechanism. This solution provides the

    greatest level of performance for a single application, but comes at the additional cost of data

    duplication and issues of consistency for rapidly changing data (synchronization latency).

    For small organizations with static data sets, the application centric approach may be quite effective. In

    many cases, the application centric approach will be sub-optimal in all areas due to the effort involved in

    establishing the data synchronization mechanisms and the cost of data duplication.

    Unfortunately, application centric partitioning is the most natural form of partitioning due to the

    isolation of repositories for each application. In this form, developers of a solution are able to focus on

    the local problem alone. This may lead to lower development costs for an application at the cost of

    poorer fit to the business capability required, as developers do not need to fully understand the

    problem domain.

    Domain Centric

    A domain centric partitioning of repositories ensures that each defined business domain within theenterprise has a single master repository for all of that domains data. This tends to have similar pros

    and cons to the application centric model with the exception that all applications developed are

    required to conform to the existing domain repositories. In this manner, application developers are

    required to have a reasonable understanding of the business domains affected by the application.

    In a domain centric model, each application may be required to communicate with more than one

    repository. This data integration across repositories may result in the need for additional join

  • 8/9/2019 EIM Intro - Information Architectures -Doc

    16/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-01

    elements to be added to repositories to support this integration. Once completed however, all future

    applications will be able to reuse the integrated data model.

    The key limitation of a domain centric approach is again the potential for data duplication between

    domains and the reconciliation of sameness of entities between domains. For example, if two

    business domains have data on people, and if both domains have data on a John Smith, are they theSAME person or different people with the same name? Further, if the domains each model person for

    their repository (not using an Entity Oriented modeling approach from prior sections), the person

    models may be different and thusly incompatible.

    Entity Centric

    Using an entity centric repository approach is the most data efficient, and design costly approach. In the

    entity centric approach, each data entity has an isolated repository with identified linkages between

    repositories. For example, there may be a single people repository that contains records of all human

    beings known throughout the enterprise. This repository will contain people information for

    employees, customers, suppliers, contractors, etc. Notice however that the people repository only

    contains the information that describes the people notion of those individuals. For an employee for

    example, the human resources (HR) portion of their data including the very fact that they are an

    employee is stored in a differentrepository, not the people repository. This is simultaneously anamazing benefit and a limitation. The modeling of entities drives the repository boundaries, and the

    integration of data across repositories happens within one of the repositories participating in the

    integration. For our HR example, there may be a staff repository that contains all HR relevant aspects

    of employees and their reporting chains, but it is the staff repository that contains the references to

    the person repository.

  • 8/9/2019 EIM Intro - Information Architectures -Doc

    17/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-01

    The separation of data entities into isolated repositories improves security, as each type of data is

    stored in isolation of all other data, and permits the use of entity specific services to interact with the

    data elements. In general, entity based partitioning is the most effective form of partitioning when

    done properly. Proper partitioning is characterized by identifying the proper granularity of what

    distinguishes one entity type from another (which is a topic beyond the scope of this document).

    Processing Engines

    A large part of computationally intensive applications involves generic processing functions. The

    separation of processing capabilities into reusable structures can yield great cost savings in multiple

    areas including long-term supportability. The primary cost associated with these reusable structures is

    designing them up front for reuse. The isolation of these computational units can take on different

    forms:

    Reusable libraries such as a statistical analysis library containing generic routines forstatistical functions are probably the most common form of reuse. These functions are

    generically reusable to any applications referencing the libraries and are often available

    from commercial vendors.

    Reusable frameworks provide a collection of common capabilities that may be reusedacross applications. Many of these frameworks are available commercially for specific

    purposes such as aspect oriented computing.

    Reusable services such as data processing web services are increasing in availability andform the basis of most SOA implementations. These services often use co-located data

    to increase performance.

  • 8/9/2019 EIM Intro - Information Architectures -Doc

    18/19

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-01

    In developing software for an enterprise, all three of the above partitioning strategies may be used

    together for maximal effectiveness and operational longevity.

    User Presentations

    The user interface of an application is responsible for the presentation of data and controls to the

    application user. This portion of an application is often called the presentation layer and ideally

    contains no functional logic for the application. When properly designed and partitioned, the

    application graphical user interface (GUI) is completely independent of the functional portion of the

    application. Further, the GUI itself can be partitioned into components that can be reused across

    applications.

    There are two different aspects of the GUI that can be partitioned for an application:

    Partitioning of the GUI from the capability logic Partitioning of the GUI itself into GUI components

    The most significant area of reuse comes from the first area of separating the GUI from all business

    logic. This should be the default development pattern for application development, and has been

    taught commonly in computer science courses for years. Unfortunately, this is often not followed, in an

    effort to reduce development time and the need for planning the separation of the GUI from the logic.

    When the GUI is separated from all other logic, both the processing logic and data access code (which

    interacts with data repositories), can be reused across applications. Moreover, by separating the GUI

  • 8/9/2019 EIM Intro - Information Architectures -Doc

    19/19

    Corsello Research Foundation

    Public Distribution CRF RDTE TR 20091102 01

    from application logic, multiple distinct GUIs can be built for any given capability. At this level, an

    application can be constructed by placing a new GUI on top of existing logic and data access libraries.

    This drastically reduces the development time and standardizes the capabilities of new applications.

    Creating GUI components has the additional capability of permitting all applications that use the

    component to have a common set of controls and thereby, common workflows. Within an organization,if all applications are developed using common controls, users of those applications will need minimal

    training moving between applications. Identifying the boundary for developing a GUI component is the

    difficult part, which requires planning throughout the design of the application. A properly defined GUI

    component will operate upon a well-defined entity in a standard, well-defined manner. This component

    will provide access to data, proper controls (such as buttons, lists and drop-downs) and consistent

    behavior. In addition, if the data repositories are well- defined for an entity (Entity Centric partitioning),

    the GUI control should match the repository boundary for better stability over time.

    ConclusionsArchitecting information solutions for an organization is a complex set of practices and trade-offs tomaximize capabilities while minimizing cost. Given that information solutions take a great deal of time

    and care to construct, proper planning is required well in advance of need to ensure solutions are

    available by the time the need arises without wasted efforts.

    Various strategies exist for planning information repositories, software implementations and user facing

    applications. Planning for reuse of repositories and software back-end components and services is of

    great importance. Stakeholders involved with information strategies need to understand the difference

    between the data repositories containing data, back-end software processing data and the user

    interfaces that present data and processing. The separation of these concepts in the minds of those

    involved in planning can yield great results in long-term cost savings and capabilities realized.

    Appendices

    References