View
222
Download
0
Category
Preview:
Citation preview
8/9/2019 Information Considerations - Doc
1/19
CRF-RDTE-TR-20100202-08
11/2/2009
Public Distribution| Michael Corsello
CORSELLO
RESEARCH
FOUNDATION
INFORMATION CONSIDERATIONSMANAGEMENT OF ENTERPRISE DATA
8/9/2019 Information Considerations - Doc
2/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20100202-08
AbstractInformation management is a complex topic involving not just information technology, but business
processes and staffing as well. Enterprise Information Management includes every aspect of the
creation, storage, use, disposal and accountability for every piece of information an organization comes
in contact with.
8/9/2019 Information Considerations - Doc
3/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20100202-08
Table of ContentsAbstract ......................................................................................................................................................... 2
Introduction .................................................................................................................................................. 5
Audiences ...................................................................................................................................................... 5
Business Domains ......................................................................................................................................... 6
Data Categories ............................................................................................................................................. 7
Raw Data ................................................................................................................................................... 7
Accuracy and Precision ......................................................................................................................... 8
Faults and Blunders ............................................................................................................................... 8
Data Capture ......................................................................................................................................... 9
Human Capture ..................................................................................................................................... 9
Sensor Capture ...................................................................................................................................... 9
Derived Data ........................................................................................................................................... 10
Data Uses .................................................................................................................................................... 10
Data Structure Types................................................................................................................................... 11
Unstructured (Documents) ..................................................................................................................... 11
Semi-Structured ...................................................................................................................................... 12
Structured ............................................................................................................................................... 12
Structure Element Types ......................................................................................................................... 12
Textual ................................................................................................................................................. 12
Tabular ................................................................................................................................................ 12
Graph Data .......................................................................................................................................... 13
Spatial .................................................................................................................................................. 13
Temporal ............................................................................................................................................. 14
Data Formats ............................................................................................................................................... 15
File Formats ............................................................................................................................................. 15
Stability ................................................................................................................................................... 15
Longevity ................................................................................................................................................. 15
Data ..................................................................................................................................................... 16
Storage ................................................................................................................................................ 16
Format ................................................................................................................................................. 16
Media .................................................................................................................................................. 16
8/9/2019 Information Considerations - Doc
4/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20100202-08
Data Volumes .............................................................................................................................................. 16
Separation ............................................................................................................................................... 17
Sharding .................................................................................................................................................. 17
Versioning ............................................................................................................................................... 17
Archival ................................................................................................................................................... 18
Evaporation ............................................................................................................................................. 18
Conclusions ................................................................................................................................................. 19
Appendices .................................................................................................................................................. 19
Acronym List ........................................................................................................................................... 19
References .............................................................................................................................................. 19
8/9/2019 Information Considerations - Doc
5/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20100202-08
IntroductionOrganizational management of information is a long-standing issue that has not been solved by
technology. Increasing capabilities to generate data and greater capacities for storing data have
become available as technology advances. This new information glut presents problems to all
organizations that create or use information. The storage of data is a multi-dimensional problem that
has to be considered from many aspects:
Audiences, both intended and unintended Business domains, including legal and policy restrictions Data categories, such as sensor feeds, field sampling, analytic results Data uses, which includes activities such as data capture, analysis and reporting Data structure types, such as document, tabular, raster, spatial or temporal Data formats, including proprietary formats that may be restrictive in use Data volume over time, which will drive the infrastructure to store data
Each of these aspects alone will only provide a partial set of demands and restrictions to the handling of
data for the enterprise. Given that an organization in the large is both an enterprise and a part of
larger enterprises that include all other organizations in a similar business domain, it is imperative that
the organization information strategy meet both short and long-term demands of the enterprises that
the organization interacts with.
AudiencesInformation is useless without an audience. Even in automated processing systems and control systems,
an audience receives the result of the information processing. The benefit to the audience is the
purpose for the information. There are two primary categories of audience when dealing withinformation in any form, intended and unintended.
The intended audience is divided further into the two
categories, direct and indirect. The direct intended
audience is the portion of the audience using an
information store which the store was designed to
directly support. This is the primary user base of the
information. The indirect intended audience is the set of
users (and domains) which were planned for during
information store design. This audience will have varyinglevels of success in using the information and may or may not be an actual set of users (there may not
be any people in this set that actually use the information). It is not uncommon that the indirect
intended audience only amounts to a small portion of the user base and may be incorrectly targeted
during the design phase.
The unintended audience is the set of information users which the information store was not planned or
designed for. The unintended audience is likewise divided into the two categories, direct and indirect.
Figure 1. Top-Level Categories of Users
8/9/2019 Information Considerations - Doc
6/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20100202-08
The direct unintended audience is the portion of the audience for an information store that is within a
business domain or organization that correlates well with an intended group. The direct unintended
audience will generally meet with moderately good
success in information usage with the primary limitations
being access to the information. Finally, the indirect
unintended audience is the set of users that are not
related to an expected domain. This form of user may
include orthogonal business areas (such as a construction
firm using business demographic data for site selection)
and may gain significant benefits from information usage.
In fact, it is entirely possible that this group of users may out number and out benefit all other user
groups. Since this group is not intended, they are the most difficult to support. These groups may have
any form of software applications and require data in
specific formats. Unfortunately, data may be provided in
incompatible formats which greatly hinders reuse. While
the use of open-source data standards (such as XML
schemas) and web services help cross-platform
information exchanges, they do not solve the problems of
data formats for specific applications.
Business DomainsThe notion of a business domain is a business functional area that will have a related goal and use
similar information. A single organizational group may operate in multiple business domains and
therefore have information stores that overlap those multiple business domains.
A business domain such as hydraulic engineering,
geodetic survey, ecological monitoring or commercial
fishing will produce and use information that overlaps
with other domains. In fact, the specific domains
mentioned in this paragraph all overlap in significant
ways. For example, all of the above domains utilize
information about fixed monuments for relative
locations, which are the heart of surveying. Likewise, all
of these domains will have use for information about
locations in general. Further, these groups will all use
information on projects. First, all of the domains other
than commercial fishing would have direct efforts on the construction of a lock, dam or channel. The
commercial fishing domain may then be an unintended audience utilizing data for planning trawler
capacities and travel schedules for safe cargo passage. In this case, the commercial fishing domain is
simply acting as a waterway cargo hauling domain.
Figure 4. Business Domains with Commonality
Figure 2. Types of Intended Users
Figure 3. Relative Magnitude of User Types
8/9/2019 Information Considerations - Doc
7/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20100202-08
The ultimate consideration for business domains is information structure, use and access. Information
structure, use and access are the themes that run through the remainder of this paper to describe the
what, where, when, why and how of interacting with the information produced by an
organization. One of
the most important
considerations with
business domains will
be the isolation of
domains from one
another in space and
time. In general,
each organization will
operate at specific
locations and
produce and maintaininformation for a time
period that is relevant to that organization. Policy or legal issues relating to the information being
regulated (such as personally identifiable information (PII)) may drive organizational isolation. Also,
each organization will isolate its own information technology infrastructure from the global internet for
security purposes. Security and auditing considerations must be addressed when information must flow
across these boundaries.
Data CategoriesFirst, a separation between data and information must be established. Information is the collection of
data, tools and a context which a human will use to derive knowledge. Data is the collection of stored
values used as information when presented in context. Data may imply context based upon a storage
model or management tool (such as a software application). However, data can be moved between
models to alter or infer other contexts.
Each category of information that the organization will produce or manage will affect the information
modeling for that organization. Data categories include raw data sources such as sensors and field
collection and processed sources such as analysis results. A data category will influence the model used
to store that information and may greatly affect the theoretical rate of production for that data.
Raw DataRaw data includes all categories of data that are collected directly as a result of observation of a physical
phenomenon in the real world. Raw data is the original form of data from a source and must be
considered the basis for all derived data sets that are derived using a given raw source. Raw data is
generally collected in a higher volume than derived data and has the greatest potential for reuse of any
form of data.
Figure 5. Data Domains Shared Across Business Domains
8/9/2019 Information Considerations - Doc
8/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20100202-08
Prior to detailing categories of raw data, it is important to appreciate sources of error and
incompatibility between data sets. Errors in data will fall into two primary areas: measurement error
and introduced error. Measurement error is in the form of accuracy and precision of the measurement
made. Introduced error is in the form of faults and blunders in data collection and entry generally
associated with human interpretation and capture.
Accuracy and Precision
Accuracy and precision are commonly confused in spite of their relative simplicity. Accuracy is the
absolute measure of how close to the actual value a measurement is. It is often described as how close
a dart is to the bulls-eye, whereas precision is the repeatability of a measure. Precision is the
closeness of sequential measurements to each other irrespective of the actual value being measured. In
shooting terms, precision is measured as the size of the group for sequential shots. It is therefore
quite possible to have low accuracy (missed the target by 12 inches) and high precision (95% of all shots
were within 0.0001 inches of each other).
In general, sensors are calibrated for accuracy and have intrinsic precision properties that may not be
adjustable over the lifespan of the sensor. In many cases, the loss of precision in a sensor is the
indication that the sensor is at its end of effective life. Sensors with reduced accuracy may introduce a
systematic error, which can potentially be corrected by adding an error offset.
Accuracy and precision are of concern in both human and sensor measurements. Human
measurements tend to be more stochastic in terms of both accuracy and precision, but not necessarily
of less value.
Generic Accuracy
In general terms, accuracy should be recorded for all raw data collected regardless of source or data
type. These generic accuracy statements are metadata (data about data) attached to all data collectedunder a similar generic accuracy. The inclusion of an accuracy statement with all raw data sets will
ensure that data users can determine the maximum level to which derived data could be considered
accurate. No derived data set can be of greater accuracy than the least accurate source data used to
derive the result. This has profound implications in analysis where partial data sets must be integrated
to form a complete picture from which analytical results are derived.
Faults and Blunders
When a data value is not captured correctly, it may be due to a fault or blunder. A fault is the capture of
a correct measure or value in an incorrect manner, such as transposing latitude and longitude. A
blunder is the capture of an incorrect value, either by incorrect measurement or by incorrect entry.Incorrect measurement is a gap between the physical environment and the data system (such as a
reading from a GPS on a tripod that is not leveled properly). Incorrect entry is a gap between the source
data and a destination data system (such as incorrectly entering a value into a database or notebook by
transposing digits).
Faults and blunders are generally caused by humans and may exist anywhere in the data lifecycle.
Faults and blunders are random and not easily detected or corrected.
8/9/2019 Information Considerations - Doc
9/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20100202-08
Data Capture
Raw data may be captured by humans or by automated means such as sensors. When sensors are used,
they may perform direct electronic capture or require a human to record and transfer the sensed data.
Any data capture that requires a human to record the data will be classed as human capture equal to all
other human captured data.
Regardless of the mechanism of data capture, raw data should be considered gospel data that must be
retained for the longest period of time. If the raw data used in an analysis is lost, that analysis could not
be verified and the validity of the analysis will be ultimately vulnerable to scrutiny. Unfortunately, the
volume of raw data may be prohibitive to traditional storage for long-term. In this case, a mechanism
for data persistence will be required that allows for long-term persistence to support the life time of all
analysis products.
Human Capture
Human data capture is the practice of a person measuring, recording and entering values into a data set.
Human capture is still a common form of data generation and includes all business data entry systems
such as online forms or general spreadsheets. Since a human is collecting the data and entering it
into a system, the maximum number of people capturing and entering data and the absolute rate at
which a single person can generate this data limits the rate of generation. Human capture systems are
generally low to medium volume capture sources.
Sensor Capture
Sensor systems can be direct capture or indirect capture. A direct capture system records all data
sensed directly to a persistent store such as a database. Direct capture systems often include sensors
such as:
Fixed thermal sensors (weather stations) Closed circuit cameras Supervisory Control and Data Acquisition (SCADA) systems Hardware self monitoring Software logging / auditing
Indirect capture systems include all disconnected sensor systems that require periodic data transfers
such as field data loggers. Indirect capture systems include sensors such as:
Hydrolab Datasondes Hardware diagnostic sensors (such as those in automobiles) GPS units Autonomous hardware devices (such as robots, transponder readers / loggers, etc)
Sensor capture devices tend to provide reliable data if properly used and maintained. Due to the
automated nature of these devices they can have low to extremely high data capture rates. Indirect
capture devices are limited in capture rate due to the transfer requirements, but this can be overcome
by device exchanging.
8/9/2019 Information Considerations - Doc
10/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20100202-08
Derived Data
All data products that are created from existing data are considered derived. Derived data will often be
smaller in size and greater in value to a user than the source data used to produce the derived data.
Derived data may be produced dynamically on-demand (such as complex queries over a relational
data store), or statically produced and persisted (such as analysis results and formal reports). Whenever
possible, metadata should be captured and maintained for all derived data indicating its chain of
production and all data sets used in the derivation of the product.
Data UsesEach user will have a set of uses for a data set. The basic forms of data use include capture, view,
analysis (derivation) and reporting (derivation). The initial use of a data set will be data capture or
generation, which involves the processes of users that produce and record the raw data.
Data view (reading or retrieval) is the practice of physically viewing, querying, browsing or enumerating
a data set. Viewing is the most common use of data. Data viewing is subject to availability, discoveryand transfer constraints. The user context may indicate the presentation format of data and therefore
may require a transformation of the data from its storage format.
Analysis is the act of deriving a data set from some other data set or sets. Analysis often requires data
to be in a specific structure or format that is amenable to the specific type of analysis to be performed.
Additionally, data volume may be a consideration that limits the ability to transform the data due to
storage limitations. Finally, analysis processes may require data transfer at specific rates or access
methodologies (such as random or sequential access) that drive the data storage environment (such as
server sizing). This tends to be the most performance critical use of data and may significantly affect
costs and planning.
Reporting is the act of deriving an output data set from an existing data set. Reporting generally adds
minimal new information (generally just aggregation functions such as averages or sums), but tends to
add significant value due to improved information context and comprehension. Reporting may be
automated (scheduled or triggered) or manual. Reporting may involve data transformation and
additional storage (such as in an online analytic processing (OLAP) system).
8/9/2019 Information Considerations - Doc
11/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20100202-08
Figure 6. Relative size of user communities
All forms of data use will influence storage methodologies, environment scaling and access control.
Integration of external application systems can be the most significant cost of time and money inarchitecting an effective information strategy. It is important to understand the scale of the users as
well. In any information system, the largest proportion of users is viewers, while the smallest
proportion of users is creators. Given this pyramid (as depicted above), it is clear that data creation is
targeted at a small set of users, which have specialized needs. Due to the small size of the community, it
is reasonable to invest greater funds per user to create robust data collection tools. If the smaller
communities are not emphasized, the data available to users will be of lesser quality. In general, it is a
more effective utilization of resources to invest in collection and analysis tools, with general purpose
viewing tools being created as needed.
Data Structure TypesAll data will be structured in a manner that a computer can handle. These structures will abstract the
physical world to affect the information capability the organization requires. Data structure types
include three primary levels; unstructured, semi-structured and structured.
Unstructured (Documents)
Unstructured data is most commonly documents. These data types store values that have embedded
meaning with an absence of structure for conveying that meaning. A document can be understood by
reading it and may have many meanings depending upon the reader. Since there is no pre-defined
structure to the content, a computer treats the content as a single atomic data item. The content withinthe document may be mined to acquire limited information (generally metadata) beyond that which
may be entered by the creator. Unstructured data is of limited use within a data set, but provides
immense use to human users. Unstructured data is primarily managed in an information store as a
document repository flagged with metadata relating the document to other data.
8/9/2019 Information Considerations - Doc
12/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20100202-08
Semi-Structured
Semi-structured data is data containing a limited structural definition. XML is a standard encoding for
data that may be externally structured via an XML Schema. Semi-structured data may be fully
conformant with a single structural definition, or it may be conformant in part with multiple structural
definitions. Semi-structured data will be a consideration for implementation use and consumption for
externally provided data.
Structured
Structured data is all data stored under a pre-defined, well-known structure. Relational databases, or
object-oriented classes, define structured data. Structured data is the most powerful aspect of data
storage and modeling as it conveys structure and implies meaning of and between data elements.
In most cases, an analysis will require specific data in a specific structure, but will not semantically
comprehend the data used. Structured storage and transfer of data does not imply meaning or
guarantee correctness. All data structures persisted within an organization repository should be
structured storage unless otherwise required. Semi-structured data and unstructured data may bestored within a structured repository.
Structure Element Types
Within a structured or semi-structured store, each data element will be of some structured type.
Primary data types include; Boolean (true/false), integral numeric, real numeric, text, binary and
abstract type. The abstract types are constructed as a defined collection of the other primary data
types. All primary data types are supported in some form in every language, database and general
information system.
Textual
A textual data element is any free-form text entity that is interpreted as a whole. In general, a
document can be considered primarily a textual entity. Note that specific document encoding formats
(such as the Microsoft Word format) may include far more than text and be dependent upon a specific
tool to read the data in this format. General purpose textual information (including HTML and XML files)
are encoded in a standard form that is easily understood in any platform.
Tabular
Tabular data elements are any data structured as rows and columns of values. Object-oriented classes
are also considered as a tabular structure in that the class is analogous to a table structure. The key
concept of tabular data is that of a pre-defined structure for all instances of data conforming to this
structure. Any data structure that is a two-dimensional array of values is a table. All digital images maybe considered tabular data in their raw form, it is only their encoding (such as JPEG or TIFF) that makes
them distinct elements. Further, a tabular data structure may store any other form of data within a
single column of that structure.
8/9/2019 Information Considerations - Doc
13/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20100202-08
Graph Data
Graph data elements are any data structured upon relationships between elements. This may be in the
form of hierarchies (trees) or networks (graphs). A graph (G) is a set of nodes (N) and edges (E),
indicated as G(N,E) where each edge (e) in the set of edges (E) is a directed pair of nodes (n1,n2) both in
the set of nodes (N). Edges may be directed as source destination or bi-directional (considered
undirected). A social network like Facebookis an undirected graph connecting people (nodes) to each
other that indicate they know each other (edges). A road or river network would also be examples of
graphs.
Graph data may also be tabular, in that the nodes and edges may each be represented by a table. It is
also possible for a set of nodes to be shared across multiple graphs, each with a distinct set of edges.
Graph data is conceptually simple, but is potentially complex and may be used to support any number of
complex analytical routines. In general, anything representing connectivity can be considered as a
graph. Most commercially available data management tools including relational databases have poor
support for graph data. Further, most commercial tools that do support graph data expect the entire
graph to be entirely in memory, which is a significant limitation for large data sets. As an example, the
US roads network has over 30,000,000 edges (road segments) and an approximately equal number of
nodes (intersections).
Spatial
Data regarding a location or graphical depiction (diagram or map) may be considered spatial data.
Spatial data generically consists of geometries (shapes) as an attribute (column) of a structured data set.
If the geometries represent physical locations on the Earth or another celestial body, those geometries
are considered geospatial. A geographic information system (GIS) is one form of repository for storing
geospatial data.
Coordinate Systems
Simple geometry data is based upon a standard Cartesian plane such as a drawing program (e.g.
Microsoft Visio or open source Dia) would support. If the geometries are geospatial, there may be a
planetary coordinate system that allows for accurate representation on an ellipsoidal model of the
planet. The coordinate systems of geospatial geometries are mathematically complex and distort
aspects of the geometries in specific ways depending upon the coordinate system selected. As data is
collected, it will exist in a single coordinate system.
As data is transformed into another coordinate system a distortion will occur. Any analysis performed
using data from different coordinate systems will contain a measure of error (or bias) based uponartifacts introduced by the distortion.
Spatial Accuracy, Precision and Scale
The accuracy and precision of spatial data has an added aspect of vertex density. A geometry, such as a
line, is composed of multiple points which are connected by a line. This chaining of points (vertices)
allows for the generation of complex lines (polylines or arcs) and polygons. When created (digitized) the
spacing between the vertices maybe significant. Each vertex has accuracy and precision measures as
8/9/2019 Information Considerations - Doc
14/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20100202-08
does the space between the vertices. For example, a line consisting of three points may be accurate and
precise to within one inch at the vertices, with no measure of accuracy or precision for the linear
segments connecting the vertices.
Spatial scale is the ratio of real world distance to the depicteddistance of a geometry. For geospatial
data, the coordinates stored are often exact within the precision of the data element (such as a 64-bitfloating point number). In terms of scale, the definedgeospatial scale is the scale at which the data was
collected for use. This scale is an indication of the nominal distances between vertices to depict
variation in location (such as the curves in a stream). As the scale becomes more precise (tends to 1/1)
there will be more vertices per unit distance to indicate variation. As the scale becomes less precise
(tends to 1/) there will be fewer vertices per unit distance. As a result, coarse scale data will depict
less of the variation in a road or river as it meanders. Further, as the scale is coarser the accuracy of the
geometry will diverge from the physical world more between the vertices than at a finer scale. When
the spatial data is transformed to another coordinate system, this may cause great concern as the linear
segments between vertices cannot distort (curve) with the coordinate system unless interpolated
vertices are inserted. In the case of added vertices to introduce this curve, the original data is lost and
may appear more fine-scale than the source data actually was. This is a serious issue that must be
addressed for all spatial data.
Temporal
Data describing the real world is always valid in time. Time will be considered a one-dimensional space
that is infinite in both directions. Like spatial data, time is an attribute of other data structures such as
tabular data. Temporal data can be categorized into three primary classes; archival, historical and true
temporal.
Archival
Archival data is not truly a temporal aspect of data, but instead of data handling. Archival is the practice
of removing old data from the primary operational data store to conserve space and improve
performance. Archival may be ad hoc or scheduled. Archived data is stored separate (even if only
logically) from operational data and required additional processing to query. Archival data may be
coupled with temporal data for additional capability.
Historical
Historical data is a form of temporal data that persists data record changes over time. This is often
accomplished via a start date and end date for old records of a specific entity. This is commonly
coupled with auditing for transactional histories of data. The historical records may be queried into to
provide a level of detail on how an entity has changed over time. In historical data, the temporal aspect
of the history is secondary as all data is operated upon in the present sense and historic information is
merely available for viewing. History data may be archived to free space.
Temporal
Truly temporal data also contains the notion of a start and end date, but adds the concept of time as a
first-class data element. A temporal data entity may be edited or created in the present, past or future.
8/9/2019 Information Considerations - Doc
15/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20100202-08
For example, a bridge may be planned for construction in the future and entered into the system for
that future time. If the bridges are queried in the present, the future bridge will not appear. Likewise, a
bridge may be scheduled for demolition and therefore be indicated as demolished in the future. If the
system is queried in the future, the bridge will not appear. Finally, after the bridge is destroyed, it may
be rebuilt further in the future. Once rebuilt, the original bridge record may be resurrected as existing
once again (and considered as the same bridge). This allows for a fully temporally aware data system
that can be acted upon in time. Temporal data may be archived to free space and may be coupled with
spatial data for full four-dimensional space and time awareness.
Data FormatsAll data stored within a computer is represented as files. Even relational databases store their data in
files. An important aspect for data management is the format of the data, both in terms of data models,
and file formats. A data model is the logical decomposition of an entity into the properties that are
required to describe that entity. These data models are then persisted in some manner within files. For
example, the Microsoft Word application saves documents in a .doc or .docx file, each of which have
a specific encoding of the data contained within the file.
File Formats
Data file formats can be open or closed. An open format file is a well-known and well-documented
format such as a text file (regardless of content), an ESRI Shapefile or Autodesk dxf file. These formats
are well-defined and publicly published. Anyone can develop an application that can extract the
information from a file that conforms to these formats. A closed (proprietary) format is any file format
that is not well-known or well-documented. The Oracle .ora and MS Word .doc file formats are
examples of closed formats.
Stability
File formats will change over time as new versions are released. In general, as we embrace change,
changes to file formats must follow. Changes in file formats, whether open or closed, will affect all data
currently persisted. The more frequently a format is revised by the format definers, the less stable the
format is. As data is stored in a specific version of a format, that data will become more difficult to
maintain as the format is revised over time.
Longevity
Longevity is important to information strategies in four primary respects:
Data longevity Storage longevity (management) Format longevity (stability) Media longevity
8/9/2019 Information Considerations - Doc
16/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20100202-08
Data
Data longevity relates to the period of time that a given piece of information is relevant and valid. In
general, data is perpetual. For a given data element captured in time, it is a stable representation of
that element at that point in time, for all time. As time progresses, the data captured in the past
maintains its representation of that data element at the point in time it was captured. If any form of
temporal change determination is reasonable, or for primary historical purposes, data must be
maintained in perpetuity. Only a small portion of raw information is truly transient in nature, whereas
many derived datasets are highly volatile in time.
When data longevity is considered, the applicable maintainable lifetime must be identified per data
element. For that time period, the data must be persisted and protected.
Storage
Storage longevity is the natural follow-up to data longevity, where the mechanisms for storage are
maintained. Storage longevity deals with the planning and maintenance of infrastructures and
applications for maintaining the data to be stored. Storage longevity includes hardware replacement
and archival plans for moving data to secondary stores.
An important consideration in storage longevity is failure mitigation and recovery. If any devices fail,
how will the data be restored to operational capability? Additionally, if a third party storage is used
(such as cloud storage), the storage longevity should include any considerations of inter-organization
practices and policies to ensure storage stability.
Format
Format longevity directly relates to the stability of selected storage formats. Further, format longevity
includes migration of data from one format to another as formats are upgraded, or as format selections
change (such as a migration from one RDBMS to another). Format longevity may be critical in terms ofaffecting the chain of custody of the data. As a format is changed, it may alter the content (e.g. floating
point representations may be different, resulting in precision changes), or simply result in a non-
original, official copy of the content. This is a critical issue to the National Archives and Records
Administration (NARA).
Media
Media longevity deals with the effective lifespan of physical storage media. For example, the lifespan of
a compact disk (CD) is estimated at anywhere from 2-10 years or more
(http://www.archives.gov/records-mgmt/initiatives/temp-opmedia-faq.html). The actual lifespan of a
physical media is most critical for offline or extended storage that is not used frequently. The active
data media will be replaced during storage maintenance, but media longevity is often unaddressed
except during long-term archival planning (again a critical NARA issue).
Data VolumesThe volume of data collected per unit time for a specific data entity will drive much of the planning for
data storage. In high-volume data collection activities, long-term storage is only practical using low-cost
http://www.archives.gov/records-mgmt/initiatives/temp-opmedia-faq.htmlhttp://www.archives.gov/records-mgmt/initiatives/temp-opmedia-faq.htmlhttp://www.archives.gov/records-mgmt/initiatives/temp-opmedia-faq.htmlhttp://www.archives.gov/records-mgmt/initiatives/temp-opmedia-faq.html8/9/2019 Information Considerations - Doc
17/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20100202-08
offline media such as DVD, tape or Blu-Ray. The longevity of this media then becomes an issue,
therefore the media may need to be tested and replicated periodically to ensure stability of the content.
Even for mid to low density data collection, if the data is temporal with a high change rate (such as
product inventory), accumulation of records may result in an overall higher than anticipated volume.
Finally, aggregation of multiple data entities can also result in large data volumes. Scaling informationsolutions to handle large data volumes is a complex undertaking that is still in its infancy. Scaling may
include any of several approaches:
Separation, keep each data entity type in an isolated store Sharding, keep natural subsets of a data entity type in an isolated store Versioning, keep only changes to data entities over time Archival, move older data offline to secondary stores Evaporation, delete out of date data automatically
Higher storage densities are also possible using existing technologies such as compression, efficient
encoding (e.g. JPEG vs BMP for images) and simple removal of redundancy.
Separation
Data separation encompasses any means of separating data in a natural manner by keeping unrelated
information separate from other information. Service oriented architectures (SOA) and cloud
computing in general strongly advocate data separation.
Separation may make applications more complex as each type of data is stored in isolation from other
data. When it is necessary to query across data entity types, multiple data stores must be queried and
all result sets related based upon commonality. Further, there are issues with data editing anomalies
that must be addressed.
Sharding
The notion of sharding is related to that of separation, with the exception that a single data entity type
is separated into isolated stores based upon some property of the entity instances. For example, a
temperature store may be separated by location where each temperature monitoring site has its own
data store, but all use the same logical model. If temperature data is queried across all sites, then
multiple parallel queries must be executed and results unioned.
Sharding, like separation, may cause significant increases in complexity to applications consuming data
from a shared data store due to the increased query integration complexity.
Versioning
The versioning of data is a form of temporal storage where only a single complete record is stored for a
data entity and all temporal changes are stored as change sets that only contain the information
elements that changed. In most circumstances, data formats drive the storage mechanisms used. Since
few data formats include the capability for versioning, there are few circumstances where versioning is
cost-effective to implement. In most cases, for temporal storage, entire copies of the data element are
8/9/2019 Information Considerations - Doc
18/19
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20100202-08
stored for each temporal update made. This results in many copies of unchanged data being persisted
over time.
Versioning results in a performance cost in the retrieval of a specific change set. This performance cost
is a result of the integration of the changes (deltas) to the base record to arrive at the derived temporal
record. The base record may be optimized to be the most commonly retrieved temporal instance(usually the most current record will be the completely stored base record, resulting in a reverse
temporal change set).
Archival
Archival is the removal of historical records from the primary store to a secondary, higher density but
lower cost, store. The archival store may add performance to the retained data by reducing the size of
the store to be actively searched. Unfortunately, archival generally increases the complexity, cost and
time for querying into the archive.
Evaporation
Evaporation (or purging) is the removal (deletion) of old data from the system. Evaporation is
generally an automated process that expires old records based upon some criteria and deletes those
records from the system. This may or may not be in conjunction with archival, where old archives are
destroyed. Evaporation cannot be undone, so this must be thoroughly planned to ensure no useful data
is lost.
8/9/2019 Information Considerations - Doc
19/19
Corsello Research Foundation
Public Distribution CRF RDTE TR 20100202 08
ConclusionsEvery organization must evaluate its information strategy and portfolio to ensure all information is
properly maintained for an effective lifecycle for all users, both intended and unintended. There are
many considerations for each form of data that comprises the organizational information corpus. Data
modeling is an activity which is critical to ensure the proper entities are captured in a repeatable,
standardized and maintainable manner.
Overall, data is the primary vehicle for transferring information which is used to derive knowledge. Well
designed and defined data will provide better information and may create opportunities for new,
unplanned uses of the data without a need for new data models. These emergent capabilities may be of
much greater societal impact than the original uses the data was modeled for.
Appendices
Acronym ListAcronym Description
DID Data Item Description
DoD Department of Defense
DSU Defense Systems Unit
GAO Government Accountability Office
NARA National Archives and Records Administration
References
Recommended