22
Data about Data – The complexities of Metadata …a Whitepaper by Bob Panic, Solutions Architect – Business Applications, IT Infrastructure, Corporate Data Systems, Cloud Solutions, Security Architecture… Web: www.bobpanic.com e: [email protected] m: +61(0) 424 102 603 Metadata Context - introduction by Bob Panic The babble around metadata is now growing like a howling wind in a small, locked room and has in particular been a topic of discussion in the media due to the fact that governments are introducing data retention laws and “metadata legislation”. Metadata is a topic of conversation without anyone specifically understanding as too what metadata is and what does it (or does not) cover. This whitepaper is quite complex and covers quite a considerable spectrum of elements of metadata. I have tried to compile a white paper that is broad and potentially easier to understand for the end reader. At the end of the day discussions around data and metadata are technical in nature and you cannot escape this in the context of this whitepaper. I always find when I present this topic to groups in a workshop format, the message is a little better understood as the human element comes into play and active feedback is encouraged. Digital DNA – the data of metadata I find that when discussing technical matters, that a human interpretation is best. The best way for me to explain metadata, rightly or wrongly, which can be easily understood, is in the context of DNA. DNA is a highly complex topic, yet most people have a good understanding of what DNA does and how it relates to life and us as individuals, humans, animals, carbon based lifeforms if you will. Metadata can be thought of as “Digital DNA” it is the best way that I can relate to it in a context that could be commonly understood. I find even the summary of “Data about Data” as a complex passage for the average person to understand and I hope that Digital DNA might be a better context. And really to think about the “collection of metadata”, as most governments refer to the collection and storage of varied data sources, I think Digital DNA is an apt description for Metadata. What is Data? “…any product of a digital technology can be referred to as data…

Introduction to metadata management by bob panic

Embed Size (px)

Citation preview

Page 1: Introduction to metadata management by bob panic

Data about Data – The complexities of Metadata

…a Whitepaper by Bob Panic, Solutions Architect – Business Applications, IT

Infrastructure, Corporate Data Systems, Cloud Solutions, Security

Architecture…

Web: www.bobpanic.com e: [email protected] m: +61(0) 424 102 603

Metadata Context - introduction by Bob Panic

The babble around metadata is now growing like a howling wind in a small, locked room and has

in particular been a topic of discussion in the media due to the fact that governments are

introducing data retention laws and “metadata legislation”.

Metadata is a topic of conversation without anyone specifically understanding as too what

metadata is and what does it (or does not) cover. This whitepaper is quite complex and covers

quite a considerable spectrum of elements of metadata. I have tried to compile a white paper

that is broad and potentially easier to understand for the end reader.

At the end of the day discussions around data and metadata are technical in nature and you

cannot escape this in the context of this whitepaper. I always find when I present this topic to

groups in a workshop format, the message is a little better understood as the human element

comes into play and active feedback is encouraged.

Digital DNA – the data of metadata

I find that when discussing technical matters, that a human interpretation is best. The best way for

me to explain metadata, rightly or wrongly, which can be easily understood, is in the context of

DNA.

DNA is a highly complex topic, yet most people have a good understanding of what DNA does

and how it relates to life and us as individuals, humans, animals, carbon based lifeforms if you will.

Metadata can be thought of as “Digital DNA” it is the best way that I can relate to it in a context

that could be commonly understood. I find even the summary of “Data about Data” as a

complex passage for the average person to understand and I hope that Digital DNA might be a

better context.

And really to think about the “collection of metadata”, as most governments refer to the

collection and storage of varied data sources, I think Digital DNA is an apt description for

Metadata.

What is Data?

“…any product of a digital technology can be referred to as data…”

Page 2: Introduction to metadata management by bob panic

The above is my personal statement, I think some will find fault with it, I am sure most can summerise

the concept better, but in the most simplest way that I can, I think the above statement is pretty

spot on.

Metadata – or Digital DNA – is the detailed “machine code” embedded within that digital

product, data, and can contain information on the who, what, where, why and how that

particular data was “born” and its particular travels through “life”.

…Metadata is as detailed, and as rich, as the digital system that produced the data was

designed to do…

You may need to read the above statement a few times to understand what Metadata can “do”

(or more accurately contain) and what it can “not do”.

Examples Please! …The Evolution of the digital Photo…

1. Imagine it is early 2000’s, not that long ago, but an eon in the life of a digital product. You

have just unwrapped your new digital camera, the super awesome (and totally imagined

product) the 1 megapixel ZoomTastic 2000, the very first digital camera of its kind! You look at

your $2000 dollar investment and you know that this is the future, you may never need another

photography camera ever again! Film is buried forever, the stock price of silver drops to an all-

time low… You place your massive 1mb memory card into the camera, it’s the new “film’

don’t you know, and you go to the most beautiful rose in your garden. You take what will be

the first of many, many digital photos.

This digital photo is a product of your digital camera = DATA!

So now you take your digital photo back home to your computer and load it into a special bit

of software, let’s call it “Photo House”, this software allows you to view the most beautiful photo

of a rose ever taken in a matter of seconds. No time waiting on processing or development of

film.

This is the start of the instant photography revolution… and you are proud as punch…

But as you look at the photo on your computer screen, displayed by the help of your special

“Photo House” editing software, you see bits of additional information you have never seen

before. It’s not displayed in the actual photo but “hidden”. It contains, the time, date, the f

stop, the ISO (sensitivity of the “film”), the speed of the shutter, all little numbers and details

that most people would not understand, but you can understand because you are a

professional and this critical data enhances your understanding of the digital photograph

taken by this fancy new digital camera. This is = METADATA!

2. …and its now 2015. You are still a great professional photographer, you love to photograph

everything. Your tool of choice is the fantastically imagined “20 Megapixel Pineapple

YouSmartPhone 2015 ™” fresh of the factory floor from East Bolivia.

So what has changed in the 15 years since you had your first digital camera?

a. This has been the 15th such digital camera…

b. The picture is 20x bigger in size (bigger DATA file)

Page 3: Introduction to metadata management by bob panic

c. And the information stored within the file is just stunning: GPS co-ordinates, time, date,

names and address, color modeling information, various topographical and location

based information, altitude, even the heartbeat of the photographer taken in the

moment the shutter was pressed all stored – more METADATA

Privacy vs. Security

The term metadata, like the now over used term: “The Cloud”, is now becoming a generally

overused verb, a vernacular to describe something complex without anyone really been able

understand what it means. This is a trap when technical terms become common topics of

discussions in café’s and general conversation. Fear and mistrust sets in when complex

technologies, poorly understood by general society, and bantered about by politicians who wrap

superlatives like “national security and legislation” in the same sentence as metadata.

Big Data or Big Brother?

Society fears a government that intrudes on our private lives, and when governments legislate the

capture and storage of our private information we wonder, what will they find?

The Government has my data…so what?

So what “secrets” would a government know about “me” from my metadata you might ask?

The basic answer, quite a bit, in fact a considerable map of your life would emerge and could be

plotted out from your “metadata”. But before you cry out in anger and disgust, remember that

anyone can know a lot about you just from your personal social media profile. So why the fuss?

The fear is that armed with a highly detailed spectrum of information about an individual,

governments would deal with the social sea that you inhabit - the social and private data you

create (or that is created about you) - in a manner to prevent freedoms and democratic rights.

Possible? Yes. Probable? Unlikely.

Historically, data theft, identity theft or misrepresentation is the domain of the criminal sector, a

dark world were your stolen credit card details are used for nefarious means. The most likely

scenario however is that a prepubescent tween, hidden in the basement of his parents’ home, in

some remote and dull outer suburban wasteland, bored and with little a social life, has hacked a

poorly secure gaming or smart TV network and stolen parts of your private life, happily charging

soft porn Hustler subscriptions to your Myer One card without your knowledge or approval.

And this is where the average person gets scared. If the Russian mafia (or most likely a 13 year old)

can hack into my private details and run rampant, what would or could, the government do

As I write this, the debate on metadata collection is still heating up in Australia and the proposed

legislation has already got 30 odd amendments to it. One such amendment is that a Journalist’s

data is exempt from metadata retention laws.

Sounds great and noble, but how do we distinguish “my” metadata from a Journalists metadata?

Page 4: Introduction to metadata management by bob panic

In fact who can claim to be a journalist in 2015?

I am a blogger, am I not considered a creator of copy and journalistic prose?

What about hate speech bloggers? They too could be considered journalists. Oh, sorry we are

talking about “professional/paid” journalists! Ok, that discounts the funds hate speech bloggers

get from the Ku Klux Klan to support hate speech web sites and online newspapers…and by the

way…fundamentalist groups don’t publish articles do they? Nor blog?

Let’s get real. At a given point in time, in the collection and storage of metadata, you cannot

determine neither the origin nor the producer of the original data source. Have a long think about

that last sentence.

So what that means is that significant smarts, either human or computational algorithms, would

need to be applied before data context is determined: who created it, who is the intended

recipient, what was the intended purpose…etc.

Before metadata can be interpreted it needs to be gathered on mass, stored on mass and to the

eyes of the average beholder, it will look like a mess!

But even before all this, all this collection and storage of “Big Data”, something fundamental

needs to be clearly understood about all this metadata:

A mentioned earlier, metadata, is as good, and as detailed, as the “system” that has been

designed to create the data in the first place. A bit hard to understand I know, so back to my

digital camera example:

In 2000 the digital camera, as a direct limitation of the technology at that specific given time in

history, and from the specific manufacturer and even as a result of budgetary restrictions of the

specific model of digital camera, was designed to provide a limited amount of information as part

of its core function. If you were to grab a digital photo file from the year 2000 and you were to

look at its metadata, you might not be able to find specific information on location, time, or who

took the digital photo. The technology of the time, just could not create that much detail.

In 2015, adversely, with the demands of consumers and violent competition between brands and

the vast leaps of technology, digital cameras have morphed into smart phones and these devices

create a myriad of metadata for each and every digital photo taken. More digital photos are

taken by iPhones than any other digital photographic device or system, and the metadata

contained within each photo is mind boggling.

So what about in 2020? This is an interesting flight of fancy for future gazing, we see glimpses of

this 2020 future today in both social media sites like Facebook and our Smartphones. One such

technology/feature: Facial recognition!

Try the following experiment. Grab a smart phone that is not more than 12 months old and take a

photo of a person standing in front of a busy street. The smartphone will instantly detect the face,

automatically focus on the face (separating it from other objects in the background/foreground)

and the result: you take a nice sharp photo of your loved one with your snazzy smart phone. Very

smart indeed.

I am hoping you are now seeing where I am going with this line of thinking. So you now look to

post your lovely photo of your loved one on your Facebook or social media account. Can you

Page 5: Introduction to metadata management by bob panic

see what happens? Facebook (or any other social media site for that matter) will automatically

detect that the uploaded photo that was taken has a face/person and it ask you if you want to

create a tag, and name the loved one or individual in your photo. And as the dutiful slave to

technology you are, you freely oblige by identifying the person by name.

This is an example of how rich metadata gets created. So far no evil government intervention is

required. The manufacturers create digital products that create digital outputs and metadata is

generated, as the human interface to this created data, we then add context and a value that

goes beyond a simple, private photo taken for our personal enjoyment.

So by 2020 we would have created so much additional information about every single

photograph ever taken, that our smartphones will automatically identify and tag specific

individuals, no need for any human intervention. This could then be referred to as Super Rich Data.

By 2025 companies and governments could collect so much super rich “big data” that we could

potentially take a photo in the busy intersection of Times Square in New York, and each and every

face within that photo could be recognised and tagged, automatically. That is not big brother,

or evil governments spying on our every move (a nod to “Person of Interest” TV show here), this is

society generated rich metadata - just what the government needs to fight fundamentalist

insurgencies, hate speech and terrorism…

But Wait there is more…!

Metadata is just not the domain of digital photography, there is the internet, online advertising,

Facebook and Social Media Metadata, internet browsing history, social tagging, email trails and

communications, mobile phone tracking and GPS enabled devices, Internet of Things, data

storage, cloud technologies, service oriented software, public data sharing and the endlessly

growing list of software and digital hardware creating digital DNA, a sea of metadata

breadcrumbs scattered throughout the virtual world that lead straight back to your front door.

But these

Now on to the technical complexities of metadata. The below has been compiled and edited to

provide a detailed glimpse into the complex nature of data and machine metadata.

Metadata Management

Metadata Management Life Cycle

Metadata management Life Cycle defines the various phases associated with the end-to-end

metadata management process starting from planning through maintenance till retirement of

metadata

Page 6: Introduction to metadata management by bob panic

Governance and Planning

Governance and Planning involves initial planning, defining the objectives for metadata

management process, identification of owners and associated roles and responsibilities for each

of the stakeholders.

The ability to ingest and explore any data – including structured, semi-structured and unstructured

data is critical to getting the most out of corporate data and data warehouses. Given this usage,

it is challenging to enforce a strict control and governance regime on the data being ingested

into the Data warehouse environments and hence Governance of Metadata is of relatively lesser

significance in this context.

Page 7: Introduction to metadata management by bob panic

Metadata Content

Metadata content defines the types of metadata that need to be captured as part of the

metadata management process.

Metadata Capture Strategy

Metadata capture strategy defines the process and/or tools that need to be used for capturing

the required metadata. Strategy for metadata capture can include multiple tools/approaches

based on the type of data and feasibility constraints. The strategy outlines the guidelines for using

an appropriate tool or mechanism for identified use cases.

Type of Metadata Definition / Description

Business Metadata

Business Metadata defines the data in the Warehouse in user friendly

terms. Business Metadata captures ‘what’ data is stored in the

Warehouse, ‘where’ the data is sourced from, ‘how’ the data is used

and its relationship to other data in the Warehouse.

Technical Metadata

Technical Metadata defines the data, objects and processes in the

Warehouse from a technical point of view. Technical Metadata

captures system metadata – such as tables, data elements, indices,

partitions in a relational database, files stored in the cluster, security

classification for the data elements etc.

Operational Metadata

Operational Metadata (or sometimes also referred to as the Process

Metadata) is the data about the processes in the Warehouse.

Operational Metadata captures process schedules, frequency of

batch processes, status summary and usage statistics for various

processes etc.

Business Rules &

Transformation Rules

Business Rules and Transformation Rules related metadata capture the

rules applied on data elements during the data acquisition, data

ingestion or data extraction and loading processes in the Data

Warehouse.

In some cases, this metadata can also be used to dynamically process

and load the source data feeds into the Data Warehouse.

System Statistics

System Statistics related metadata captures data related to system

resource utilisation for proactive monitoring and maintenance within

a Data Warehouse environment.

Metadata for

Downstream Process

Metadata for downstream processes captures the ‘Technical

Metadata’ including mapping of data elements from the Warehouse

to downstream processes or applications such as BI tools, analytical

models or any other downstream applications.

Page 8: Introduction to metadata management by bob panic

Metadata Model and Integration

Metadata Modelling defines the data modelling strategy for the metadata repository. Metadata

Integration defines the approach for integration of various types of metadata including

integration from various metadata repositories, if applicable.

Metadata Visibility

Metadata Visibility defines the processes associated with enabling access to the metadata

elements, types of analyses and use-cases for usage of metadata by end-users.

Metadata Standards and Quality

Metadata Standards and Quality have been of relatively lesser significance in the past compared

to the other phases in the context of Data Warehouse planning and commissioning. Metadata is

created once and is occasionally used by a limited set of users. Hence typically Organisations do

not invest in tracking or enhancing the quality of metadata captured – either through an

automated process or through a manual process. However as the gathering and use of metadata

grows and is governed by state and federal laws, standards will need to be strengthened and full

auditability will need to be proved to ensure the “sanctity” of core metadata repositories

Maintenance and Retirement

Maintenance and Retirements define the following aspects associated with metadata

management processes.

Purging and archival or obsolete metadata (Operational Metadata for example)

Restructuring and enhancements to the Metadata Model

Processes and Governance for ensuring accuracy and timeliness of the metadata

captured with on-going changes and project releases

Metadata Content - Detail

This section details the list of recommended metadata data elements that need to be captured

for various types of Metadata as part of the Metadata Management strategy for the environment.

Business Metadata

Following are the recommended Business Metadata data elements that need to be captured for

the Business metadata. The Conceptual Model, Logical model information are also stored in the

Business metadata for the ease for usage and to understand the impact analysis for any business

changes

Page 9: Introduction to metadata management by bob panic

Metadata Data Elements Level

Source Feed Business Name Source Feed

Source Feed Business Description Source Feed

Source Feed Usage Source Feed

Source Feed Group Name Source Feed

External Data Source Indicator Source Feed

Source Host Code Name Source Feed

Source Feed Business Owner / Contact Source Feed

Source Feed Technical Contact Source Feed

Source Column Business Name Source Column

Source Column Business Description Source Column

Target File Business Name Target File

Target File Business Description Target File

Target File Usage Target File

Subject Area Target File

Data Security Classification Target File

Target Column Business Name Target Column

Target Column Business Description Target Column

Target Column Synonym(s) Target Column

Technical Metadata

Following are the recommended Technical Metadata data element that needs to be captured

for the ODS, Data warehouse, Data Marts, Source Systems. This should captured for all source,

target and extracts provided

Level Metadata Data Elements

Source Feed Source Feed Name

Source Feed Source Database Name

Source Feed Source Table Technical Name

Page 10: Introduction to metadata management by bob panic

Source Feed Source Data File Name

Source Feed Source Feed Group Name

Source Feed Source Host Type

Source Feed Source System Code Name

Source Feed Source Feed Format Type

Source Feed Source File Layout Definition (XSD / JSON etc.)

Source Feed Source Trigger File Name

Source Feed Source Trigger File Type and Format

Source Feed Source Encryption Method

Source Feed Source Feed Profile Path

Source Feed Source Feed Delivery Frequency

Source Feed Exception Days for the Source Feed

Source Feed Expected Delivery Time of the Source Feed

Source Feed Expected Number of Records

Source Feed Number of Columns (Source Feed)

Source Column Source Column Technical Name

Source Column Source Column Data Format

Source Column Source Column Data Type

Source Column Source Column Data Length

Source Column Required / Optional (NULL) Indicator

Target File Target File Name

Target File Target File Format Type

Target File Target File Layout Definition (XSD / JSON etc.)

Target File HDFS Location (Directory Path)

Target File Target Data Security (ARD Role)

Data Source Ingestion Method / Extraction Method

Target File Archive Location

Page 11: Introduction to metadata management by bob panic

Target File Target Encryption Method

Target Object Target Resource Size

Target File / Table Update Frequency

Target File / Table Update Type

Target Column Target Column Technical Name

Target Column Target Column Data Format

Target Column Target Column Data Type

Target Column Target Column Data Length

Target Column Expression / Transformation (Source – Target)

Column Column Delimiter Used

Column System of Record / System of Reference

Operational Metadata

Following are the data elements recommended to be captured as part of the Operational

Metadata. The Operational Metadata captured does not vary based on the source system of the

type of the source data.

Operational Metadata data elements can be classified into 2 broad categories – Data

Movement and Data Usage, for each of the source data types.

Following are the recommended Operational Metadata data elements that needs to be

captured

Metadata Data Elements Structured Unstructured

Data Movement Metadata

Source Feed Delivery Time SLA

Source Feed Delivery Time (Actual)

Source Feed Exception Indicator

Source Feed Exception Details

Number of Records Received

Expected Number of Columns

Page 12: Introduction to metadata management by bob panic

Actual Number of Columns Received

Data Load Rule Name

Data Load Rule Threshold Type

Data Load Rule Failure Value

Data Load Rule Last Failure Date and Time

Business Date

Last Data Load Date and Time

Data As of Date

Job Name

Job Description

Job Location

Job Type (Batch / Real-Time etc.)

Job Execution Frequency

Job Execution Start Time

Job Execution End Time

Job Status

Job Completion Time SLA

Job Execution Exception Indicator

Job Execution Exception Type

Job Execution Exception Details

Number of Success Records

Number of Exception Records

Number of Rejected Records

Data Usage Metadata

Access Count

Last Access Date and Time

Last Access User / Process

Number of Queries / Extractions

Last Extraction Date and Time

Output Protocol (FTP, Tumbleweed etc.)

Page 13: Introduction to metadata management by bob panic

Business Rules and Transformation Rules

Following are the recommended Business Rules and Transformation Rules related Metadata data

elements that needs to be captured

Metadata Data Elements File Level Column Level

Rule Name

Rule Type

Rule Level Name

Rule Threshold Type

Alert Threshold Value

Abort Threshold Value

Rule Default Value

Trigger Field Name

Rule Filter Condition

Rule Parameter Name

Rule Parameter Value

System Statistics

Following are the recommended System Statistics that needs to be captured. The metadata data

elements listed are high level statistics which can comprise of one or more detailed statistics. The

detailed list of system statistics that can be captured depends on the Operating System,

monitoring tools used etc. The table below provides examples of detailed statistics for each

category

Metadata Data Elements Examples

CPU Utilisation CPU Utilisation of System Processes, CPU Utilisation of

Applications / Users, CPU Idle Time etc.

Memory Utilisation Total Physical Memory, Memory used for Swap, Memory Used for

Caching etc.

Storage Utilisation Total Space Available, Utilised Space

I/O Utilisation Number of Transfers per Second, Data Reads (kB/s), Data Writes

(kB/s), I/O Wait Time, Reads per Second, Writes per Second etc.

Page 14: Introduction to metadata management by bob panic

Metadata Capture Strategy

In the context of Data Warehouse, Metadata is captured only in the production environment

The approach or strategy for capturing the Metadata for the Warehouse can be broadly classified

into 4 categories as follows

Metadata capture for structured data

Metadata capture for semi-structured / unstructured data sources

Metadata capture for downstream processes from Warehouse

The following table summarises the metadata capture strategy by type of Metadata

Metadata Type Options

Business Metadata Sourced from Commercial BI Metadata

Repository

Manual Capture

Technical Metadata Sourced from Commercial BI Metadata

Repository

Auto-Capture (from system tables / repositories)

Manual Capture

Operational Metadata Published to Metadata Repository

Auto-Capture (from Application Repositories)

Business Rules & Transformation

Rules

Custom Manual Capture (through the portal)

System Statistics Auto-Capture

Metadata for Downstream

Processes

Manual Capture

Business Metadata

Business metadata provides the data definition for each of the data elements processed and

loaded into the Warehouse. The metadata management process should provide a mechanism

for manual capture of Business Metadata during the design phase.

Following are the general guidelines for capturing the Business Metadata

For structured data sourced

o If the Business Metadata is available within the Source Metadata Repository, the

required data elements should be sourced and loaded into the Data Warehouse

Metadata Repository

o If the Business Metadata is not available within the Source Metadata Repository,

the data owner responsible for the movement of the data from Source to Data

Warehouse should provide the business metadata. The metadata can be

captured manually using a customised template used for Metadata Management

process.

Data Stewards or Analysts responsible for capturing (creating) the business

metadata should be able to upload the metadata through a self-serviced

portal. This would enable authentication and authorisation for the users

capturing or creating the metadata.

Page 15: Introduction to metadata management by bob panic

Alternatively, Data Stewards or Analysts can be provided with a UI on the

portal for creating the business metadata that cannot be sourced

programmatically.

For any other source data feeds and target objects (in all cases), business metadata

should be captured using the manual capture process. When the data is captured

through the manual process

o Metadata certified , validated and released

The table below captures the details of metadata capture by layer for Business Metadata

Layer When Metadata Capture Strategy Responsible Party

Data Access Layer Design Phase Manual Capture Business Analysts

Data Storage Layer Design Phase Manual Capture Business Analysts

Technical Metadata

Technical metadata captures the details of how, what and where the data elements are stored

within the Data Warehouse environments. Given the multitude of options for modelling and storing

the various types of data in a Data Warehouse, the Technical Metadata captured varies based

on the type of data being sourced or ingested into the Data environment.

The table below captures the details of metadata capture by layer for Technical Metadata

Layer When Metadata Capture

Strategy Responsible Party

Data Access Layer Design Phase Auto-Capture Data Stewards

Design Phase Manual Capture Data Stewards

Data Landing Layer Design Phase Auto-Capture Data Stewards

Data Integration

Layer Design Phase Manual Capture Data Stewards

Data Storage Layer

Design /

Development Phase Auto-Capture Data Stewards

Design Phase Manual Capture Data Stewards

Operational Metadata

Operational Metadata captures data from the auditing and logging for data acquisition, data

transformation and loading processes, BI usage data, details around data integration job and

report execution times etc.

Page 16: Introduction to metadata management by bob panic

The approach and guidelines for capturing the Operational Metadata depends on the type of

operational data being captured and can be broadly classified into following categories

Operational Metadata for Data Movement

Operational Metadata for BI and Analytics

The Metadata Management process implemented should capture the Operational Metadata for

data movement during the actual job execution. The metadata should be captured

programmatically without any manual intervention. Operational Metadata for Data Usage

however can be extracted on a period basis and can be scheduled.

Metadata Repository

An Operational Metadata repository should be created for the Data Warehouse

It is recommended to implement a metadata repository at least for Operational

Metadata irrespective of the Data Modelling strategy adopted

If an integrated Metadata Repository is implemented, the Operational Metadata can be

part of the repository (subject area approach)

Guidelines

Following are the general guidelines for capturing Operational Metadata for Data Movement

A common approach is used for capturing Operational Metadata for structured, semi-

structured and unstructured data

Metadata capture should be event driven and required data elements should be

published into the metadata repository as soon as the data movement process / cycle

completes

Data Ingestion, Data Extraction and the Data Load processes should have a mechanism

to publish the required data elements into the Operational Metadata repository

o The data elements may either be published using pre and post processing scripts for

the batch processes

o Alternatively, a control script can be continuously monitor the batch process and

publish the required data elements into the operational metadata repository

Following are the general guidelines for capturing Operational Metadata for BI and Analytics

Operational Metadata for BI and analytics will be primarily sourced from the application

repositories

Metadata capture can be batch oriented, with ability to support intra-day batches

The table below captures the details of metadata capture by layer for Operational Metadata

Layer When Metadata Capture

Strategy Responsible Party

Data Integration

Layer Data Movement Auto-Capture

Data Storage Layer Post Go-Live, on

regular basis Auto-Capture

Page 17: Introduction to metadata management by bob panic

Business Rules & Transformation Rules

Business Rules and Transformation Rules applied for the data sourced into the Data environment

is always captured through a custom manual process. This section provides the general guidelines

for capturing the Business Rules and / or Transformation rules based on the type of Data

Structured Data

Business Rules and Transformation Rules should be captured as separate rules

Applicable Business Rules and Transformation Rules should be captured at both Source

Table level as well as Source Column Level

Linkage between the Business Rules and Transformation Rules should be established

through the source object

Multiple rules may be associated with a given Source Table or Source Column

Rules may either be captured and stored in the metadata repository (database) or

maintained as Excel files associated with the source object

Semi-Structured / Unstructured Data

Business Rules and Transformation Rules should be captured as separate rules

Rules should be captured at source feed level

Multiple rules may be associated with a given source feed

It is recommended to capture the rules using Excel files associated with the source objects

o Business rules can be optional at field level

o Transformation rules applicable to field level may be captured in the Excel files

Business Rules and Transformation Rules related metadata is dependent on the Technical

Metadata for the source data feeds or source data elements. In order to ensure data quality and

accuracy of the metadata, it is recommended to capture the business rules and transformation

rules metadata through a UI on the portal with following checks and balances

Source data feeds and data elements should be pre-populated from the Technical

Metadata available in the metadata repository

End-users should not be able to edit or modify the source data elements

UI can have basic validations to ensure mandatory metadata elements are captured

UI should also have a provision to allow users to upload a file with the rules either at source

data feed level or at source data element level

Users should be able to edit – update or delete any rules entered through the UI

The table below captures the details of metadata capture by layer for Business Rules and

Transformation Rules related Metadata

Layer When Metadata Capture Strategy Responsible Party

Data Integration

Layer Design Phase

Manual Capture (Custom

Process)

System Statistics

System Statistics for the Warehouse environment should be captured using automated capture

from the system logs or through the use of system monitoring tools and utilities.

Page 18: Introduction to metadata management by bob panic

Following are the general guidelines for capturing System Statistics

System statistics should always be captured using an automated process

Key utilisation statistics such as CPU or memory utilisation should be tracked continuously

Utilisation statistics for other resources such as storage may be captured on a periodic

basis

The table below captures the details of metadata capture by layer for System Statistics

Layer When Metadata Capture

Strategy Responsible Party

Data Landing Layer Post Go-Live, on

regular basis Auto-capture System Administrators

Data Integration

Layer

Post Go-Live, on

regular basis Auto-capture System Administrators

Data Storage Layer Post Go-Live, on

regular basis Auto-capture System Administrators

Metadata for Downstream Processes

Metadata for the downstream processes comprises of business metadata for the target objects,

technical metadata for the target objects including the lineage from warehouse/ Hadoop to the

downstream data repositories (data marts/ Hive / HBase etc.), BI tools or analytical models. This

metadata is required to enable complete lineage analysis from the source systems to the target

applications.

Following are the general guidelines for capturing the metadata for downstream processes

Business Analysts or the data stewards responsible for moving the data from the Data

Warehouse to the downstream applications should be primarily responsible for capturing

the Business Metadata elements

Technical SMEs / technical point-of-contact for the downstream applications should be

primarily responsible for capturing the Technical Metadata including the lineage

metadata

Any business rules and transformation rules applied should be captured at both Entity and

Attribute level

Any business rules and transformation rules applied should be captured at both Entity and

Attribute level

The table below captures the details of metadata capture by layer for System Statistics

Layer When Metadata Capture

Strategy Responsible Party

Data Storage Layer Design Phase Manual Capture

Business Analysts

Data Analysts

Data Stewards

Page 19: Introduction to metadata management by bob panic

Metadata Modeling and Integration

Metadata modelling defines the approach or data modelling strategy for the metadata

repository. This section describes various options for metadata modelling and provides a

comparative analysis between each of the options.

Metadata Refresh

Metadata Refresh defines the process and frequency for capturing and updating the metadata

on an on-going basis. The processes and frequency of Metadata refresh varies based on the type

of the Metadata and the environment for which Metadata is being captured and refreshed.

The table below provides a consolidated view of the Metadata refresh strategy for each of the

environments

Type of Metadata Description

Business Metadata

Metadata is “created”

Initial Metadata captured during Design Phase

Metadata needs to be updated continuously whenever there

is a change to source data feed or target structures, enforced

as part of the code release process

Technical Metadata

Metadata is “created”

Metadata that needs to be captured manually is created

during the Design Phase

Metadata captured using automated process is initially

created during the development phase and certified before

code release

Metadata needs to be updated continuously whenever there

is a change to source data feed or target structures, enforced

as part of the code release process

Operational Metadata

Data Movement related Operational Metadata is captured

using event driven approach, but on ad-hoc basis

Data Usage related Operational Metadata can be captured

on a need basis (Optional)

Business Rules and

Transformation Rules

Rules related Metadata should be “created”

Initial metadata should be created post the Technical

Metadata is sourced into the repository

Metadata should be updated on a continuous basis, as and

when there is a need for change using the custom manual

approach defined

System Statistics

Captured using automated process on a need basis

Need to captured and maintained on a regular basis only if

required (for usage based charge-back mechanism for

example)

Page 20: Introduction to metadata management by bob panic

Metadata for Downstream

Processes / Applications

For any downstream applications designed, metadata should

be “created” in environment

Metadata should be captured during the Design phase

Metadata Visibility

Visibility or access to the Metadata captured for the Data Warehouse should be enabled only

through a standard intranet portal. The portal should provide the following functionalities

Provide a layer of abstraction for the metadata capture, integration and storage aspects

Ability to authenticate users accessing the portal

o It is assumed that there is no need for user authorisation (data security)

Ability to search on the metadata captured, using any of the use-cases identified

o Provide a layer of abstraction between the User Interface and the underlying data

elements on which the search operation is performed. For example – a basic

search on UI for table name could perform a search on table technical name,

table business name, table business description and the source data file name.

o Provide ability to perform advanced search using a combination of search criteria.

For example – search for a given table name within a subject area for a given

Market.

o Pagination of the search results for better readability

o Ability to sort the search results on predefined criteria including search relevance

(this use case may need further discussion and elaboration)

o Should provide ability to export the search results to Excel for offline analysis

Ability to establish data lineage for data entities and elements within the Data Warehouse

o Should support bi-directional lineage analysis

o Completeness and quality of data lineage information will be dependent on the

accuracy and completeness of the metadata captured – either through

automated process or through the manual capture process

Ability to generate and view standard operational reports

Following are the general guidelines with respect to the Metadata Visibility

End users (data analysts for example) for metadata should never be provided direct

access to the metadata repository – database tables or the Excel files within Data

Warehouse

Only system administrators and technical SMEs for the Data Warehouse may have direct

access to the metadata repository including the physical storage

Access to metadata environments should be enabled through separate user interfaces

– separate portals, sub-sites etc.

User Groups and Associated Usage

This section captures the details of the target user groups who would need access to the portal

and their associated usage of the portal, in each of the environments

Page 21: Introduction to metadata management by bob panic

Metadata Analysis & Usage

The Metadata Repository portal supports the following types of analysis and usage of the

metadata captured.

Lineage Analysis

Lineage analysis is one of the key requirements for the proposed Metadata Management solution.

The metadata captured should support the following types of lineage analysis

For structured data source extracted from Source, the metadata in Data Metadata

repository should support bi-directional lineage analysis from the tables in Source/

Warehouse to the Data Warehouse or any downstream applications from Data

warehouse

o The metadata should support lineage analysis at table and column level

o For each of the tables / Files from Source, the System of Record information for the

original source feed may be made available as additional information. However,

the lineage from the original source data feed to the Source Files/ tables will be

out of scope for lineage analysis

o The completeness of lineage metadata will be dependent on the process

implemented for capturing the metadata for downstream processes / applications

For semi-structured or unstructured data sources, the metadata captured should support

lineage analysis as follows

o Bi-directional lineage analysis at object level (web files, video files etc.)

o For data sources like IVR where each transaction can potentially contain an audio

file, lineage analysis should capture the linkage of audio files to the transaction and

the source feed

o For structure metadata captured as part of unstructured data sources, the

metadata should support lineage analysis at column (data element) level

Data Usage Analysis

Data usage analysis primarily provides ability to track what data within the Warehouse is being

used, frequency of usage and the access log of end-users accessing the data. Data usage

analysis helps in identifying the frequency of data elements being accessed, improve the data

modelling and restructure the data to provide easier and quicker access to end-users.

Data Analysis usage requires the Data Usage related operational metadata to be captured as

part of the metadata management process. Some of these operational metadata for structured

data can be captured through automated processes either from the system logs or system tables.

However, for semi-structured or unstructured data capturing operational metadata may require

some level of tracking at the operating system level and is subject to feasibility, specific use case

requirement and the decision to implement tracking user activity at such detailed level.

BI Usage Analysis

Operational Metadata required for supporting BI usage analysis will be primarily sourced from the

application metadata repositories. BI usage analysis helps to understand the user behavior on BI

Page 22: Introduction to metadata management by bob panic

tools and applications and this identifying potential opportunities for redesign and / or

optimisation.

Following are some examples of analyses typically performed on BI Usage

Number of users executing reports on a daily / weekly basis

Average number of reports executed on a daily / weekly basis

Number of times a report is run in the last x days

Audit Analysis

Audit analysis requires Operational Metadata to be captured for the data integration and load

processes. Audit analysis primarily helps to understand the effectiveness of the data movement

and data loading processes and helps to identify potential opportunities for redesign and / or

optimisation.

Examples or audit analyses reports are as follows:

Average execution times for batch processes, by subject areas

Long running jobs at the potential risk of missing data loading SLAs (for proactive tuning)

Jobs exceeding the average execution times on a daily / weekly basis

Average number of errors or exceptions on a periodic basis

Frequently occurring errors or exceptions by Source Feed or Subject Area

Metadata Maintenance and Retirement

Metadata Maintenance and Retirement process will be closely related and dependent on the

Governance and Planning for Metadata. For the `Warehouse, Metadata Maintenance and

Retirement strategy needs to be cater to the differences in target audience, data movement

strategy and the data retention strategy for each of these environments.

Following are the general guidelines for Metadata Maintenance and Retirement:

Metadata will be captured only for the ‘Shared’ Area

No metadata will be captured or maintained for user specific directories (‘Private’ Area)

Metadata capture and updates for any metadata captured using manual or custom

process need to be enforced as part of the code release checklist and should be up-to-

date at given time

Technical metadata captured using automated process also should be maintained

completely and accurately for all objects

Following metadata captured using an automated process may be refreshed on a need

basis

o Operational Metadata

o System Statistics

When data is purged, all metadata associated with that data / data objects should also

be purged from the metadata repository