Deploying a Governed Data Lake

Deploying a Governed Data LakeAlex GorelikFounder and CEO, Waterline Data

Everyone needs data to make better decisions

A data lake

http://www.pwc.com/us/en/technology-forecast/2014/issue1/features/data-lakes.jhtml

“Size and low cost”

“Fidelity: Hadoop data lakes preserve data in its original form”

“Ease of accessibility: Accessibility is easy in the data lake”

“Late binding: Hadoop lends itself to flexible, task-oriented structuring and does not require up-front data models”

“Nearly unlimited potential for operational insight and data discovery. As data volumes, data variety, and metadata richness grow, so does the benefit.”

Data warehouse vs. data lakeData Warehouse

• Production system

• Well-defined usage

• Well-defined schema

• Clean, trusted data

• Heavy IT reliance– Less technical analysts– Large IT teams: DBAs,

Data Architects, ETL Developers, BI Developers, DQ Developers, Data Modelers, Data Stewards

Data Lake

• Non-production system

• Future, experimental usage

• No schema (schema on read)

• Raw data, frictionless ingestion

• Self-service– More technical analysts– IT manages the cluster and ingestion,

but no IT involvement when working with data

as the platform for a scalable data lake infrastructure

✔ Hadoop

• Lots of data (Volume): cost-effective storage and scalable processing

• Flexibility to handle all kinds of data (Variety)

• Will be around for a long time: modularity to insure future-proofing

Is Hadoop enough?

Big Data Architect

Hadoop

We have Hadoop, now

10-20 nodes

Big Data Architect

Hadoop

How do I get the business to start using it?

Data Scientist/Business

Analyst

10-20 nodes

Big Data Architect

Hadoop

How do I get the business to start using it?

AnalystHow do I find

and understand data easily to do big data analytics?

Self-service

10-20 nodes

Big Data Architect

Hadoop

Analyst

No security and governance

10-20 nodes

Risk/Data Governance Executive

How do I ensure compliance with regulations and data policies ?

Sensitive data?

Big Data Architect

Hadoop

How do I scale?

Analysts

100s/1000s of nodesManual process to catalog the lake can’t

• Lots of data (Volume): cost-effective storage and scalable processing

• Flexibility to handle all kinds of data (Variety)

• Will be around for a long time: modularity to insure future-proofing

• Self-service to help users find, understand and use the data

• Governance to protect sensitive data, document lineage and asses quality

The platform for a scalable data lake infrastructure

✔ Hadoop

X Hadoop

Waterline Data Inventory broadens Hadoop adoption through governed self-service

Big Data Architect

Hadoop

Analyst

100s/1000s of nodes

Risk/Data Governance Executive

Self-service Security and governance

Massive scale

3-phase approach to a governed data lake

Organize the lake

Inventory the lake

Open up the lake

Organize the lake into zonesOrganize the lake

Establish access control per zone

• Business Analysts• Data Scientists

• Data Scientists• Data Engineers

• Data Stewards

Sensitive Landing

GoldWork

Organize the lake

The governed data lakeData Scientist/Business Analyst Data Steward Big Data Architect

HDFS Hive

Waterline Data Inventory

Find/understand Govern Governed data layer

Governance

Inventory

Self-Service

Metadata Curation

Self-Service Catalog/Provisioning

Big Data Architect

Find/understandGoverned data layer

Data Scientist/Business Analyst

The governed data lakeData Steward

HDFS Hive

Govern

Inventory

Inventory the lake

Profile and discover the content of files and Hive tables

Inventory

Parse multiple content types

Create catalog automatically

Discover lineage automatically

Self-Service Catalog/Provisioning

Big Data Architect

Find/understandGoverned data layer

The governed data lakeData Steward

HDFS Hive

Govern

Inventory

Govern the lake

Governance

• Inspect files and perform tag curation

• Identify sensitive data• Assess data quality• Discover data lineage• Manage glossary

Navigate Lineage of Files in HadoopClickable, navigable

lineage discovered using file content or imported from other tools through

REST APIs

Automated Data Profiling Helps with Quality Assessment

Infographic shows contents at a glance:• Different types of data in

the same field• Number of missing

values

Separate profiles for each data type including number of unique values (cardinality), uniqueness (selectivity) and type-specific measures like mean and standard deviation for numbers

Data Preview and Visualization Helps Understand the Data

Visualization helps understand the shape and distribution of data

Most frequent values for each field

Discover Sensitive Data

Screen shot

Find all fields that may have SSN

Curate Discovered Sensitive Data Fields

Curate the field and accept or reject the tag

Manage Glossary

Import or create a business glossary

Manage tags

View and search history

Screenshot of history tab

Another screenshot of searching history (made up)

Data Inventory keeps track of all user tagging, schema changes, lineage changes in Audit History

Data Steward

Govern

Big Data Architect

Governed data layer

Open up the data lake

HDFS Hive

Waterline Data InventoryInventory

Governance

Self-Service

Find/understand

Explore catalog and provision data securely

Open up the lake

Find and Understand

Automatically propagate user-defined tags (crowdsource ontology)

Discover meaning of fields and tag automatically

Multi-faceted drill down

Automated facet creation based on metadata

Business metadata-based search

Annotate fields, files and folders with tags

• Analysts can tag fields and files with meaningful business tags

• Type-ahead shows existing available tags that match the typed string

• Users can choose one or create a new tag

• Period in tag name automatically creates tag hierarchy (e.g., Restaurant.Name creates category “Restaurant” and tag “Name”

Based on a single field in one file tagged as Restaurant.Name, Waterline Data Inventory

discovery engine found 25 additional instances of Restaurant Name automatically.

User assigned tags are solid blue

Automatically suggested tags are faded blue with

confidence levelDelimited files

don’t have field names

Waterline Data Inventory learns from analysts who manually tag fields and automatically finds and tags similar fields

Create Hive tables

Screen shot of file with “Generate Hive Table” option selected- Replace Hive with Drill

Generate Hive Tables

Company overview• Headquartered in Mountain View, CA• Funded in 2013 by Menlo Ventures and Sigma West• Management Team:

Alex Gorelik, Founder, CEO

Founded Exeros (IBM) and Acta (SAP), IBM DE,

Informatica GM. Columbia BSCS, Stanford MSCS.

Oliver Claude, Marketing

VP SAP, VP Informatica, IBM,

Siebel. Nova Southeastern MS

Jason Chen, Engineering

VP Teradata, Acta, Sybase. USC PhD

Ravi Ramachandran,

CSC-Infochimps Big Data, AppLabs,

Xchanging, Pegasystems.

Scient (Razorfish)WATERLINE DATA NAMED COOL VENDORGartner, Cool Vendors in Information Governance and MDM, 2015Guido DeSimoni, Roxane Edjlali, Saul Judah, Bill O'Kane, Andrew White

Visit our exhibit in the ballroom to get more information

Deploying a Governed Data Lake

Software

Deploying eduroam

HYDROGRAPHIC SURVEY OF Ponca City Lake · HYDROGRAPHIC SURVEYING PROCEDURES ... conducted a hydrographic survey of Lake ... Deploying the unit involved lowering the probe,

Deploying ASP

Deploying opetus

Arrowhead Telepresence Coalition · 2019-06-13 · Governed by a board of County Commissioners representing each of the 5 member counties ... Portage, Itasca, Koochiching, Lake, and

MECHANICAL LAW’S GOVERNED

Connecting the government with the governed, and the governed with each other - Jeff Mooney

Deploying Firewalls

Deploying Often

Denodo DataFest 2016: The Governed Data Lake – Putting Big Data to Work

Deploying DNS

Governed Self-service BI

Options for Deploying Apps / Add-Ins Deploying to the Store Deploying To Exchange Deploying to The Corporate Catalog Additional Approaches

Risk Mapping Royal London Governed Portfolios - …mail-intrinsicfs.com/ifa-news-mar/Royal London Governed Portfolios...Risk Mapping Royal London Governed Portfolios March 2016

Deploying Audacity and LAME using SyAM Management Utilities Tool Tips - Deploying... · Tool Tips Page 1 Deploying Audacity Deploying Audacity and LAME using SyAM Management Utilities

Deploying openstackdell

Town of Lake Pleasant - New York State Comptroller Town of Lake Pleasant (Town) is located in Hamilton County and has a population of approximately 780. The Town is governed by an

Deploying DNSSEC

Deploying w7

Deploying WINS