Deploying a Governed Data Lake

Deploying a Governed Data LakeAlex GorelikFounder and CEO, Waterline Data

2

Everyone needs data to make better decisions

3

A data lake

http://www.pwc.com/us/en/technology-forecast/2014/issue1/features/data-lakes.jhtml

“Size and low cost”

“Fidelity: Hadoop data lakes preserve data in its original form”

“Ease of accessibility: Accessibility is easy in the data lake”

“Late binding: Hadoop lends itself to flexible, task-oriented structuring and does not require up-front data models”

“Nearly unlimited potential for operational insight and data discovery. As data volumes, data variety, and metadata richness grow, so does the benefit.”

4

Data warehouse vs. data lakeData Warehouse

• Production system

• Well-defined usage

• Well-defined schema

• Clean, trusted data

• Heavy IT reliance– Less technical analysts– Large IT teams: DBAs,

Data Architects, ETL Developers, BI Developers, DQ Developers, Data Modelers, Data Stewards

Data Lake

• Non-production system

• Future, experimental usage

• No schema (schema on read)

• Raw data, frictionless ingestion

• Self-service– More technical analysts– IT manages the cluster and ingestion,

but no IT involvement when working with data

5

as the platform for a scalable data lake infrastructure

✔ Hadoop

✔ Hadoop

✔ Hadoop

• Lots of data (Volume): cost-effective storage and scalable processing

• Flexibility to handle all kinds of data (Variety)

• Will be around for a long time: modularity to insure future-proofing

6

Is Hadoop enough?

Big Data Architect

Hadoop

We have Hadoop, now

what?

10-20 nodes

7

Big Data Architect

Hadoop

How do I get the business to start using it?

Data Scientist/Business

Analyst

10-20 nodes

8

Big Data Architect

Hadoop

How do I get the business to start using it?


AnalystHow do I find

and understand data easily to do big data analytics?

Self-service

10-20 nodes

9

Big Data Architect

Hadoop


Analyst

No security and governance

10-20 nodes

Risk/Data Governance Executive

How do I ensure compliance with regulations and data policies ?

Sensitive data?

10

Big Data Architect

Hadoop

How do I scale?


Analysts

100s/1000s of nodesManual process to catalog the lake can’t

scale

11

• Lots of data (Volume): cost-effective storage and scalable processing

• Flexibility to handle all kinds of data (Variety)

• Will be around for a long time: modularity to insure future-proofing

• Self-service to help users find, understand and use the data

• Governance to protect sensitive data, document lineage and asses quality

The platform for a scalable data lake infrastructure

✔ Hadoop

✔ Hadoop

✔ Hadoop

X Hadoop

X Hadoop

12

Waterline Data Inventory broadens Hadoop adoption through governed self-service

Big Data Architect

Hadoop


Analyst

100s/1000s of nodes

Risk/Data Governance Executive

Self-service Security and governance

Massive scale

13

3-phase approach to a governed data lake

Organize the lake

Inventory the lake

Open up the lake

14

Organize the lake into zonesOrganize the lake

15

Establish access control per zone

• Business Analysts• Data Scientists

• Data Scientists• Data Engineers

• Data Scientists• Data Engineers

• Data Stewards

Sensitive Landing

GoldWork

Organize the lake

16

The governed data lakeData Scientist/Business Analyst Data Steward Big Data Architect

HDFS Hive

Waterline Data Inventory

Find/understand Govern Governed data layer

Governance

Inventory

Self-Service

17

Metadata Curation

Self-Service Catalog/Provisioning

Big Data Architect

Find/understandGoverned data layer

Data Scientist/Business Analyst

The governed data lakeData Steward

HDFS Hive


Govern

Inventory

Inventory the lake

Profile and discover the content of files and Hive tables

18

Inventory

Parse multiple content types

Create catalog automatically

Discover lineage automatically

19

Self-Service Catalog/Provisioning

Big Data Architect

Find/understandGoverned data layer


The governed data lakeData Steward

HDFS Hive


Govern

Inventory

Govern the lake

Governance

• Inspect files and perform tag curation

• Identify sensitive data• Assess data quality• Discover data lineage• Manage glossary

20

Navigate Lineage of Files in HadoopClickable, navigable

lineage discovered using file content or imported from other tools through

REST APIs

21

Automated Data Profiling Helps with Quality Assessment

Infographic shows contents at a glance:• Different types of data in

the same field• Number of missing

values

Separate profiles for each data type including number of unique values (cardinality), uniqueness (selectivity) and type-specific measures like mean and standard deviation for numbers

22

Data Preview and Visualization Helps Understand the Data

Visualization helps understand the shape and distribution of data

Most frequent values for each field

23

Discover Sensitive Data

Screen shot

Find all fields that may have SSN

24

Curate Discovered Sensitive Data Fields

Curate the field and accept or reject the tag

25

Manage Glossary

Import or create a business glossary

Manage tags

26

View and search history

Screenshot of history tab

Another screenshot of searching history (made up)

Data Inventory keeps track of all user tagging, schema changes, lineage changes in Audit History

27

Data Steward

Govern

Big Data Architect

Governed data layer

Open up the data lake

HDFS Hive

Waterline Data InventoryInventory

Governance

Self-Service

Find/understand


Explore catalog and provision data securely

Open up the lake

28

Find and Understand

Automatically propagate user-defined tags (crowdsource ontology)

Discover meaning of fields and tag automatically

Multi-faceted drill down

Automated facet creation based on metadata

Business metadata-based search

29

Annotate fields, files and folders with tags

• Analysts can tag fields and files with meaningful business tags

• Type-ahead shows existing available tags that match the typed string

• Users can choose one or create a new tag

• Period in tag name automatically creates tag hierarchy (e.g., Restaurant.Name creates category “Restaurant” and tag “Name”

30

Based on a single field in one file tagged as Restaurant.Name, Waterline Data Inventory

discovery engine found 25 additional instances of Restaurant Name automatically.

User assigned tags are solid blue

Automatically suggested tags are faded blue with

confidence levelDelimited files

don’t have field names

Waterline Data Inventory learns from analysts who manually tag fields and automatically finds and tags similar fields

31

Create Hive tables

Screen shot of file with “Generate Hive Table” option selected- Replace Hive with Drill

Generate Hive Tables

32

https://youtu.be/jQ-i83BCtKI

33

Company overview• Headquartered in Mountain View, CA• Funded in 2013 by Menlo Ventures and Sigma West• Management Team:

Alex Gorelik, Founder, CEO

Founded Exeros (IBM) and Acta (SAP), IBM DE,

Informatica GM. Columbia BSCS, Stanford MSCS.

Oliver Claude, Marketing

VP SAP, VP Informatica, IBM,

Siebel. Nova Southeastern MS

MIS.

Jason Chen, Engineering

VP Teradata, Acta, Sybase. USC PhD

CS.

Ravi Ramachandran,

Sales

CSC-Infochimps Big Data, AppLabs,

Xchanging, Pegasystems.

Scient (Razorfish)WATERLINE DATA NAMED COOL VENDORGartner, Cool Vendors in Information Governance and MDM, 2015Guido DeSimoni, Roxane Edjlali, Saul Judah, Bill O'Kane, Andrew White

Visit our exhibit in the ballroom to get more information

Software

Deploying a Governed Data Lake