34
Deploying a Governed Data Lake Alex Gorelik Founder and CEO, Waterline Data

Deploying a Governed Data Lake

Embed Size (px)

Citation preview

Page 1: Deploying a Governed Data Lake

Deploying a Governed Data LakeAlex GorelikFounder and CEO, Waterline Data

Page 2: Deploying a Governed Data Lake

2

Everyone needs data to make better decisions

Page 3: Deploying a Governed Data Lake

3

A data lake

http://www.pwc.com/us/en/technology-forecast/2014/issue1/features/data-lakes.jhtml

“Size and low cost”

“Fidelity: Hadoop data lakes preserve data in its original form”

“Ease of accessibility: Accessibility is easy in the data lake”

“Late binding: Hadoop lends itself to flexible, task-oriented structuring and does not require up-front data models”

“Nearly unlimited potential for operational insight and data discovery. As data volumes, data variety, and metadata richness grow, so does the benefit.”

Page 4: Deploying a Governed Data Lake

4

Data warehouse vs. data lakeData Warehouse

• Production system

• Well-defined usage

• Well-defined schema

• Clean, trusted data

• Heavy IT reliance– Less technical analysts– Large IT teams: DBAs,

Data Architects, ETL Developers, BI Developers, DQ Developers, Data Modelers, Data Stewards

Data Lake

• Non-production system

• Future, experimental usage

• No schema (schema on read)

• Raw data, frictionless ingestion

• Self-service– More technical analysts– IT manages the cluster and ingestion,

but no IT involvement when working with data

Page 5: Deploying a Governed Data Lake

5

as the platform for a scalable data lake infrastructure

✔ Hadoop

✔ Hadoop

✔ Hadoop

• Lots of data (Volume): cost-effective storage and scalable processing

• Flexibility to handle all kinds of data (Variety)

• Will be around for a long time: modularity to insure future-proofing

Page 6: Deploying a Governed Data Lake

6

Is Hadoop enough?

Big Data Architect

Hadoop

We have Hadoop, now

what?

10-20 nodes

Page 7: Deploying a Governed Data Lake

7

Big Data Architect

Hadoop

How do I get the business to start using it?

Data Scientist/Business

Analyst

10-20 nodes

Page 8: Deploying a Governed Data Lake

8

Big Data Architect

Hadoop

How do I get the business to start using it?

Data Scientist/Business

AnalystHow do I find

and understand data easily to do big data analytics?

Self-service

10-20 nodes

Page 9: Deploying a Governed Data Lake

9

Big Data Architect

Hadoop

Data Scientist/Business

Analyst

No security and governance

10-20 nodes

Risk/Data Governance Executive

How do I ensure compliance with regulations and data policies ?

Sensitive data?

Page 10: Deploying a Governed Data Lake

10

Big Data Architect

Hadoop

How do I scale?

Data Scientist/Business

Analysts

100s/1000s of nodesManual process to catalog the lake can’t

scale

Page 11: Deploying a Governed Data Lake

11

• Lots of data (Volume): cost-effective storage and scalable processing

• Flexibility to handle all kinds of data (Variety)

• Will be around for a long time: modularity to insure future-proofing

• Self-service to help users find, understand and use the data

• Governance to protect sensitive data, document lineage and asses quality

The platform for a scalable data lake infrastructure

✔ Hadoop

✔ Hadoop

✔ Hadoop

X Hadoop

X Hadoop

Page 12: Deploying a Governed Data Lake

12

Waterline Data Inventory broadens Hadoop adoption through governed self-service

Big Data Architect

Hadoop

Data Scientist/Business

Analyst

100s/1000s of nodes

Risk/Data Governance Executive

Self-service Security and governance

Massive scale

Page 13: Deploying a Governed Data Lake

13

3-phase approach to a governed data lake

Organize the lake

Inventory the lake

Open up the lake

Page 14: Deploying a Governed Data Lake

14

Organize the lake into zonesOrganize the lake

Page 15: Deploying a Governed Data Lake

15

Establish access control per zone

• Business Analysts• Data Scientists

• Data Scientists• Data Engineers

• Data Scientists• Data Engineers

• Data Stewards

Sensitive Landing

GoldWork

Organize the lake

Page 16: Deploying a Governed Data Lake

16

The governed data lakeData Scientist/Business Analyst Data Steward Big Data Architect

HDFS Hive

Waterline Data Inventory

Find/understand Govern Governed data layer

Governance

Inventory

Self-Service

Page 17: Deploying a Governed Data Lake

17

Metadata Curation

Self-Service Catalog/Provisioning

Big Data Architect

Find/understandGoverned data layer

Data Scientist/Business Analyst

The governed data lakeData Steward

HDFS Hive

Waterline Data Inventory

Govern

Inventory

Inventory the lake

Profile and discover the content of files and Hive tables

Page 18: Deploying a Governed Data Lake

18

Inventory

Parse multiple content types

Create catalog automatically

Discover lineage automatically

Page 19: Deploying a Governed Data Lake

19

Self-Service Catalog/Provisioning

Big Data Architect

Find/understandGoverned data layer

Data Scientist/Business Analyst

The governed data lakeData Steward

HDFS Hive

Waterline Data Inventory

Govern

Inventory

Govern the lake

Governance

• Inspect files and perform tag curation

• Identify sensitive data• Assess data quality• Discover data lineage• Manage glossary

Page 20: Deploying a Governed Data Lake

20

Navigate Lineage of Files in HadoopClickable, navigable

lineage discovered using file content or imported from other tools through

REST APIs

Page 21: Deploying a Governed Data Lake

21

Automated Data Profiling Helps with Quality Assessment

Infographic shows contents at a glance:• Different types of data in

the same field• Number of missing

values

Separate profiles for each data type including number of unique values (cardinality), uniqueness (selectivity) and type-specific measures like mean and standard deviation for numbers

Page 22: Deploying a Governed Data Lake

22

Data Preview and Visualization Helps Understand the Data

Visualization helps understand the shape and distribution of data

Most frequent values for each field

Page 23: Deploying a Governed Data Lake

23

Discover Sensitive Data

Screen shot

Find all fields that may have SSN

Page 24: Deploying a Governed Data Lake

24

Curate Discovered Sensitive Data Fields

Curate the field and accept or reject the tag

Page 25: Deploying a Governed Data Lake

25

Manage Glossary

Import or create a business glossary

Manage tags

Page 26: Deploying a Governed Data Lake

26

View and search history

Screenshot of history tab

Another screenshot of searching history (made up)

Data Inventory keeps track of all user tagging, schema changes, lineage changes in Audit History

Page 27: Deploying a Governed Data Lake

27

Data Steward

Govern

Big Data Architect

Governed data layer

Open up the data lake

HDFS Hive

Waterline Data InventoryInventory

Governance

Self-Service

Find/understand

Data Scientist/Business Analyst

Explore catalog and provision data securely

Open up the lake

Page 28: Deploying a Governed Data Lake

28

Find and Understand

Automatically propagate user-defined tags (crowdsource ontology)

Discover meaning of fields and tag automatically

Multi-faceted drill down

Automated facet creation based on metadata

Business metadata-based search

Page 29: Deploying a Governed Data Lake

29

Annotate fields, files and folders with tags

• Analysts can tag fields and files with meaningful business tags

• Type-ahead shows existing available tags that match the typed string

• Users can choose one or create a new tag

• Period in tag name automatically creates tag hierarchy (e.g., Restaurant.Name creates category “Restaurant” and tag “Name”

Page 30: Deploying a Governed Data Lake

30

Based on a single field in one file tagged as Restaurant.Name, Waterline Data Inventory

discovery engine found 25 additional instances of Restaurant Name automatically.

User assigned tags are solid blue

Automatically suggested tags are faded blue with

confidence levelDelimited files

don’t have field names

Waterline Data Inventory learns from analysts who manually tag fields and automatically finds and tags similar fields

Page 31: Deploying a Governed Data Lake

31

Create Hive tables

Screen shot of file with “Generate Hive Table” option selected- Replace Hive with Drill

Generate Hive Tables

Page 33: Deploying a Governed Data Lake

33

Company overview• Headquartered in Mountain View, CA• Funded in 2013 by Menlo Ventures and Sigma West• Management Team:

Alex Gorelik, Founder, CEO

Founded Exeros (IBM) and Acta (SAP), IBM DE,

Informatica GM. Columbia BSCS, Stanford MSCS.

Oliver Claude, Marketing

VP SAP, VP Informatica, IBM,

Siebel. Nova Southeastern MS

MIS.

Jason Chen, Engineering

VP Teradata, Acta, Sybase. USC PhD

CS.

Ravi Ramachandran,

Sales

CSC-Infochimps Big Data, AppLabs,

Xchanging, Pegasystems.

Scient (Razorfish)WATERLINE DATA NAMED COOL VENDORGartner, Cool Vendors in Information Governance and MDM, 2015Guido DeSimoni, Roxane Edjlali, Saul Judah, Bill O'Kane, Andrew White

Page 34: Deploying a Governed Data Lake

Visit our exhibit in the ballroom to get more information