Upload
waterlinedata
View
953
Download
1
Embed Size (px)
Citation preview
Deploying a Governed Data LakeAlex GorelikFounder and CEO, Waterline Data
2
Everyone needs data to make better decisions
3
A data lake
http://www.pwc.com/us/en/technology-forecast/2014/issue1/features/data-lakes.jhtml
“Size and low cost”
“Fidelity: Hadoop data lakes preserve data in its original form”
“Ease of accessibility: Accessibility is easy in the data lake”
“Late binding: Hadoop lends itself to flexible, task-oriented structuring and does not require up-front data models”
“Nearly unlimited potential for operational insight and data discovery. As data volumes, data variety, and metadata richness grow, so does the benefit.”
4
Data warehouse vs. data lakeData Warehouse
• Production system
• Well-defined usage
• Well-defined schema
• Clean, trusted data
• Heavy IT reliance– Less technical analysts– Large IT teams: DBAs,
Data Architects, ETL Developers, BI Developers, DQ Developers, Data Modelers, Data Stewards
Data Lake
• Non-production system
• Future, experimental usage
• No schema (schema on read)
• Raw data, frictionless ingestion
• Self-service– More technical analysts– IT manages the cluster and ingestion,
but no IT involvement when working with data
5
as the platform for a scalable data lake infrastructure
✔ Hadoop
✔ Hadoop
✔ Hadoop
• Lots of data (Volume): cost-effective storage and scalable processing
• Flexibility to handle all kinds of data (Variety)
• Will be around for a long time: modularity to insure future-proofing
6
Is Hadoop enough?
Big Data Architect
Hadoop
We have Hadoop, now
what?
10-20 nodes
7
Big Data Architect
Hadoop
How do I get the business to start using it?
Data Scientist/Business
Analyst
10-20 nodes
8
Big Data Architect
Hadoop
How do I get the business to start using it?
Data Scientist/Business
AnalystHow do I find
and understand data easily to do big data analytics?
Self-service
10-20 nodes
9
Big Data Architect
Hadoop
Data Scientist/Business
Analyst
No security and governance
10-20 nodes
Risk/Data Governance Executive
How do I ensure compliance with regulations and data policies ?
Sensitive data?
10
Big Data Architect
Hadoop
How do I scale?
Data Scientist/Business
Analysts
100s/1000s of nodesManual process to catalog the lake can’t
scale
11
• Lots of data (Volume): cost-effective storage and scalable processing
• Flexibility to handle all kinds of data (Variety)
• Will be around for a long time: modularity to insure future-proofing
• Self-service to help users find, understand and use the data
• Governance to protect sensitive data, document lineage and asses quality
The platform for a scalable data lake infrastructure
✔ Hadoop
✔ Hadoop
✔ Hadoop
X Hadoop
X Hadoop
12
Waterline Data Inventory broadens Hadoop adoption through governed self-service
Big Data Architect
Hadoop
Data Scientist/Business
Analyst
100s/1000s of nodes
Risk/Data Governance Executive
Self-service Security and governance
Massive scale
13
3-phase approach to a governed data lake
Organize the lake
Inventory the lake
Open up the lake
14
Organize the lake into zonesOrganize the lake
15
Establish access control per zone
• Business Analysts• Data Scientists
• Data Scientists• Data Engineers
• Data Scientists• Data Engineers
• Data Stewards
Sensitive Landing
GoldWork
Organize the lake
16
The governed data lakeData Scientist/Business Analyst Data Steward Big Data Architect
HDFS Hive
Waterline Data Inventory
Find/understand Govern Governed data layer
Governance
Inventory
Self-Service
17
Metadata Curation
Self-Service Catalog/Provisioning
Big Data Architect
Find/understandGoverned data layer
Data Scientist/Business Analyst
The governed data lakeData Steward
HDFS Hive
Waterline Data Inventory
Govern
Inventory
Inventory the lake
Profile and discover the content of files and Hive tables
18
Inventory
Parse multiple content types
Create catalog automatically
Discover lineage automatically
19
Self-Service Catalog/Provisioning
Big Data Architect
Find/understandGoverned data layer
Data Scientist/Business Analyst
The governed data lakeData Steward
HDFS Hive
Waterline Data Inventory
Govern
Inventory
Govern the lake
Governance
• Inspect files and perform tag curation
• Identify sensitive data• Assess data quality• Discover data lineage• Manage glossary
20
Navigate Lineage of Files in HadoopClickable, navigable
lineage discovered using file content or imported from other tools through
REST APIs
21
Automated Data Profiling Helps with Quality Assessment
Infographic shows contents at a glance:• Different types of data in
the same field• Number of missing
values
Separate profiles for each data type including number of unique values (cardinality), uniqueness (selectivity) and type-specific measures like mean and standard deviation for numbers
22
Data Preview and Visualization Helps Understand the Data
Visualization helps understand the shape and distribution of data
Most frequent values for each field
23
Discover Sensitive Data
Screen shot
Find all fields that may have SSN
24
Curate Discovered Sensitive Data Fields
Curate the field and accept or reject the tag
25
Manage Glossary
Import or create a business glossary
Manage tags
26
View and search history
Screenshot of history tab
Another screenshot of searching history (made up)
Data Inventory keeps track of all user tagging, schema changes, lineage changes in Audit History
27
Data Steward
Govern
Big Data Architect
Governed data layer
Open up the data lake
HDFS Hive
Waterline Data InventoryInventory
Governance
Self-Service
Find/understand
Data Scientist/Business Analyst
Explore catalog and provision data securely
Open up the lake
28
Find and Understand
Automatically propagate user-defined tags (crowdsource ontology)
Discover meaning of fields and tag automatically
Multi-faceted drill down
Automated facet creation based on metadata
Business metadata-based search
29
Annotate fields, files and folders with tags
• Analysts can tag fields and files with meaningful business tags
• Type-ahead shows existing available tags that match the typed string
• Users can choose one or create a new tag
• Period in tag name automatically creates tag hierarchy (e.g., Restaurant.Name creates category “Restaurant” and tag “Name”
30
Based on a single field in one file tagged as Restaurant.Name, Waterline Data Inventory
discovery engine found 25 additional instances of Restaurant Name automatically.
User assigned tags are solid blue
Automatically suggested tags are faded blue with
confidence levelDelimited files
don’t have field names
Waterline Data Inventory learns from analysts who manually tag fields and automatically finds and tags similar fields
31
Create Hive tables
Screen shot of file with “Generate Hive Table” option selected- Replace Hive with Drill
Generate Hive Tables
33
Company overview• Headquartered in Mountain View, CA• Funded in 2013 by Menlo Ventures and Sigma West• Management Team:
Alex Gorelik, Founder, CEO
Founded Exeros (IBM) and Acta (SAP), IBM DE,
Informatica GM. Columbia BSCS, Stanford MSCS.
Oliver Claude, Marketing
VP SAP, VP Informatica, IBM,
Siebel. Nova Southeastern MS
MIS.
Jason Chen, Engineering
VP Teradata, Acta, Sybase. USC PhD
CS.
Ravi Ramachandran,
Sales
CSC-Infochimps Big Data, AppLabs,
Xchanging, Pegasystems.
Scient (Razorfish)WATERLINE DATA NAMED COOL VENDORGartner, Cool Vendors in Information Governance and MDM, 2015Guido DeSimoni, Roxane Edjlali, Saul Judah, Bill O'Kane, Andrew White
Visit our exhibit in the ballroom to get more information