This document discusses deploying a governed data lake using Hadoop and Waterline Data Inventory. It begins by outlining the benefits of a data lake and differences between data lakes and data warehouses. It then discusses using Hadoop as the platform for the data lake and some challenges around governance, scale, and usability. The document proposes a three phase approach using Waterline Data Inventory to organize, inventory, and open up the data lake. It provides screenshots and descriptions of Waterline's key capabilities like metadata discovery, data profiling, sensitive data identification, governance tools, and self-service catalog. It also includes an overview of Waterline Data as a company.
3. 3
A data lake
http://www.pwc.com/us/en/technology-forecast/2014/issue1/features/data-lakes.jhtml
“Size and low cost”
“Fidelity: Hadoop data
lakes preserve data in its
original form”
“Ease of accessibility:
Accessibility is easy in
the data lake”
“Late binding: Hadoop
lends itself to flexible,
task-oriented structuring
and does not require up-
front data models”
“Nearly unlimited potential for operational insight
and data discovery. As data volumes, data
variety, and metadata richness grow, so does the
benefit.”
4. 4
Data warehouse vs. data lake
Data Warehouse
• Production system
• Well-defined usage
• Well-defined schema
• Clean, trusted data
• Heavy IT reliance
– Less technical analysts
– Large IT teams: DBAs,
Data Architects, ETL
Developers, BI
Developers, DQ
Developers, Data
Modelers, Data Stewards
Data Lake
• Non-production system
• Future, experimental usage
• No schema (schema on read)
• Raw data, frictionless ingestion
• Self-service
– More technical analysts
– IT manages the cluster and ingestion,
but no IT involvement when working with
data
5. 5
as the platform for a scalable
data lake infrastructure
✔ Hadoop
✔ Hadoop
✔ Hadoop
• Lots of data (Volume): cost-effective storage and
scalable processing
• Flexibility to handle all kinds of data (Variety)
• Will be around for a long time: modularity to insure
future-proofing
8. 8
Big Data Architect
Hadoop
How do I get
the business to
start using it?
Data Scientist/Business
Analyst
How do I find
and understand
data easily to
do big data
analytics?
Self-service
10-20 nodes
9. 9
Big Data Architect
Hadoop
Data Scientist/Business
Analyst
No security and
governance
10-20 nodes
Risk/Data Governance
Executive How do I ensure
compliance with
regulations and
data policies ?
Sensitive data?
10. 10
Big Data Architect
Hadoop
How do I
scale?
Data Scientist/Business
Analysts
100s/1000s of nodes
Manual process to catalog the lake can’t scale
11. 11
• Lots of data (Volume): cost-effective storage
and scalable processing
• Flexibility to handle all kinds of data (Variety)
• Will be around for a long time: modularity to
insure future-proofing
• Self-service to help users find, understand
and use the data
• Governance to protect sensitive data,
document lineage and asses quality
The platform for a scalable
data lake infrastructure
✔ Hadoop
✔ Hadoop
✔ Hadoop
X Hadoop
X Hadoop
12. 12
Waterline Data Inventory broadens Hadoop
adoption through governed self-service
Big Data Architect
Hadoop
Data
Scientist/Business
Analyst
100s/1000s of nodes
Risk/Data
Governance
Executive
Self-service Security and
governance
Massive scale
13. 13
3-phase approach to a governed data lake
Organize
the lake
Inventory
the lake
Open up
the lake
15. 15
Establish access control per zone
• Business Analysts
• Data Scientists
• Data Scientists
• Data Engineers
• Data Scientists
• Data Engineers
• Data Stewards
Sensitive Landing
GoldWork
Organize
the lake
16. 16
The governed data lake
Data Scientist/Business Analyst Data Steward Big Data Architect
HDFS Hive
Waterline Data Inventory
Find/understand Govern
Governed
data layer
Governance
Inventory
Self-Service
17. 17
Metadata Curation
Self-Service Catalog/Provisioning
Big Data Architect
Find/understand
Governed
data layer
Data Scientist/Business Analyst
The governed data lake
Data Steward
HDFS Hive
Waterline Data Inventory
Govern
Inventory
Inventory
the lake
Profile and discover
the content of files
and Hive tables
19. 19
Self-Service Catalog/Provisioning
Big Data Architect
Find/understand
Governed
data layer
Data Scientist/Business Analyst
The governed data lake
Data Steward
HDFS Hive
Waterline Data Inventory
Govern
Inventory
Govern
the lake
Governance
• Inspect files and perform
tag curation
• Identify sensitive data
• Assess data quality
• Discover data lineage
• Manage glossary
20. 20
Navigate Lineage of Files in Hadoop
Clickable, navigable
lineage discovered using
file content or imported
from other tools through
REST APIs
21. 21
Automated Data Profiling Helps with Quality
Assessment
Infographic shows
contents at a glance:
• Different types of data in
the same field
• Number of missing
values
Separate profiles for each data
type including number of unique
values (cardinality), uniqueness
(selectivity) and type-specific
measures like mean and
standard deviation for numbers
22. 22
Data Preview and Visualization Helps
Understand the Data
Visualization helps
understand the shape
and distribution of data
Most frequent values for
each field
26. 26
View and search history
Screenshot of history tab
Another screenshot of searching
history (made up)
Data Inventory keeps
track of all user tagging,
schema changes, lineage
changes in Audit History
27. 27
Data Steward
Govern
Big Data Architect
Governed
data layer
Open up the data lake
HDFS Hive
Waterline Data Inventory
Inventory
Governance
Self-Service
Find/understand
Data Scientist/Business Analyst
Explore catalog
and provision
data securely
Open up
the lake
28. 28
Find and Understand
Automatically propagate user-
defined tags (crowdsource ontology)
Discover meaning of fields and
tag automatically
Multi-faceted
drill down
Automated facet creation
based on metadata
Business metadata-based search
29. 29
Annotate fields, files and folders with tags
• Analysts can tag fields and files
with meaningful business tags
• Type-ahead shows existing
available tags that match the
typed string
• Users can choose one or create
a new tag
• Period in tag name automatically
creates tag hierarchy (e.g.,
Restaurant.Name creates
category “Restaurant” and tag
“Name”
30. 30
Based on a single field in one file tagged as
Restaurant.Name, Waterline Data Inventory
discovery engine found 25 additional instances of
Restaurant Name automatically.
User assigned tags are
solid blue
Automatically suggested
tags are faded blue with
confidence level
Delimited files
don’t have
field names
Waterline Data Inventory learns from analysts who manually
tag fields and automatically finds and tags similar fields
31. 31
Create Hive tables
Screen shot of file with “Generate Hive Table” option selected
- Replace Hive with Drill
Generate Hive
Tables
33. 33
Company overview
• Headquartered in Mountain View, CA
• Funded in 2013 by Menlo Ventures and Sigma West
• Management Team:
Alex Gorelik,
Founder, CEO
Founded Exeros
(IBM) and Acta
(SAP), IBM DE,
Informatica GM.
Columbia BSCS,
Stanford MSCS.
Oliver Claude,
Marketing
VP SAP, VP
Informatica, IBM,
Siebel. Nova
Southeastern MS
MIS.
Jason Chen,
Engineering
VP Teradata, Acta,
Sybase. USC PhD
CS.
Ravi
Ramachandran,
Sales
CSC-Infochimps Big
Data, AppLabs,
Xchanging,
Pegasystems.
Scient (Razorfish)
WATERLINE DATA NAMED COOL VENDOR
Gartner, Cool Vendors in Information Governance
and MDM, 2015
Guido DeSimoni, Roxane Edjlali, Saul Judah, Bill
O'Kane, Andrew White