Deploying a Governed Data Lake

2
Everyone needs data to make better decisions

3
A data lake
http://www.pwc.com/us/en/technology-forecast/2014/issue1/features/data-lakes.jhtml
“Size and low cost”
“Fidelity: Hadoop data
lakes preserve data in its
original form”
“Ease of accessibility:
Accessibility is easy in
the data lake”
“Late binding: Hadoop
lends itself to flexible,
task-oriented structuring
and does not require up-
front data models”
“Nearly unlimited potential for operational insight
and data discovery. As data volumes, data
variety, and metadata richness grow, so does the
benefit.”

4
Data warehouse vs. data lake
Data Warehouse
• Production system
• Well-defined usage
• Well-defined schema
• Clean, trusted data
• Heavy IT reliance
– Less technical analysts
– Large IT teams: DBAs,
Data Architects, ETL
Developers, BI
Developers, DQ
Developers, Data
Modelers, Data Stewards
Data Lake
• Non-production system
• Future, experimental usage
• No schema (schema on read)
• Raw data, frictionless ingestion
• Self-service
– More technical analysts
– IT manages the cluster and ingestion,
but no IT involvement when working with
data

5
as the platform for a scalable
data lake infrastructure
✔ Hadoop
✔ Hadoop
✔ Hadoop
• Lots of data (Volume): cost-effective storage and
scalable processing
• Flexibility to handle all kinds of data (Variety)
• Will be around for a long time: modularity to insure
future-proofing

6
Is Hadoop enough?
Big Data Architect
Hadoop
We have
Hadoop, now
what?
10-20 nodes

7
Big Data Architect
Hadoop
How do I get
the business to
start using it?
Data Scientist/Business
Analyst
10-20 nodes

8
Big Data Architect
Hadoop
How do I get
the business to
start using it?
Analyst
How do I find
and understand
data easily to
do big data
analytics?
Self-service
10-20 nodes

9
Big Data Architect
Hadoop
Analyst
No security and
governance
10-20 nodes
Risk/Data Governance
Executive How do I ensure
compliance with
regulations and
data policies ?
Sensitive data?

10
Big Data Architect
Hadoop
How do I
scale?
Analysts
100s/1000s of nodes
Manual process to catalog the lake can’t scale

11
• Lots of data (Volume): cost-effective storage
and scalable processing
• Flexibility to handle all kinds of data (Variety)
• Will be around for a long time: modularity to
insure future-proofing
• Self-service to help users find, understand
and use the data
• Governance to protect sensitive data,
document lineage and asses quality
The platform for a scalable
data lake infrastructure
✔ Hadoop
✔ Hadoop
✔ Hadoop
X Hadoop
X Hadoop

12
Waterline Data Inventory broadens Hadoop
adoption through governed self-service
Big Data Architect
Hadoop
Data
Scientist/Business
Analyst
100s/1000s of nodes
Risk/Data
Governance
Executive
Self-service Security and
governance
Massive scale

13
3-phase approach to a governed data lake
Organize
the lake
Inventory
the lake
Open up
the lake

14
Organize the lake into zones
Organize
the lake

15
Establish access control per zone
• Business Analysts
• Data Scientists
• Data Scientists
• Data Engineers
• Data Scientists
• Data Engineers
• Data Stewards
Sensitive Landing
GoldWork
Organize
the lake

16
The governed data lake
Data Scientist/Business Analyst Data Steward Big Data Architect
HDFS Hive
Waterline Data Inventory
Find/understand Govern
Governed
data layer
Governance
Inventory
Self-Service

17
Metadata Curation
Self-Service Catalog/Provisioning
Big Data Architect
Find/understand
Governed
data layer
Data Scientist/Business Analyst
Data Steward
HDFS Hive
Govern
Inventory
Inventory
the lake
Profile and discover
the content of files
and Hive tables

18
Inventory
Parse multiple
content types
Create catalog
automatically
Discover lineage
automatically

19
Self-Service Catalog/Provisioning
Big Data Architect
Find/understand
Governed
data layer
Data Steward
HDFS Hive
Govern
Inventory
Govern
the lake
Governance
• Inspect files and perform
tag curation
• Identify sensitive data
• Assess data quality
• Discover data lineage
• Manage glossary

20
Navigate Lineage of Files in Hadoop
Clickable, navigable
lineage discovered using
file content or imported
from other tools through
REST APIs

21
Automated Data Profiling Helps with Quality
Assessment
Infographic shows
contents at a glance:
• Different types of data in
the same field
• Number of missing
values
Separate profiles for each data
type including number of unique
values (cardinality), uniqueness
(selectivity) and type-specific
measures like mean and
standard deviation for numbers

22
Data Preview and Visualization Helps
Understand the Data
Visualization helps
understand the shape
and distribution of data
Most frequent values for
each field

23
Discover Sensitive Data
Screen shot
Find all fields that
may have SSN

24
Curate Discovered Sensitive Data Fields
Curate the field and
accept or reject the tag

25
Manage Glossary
Import or create a
business glossary
Manage tags

26
View and search history
Screenshot of history tab
Another screenshot of searching
history (made up)
Data Inventory keeps
track of all user tagging,
schema changes, lineage
changes in Audit History

27
Data Steward
Govern
Big Data Architect
Governed
data layer
Open up the data lake
HDFS Hive
Inventory
Governance
Self-Service
Find/understand
Explore catalog
and provision
data securely
Open up
the lake

28
Find and Understand
Automatically propagate user-
defined tags (crowdsource ontology)
Discover meaning of fields and
tag automatically
Multi-faceted
drill down
Automated facet creation
based on metadata
Business metadata-based search

29
Annotate fields, files and folders with tags
• Analysts can tag fields and files
with meaningful business tags
• Type-ahead shows existing
available tags that match the
typed string
• Users can choose one or create
a new tag
• Period in tag name automatically
creates tag hierarchy (e.g.,
Restaurant.Name creates
category “Restaurant” and tag
“Name”

30
Based on a single field in one file tagged as
Restaurant.Name, Waterline Data Inventory
discovery engine found 25 additional instances of
Restaurant Name automatically.
User assigned tags are
solid blue
Automatically suggested
tags are faded blue with
confidence level
Delimited files
don’t have
field names
Waterline Data Inventory learns from analysts who manually
tag fields and automatically finds and tags similar fields

31
Create Hive tables
Screen shot of file with “Generate Hive Table” option selected
- Replace Hive with Drill
Generate Hive
Tables

33
Company overview
• Headquartered in Mountain View, CA
• Funded in 2013 by Menlo Ventures and Sigma West
• Management Team:
Alex Gorelik,
Founder, CEO
Founded Exeros
(IBM) and Acta
(SAP), IBM DE,
Informatica GM.
Columbia BSCS,
Stanford MSCS.
Oliver Claude,
Marketing
VP SAP, VP
Informatica, IBM,
Siebel. Nova
Southeastern MS
MIS.
Jason Chen,
Engineering
VP Teradata, Acta,
Sybase. USC PhD
CS.
Ravi
Ramachandran,
Sales
CSC-Infochimps Big
Data, AppLabs,
Xchanging,
Pegasystems.
Scient (Razorfish)
WATERLINE DATA NAMED COOL VENDOR
Gartner, Cool Vendors in Information Governance
and MDM, 2015
Guido DeSimoni, Roxane Edjlali, Saul Judah, Bill
O'Kane, Andrew White

Visit our exhibit in the ballroom to get
more information

Deploying a Governed Data Lake

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Deploying a Governed Data Lake

Similar to Deploying a Governed Data Lake (20)

Recently uploaded

Recently uploaded (20)

Deploying a Governed Data Lake