SlideShare a Scribd company logo
1 of 35
Download to read offline
How to Build a
Successful Data Lake
Alex Gorelik
Waterline Data
Founder and CEO
Data Lakes Power Data-Driven Decision Making
Maximize Business Value With a Data
Lake
How Do You Democratize the Data Lake to Maximize Business
Value?
Data
Lake
Data
Puddle
Data
Swamp
No Value Enterprise Impact
Tight Control
“Governed”
Self-Service
Business Value
Data
Democratization
DW Off-
loading
Data Swamps
Raw data
Can’t find or use
data
Can’t allow access
without protecting
sensitive data
Data Warehouse Offloading: Cost Savings
I prefer a data
warehouse--it’s
more predictable
It takes IT 3 months of data
architecture and ETL work to
add new data to the data lake
I can’t get the original data

Low variety of data and low adoption
• Focused use case (e.g., fraud detection)
• Fully automated programs (e.g., ETL off-loading)
• Small user community (e.g., data science sand box)
Strong technical skill set requirement
Data Puddles: Limited Scope and Value
What Makes a Successful Data Lake?
Right Data Right InterfaceRight Platform + +
Right Platform:
• Volume—Massively scalable
• Variety—Schema on read
• Future proof—modular—same data can be used by
many different projects and technologies
• Platform cost – extremely attractive cost structure
Right Data Challenges
Most Data is Lost, So it Can’t Be Analyzed Later
Only a small portion of data in enterprises today
is saved in data warehouses
Data Exhaust
Right Data: Save Raw Data Now to Analyze Later
• Don’t know now what data will be
needed later
• Save as much data as possible now
to analyze later
• Don’t know now what data will be
needed later
• Save as much data as possible now
to analyze later
• Save raw data, so it can be treated
correctly for each use case
Right Data: Save Raw Data Now to Analyze Later
• Departments hoard and protect
their data and do not share it with
the rest of the enterprise
• Frictionless ingestion does not
depend on data owners
Right Data Challenges: Data Silos and Data Hoarding
Right Interface: Key to Broad Adoption
• Data marketplace for
data self-service
• Providing data at the
right level of expertise
Providing Data at the Right Level of Expertise
Data scientists Business analysts
Raw data
Clean, trusted,
prepared data
Roadmap to Data Lake Success
Organize the lake
Set up for self-service
Open the lake to the users
Organize the Data Lake into Zones Organize
the lake
Multi-modal IT – Different Governance
Levels for Different Zones
Raw or
Landing Sensitive
Gold or
Curated
Work
Data Stewards
Data Scientists
Data Engineers
Data Scientists, Business Analysts
 Minimal governance
 Make sure there is no
sensitive data
 Minimal governance
 Make sure there is no
sensitive data
 Heavy governance
 Trusted, curated data
 Lineage, data quality
 Heavy governance
 Restricted access
Business Analyst Self-Service Workflow
Find and
Understand Provision Prep Analyze
Set up for
self-service
Finding, understanding and governing data in
a data lake is like shopping at a flea market
“We have 100 million fields of data – how can anyone find or trust
anything?” – Telco Executive
Botond Horvath / Shutterstock.com
DATA SCIENTIST /
BUSINESS ANALYST
DATA
STEWARD
BIG DATA
ARCHITECT
Can’t govern and trust data
(unknown metadata, data
quality, PII, data lineage)
Need data to use with self-
service tools but can’t explore
everything manually to find
and understand data
Can’t catalog all the data
manually and keep up with
data provisioning
Instead Imaging Shopping On Amazon.com
Catalog
Find, Understand And
Collaborate
Provision
Catalog
Find, Understand And
Collaborate
Provision
Waterline Data is like Amazon for Data in Hadoop
Finding and Understanding Data
• Crowdsource metadata and automate
creation of a catalog
• Institutionalize tribal data knowledge
• Automate discovery to cover all data
sets
• Establish trust
• Curated annotated data sets
• Lineage
• Data quality
• Governance
Find and
Understand
Accessing and Provisioning Data
You cannot give all access to all users
You must protect PII data and sensitive business information
Provision
Agile/Self-service
approach
Create a metadata-only catalog
When users request access,
data is de-identified and
provisioned
Top down approach
Find and de-identify all
sensitive data
Provide access to every user for
every dataset as needed
Provide a Self-Service Interface to Find,
Understand, and Provision Data
Prepare data for analytics Prep
Clean data
Remove or fix bad data, fill in
missing values, convert to
common units of measure
Shape data
Combine (join, concatenate)
Resolve entities (create a single
customer record from multiple
records or sources)
Transform (aggregate, bucketize,
filter, convert codes to names, etc.)
Blend data
Harmonize data from multiple
sources to a common schema
or model
Tooling
Many great dedicated data
wrangling tools on the horizon
Some capabilities in BI and data
visualization tools
SQL and scripting languages for
the more technical analysts
Data Analysis
• Many wonderful self-
service BI and data
visualization tools
• Mature space with many
established and
innovative vendors
Magic Quadrant for Business Intelligence and Analytics Platforms
04 February 2016 | ID:G00275847
Analyst(s): Josh Parenteau, Rita L. Sallam, Cindi Howson, Joao Tapadinhas, Kurt Schlegel, Thomas W. Oestreich
Analyze
Unlock the Value of the Data Lake with the
Waterline Data Smart Data Catalog
Time To Value Tribal Knowledge Sharing Trust
Waterline Data Is The Only Smart Data
Catalog For The Data Lake
“Use an INFORMATION
CATALOG TO MAXIMIZE
BUSINESS VALUE From
Information Assets”
“automatically identify, profile,
and metatag files in HDFS and
make them available for
analysis and exploration”
“tapped into an important and
underserved opportunity”
“comprehensive big data
governance and discovery
platform”
“opens the data to a
wider variety of people”
“fills a critical gap in big data
exploratory analytics by
automating the tagging and
cataloging of data”
Current Customers
Healthcare
Insurance
Life Sciences
Aerospace
Automotive
Banking
Government
Marketing
"Opening up a data lake for self-service analytics requires a
data catalog that's smart enough to automatically catalog every
field of data so business analysts can maximize time to value” --
Jerry Megaro, Global Head Of Data Analytics, Merck KGaA
“Understanding where your data came from and what it means
in context is vital to making a data lake initiative successful and
not just another data quagmire – the catalog plays a critical
component in this” -- Global Head of Data Governance, Risk,
and Standard, International Multi-Line Insurer
“A governed yet agile data catalog is key to open up the data
lake to business people” -- Paolo Arvati, Big Data, CSI-
Piemonte
We Run Natively On Hadoop And Integrate
With Existing Tools
Workflow of Enabling Self-Service
Analytics With Hortonworks
Hortonworks Atlas And Ranger
Data Prep Analytics &
Visualization
Smart Data
DiscoveryProfiling, Sensitive
Data & Data
Lineage
Discovery,
Automated
Tagging
Data
Stewardship
Curate Tags
Self-Service
Data
Catalog
Find, Collaborate
And Take Action
Metadata,
Tags, Data
Lineage
Metadata,
Tags, Roles &
Access Control
Roles &
Access Control
A Successful Data Lake
Right Data Right InterfaceRight Platform + +
Come to Booth 303 to see a demo
and talk to us about your data lake
Come to the Atlas session at 4:00 PM on
Thursday in room 210C
Waterline Data
The Smart Data Catalog Company

More Related Content

What's hot

Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?DATAVERSITY
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceDATAVERSITY
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Azure+Databricks+Course+Slide+Deck+V4.pdf
Azure+Databricks+Course+Slide+Deck+V4.pdfAzure+Databricks+Course+Slide+Deck+V4.pdf
Azure+Databricks+Course+Slide+Deck+V4.pdfChitresh Kaushik
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogDATAVERSITY
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...DATAVERSITY
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
 
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data ArchitectureADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data ArchitectureDATAVERSITY
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureLorenzo Nicora
 
Data Quality Best Practices
Data Quality Best PracticesData Quality Best Practices
Data Quality Best PracticesDATAVERSITY
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...HostedbyConfluent
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?DATAVERSITY
 

What's hot (20)

Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data Governance
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
Azure+Databricks+Course+Slide+Deck+V4.pdf
Azure+Databricks+Course+Slide+Deck+V4.pdfAzure+Databricks+Course+Slide+Deck+V4.pdf
Azure+Databricks+Course+Slide+Deck+V4.pdf
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data Catalog
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data ArchitectureADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and Future
 
Data Quality Best Practices
Data Quality Best PracticesData Quality Best Practices
Data Quality Best Practices
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
 

Similar to Build a Successful Data Lake with a Smart Data Catalog

Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on HadoopCaserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data LakeCaserta
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business EnablerSrinivasan Sankar
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It? Caserta
 
How to build a successful data lake Presentation.pptx
How to build a successful data lake Presentation.pptxHow to build a successful data lake Presentation.pptx
How to build a successful data lake Presentation.pptxTarekHassan840678
 
Data Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaData Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaCaserta
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefitsRicky Barron
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)Moacyr Passador
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...DataScienceConferenc1
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsCaserta
 
Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentCaserta
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchSheetal Pratik
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneySai Paravastu
 

Similar to Build a Successful Data Lake with a Smart Data Catalog (20)

Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business Enabler
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
 
How to build a successful data lake Presentation.pptx
How to build a successful data lake Presentation.pptxHow to build a successful data lake Presentation.pptx
How to build a successful data lake Presentation.pptx
 
Data Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaData Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with Cloudera
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefits
 
The Power of Data
The Power of DataThe Power of Data
The Power of Data
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment Options
 
Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business Environment
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...itnewsafrica
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 

Recently uploaded (20)

Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 

Build a Successful Data Lake with a Smart Data Catalog

  • 1. How to Build a Successful Data Lake Alex Gorelik Waterline Data Founder and CEO
  • 2. Data Lakes Power Data-Driven Decision Making
  • 3. Maximize Business Value With a Data Lake How Do You Democratize the Data Lake to Maximize Business Value? Data Lake Data Puddle Data Swamp No Value Enterprise Impact Tight Control “Governed” Self-Service Business Value Data Democratization DW Off- loading
  • 4. Data Swamps Raw data Can’t find or use data Can’t allow access without protecting sensitive data
  • 5. Data Warehouse Offloading: Cost Savings I prefer a data warehouse--it’s more predictable It takes IT 3 months of data architecture and ETL work to add new data to the data lake I can’t get the original data 
  • 6. Low variety of data and low adoption • Focused use case (e.g., fraud detection) • Fully automated programs (e.g., ETL off-loading) • Small user community (e.g., data science sand box) Strong technical skill set requirement Data Puddles: Limited Scope and Value
  • 7. What Makes a Successful Data Lake? Right Data Right InterfaceRight Platform + +
  • 8. Right Platform: • Volume—Massively scalable • Variety—Schema on read • Future proof—modular—same data can be used by many different projects and technologies • Platform cost – extremely attractive cost structure
  • 9. Right Data Challenges Most Data is Lost, So it Can’t Be Analyzed Later Only a small portion of data in enterprises today is saved in data warehouses Data Exhaust
  • 10. Right Data: Save Raw Data Now to Analyze Later • Don’t know now what data will be needed later • Save as much data as possible now to analyze later
  • 11. • Don’t know now what data will be needed later • Save as much data as possible now to analyze later • Save raw data, so it can be treated correctly for each use case Right Data: Save Raw Data Now to Analyze Later
  • 12. • Departments hoard and protect their data and do not share it with the rest of the enterprise • Frictionless ingestion does not depend on data owners Right Data Challenges: Data Silos and Data Hoarding
  • 13. Right Interface: Key to Broad Adoption • Data marketplace for data self-service • Providing data at the right level of expertise
  • 14. Providing Data at the Right Level of Expertise Data scientists Business analysts Raw data Clean, trusted, prepared data
  • 15. Roadmap to Data Lake Success Organize the lake Set up for self-service Open the lake to the users
  • 16. Organize the Data Lake into Zones Organize the lake
  • 17. Multi-modal IT – Different Governance Levels for Different Zones Raw or Landing Sensitive Gold or Curated Work Data Stewards Data Scientists Data Engineers Data Scientists, Business Analysts  Minimal governance  Make sure there is no sensitive data  Minimal governance  Make sure there is no sensitive data  Heavy governance  Trusted, curated data  Lineage, data quality  Heavy governance  Restricted access
  • 18. Business Analyst Self-Service Workflow Find and Understand Provision Prep Analyze Set up for self-service
  • 19. Finding, understanding and governing data in a data lake is like shopping at a flea market “We have 100 million fields of data – how can anyone find or trust anything?” – Telco Executive
  • 20. Botond Horvath / Shutterstock.com DATA SCIENTIST / BUSINESS ANALYST DATA STEWARD BIG DATA ARCHITECT Can’t govern and trust data (unknown metadata, data quality, PII, data lineage) Need data to use with self- service tools but can’t explore everything manually to find and understand data Can’t catalog all the data manually and keep up with data provisioning
  • 21. Instead Imaging Shopping On Amazon.com Catalog Find, Understand And Collaborate Provision
  • 22. Catalog Find, Understand And Collaborate Provision Waterline Data is like Amazon for Data in Hadoop
  • 23. Finding and Understanding Data • Crowdsource metadata and automate creation of a catalog • Institutionalize tribal data knowledge • Automate discovery to cover all data sets • Establish trust • Curated annotated data sets • Lineage • Data quality • Governance Find and Understand
  • 24. Accessing and Provisioning Data You cannot give all access to all users You must protect PII data and sensitive business information Provision Agile/Self-service approach Create a metadata-only catalog When users request access, data is de-identified and provisioned Top down approach Find and de-identify all sensitive data Provide access to every user for every dataset as needed
  • 25. Provide a Self-Service Interface to Find, Understand, and Provision Data
  • 26. Prepare data for analytics Prep Clean data Remove or fix bad data, fill in missing values, convert to common units of measure Shape data Combine (join, concatenate) Resolve entities (create a single customer record from multiple records or sources) Transform (aggregate, bucketize, filter, convert codes to names, etc.) Blend data Harmonize data from multiple sources to a common schema or model Tooling Many great dedicated data wrangling tools on the horizon Some capabilities in BI and data visualization tools SQL and scripting languages for the more technical analysts
  • 27. Data Analysis • Many wonderful self- service BI and data visualization tools • Mature space with many established and innovative vendors Magic Quadrant for Business Intelligence and Analytics Platforms 04 February 2016 | ID:G00275847 Analyst(s): Josh Parenteau, Rita L. Sallam, Cindi Howson, Joao Tapadinhas, Kurt Schlegel, Thomas W. Oestreich Analyze
  • 28. Unlock the Value of the Data Lake with the Waterline Data Smart Data Catalog Time To Value Tribal Knowledge Sharing Trust
  • 29. Waterline Data Is The Only Smart Data Catalog For The Data Lake “Use an INFORMATION CATALOG TO MAXIMIZE BUSINESS VALUE From Information Assets” “automatically identify, profile, and metatag files in HDFS and make them available for analysis and exploration” “tapped into an important and underserved opportunity” “comprehensive big data governance and discovery platform” “opens the data to a wider variety of people” “fills a critical gap in big data exploratory analytics by automating the tagging and cataloging of data”
  • 30. Current Customers Healthcare Insurance Life Sciences Aerospace Automotive Banking Government Marketing "Opening up a data lake for self-service analytics requires a data catalog that's smart enough to automatically catalog every field of data so business analysts can maximize time to value” -- Jerry Megaro, Global Head Of Data Analytics, Merck KGaA “Understanding where your data came from and what it means in context is vital to making a data lake initiative successful and not just another data quagmire – the catalog plays a critical component in this” -- Global Head of Data Governance, Risk, and Standard, International Multi-Line Insurer “A governed yet agile data catalog is key to open up the data lake to business people” -- Paolo Arvati, Big Data, CSI- Piemonte
  • 31. We Run Natively On Hadoop And Integrate With Existing Tools
  • 32. Workflow of Enabling Self-Service Analytics With Hortonworks Hortonworks Atlas And Ranger Data Prep Analytics & Visualization Smart Data DiscoveryProfiling, Sensitive Data & Data Lineage Discovery, Automated Tagging Data Stewardship Curate Tags Self-Service Data Catalog Find, Collaborate And Take Action Metadata, Tags, Data Lineage Metadata, Tags, Roles & Access Control Roles & Access Control
  • 33. A Successful Data Lake Right Data Right InterfaceRight Platform + +
  • 34. Come to Booth 303 to see a demo and talk to us about your data lake Come to the Atlas session at 4:00 PM on Thursday in room 210C
  • 35. Waterline Data The Smart Data Catalog Company

Editor's Notes

  1. End-user tools only provide the last mile to leverage data, but they of and by themselves don’t know where the right data is. The right data has to be found, quickly and securely.
  2. The opposite of a flea market is Amazon. It gives the consumer self-service, but it functions as a managed application.
  3. Like Amazon, we offer a solution that catalogs the data assets, provides a front-end to find, understand, and share, and provides a way to take action and quickly open the data in any end-user tool to wrangle, visualize, or analyze the data.
  4. A data lake provides one place where any data can be saved and used by business analysts and data scientists to mash up data in new ways to answer new business questions. Waterline Data enables you to open up the data lake to business analysts and data scientists so they can do data prep, analytics, or modeling. Our product delivers value along 3 dimensions (i.e., the 3 T’s). We catalog every field of data for the entire data lake and we provide an interface to quickly find, understand, and take action on the data (e.g., you can provision or open the data in Trifacta) – The end result is faster time to uncover value We don’t just discover what the data means, but we also empower subject matter experts to augment the data catalog with additional tags and comments to capture additional information, such as the intended use of the data, to help accelerate future projects We facilitate data governance by tagging data based on approved business glossaries and data stewardship curation, as well as by providing secure self-service access to the data based on roles and visibility rules
  5. Waterline Data has been acknowledged as filling an important gap in opening up data lakes for self-service data preparation and analytics. The need for a data catalog has been recognized as key to enabling a data democracy and self-service by the business. For instance Gartner just released a paper on how CDOs can leverage an information catalog to get more business value from data assets. We are the only company that can build a data catalog automatically, and at scale, for a data lake.
  6. We have customers in production across many industries. They realize value by being able to catalog all the data quickly and make it easily available to the business to do self-service data preparation and analytics. They also get value from the fact that the data catalog supports agile data governance, by enabling data stewards to quickly curate tags, and by providing several levels of access control based on the data governance policies (e.g., access to sensitive data is protected). (if they ask, data lakes range from smaller 5-node clusters to over 100 nodes, so our product can be used right away even when the lake is small, and grow to a large lake)
  7. Our product runs natively on the major platforms like AWS, Cloudera, Hortonworks, MapR, and Pivotal. We are also in the process of certifying on IIP. We integrate with existing data management tools: We can import and export data lineage and tag information with Atlas and Navigator We support access control policies and integrate with LDAP, Ranger and Sentry We can import existing business glossaries from Collibra, Informatica, or IBM (note this is done through our API so we should be able to import from any business glossary) We can integrate with ETL tools to import metadata We integrate with end-user tools through an open framework (we provide the ability to generate Hive tables automatically, as well as the ability to open the data directly in end-user tools)
  8. Waterline Data accelerates the creation of the data catalog at big data scale: We parse, profile, and discover sensitive data and data lineage, and automatically tag fields based on an integrated business glossary and tagging rules We empower data stewards to quickly curate tags We empower business analysts and data scientists to quick find the right data they need and take immediate action with the data by being able to open it with the desired end-user tool