SlideShare a Scribd company logo
1 of 34
Download to read offline
DATA LAKE ARCHITECTURE
Monojit Basu, Founder & Director
TechYugadi IT Solutions & Consulting
OSI DAYS 2016, BANGALORE
Data Never Sleeps
 Every minute
 Facebook users share 216,302 photos
 Dropbox users upload 833,333 new files
 Youtube users share 400 hours of new video
 Twitter users send 350,000 tweets
 A Boeing 737 Aircraft in flight generates 40 TB of data
EDW vs Data Lake
 Data Lake is built on the premise that every drop of
data is valuable
 Its a place for capturing and exploring huge
volumes of raw data that a business generates
 Explorers are diverse: business analysts, data
scientists, …
 even business managers (using self-service)
 Goals of exploration may be loosely defined
EDW vs Data Lake
 EDW stores filtered and processed data
 For pre-meditated usage scenarios
 Traditionally structured in the form of ‘cubes’
 Analogy
 Difference between a college library (focused on
curriculum) and the US Library of Congress
EDW vs Data Lake
 Schema-on-Read
 Schema-on-Write
DATA LAKE
XML
JSON
CSV
PDF
TRADING PARTNER
REST API
INVOICING
ORDERS DB
READ /
EXTRACT
READ /
EXTRACT
READ /
EXTRACT
CRM
ANALYTICS
SCM
ANALYTICS
RECO
ENGINE
ENTERPRISE DATA
WAREHOUSE
XML
JSON
CSV
PDF
TRADING PARTNER
REST API
INVOICING
ORDERS DB
SALES
OPERATIONS
MARKETING
ETL
Why Think of Data Lake
 Business Drivers
 Diverse sources of data: transactions, interactions, human
and machine-generated
 Routine analysis not enough – deeper insights lead to
differentiation
 Agile and Adaptive Business Models
 Technology Drivers
 Fast, cheap and scalable storage (eg. HDFS)
 Diverse data-processing engines (eg. NoSQL)
 Infinitely elastic processing power (cluster of commodity
servers)
Application Domains
 Healthcare  IoT
 E-Governance  Insurance
What Features Should It Support
 Scalable Storage Layer
 3 V’s of Data Inflow
 Data Discovery
 Data Governance
 Pluggable and Extensible Analytics
 Elastic Processing Power
 Multi-stakeholder and Multi-tenant Access
Building It On Top Of Hadoop
 Data Lake doesn’t have to be Hadoop
 But Hadoop has proven its prowess on planet-scale
data, in terms of:
 Data Volumes
 Elastic Data Processing Power
 Probably the idea of a Data Lake was inspired by
Hadoop
 Naturally most often a Data Lake Architecture is
built around Hadoop
Storage Capacity: Metrics
 Normally HDFS scales even with one NameNode
 Unless you have hundreds of Petabytes data
 But you need to monitor the usage pattern
 Are you creating too many small files (what’s the
average number of blocks per file)?
 How much RAM would you need for the NameNode? (a
high value could mean larger GC pauses)
 Internal Load (heartbeats and block reports) vs
External Get and Create Requests
Storage Capacity: HDFS Federation
 Single Name Node  NameNode Federation
Name
Node
Data
Node1
Data
Node2
Data
NodeN
MR
Client
Get / Create
Internal
Load
…
NameNode1 NameNode2
Block Pool1 Block Pool2
Data
Node1
Data
Node2
Data
NodeN…
Storage Capacity: Availability
 NameNode Federation does not ensure HA
 Even if you don’t go for Federation, configuring high
availability is recommended
 Essentially set up a Standby NameNode
 Active NameNode shares state with the Standby
 Using a shared Journal Manager, or
 Simply using a NFS-mounted shared File directory
 Synchronization frequency is configurable
Compute Capacity
 Hadoop 1.0 supported 1 type of Job (Map-Reduce)
 MR jobs were scheduled by a ‘JobTracker’ process
 Hadoop 2.0 offers a Resource Manager (YARN)
 It is intended to replace JobTracker and better the
Hadoop cluster size limit from 3000 to 10000
 But more important: YARN supports different types of
Jobs including MR to run on Hadoop
 Hence Data Lake should preferably be built on YARN
Compute Capacity: YARN
 YARN ARCHITECTURE
RESOURCE
MANAGER
NODE MANAGER
MR APP
MASTER
SPARK
TASK
NODE MANAGER
SPARK APP
MASTER
MR
TASK
N
O
D
E
1
N
O
D
E
2
MR CLIENT
SPARK
CLIENT
Data Inflow
 The goal is to build a pipeline into Hadoop-native
data stores
 HDFS, mandatorily
 Hive and Hbase, preferably
 Considering the variety of data formats that a Data
Lake must accommodate:
 A general purpose Data Integration Tool must be chosen
 For example, Pentaho Data Integration (PDI)
Data Inflow
 Pipelines specialized for specific data formats may
also be plugged in
HDFS
FLAT FILE INPUT
CONNECTOR
WEB SERVICE INPUT
CONNECTOR
HDFS OUTPUT
CONNECTOR
.txt .json
SQOOP FLUME
DB log
Data Inflow: Streaming Data
 Streaming Data may be processed in two ways
 Simply store in the Data Lake for future analysis
 Interesting tweets for building a sentiment analysis model
 Store and Forward to a Real-time Analytics Engine
 Even as real-time processing occurs, the source data in
raw format may be useful in future
 To build / update machine learning models, for example
in fraud analytics
HDFS
STORE STORE &
FORWARD
Data Analytics
 A Data Lake built on HDFS will most likely use a
Hadoop cluster to analyze data
 Sometimes the result of the analysis may be stored
back into HDFS (or possibly Hive / Hbase)
 But Data Visualization and Reporting / Dashboards
may work only on structured data cubes
 Hence on the Analytics side, a Data Lake may need
outflow paths from HDFS into structured data stores
Plugging In Data Analytics Engine
 Jaspersoft Reporting with HDFS
HDFS
ANALYZED DATA
JASPERSOFT ETL
HDFS INPUT
CONNECTOR OLAP
CUBE
JASPERSOFT
REPORTING
ENGINE
Data Governance
 Data Lake does not conform to a schema
 Data Governance makes it possible to make sense
of the data
 To both analysts and administrators
 Data Governance is a fairly open-ended subject
 Vendors offer different techniques to solve each
governance use case
 But common patterns are emerging across the landscape
Data Governance: Analyst Use Cases
 To search and retrieve ‘relevant’ data for analysis
 Common Techniques
 Metadata Management
 Data tagging
 Text Search
 Data Classification
 Metadata can include technical as well as business
information (linked to a Business Glossary)
 Data tags are often created by users collaboratively
Data Governance: Admin Use Cases
 Track data flow from
source to end applications
 Retain, replicate and
archive based on usage
 Track access and usage
information for compliance
 Lineage
 Data Life-cycle
Management
 Auditing
Automated Metadata Generation
 As data is ingested, suitable attributes are extracted
and stored into a metadata repository
 Data type (XML, PDF, text, etc)
 Data size
 Creation and Last Access time, etc
 Even data tags can be inserted at the time of ingest
 Unconditionally, eg. ‘sales’
 Conditionally, eg. ‘holiday_sales’
Apache Atlas For Data Governance
Source: http://atlas.incubator.apache.org/Architecture.html
Data Access And Security
 By default HDFS is secured using
 Kerberos for authentication, and
 Unix-style file permissions for authorization
 In a large data repository with diverse stakeholders
you may need more control
 If so, a couple of products may be considered for
augmenting Data Security:
 Apache Knox
 Apache Ranger
Data Access And Security
HDFS
Perimeter Security:
Knox
KERBEROS
Authentication Authorization
(rwx)
RANGER Federated
Access Control
NODE 1 NODE N
Why Use Ranger
 Supports Federated Access Control
 Can fall-back upon default HDFS file permissions
 Manages Access Control over several Hadoop-
based components, like Hive, Storm, etc.
 Advanced fine-grained access control, like
 Deny policies for user or group
 Tag-based access control, where a collection of
resources share a common access tag
 For example, a few columns in a Hive table and a
certain files in HDFS could share a tag: ‘internal_audit’
Steps To Build A Data Lake
 Set up a scalable data storage layer
 Set up a Compute Cluster capable of running a
diverse mix of Jobs
 Create data flow pipeline(s) for batch jobs
 Create data flow pipeline(s) for streaming data
Steps To Build A Data Lake
 Plug in one or more Analytics Engine(s)
 Set up mechanisms for efficient data discovery
and data governance
 Implement Data Access Controls
 Design a Monitoring Infrastructure for Jobs and
Resources (not covered today)
Building A Data Lake: Starting Points
 Set up a scalable data storage layer: HDFS
 Set up a Compute Cluster capable of running a
diverse mix of Jobs: YARN
 Create data flow pipeline(s) for batch jobs:
Pentaho HDFS Connector
 Create data flow pipeline(s) for streaming data:
Pentaho Messaging Connector
Steps To Build A Data Lake
 Plug in one or more Analytics Engine(s): Pentaho
Reporting and Spark MLib
 Set up mechanisms for efficient data discovery
and data governance: Apache Atlas
 Implement Data Access Controls: Apache Ranger
 Design a Monitoring Infrastructure for Jobs and
Resources: Apache Ambari
Taking The Plunge
 Do you need to plan for and build a Data Lake?
 Ask yourself: what fraction of your data are you
analyzing today ?
 What value might the unused data offer ?
 For marketing campaigns
 For product lifecycle management
 For regulatory compliance, and so on …
 Engage your stakeholders from different LoBs
 Is decision making being hampered by lack of data ?
Taking The Plunge
 Start small: There is a learning curve
 Storing data is not enough – maintaining the
stewarding the data is all important
 Design for extensibility and plugability
 Minimize vendor lock-in
 Be open to change as you scale your infrastructure
monojit@techyugadi.com

More Related Content

What's hot

Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseData Con LA
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)Amazon Web Services
 
Data Lake: A simple introduction
Data Lake: A simple introductionData Lake: A simple introduction
Data Lake: A simple introductionIBM Analytics
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data PipelineManish Kumar
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InSnapLogic
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lakeMykola Zerniuk
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's includedJames Serra
 

What's hot (20)

Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
 
Data Lake: A simple introduction
Data Lake: A simple introductionData Lake: A simple introduction
Data Lake: A simple introduction
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data Pipeline
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump In
 
Big Data
Big DataBig Data
Big Data
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
Data Mesh 101
Data Mesh 101Data Mesh 101
Data Mesh 101
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 

Viewers also liked

Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecturemark madsen
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 
The technology of the business data lake
The technology of the business data lakeThe technology of the business data lake
The technology of the business data lakeCapgemini
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architectureMilos Milovanovic
 
Search Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer CentreSearch Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer Centrejatin batra
 
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...RSD
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in sparkPeng Cheng
 
Taming the Data Lake with Scalable Metrics Model Framework
Taming the Data Lake with Scalable Metrics Model FrameworkTaming the Data Lake with Scalable Metrics Model Framework
Taming the Data Lake with Scalable Metrics Model FrameworkRamkumar Ravichandran
 
Industrial internet big data uk market study
Industrial internet big data uk market studyIndustrial internet big data uk market study
Industrial internet big data uk market studySari Ojala
 
Competition Improves Performance: Only when Competition Form matches Goal Ori...
Competition Improves Performance: Only when Competition Form matches Goal Ori...Competition Improves Performance: Only when Competition Form matches Goal Ori...
Competition Improves Performance: Only when Competition Form matches Goal Ori...Eugene Yan Ziyou
 
The concept of Datalake with Hadoop
The concept of Datalake with HadoopThe concept of Datalake with Hadoop
The concept of Datalake with HadoopAvkash Chauhan
 
Social network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG communitySocial network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG communityEugene Yan Ziyou
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservicesBigstep
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos MilovanovicInstitute of Contemporary Sciences
 
AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)Eugene Yan Ziyou
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopAvkash Chauhan
 

Viewers also liked (20)

Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecture
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
The technology of the business data lake
The technology of the business data lakeThe technology of the business data lake
The technology of the business data lake
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
Search Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer CentreSearch Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer Centre
 
search engines
search enginessearch engines
search engines
 
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...
 
R language
R languageR language
R language
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in spark
 
Taming the Data Lake with Scalable Metrics Model Framework
Taming the Data Lake with Scalable Metrics Model FrameworkTaming the Data Lake with Scalable Metrics Model Framework
Taming the Data Lake with Scalable Metrics Model Framework
 
Industrial internet big data uk market study
Industrial internet big data uk market studyIndustrial internet big data uk market study
Industrial internet big data uk market study
 
Competition Improves Performance: Only when Competition Form matches Goal Ori...
Competition Improves Performance: Only when Competition Form matches Goal Ori...Competition Improves Performance: Only when Competition Form matches Goal Ori...
Competition Improves Performance: Only when Competition Form matches Goal Ori...
 
The concept of Datalake with Hadoop
The concept of Datalake with HadoopThe concept of Datalake with Hadoop
The concept of Datalake with Hadoop
 
Social network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG communitySocial network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG community
 
Destroying Data Silos
Destroying Data SilosDestroying Data Silos
Destroying Data Silos
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 
Big model, big data
Big model, big dataBig model, big data
Big model, big data
 
AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache Hadoop
 

Similar to Datalake Architecture

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSStéphane Fréchette
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
 
Cloud computing major project
Cloud computing major projectCloud computing major project
Cloud computing major projectayk115
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Hadoop Big data Solution Provider
Hadoop Big data Solution ProviderHadoop Big data Solution Provider
Hadoop Big data Solution ProviderAgileiss
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalDataWorks Summit
 
Data ingestion
Data ingestionData ingestion
Data ingestionnitheeshe2
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big dealeduarderwee
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paperSupratim Ray
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudShubham Tagra
 

Similar to Datalake Architecture (20)

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Cloud computing major project
Cloud computing major projectCloud computing major project
Cloud computing major project
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Hadoop Big data Solution Provider
Hadoop Big data Solution ProviderHadoop Big data Solution Provider
Hadoop Big data Solution Provider
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
 
Data ingestion
Data ingestionData ingestion
Data ingestion
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 

Recently uploaded

Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 

Recently uploaded (20)

Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 

Datalake Architecture

  • 1. DATA LAKE ARCHITECTURE Monojit Basu, Founder & Director TechYugadi IT Solutions & Consulting OSI DAYS 2016, BANGALORE
  • 2. Data Never Sleeps  Every minute  Facebook users share 216,302 photos  Dropbox users upload 833,333 new files  Youtube users share 400 hours of new video  Twitter users send 350,000 tweets  A Boeing 737 Aircraft in flight generates 40 TB of data
  • 3. EDW vs Data Lake  Data Lake is built on the premise that every drop of data is valuable  Its a place for capturing and exploring huge volumes of raw data that a business generates  Explorers are diverse: business analysts, data scientists, …  even business managers (using self-service)  Goals of exploration may be loosely defined
  • 4. EDW vs Data Lake  EDW stores filtered and processed data  For pre-meditated usage scenarios  Traditionally structured in the form of ‘cubes’  Analogy  Difference between a college library (focused on curriculum) and the US Library of Congress
  • 5. EDW vs Data Lake  Schema-on-Read  Schema-on-Write DATA LAKE XML JSON CSV PDF TRADING PARTNER REST API INVOICING ORDERS DB READ / EXTRACT READ / EXTRACT READ / EXTRACT CRM ANALYTICS SCM ANALYTICS RECO ENGINE ENTERPRISE DATA WAREHOUSE XML JSON CSV PDF TRADING PARTNER REST API INVOICING ORDERS DB SALES OPERATIONS MARKETING ETL
  • 6. Why Think of Data Lake  Business Drivers  Diverse sources of data: transactions, interactions, human and machine-generated  Routine analysis not enough – deeper insights lead to differentiation  Agile and Adaptive Business Models  Technology Drivers  Fast, cheap and scalable storage (eg. HDFS)  Diverse data-processing engines (eg. NoSQL)  Infinitely elastic processing power (cluster of commodity servers)
  • 7. Application Domains  Healthcare  IoT  E-Governance  Insurance
  • 8. What Features Should It Support  Scalable Storage Layer  3 V’s of Data Inflow  Data Discovery  Data Governance  Pluggable and Extensible Analytics  Elastic Processing Power  Multi-stakeholder and Multi-tenant Access
  • 9. Building It On Top Of Hadoop  Data Lake doesn’t have to be Hadoop  But Hadoop has proven its prowess on planet-scale data, in terms of:  Data Volumes  Elastic Data Processing Power  Probably the idea of a Data Lake was inspired by Hadoop  Naturally most often a Data Lake Architecture is built around Hadoop
  • 10. Storage Capacity: Metrics  Normally HDFS scales even with one NameNode  Unless you have hundreds of Petabytes data  But you need to monitor the usage pattern  Are you creating too many small files (what’s the average number of blocks per file)?  How much RAM would you need for the NameNode? (a high value could mean larger GC pauses)  Internal Load (heartbeats and block reports) vs External Get and Create Requests
  • 11. Storage Capacity: HDFS Federation  Single Name Node  NameNode Federation Name Node Data Node1 Data Node2 Data NodeN MR Client Get / Create Internal Load … NameNode1 NameNode2 Block Pool1 Block Pool2 Data Node1 Data Node2 Data NodeN…
  • 12. Storage Capacity: Availability  NameNode Federation does not ensure HA  Even if you don’t go for Federation, configuring high availability is recommended  Essentially set up a Standby NameNode  Active NameNode shares state with the Standby  Using a shared Journal Manager, or  Simply using a NFS-mounted shared File directory  Synchronization frequency is configurable
  • 13. Compute Capacity  Hadoop 1.0 supported 1 type of Job (Map-Reduce)  MR jobs were scheduled by a ‘JobTracker’ process  Hadoop 2.0 offers a Resource Manager (YARN)  It is intended to replace JobTracker and better the Hadoop cluster size limit from 3000 to 10000  But more important: YARN supports different types of Jobs including MR to run on Hadoop  Hence Data Lake should preferably be built on YARN
  • 14. Compute Capacity: YARN  YARN ARCHITECTURE RESOURCE MANAGER NODE MANAGER MR APP MASTER SPARK TASK NODE MANAGER SPARK APP MASTER MR TASK N O D E 1 N O D E 2 MR CLIENT SPARK CLIENT
  • 15. Data Inflow  The goal is to build a pipeline into Hadoop-native data stores  HDFS, mandatorily  Hive and Hbase, preferably  Considering the variety of data formats that a Data Lake must accommodate:  A general purpose Data Integration Tool must be chosen  For example, Pentaho Data Integration (PDI)
  • 16. Data Inflow  Pipelines specialized for specific data formats may also be plugged in HDFS FLAT FILE INPUT CONNECTOR WEB SERVICE INPUT CONNECTOR HDFS OUTPUT CONNECTOR .txt .json SQOOP FLUME DB log
  • 17. Data Inflow: Streaming Data  Streaming Data may be processed in two ways  Simply store in the Data Lake for future analysis  Interesting tweets for building a sentiment analysis model  Store and Forward to a Real-time Analytics Engine  Even as real-time processing occurs, the source data in raw format may be useful in future  To build / update machine learning models, for example in fraud analytics HDFS STORE STORE & FORWARD
  • 18. Data Analytics  A Data Lake built on HDFS will most likely use a Hadoop cluster to analyze data  Sometimes the result of the analysis may be stored back into HDFS (or possibly Hive / Hbase)  But Data Visualization and Reporting / Dashboards may work only on structured data cubes  Hence on the Analytics side, a Data Lake may need outflow paths from HDFS into structured data stores
  • 19. Plugging In Data Analytics Engine  Jaspersoft Reporting with HDFS HDFS ANALYZED DATA JASPERSOFT ETL HDFS INPUT CONNECTOR OLAP CUBE JASPERSOFT REPORTING ENGINE
  • 20. Data Governance  Data Lake does not conform to a schema  Data Governance makes it possible to make sense of the data  To both analysts and administrators  Data Governance is a fairly open-ended subject  Vendors offer different techniques to solve each governance use case  But common patterns are emerging across the landscape
  • 21. Data Governance: Analyst Use Cases  To search and retrieve ‘relevant’ data for analysis  Common Techniques  Metadata Management  Data tagging  Text Search  Data Classification  Metadata can include technical as well as business information (linked to a Business Glossary)  Data tags are often created by users collaboratively
  • 22. Data Governance: Admin Use Cases  Track data flow from source to end applications  Retain, replicate and archive based on usage  Track access and usage information for compliance  Lineage  Data Life-cycle Management  Auditing
  • 23. Automated Metadata Generation  As data is ingested, suitable attributes are extracted and stored into a metadata repository  Data type (XML, PDF, text, etc)  Data size  Creation and Last Access time, etc  Even data tags can be inserted at the time of ingest  Unconditionally, eg. ‘sales’  Conditionally, eg. ‘holiday_sales’
  • 24. Apache Atlas For Data Governance Source: http://atlas.incubator.apache.org/Architecture.html
  • 25. Data Access And Security  By default HDFS is secured using  Kerberos for authentication, and  Unix-style file permissions for authorization  In a large data repository with diverse stakeholders you may need more control  If so, a couple of products may be considered for augmenting Data Security:  Apache Knox  Apache Ranger
  • 26. Data Access And Security HDFS Perimeter Security: Knox KERBEROS Authentication Authorization (rwx) RANGER Federated Access Control NODE 1 NODE N
  • 27. Why Use Ranger  Supports Federated Access Control  Can fall-back upon default HDFS file permissions  Manages Access Control over several Hadoop- based components, like Hive, Storm, etc.  Advanced fine-grained access control, like  Deny policies for user or group  Tag-based access control, where a collection of resources share a common access tag  For example, a few columns in a Hive table and a certain files in HDFS could share a tag: ‘internal_audit’
  • 28. Steps To Build A Data Lake  Set up a scalable data storage layer  Set up a Compute Cluster capable of running a diverse mix of Jobs  Create data flow pipeline(s) for batch jobs  Create data flow pipeline(s) for streaming data
  • 29. Steps To Build A Data Lake  Plug in one or more Analytics Engine(s)  Set up mechanisms for efficient data discovery and data governance  Implement Data Access Controls  Design a Monitoring Infrastructure for Jobs and Resources (not covered today)
  • 30. Building A Data Lake: Starting Points  Set up a scalable data storage layer: HDFS  Set up a Compute Cluster capable of running a diverse mix of Jobs: YARN  Create data flow pipeline(s) for batch jobs: Pentaho HDFS Connector  Create data flow pipeline(s) for streaming data: Pentaho Messaging Connector
  • 31. Steps To Build A Data Lake  Plug in one or more Analytics Engine(s): Pentaho Reporting and Spark MLib  Set up mechanisms for efficient data discovery and data governance: Apache Atlas  Implement Data Access Controls: Apache Ranger  Design a Monitoring Infrastructure for Jobs and Resources: Apache Ambari
  • 32. Taking The Plunge  Do you need to plan for and build a Data Lake?  Ask yourself: what fraction of your data are you analyzing today ?  What value might the unused data offer ?  For marketing campaigns  For product lifecycle management  For regulatory compliance, and so on …  Engage your stakeholders from different LoBs  Is decision making being hampered by lack of data ?
  • 33. Taking The Plunge  Start small: There is a learning curve  Storing data is not enough – maintaining the stewarding the data is all important  Design for extensibility and plugability  Minimize vendor lock-in  Be open to change as you scale your infrastructure