Datalake Architecture

DATA LAKE ARCHITECTURE
Monojit Basu, Founder & Director
TechYugadi IT Solutions & Consulting
OSI DAYS 2016, BANGALORE

Data Never Sleeps
 Every minute
 Facebook users share 216,302 photos
 Dropbox users upload 833,333 new files
 Youtube users share 400 hours of new video
 Twitter users send 350,000 tweets
 A Boeing 737 Aircraft in flight generates 40 TB of data

EDW vs Data Lake
 Data Lake is built on the premise that every drop of
data is valuable
 Its a place for capturing and exploring huge
volumes of raw data that a business generates
 Explorers are diverse: business analysts, data
scientists, …
 even business managers (using self-service)
 Goals of exploration may be loosely defined

EDW vs Data Lake
 EDW stores filtered and processed data
 For pre-meditated usage scenarios
 Traditionally structured in the form of ‘cubes’
 Analogy
 Difference between a college library (focused on
curriculum) and the US Library of Congress

EDW vs Data Lake
 Schema-on-Read
 Schema-on-Write
DATA LAKE
XML
JSON
CSV
PDF
TRADING PARTNER
REST API
INVOICING
ORDERS DB
READ /
EXTRACT
READ /
EXTRACT
READ /
EXTRACT
CRM
ANALYTICS
SCM
ANALYTICS
RECO
ENGINE
ENTERPRISE DATA
WAREHOUSE
XML
JSON
CSV
PDF
TRADING PARTNER
REST API
INVOICING
ORDERS DB
SALES
OPERATIONS
MARKETING
ETL

Why Think of Data Lake
 Business Drivers
 Diverse sources of data: transactions, interactions, human
and machine-generated
 Routine analysis not enough – deeper insights lead to
differentiation
 Agile and Adaptive Business Models
 Technology Drivers
 Fast, cheap and scalable storage (eg. HDFS)
 Diverse data-processing engines (eg. NoSQL)
 Infinitely elastic processing power (cluster of commodity
servers)

Application Domains
 Healthcare  IoT
 E-Governance  Insurance

What Features Should It Support
 Scalable Storage Layer
 3 V’s of Data Inflow
 Data Discovery
 Data Governance
 Pluggable and Extensible Analytics
 Elastic Processing Power
 Multi-stakeholder and Multi-tenant Access

Building It On Top Of Hadoop
 Data Lake doesn’t have to be Hadoop
 But Hadoop has proven its prowess on planet-scale
data, in terms of:
 Data Volumes
 Elastic Data Processing Power
 Probably the idea of a Data Lake was inspired by
Hadoop
 Naturally most often a Data Lake Architecture is
built around Hadoop

Storage Capacity: Metrics
 Normally HDFS scales even with one NameNode
 Unless you have hundreds of Petabytes data
 But you need to monitor the usage pattern
 Are you creating too many small files (what’s the
average number of blocks per file)?
 How much RAM would you need for the NameNode? (a
high value could mean larger GC pauses)
 Internal Load (heartbeats and block reports) vs
External Get and Create Requests

Storage Capacity: HDFS Federation
 Single Name Node  NameNode Federation
Name
Node
Data
Node1
Data
Node2
Data
NodeN
MR
Client
Get / Create
Internal
Load
…
NameNode1 NameNode2
Block Pool1 Block Pool2
Data
Node1
Data
Node2
Data
NodeN…

Storage Capacity: Availability
 NameNode Federation does not ensure HA
 Even if you don’t go for Federation, configuring high
availability is recommended
 Essentially set up a Standby NameNode
 Active NameNode shares state with the Standby
 Using a shared Journal Manager, or
 Simply using a NFS-mounted shared File directory
 Synchronization frequency is configurable

Compute Capacity
 Hadoop 1.0 supported 1 type of Job (Map-Reduce)
 MR jobs were scheduled by a ‘JobTracker’ process
 Hadoop 2.0 offers a Resource Manager (YARN)
 It is intended to replace JobTracker and better the
Hadoop cluster size limit from 3000 to 10000
 But more important: YARN supports different types of
Jobs including MR to run on Hadoop
 Hence Data Lake should preferably be built on YARN

Compute Capacity: YARN
 YARN ARCHITECTURE
RESOURCE
MANAGER
NODE MANAGER
MR APP
MASTER
SPARK
TASK
NODE MANAGER
SPARK APP
MASTER
MR
TASK
N
O
D
E
1
N
O
D
E
2
MR CLIENT
SPARK
CLIENT

Data Inflow
 The goal is to build a pipeline into Hadoop-native
data stores
 HDFS, mandatorily
 Hive and Hbase, preferably
 Considering the variety of data formats that a Data
Lake must accommodate:
 A general purpose Data Integration Tool must be chosen
 For example, Pentaho Data Integration (PDI)

Data Inflow
 Pipelines specialized for specific data formats may
also be plugged in
HDFS
FLAT FILE INPUT
CONNECTOR
WEB SERVICE INPUT
CONNECTOR
HDFS OUTPUT
CONNECTOR
.txt .json
SQOOP FLUME
DB log

Data Inflow: Streaming Data
 Streaming Data may be processed in two ways
 Simply store in the Data Lake for future analysis
 Interesting tweets for building a sentiment analysis model
 Store and Forward to a Real-time Analytics Engine
 Even as real-time processing occurs, the source data in
raw format may be useful in future
 To build / update machine learning models, for example
in fraud analytics
HDFS
STORE STORE &
FORWARD

Data Analytics
 A Data Lake built on HDFS will most likely use a
Hadoop cluster to analyze data
 Sometimes the result of the analysis may be stored
back into HDFS (or possibly Hive / Hbase)
 But Data Visualization and Reporting / Dashboards
may work only on structured data cubes
 Hence on the Analytics side, a Data Lake may need
outflow paths from HDFS into structured data stores

Plugging In Data Analytics Engine
 Jaspersoft Reporting with HDFS
HDFS
ANALYZED DATA
JASPERSOFT ETL
HDFS INPUT
CONNECTOR OLAP
CUBE
JASPERSOFT
REPORTING
ENGINE

Data Governance
 Data Lake does not conform to a schema
 Data Governance makes it possible to make sense
of the data
 To both analysts and administrators
 Data Governance is a fairly open-ended subject
 Vendors offer different techniques to solve each
governance use case
 But common patterns are emerging across the landscape

Data Governance: Analyst Use Cases
 To search and retrieve ‘relevant’ data for analysis
 Common Techniques
 Metadata Management
 Data tagging
 Text Search
 Data Classification
 Metadata can include technical as well as business
information (linked to a Business Glossary)
 Data tags are often created by users collaboratively

Data Governance: Admin Use Cases
 Track data flow from
source to end applications
 Retain, replicate and
archive based on usage
 Track access and usage
information for compliance
 Lineage
 Data Life-cycle
Management
 Auditing

Automated Metadata Generation
 As data is ingested, suitable attributes are extracted
and stored into a metadata repository
 Data type (XML, PDF, text, etc)
 Data size
 Creation and Last Access time, etc
 Even data tags can be inserted at the time of ingest
 Unconditionally, eg. ‘sales’
 Conditionally, eg. ‘holiday_sales’

Apache Atlas For Data Governance
Source: http://atlas.incubator.apache.org/Architecture.html

Data Access And Security
 By default HDFS is secured using
 Kerberos for authentication, and
 Unix-style file permissions for authorization
 In a large data repository with diverse stakeholders
you may need more control
 If so, a couple of products may be considered for
augmenting Data Security:
 Apache Knox
 Apache Ranger

Data Access And Security
HDFS
Perimeter Security:
Knox
KERBEROS
Authentication Authorization
(rwx)
RANGER Federated
Access Control
NODE 1 NODE N

Why Use Ranger
 Supports Federated Access Control
 Can fall-back upon default HDFS file permissions
 Manages Access Control over several Hadoop-
based components, like Hive, Storm, etc.
 Advanced fine-grained access control, like
 Deny policies for user or group
 Tag-based access control, where a collection of
resources share a common access tag
 For example, a few columns in a Hive table and a
certain files in HDFS could share a tag: ‘internal_audit’

Steps To Build A Data Lake
 Set up a scalable data storage layer
 Set up a Compute Cluster capable of running a
diverse mix of Jobs
 Create data flow pipeline(s) for batch jobs
 Create data flow pipeline(s) for streaming data

 Plug in one or more Analytics Engine(s)
 Set up mechanisms for efficient data discovery
and data governance
 Implement Data Access Controls
 Design a Monitoring Infrastructure for Jobs and
Resources (not covered today)

Building A Data Lake: Starting Points
 Set up a scalable data storage layer: HDFS
 Set up a Compute Cluster capable of running a
diverse mix of Jobs: YARN
 Create data flow pipeline(s) for batch jobs:
Pentaho HDFS Connector
 Create data flow pipeline(s) for streaming data:
Pentaho Messaging Connector

 Plug in one or more Analytics Engine(s): Pentaho
Reporting and Spark MLib
 Set up mechanisms for efficient data discovery
and data governance: Apache Atlas
 Implement Data Access Controls: Apache Ranger
 Design a Monitoring Infrastructure for Jobs and
Resources: Apache Ambari

Taking The Plunge
 Do you need to plan for and build a Data Lake?
 Ask yourself: what fraction of your data are you
analyzing today ?
 What value might the unused data offer ?
 For marketing campaigns
 For product lifecycle management
 For regulatory compliance, and so on …
 Engage your stakeholders from different LoBs
 Is decision making being hampered by lack of data ?

Taking The Plunge
 Start small: There is a learning curve
 Storing data is not enough – maintaining the
stewarding the data is all important
 Design for extensibility and plugability
 Minimize vendor lock-in
 Be open to change as you scale your infrastructure

Datalake Architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Datalake Architecture

Similar to Datalake Architecture (20)

Recently uploaded

Recently uploaded (20)

Datalake Architecture