SlideShare a Scribd company logo
1 of 30
Download to read offline
BUILDING MODERN DATA
LAKES
Minio, Spark and Unified Data Architecture work in unison
By Ravi Shankar, October 2018
10/29/18
1
FIRST ORDER LOGIC
10/29/18
2
First-order logic—also known as first-order predicate calculus and predicate logic - is a collection
of formal systems used in mathematics, philosophy, linguistics, and computer science.
Married("Harry", "Sally", "12-Dec-1995").
IsMotherOf("Sally", "Peter").
IsFatherOf("Harry", "Peter").
The Relational Model says that in your
database this is how you think about and
represent all your data
There exists one or more X such that the
marriage happened in 1995
THE DATA MODELS.
10/29/18
3
subject-oriented, integrated, time-
variant and non-volatile collection
of data
integrating data marts into a
dimensional model for
consumption
PROBLEM STATEMENT.
10/29/18
4
Earlier New Digitalization Initiatives !!!
1. Change everything
2. Keep as is. Add new relations
3. Move to:
CTO GETS HADOOP IN.
10/29/18
5
1. Scale out architecture : 2. Shared Nothing : 3. Compute + Storage together 4. Google like!!
10/29/18
6
ALL WENT SMOOTH UNTIL...
A zip file was sent from a third part vendor which contains one million jpeg files. Wrote a map
reduce program to process it
File is of size 8 GB, separated into 128MB blocks – about 63 blocks. 3 times replication
- total size about 26 GB
Executed the application – What might have happened ?
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/Hado
opRDD.scala
10/29/18
7
SPLITTABILITY IMPORTANCE.
So, performance is not guaranteed in all scenarios with existing distributed technologies
10/29/18
8
THE SUBSEQUENT MONTHS.
1. We copied data from Netezza to HIVE
2. We created reports from Tableau with HIVE ODBC
3. We created a copy of HIVE into HBASE
4. We have HDP, but Cloudera supports Impala
5. MapReduce is slow
6. All data is not at one place
7. May be some more tools are needed
8. We need a unified data architecture solution
9. Rebalancing took entire week
10. Important file types are not splittable
11. 3 copies is too much space
12. Cost of maintenance is high
13. We may need to go to cloud
14. SLA not met
15. Too much operational work
10/29/18
9
WHAT MAKES ORGANIZATION FAMOUS?
CTO wants AI but AI is different from AI !!
AI : Autonomous systems which REPLACES human cognitive thought process
AI (IA): Autonomous systems which SUPPORTS human cognitive thought process
AlgorithmInput Output ?
Both needs machine learning and deep learning. These are means to do AI or IA
Algorithm ?
Input
Output
OUTPUT MAY BE NEEDED INSTANLY, BUT LEARNING IT
MAY TAKE HOURS/DAYS/MONTHS
7
Inputs Output Layer“Hidden” Layer(s)
FILE SYSTEMS.
• The problem is the file system. Traditional block-based file systems use
lookup tables to store file locations. They break each file up into small
blocks, generally 4k in size, and store the byte offset of each block in a
large table.
• This is fine for small volumes, but when you attempt to scale to the
petabyte range, these lookup tables become extremely large. It’s like
a database. The more rows you insert, the slower your queries
run. Eventually your performance degrades to the point where your
file system becomes unusable.
• When this happens, users are forced to split their data sets up into
multiple LUNs to maintain an acceptable level of performance. This
adds complexity and makes these systems difficult to manage
29/10/18
11
BLOCK BASED STORAGE SYSTEMS.
• To solve this problem, some organizations are deploying scale-out file
systems, like HDFS. This fixes the scalability problem, but keeping these
systems up and running is a labor-intensive process.
• Scale-out file systems are complex and require constant
maintenance. In addition, most of them rely on replication to protect
your data. The standard configuration is triple-replication, where you
store 3 copies of every file.
• This requires an extra 200% of raw disk capacity for
overhead! Everyone thinks that they’re saving money by using
commodity drives, but by the time you store three full copies of your
data set, the cost savings disappears. When we’re talking about
petabyte-scale applications, this is an expensive approach.
29/10/18
12
SOLUTION TO STORAGE.
• Object stores achieve their scalability by decoupling file management
from the low-level block management. Each disk is formatted with a
standard local file system, like ext4. Then a set of object storage
services is layered on top of it, combining everything into a single,
unified volume.
• Files are stored as “objects” in the object store rather than files on a
file system. By offloading the low-level block management onto the
local file systems, the object store only has to keep track of the high-
level details.
• This layer of separation keeps the file lookup tables at a manageable
size, allowing you scale to hundreds of petabytes without
experiencing degraded performance.
29/10/18
13
SOLUTION TO STORAGE.
• To maximize usable space, object stores use a technique called
Erasure Coding to protect your data. You can think of it as the next
generation of RAID.
• In an erasure coded volume, files are divided into shards, with each
shard being placed on a different disk. Additional shards are added,
containing error correction information, which provide protection from
data corruption and disk failures. Only a subset of the shards is
required to retrieve each file, which means it can survive multiple disk
failures without the risk of data loss.
• Erasure coded volumes can survive more disk failures than RAID and
typically provides more than double the usable capacity of triple
replication, making it the ideal choice for petabyte-scale storage.
29/10/18
14
MINIO - ERASURE CODING.
29/10/18
15
• EC is based on a technology called Forward Error Correction
(FEC), developed more than 50 years ago (1940- Richard
Hamming). Used originally for controlling errors in data
transmission over noisy or unreliable tele communication
channels. Reed-Solomon codes are a kind of EC, used widely in
CDs/DVDs, Blue Ray, Satellite commn etc.
• A message of k symbols can be transformed into a longer
message (code word or parity) with n symbols such that
the original message can be recovered from a subset of
the n symbols. If n=k+1, then there is a special case
called parity check
MINIO - ERASURE CODING.
29/10/18
16
TOP 5 : COST.
• https://amzn.to/2Q7AWGo
• S3: 23 USD per TB per month.(12.5 USD per TB for cold access)
• HDFS: Using d2.8xl instance types ($5.52/hr with 71% discount, 48TB
HDD), it costs 5.52 x 0.29 x 24 x 30 / 48 x 3 / 0.7 = $103/month for 1TB of
data. (Note that with reserved instances, it is possible to achieve lower
price on the d2 family.)
• S3 is 5X cheaper than HDFS.
• S3’s human cost is virtually zero, whereas it usually takes a team of
Hadoop engineers or vendor support to maintain HDFS. Once we
factor in human cost, S3 is 10X cheaper than HDFS clusters on EC2 with
comparable capacity.
29/10/18
17
TOP 5 : ELASTICITY.
• From Databricks:
• 99.999999999% durability and 99.99% availability. Note that this is
higher than the vast majority of organizations’ in-house services.
• Majority of Hadoop clusters have availability lower than 99.9%, i.e. at
least 9 hours of downtime per year.
• With cross-AZ replication that automatically replicates across different
data centers, S3’s availability and durability is far superior to HDFS’.
• Hortonworks – Data Plane Services in 2019!
29/10/18
18
TOP 5 : PERFORMANCE.
• When using HDFS and getting perfect data locality, it is possible to get
~3GB/node local read throughput on some of the instance types (e.g.
i2.8xl, roughly 90MB/s per core). Spark DBIO, cloud I/O optimization
module, provides optimized connectors to S3 and can sustain
~600MB/s read throughput on i2.8xl (roughly 20MB/s per core).
• That is to say, on a per node basis, HDFS can yield 6X higher read
throughput than S3. Thus, given that the S3 is 10x cheaper than HDFS,
we find that S3 is almost 2x better compared to HDFS on performance
per dollar.
29/10/18
19
TOP 5 :TRANSACTIONS.
• Hadoop fs –mkdirs sample/a/b/c/
• Now you put the file into a/b/c
• Buckets…not directories
• In a Minio server instance, a single RESTful PUT request will create an
object “a/b/c/data.txt” in “mybucket” without having to create
“a/b/c” in advance
• This happens because object stores support hierarchical naming and
operations without the need for directories.
29/10/18
20
TOP 5 :TRANSACTIONS.
• Data Move is very interesting…
• What happens if you have a write code in Spark (saveAsTextFile) fils for
a partition ?
• Rename is atomic – the most critical part in Hadoop write flow
• Minio (or any object store) does not provide an atomic rename. In
fact, rename should be avoided in object storage altogether, since it
consists of two separate operations: copy and delete.
• Normal COPY is mapped to RESTful PUT request or RESTful COPY
request and triggers internal data movements between storage
nodes. The subsequent delete command maps to the RESTful DELETE
request, but usually relies on the bucket listing operation to identify
which data must be deleted. This makes a rename highly inefficient in
object stores, and the lack of atomicity may leave data in a
corrupted state.
29/10/18
21
TOP 5 :TRANSACTIONS: PERFORMANCE.
29/10/18
22
• version 1, which moves staged task output files to their final locations
at the end of the job, and version 2, which moves files as individual job
tasks complete.
TOP 5 :TRANSACTIONS: PERFORMANCE.
29/10/18
23
• version 1, which moves staged task output files to their final locations
at the end of the job, and version 2, which moves files as individual job
tasks complete.
TOP 5 : DATA INTEGRITY - ELEGANT
SOLUTION FROM SPARK.
29/10/18
24
• Version 2.1 : https://docs.databricks.com/spark/latest/spark-sql/dbio-
commit.html
SO HOW WILL IT LOOK LIKE?
29/10/18
25
COMPARISON.
MINIO
$
99.99 %
99.999999999$
DBIO
YES
HDFS
$$
99.9%
99.9999% (Estimated)
YES
NO
MINIO VS HDFS
10x
10x
10x
COMPARABLE
MINIO IS ELASTIC
10/29/18
26
FEATURE
COST/TB/
MONTH
AVLBLTY
DURABLE
WRITES
ELASTICITY
MINIO.
29/10/18
27
• High performance distributed Object Storage Server
• Simple, Efficient, Light weight and no learning curves
DEMO TIME
1) MINIO INTEROPERABILITY WITH HADOOP – PUTTING AND GETTING DATA
2) MINIO INTEROPERABILITY WITH HIVE
3) MINIO WITH UNIFIED DATA ARCHITECTURE – PRESTO
4) MINIO WITH SPARK - FILES
5) MINIO WITH SPARK – OBJECTS
6) MINIO WITH SEARCH
10/29/18
28
SUMMARY: EARN BY THIS ARCHITECTURE
10/29/18
29
THANK YOU!
Refer to:
https://blog.minio.io/modern-data-lake-with-minio-part-1-716a49499533
https://blog.minio.io/modern-data-lake-with-minio-part-2-f24fb5f82424
https://www.minio.io/
Apache Spark
Presto
10/29/18
30
QUESTIONS?

More Related Content

What's hot

[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)James Serra
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...
Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...
Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...Cathrine Wilhelmsen
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's includedJames Serra
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks FundamentalsDalibor Wijas
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
 
Introduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxIntroduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxSwathiPonugumati
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxCalvinSim10
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeDatabricks
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022Kai Wähner
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 

What's hot (20)

[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...
Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...
Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
Introduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxIntroduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptx
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 

Similar to Building modern data lakes

Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencingGuy Coates
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxAnkitChauhan817826
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inRahulBhole12
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Coursejimliddle
 
S016825 ibm-cos-nola-v1710d
S016825 ibm-cos-nola-v1710dS016825 ibm-cos-nola-v1710d
S016825 ibm-cos-nola-v1710dTony Pearson
 
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureShaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureDenodo
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
S100299 ibm-cos-orlando-v1804c
S100299 ibm-cos-orlando-v1804cS100299 ibm-cos-orlando-v1804c
S100299 ibm-cos-orlando-v1804cTony Pearson
 
Building a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native EraBuilding a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native EraAlluxio, Inc.
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
HPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big DataHPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big DataLviv Startup Club
 
Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)Lviv Startup Club
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.pptRutujaPatil247341
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizITJobZone.biz
 

Similar to Building modern data lakes (20)

getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
Storage
StorageStorage
Storage
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Course
 
S016825 ibm-cos-nola-v1710d
S016825 ibm-cos-nola-v1710dS016825 ibm-cos-nola-v1710d
S016825 ibm-cos-nola-v1710d
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureShaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
S100299 ibm-cos-orlando-v1804c
S100299 ibm-cos-orlando-v1804cS100299 ibm-cos-orlando-v1804c
S100299 ibm-cos-orlando-v1804c
 
Building a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native EraBuilding a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native Era
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
HPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big DataHPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big Data
 
Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.ppt
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
 

Recently uploaded

Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 

Recently uploaded (20)

Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 

Building modern data lakes

  • 1. BUILDING MODERN DATA LAKES Minio, Spark and Unified Data Architecture work in unison By Ravi Shankar, October 2018 10/29/18 1
  • 2. FIRST ORDER LOGIC 10/29/18 2 First-order logic—also known as first-order predicate calculus and predicate logic - is a collection of formal systems used in mathematics, philosophy, linguistics, and computer science. Married("Harry", "Sally", "12-Dec-1995"). IsMotherOf("Sally", "Peter"). IsFatherOf("Harry", "Peter"). The Relational Model says that in your database this is how you think about and represent all your data There exists one or more X such that the marriage happened in 1995
  • 3. THE DATA MODELS. 10/29/18 3 subject-oriented, integrated, time- variant and non-volatile collection of data integrating data marts into a dimensional model for consumption
  • 4. PROBLEM STATEMENT. 10/29/18 4 Earlier New Digitalization Initiatives !!! 1. Change everything 2. Keep as is. Add new relations 3. Move to:
  • 5. CTO GETS HADOOP IN. 10/29/18 5 1. Scale out architecture : 2. Shared Nothing : 3. Compute + Storage together 4. Google like!!
  • 6. 10/29/18 6 ALL WENT SMOOTH UNTIL... A zip file was sent from a third part vendor which contains one million jpeg files. Wrote a map reduce program to process it File is of size 8 GB, separated into 128MB blocks – about 63 blocks. 3 times replication - total size about 26 GB Executed the application – What might have happened ? https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/Hado opRDD.scala
  • 7. 10/29/18 7 SPLITTABILITY IMPORTANCE. So, performance is not guaranteed in all scenarios with existing distributed technologies
  • 8. 10/29/18 8 THE SUBSEQUENT MONTHS. 1. We copied data from Netezza to HIVE 2. We created reports from Tableau with HIVE ODBC 3. We created a copy of HIVE into HBASE 4. We have HDP, but Cloudera supports Impala 5. MapReduce is slow 6. All data is not at one place 7. May be some more tools are needed 8. We need a unified data architecture solution 9. Rebalancing took entire week 10. Important file types are not splittable 11. 3 copies is too much space 12. Cost of maintenance is high 13. We may need to go to cloud 14. SLA not met 15. Too much operational work
  • 9. 10/29/18 9 WHAT MAKES ORGANIZATION FAMOUS? CTO wants AI but AI is different from AI !! AI : Autonomous systems which REPLACES human cognitive thought process AI (IA): Autonomous systems which SUPPORTS human cognitive thought process AlgorithmInput Output ? Both needs machine learning and deep learning. These are means to do AI or IA Algorithm ? Input Output OUTPUT MAY BE NEEDED INSTANLY, BUT LEARNING IT MAY TAKE HOURS/DAYS/MONTHS
  • 11. FILE SYSTEMS. • The problem is the file system. Traditional block-based file systems use lookup tables to store file locations. They break each file up into small blocks, generally 4k in size, and store the byte offset of each block in a large table. • This is fine for small volumes, but when you attempt to scale to the petabyte range, these lookup tables become extremely large. It’s like a database. The more rows you insert, the slower your queries run. Eventually your performance degrades to the point where your file system becomes unusable. • When this happens, users are forced to split their data sets up into multiple LUNs to maintain an acceptable level of performance. This adds complexity and makes these systems difficult to manage 29/10/18 11
  • 12. BLOCK BASED STORAGE SYSTEMS. • To solve this problem, some organizations are deploying scale-out file systems, like HDFS. This fixes the scalability problem, but keeping these systems up and running is a labor-intensive process. • Scale-out file systems are complex and require constant maintenance. In addition, most of them rely on replication to protect your data. The standard configuration is triple-replication, where you store 3 copies of every file. • This requires an extra 200% of raw disk capacity for overhead! Everyone thinks that they’re saving money by using commodity drives, but by the time you store three full copies of your data set, the cost savings disappears. When we’re talking about petabyte-scale applications, this is an expensive approach. 29/10/18 12
  • 13. SOLUTION TO STORAGE. • Object stores achieve their scalability by decoupling file management from the low-level block management. Each disk is formatted with a standard local file system, like ext4. Then a set of object storage services is layered on top of it, combining everything into a single, unified volume. • Files are stored as “objects” in the object store rather than files on a file system. By offloading the low-level block management onto the local file systems, the object store only has to keep track of the high- level details. • This layer of separation keeps the file lookup tables at a manageable size, allowing you scale to hundreds of petabytes without experiencing degraded performance. 29/10/18 13
  • 14. SOLUTION TO STORAGE. • To maximize usable space, object stores use a technique called Erasure Coding to protect your data. You can think of it as the next generation of RAID. • In an erasure coded volume, files are divided into shards, with each shard being placed on a different disk. Additional shards are added, containing error correction information, which provide protection from data corruption and disk failures. Only a subset of the shards is required to retrieve each file, which means it can survive multiple disk failures without the risk of data loss. • Erasure coded volumes can survive more disk failures than RAID and typically provides more than double the usable capacity of triple replication, making it the ideal choice for petabyte-scale storage. 29/10/18 14
  • 15. MINIO - ERASURE CODING. 29/10/18 15 • EC is based on a technology called Forward Error Correction (FEC), developed more than 50 years ago (1940- Richard Hamming). Used originally for controlling errors in data transmission over noisy or unreliable tele communication channels. Reed-Solomon codes are a kind of EC, used widely in CDs/DVDs, Blue Ray, Satellite commn etc. • A message of k symbols can be transformed into a longer message (code word or parity) with n symbols such that the original message can be recovered from a subset of the n symbols. If n=k+1, then there is a special case called parity check
  • 16. MINIO - ERASURE CODING. 29/10/18 16
  • 17. TOP 5 : COST. • https://amzn.to/2Q7AWGo • S3: 23 USD per TB per month.(12.5 USD per TB for cold access) • HDFS: Using d2.8xl instance types ($5.52/hr with 71% discount, 48TB HDD), it costs 5.52 x 0.29 x 24 x 30 / 48 x 3 / 0.7 = $103/month for 1TB of data. (Note that with reserved instances, it is possible to achieve lower price on the d2 family.) • S3 is 5X cheaper than HDFS. • S3’s human cost is virtually zero, whereas it usually takes a team of Hadoop engineers or vendor support to maintain HDFS. Once we factor in human cost, S3 is 10X cheaper than HDFS clusters on EC2 with comparable capacity. 29/10/18 17
  • 18. TOP 5 : ELASTICITY. • From Databricks: • 99.999999999% durability and 99.99% availability. Note that this is higher than the vast majority of organizations’ in-house services. • Majority of Hadoop clusters have availability lower than 99.9%, i.e. at least 9 hours of downtime per year. • With cross-AZ replication that automatically replicates across different data centers, S3’s availability and durability is far superior to HDFS’. • Hortonworks – Data Plane Services in 2019! 29/10/18 18
  • 19. TOP 5 : PERFORMANCE. • When using HDFS and getting perfect data locality, it is possible to get ~3GB/node local read throughput on some of the instance types (e.g. i2.8xl, roughly 90MB/s per core). Spark DBIO, cloud I/O optimization module, provides optimized connectors to S3 and can sustain ~600MB/s read throughput on i2.8xl (roughly 20MB/s per core). • That is to say, on a per node basis, HDFS can yield 6X higher read throughput than S3. Thus, given that the S3 is 10x cheaper than HDFS, we find that S3 is almost 2x better compared to HDFS on performance per dollar. 29/10/18 19
  • 20. TOP 5 :TRANSACTIONS. • Hadoop fs –mkdirs sample/a/b/c/ • Now you put the file into a/b/c • Buckets…not directories • In a Minio server instance, a single RESTful PUT request will create an object “a/b/c/data.txt” in “mybucket” without having to create “a/b/c” in advance • This happens because object stores support hierarchical naming and operations without the need for directories. 29/10/18 20
  • 21. TOP 5 :TRANSACTIONS. • Data Move is very interesting… • What happens if you have a write code in Spark (saveAsTextFile) fils for a partition ? • Rename is atomic – the most critical part in Hadoop write flow • Minio (or any object store) does not provide an atomic rename. In fact, rename should be avoided in object storage altogether, since it consists of two separate operations: copy and delete. • Normal COPY is mapped to RESTful PUT request or RESTful COPY request and triggers internal data movements between storage nodes. The subsequent delete command maps to the RESTful DELETE request, but usually relies on the bucket listing operation to identify which data must be deleted. This makes a rename highly inefficient in object stores, and the lack of atomicity may leave data in a corrupted state. 29/10/18 21
  • 22. TOP 5 :TRANSACTIONS: PERFORMANCE. 29/10/18 22 • version 1, which moves staged task output files to their final locations at the end of the job, and version 2, which moves files as individual job tasks complete.
  • 23. TOP 5 :TRANSACTIONS: PERFORMANCE. 29/10/18 23 • version 1, which moves staged task output files to their final locations at the end of the job, and version 2, which moves files as individual job tasks complete.
  • 24. TOP 5 : DATA INTEGRITY - ELEGANT SOLUTION FROM SPARK. 29/10/18 24 • Version 2.1 : https://docs.databricks.com/spark/latest/spark-sql/dbio- commit.html
  • 25. SO HOW WILL IT LOOK LIKE? 29/10/18 25
  • 26. COMPARISON. MINIO $ 99.99 % 99.999999999$ DBIO YES HDFS $$ 99.9% 99.9999% (Estimated) YES NO MINIO VS HDFS 10x 10x 10x COMPARABLE MINIO IS ELASTIC 10/29/18 26 FEATURE COST/TB/ MONTH AVLBLTY DURABLE WRITES ELASTICITY
  • 27. MINIO. 29/10/18 27 • High performance distributed Object Storage Server • Simple, Efficient, Light weight and no learning curves
  • 28. DEMO TIME 1) MINIO INTEROPERABILITY WITH HADOOP – PUTTING AND GETTING DATA 2) MINIO INTEROPERABILITY WITH HIVE 3) MINIO WITH UNIFIED DATA ARCHITECTURE – PRESTO 4) MINIO WITH SPARK - FILES 5) MINIO WITH SPARK – OBJECTS 6) MINIO WITH SEARCH 10/29/18 28
  • 29. SUMMARY: EARN BY THIS ARCHITECTURE 10/29/18 29