SlideShare a Scribd company logo
1 of 23
Download to read offline
BUILDING A CEPH-POWERED 
DATA LAKE (OR) DATA GRID 
Paul Evans 
principal architect 
daystrom technology group 
paul@daystrom.com 
ceph days 
san jose 2014
Why build a data grid (or data lake) ? 
…because we have a data FLOOD in process
indeed, we love data… 
( we never seem to throw 
any of it out ) 
we’re good at generating 
more and more, but… 
too 
FAST 
too many 
VARIANTS 
too 
MUCH
IS THE ANSWER TO ALL OF THIS…. 
“ WE NEED LESS DATA! ” 
are you crazy? 
we live to store things! 
we just need better tools… 
(and more storage)
DATA 
AUTOMATION 
Workflow Automation 
Data Lake 
Data Grid STACK 
Wildly-Scalable Storage
DATA LAKE 
“a storage repository that holds a vast amount of raw data in its native 
format until it is needed”
DATA LAKE - ORIGINS 
First use credited to James Dixon, CTO at Pentaho, circa 2010 
“If you think of a datamart 
as a store of bottled water 
– cleansed and packaged 
and structured for easy 
consumption – the data 
lake is a large body of water 
in a more natural state…” 
“The contents of the data 
lake stream in from a 
source to fill the lake, and 
various users of the lake 
can come to examine, 
dive in, or take samples.”
DATA LAKE - EXPLAINED 
While a hierarchical data warehouse stores 
data in files or folders, a data lake uses a flat 
architecture to store data. Each data element 
in a lake is assigned a unique identifier and 
tagged with a set of extended metadata tags. 
When a business question arises, the data 
lake can be queried for relevant data, and that 
smaller set of data can then be analyzed to 
help answer the question.
DATA LAKE - WHY ??? 
?
DATA LAKE CHARACTER 
Unwashed Data: schema-on-read from RAW source 
Flexible Processing: batch, interactive, online, search 
MetaData Dependent: tag it or lose it 
Common Access: hdfs-centric toolset 
…in other words: this is not a glass-house Data Mart
A REFERENCE ‘LAKE’ ARCHITECTURE 
GOVERNENCE DATA ACCESS SECURITY OPERATIONS 
INTEGRATION 
DATA MANAGEMENT
A CEPHALOPOD IN THE LAKE? 
If this is import… Use this… 
Hadoop-native 
HDFS 
Locality-aware 
HDFS 
Distributed Name Svc 
Ceph 
Native Erasure Coding 
Ceph 
20% Faster * 
Ceph 
* on Terasort benchmark over IB, Mar 2014
(LAKE) DREDGERS 
technology group
DATA GRID 
“the unifying layer to how content and data are stored, protected, located 
and accessed”
DATA GRID - ORIGINS 
The need for data grids was first recognized by the scientific 
community concerning climate modeling, where exchanging PB-size 
data sets became commonplace. Recently, large-scale 
instruments such as the Large Hadron Collider (LHC) at CERN 
are driving grid innovation.
DATA GRID - EXPLAINED 
Data Grids present consistent access 
controls, governance, and metadata 
extensions to diverse storage media 
using a common, global interface for 
access and transport. 
Additionally, they offer a ‘micro-service’ architecture 
for the creation of standard tasks & policies, which 
are enforced by a distributed “grid control-plane.”
DATA GRID - WHY ???
DATA GRID - ATTRIBUTES 
Data Virtualization: common presentation of all content 
Universe-size Namespace: for files, objects & metadata 
Automation of Data Operations: distributed, scalable 
Policy Mgmt/Reporting: data valuation & action triggers
CEPH MEETS GRID 
implemented: 
Direct 
CephFS & RBD Ceph libRADOS Remote 
Cloud 
Cold Storage 
Archive 
DATA GRID unified namespace 
HiSpeed 
Tier 
Link 
LIBRADOS Ceph + 
LIBRADOS Ceph + 
RBD
GRID IRON ALL-STARS 
(Dan Bedard: danb@renci.org) 
technology group
TIME 2 SUMMARIZE… 
We are in the midst of a Data Explosion 
We need robust, expandable, yet simple solutions to store data 
We also need effective, de-centralized ways to care for the data
the SMART approach 
DATA 
AUTOMATION 
STACK 
Workflow Automation 
Ceph 
Wildly-Scalable Storage 
+ 
Data Lake 
Data Grid
thank you! 
san jose ceph days 
Paul Evans 
principal architect 
paul@daystrom.com 
technology group

More Related Content

What's hot

HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemSteve Loughran
 
A Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationA Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationSameer Tiwari
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyAlluxio, Inc.
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Beyond Hadoop and MapReduce
Beyond Hadoop and MapReduceBeyond Hadoop and MapReduce
Beyond Hadoop and MapReduceAlexander Alten
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovVasil Remeniuk
 
Pillars of Heterogeneous HDFS Storage
Pillars of Heterogeneous HDFS StoragePillars of Heterogeneous HDFS Storage
Pillars of Heterogeneous HDFS StoragePete Kisich
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Alluxio, Inc.
 
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature DataWorks Summit
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabsSiva Sankar
 
Apache ignite as in-memory computing platform
Apache ignite as in-memory computing platformApache ignite as in-memory computing platform
Apache ignite as in-memory computing platformSurinder Mehra
 
Burst Presto & Spark workloads to AWS EMR with no data copies
Burst Presto & Spark workloads to AWS EMR with no data copiesBurst Presto & Spark workloads to AWS EMR with no data copies
Burst Presto & Spark workloads to AWS EMR with no data copiesAlluxio, Inc.
 
Architecture of Hadoop
Architecture of HadoopArchitecture of Hadoop
Architecture of HadoopKnoldus Inc.
 
FAQ on Dedupe NetApp
FAQ on Dedupe NetAppFAQ on Dedupe NetApp
FAQ on Dedupe NetAppAshwin Pawar
 

What's hot (20)

HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
A Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationA Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animation
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiency
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
Beyond Hadoop and MapReduce
Beyond Hadoop and MapReduceBeyond Hadoop and MapReduce
Beyond Hadoop and MapReduce
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex Gryzlov
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Nov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.HNov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.H
 
Pillars of Heterogeneous HDFS Storage
Pillars of Heterogeneous HDFS StoragePillars of Heterogeneous HDFS Storage
Pillars of Heterogeneous HDFS Storage
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
 
Hadoop Research
Hadoop Research Hadoop Research
Hadoop Research
 
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabs
 
Apache ignite as in-memory computing platform
Apache ignite as in-memory computing platformApache ignite as in-memory computing platform
Apache ignite as in-memory computing platform
 
Burst Presto & Spark workloads to AWS EMR with no data copies
Burst Presto & Spark workloads to AWS EMR with no data copiesBurst Presto & Spark workloads to AWS EMR with no data copies
Burst Presto & Spark workloads to AWS EMR with no data copies
 
Architecture of Hadoop
Architecture of HadoopArchitecture of Hadoop
Architecture of Hadoop
 
FAQ on Dedupe NetApp
FAQ on Dedupe NetAppFAQ on Dedupe NetApp
FAQ on Dedupe NetApp
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 

Similar to Ceph Days 2014 Paul Evans Slide Deck

Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architectureMilos Milovanovic
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos MilovanovicInstitute of Contemporary Sciences
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataConstruindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataMarco Garcia
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational ScienceChelle Gentemann
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
Difference between Database vs Data Warehouse vs Data Lake
Difference between Database vs Data Warehouse vs Data LakeDifference between Database vs Data Warehouse vs Data Lake
Difference between Database vs Data Warehouse vs Data Lakejeetendra mandal
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Denodo
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 

Similar to Ceph Days 2014 Paul Evans Slide Deck (20)

Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataConstruindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigData
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
Data lakes
Data lakesData lakes
Data lakes
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Difference between Database vs Data Warehouse vs Data Lake
Difference between Database vs Data Warehouse vs Data LakeDifference between Database vs Data Warehouse vs Data Lake
Difference between Database vs Data Warehouse vs Data Lake
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 

Recently uploaded

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Ceph Days 2014 Paul Evans Slide Deck

  • 1. BUILDING A CEPH-POWERED DATA LAKE (OR) DATA GRID Paul Evans principal architect daystrom technology group paul@daystrom.com ceph days san jose 2014
  • 2. Why build a data grid (or data lake) ? …because we have a data FLOOD in process
  • 3. indeed, we love data… ( we never seem to throw any of it out ) we’re good at generating more and more, but… too FAST too many VARIANTS too MUCH
  • 4. IS THE ANSWER TO ALL OF THIS…. “ WE NEED LESS DATA! ” are you crazy? we live to store things! we just need better tools… (and more storage)
  • 5. DATA AUTOMATION Workflow Automation Data Lake Data Grid STACK Wildly-Scalable Storage
  • 6. DATA LAKE “a storage repository that holds a vast amount of raw data in its native format until it is needed”
  • 7. DATA LAKE - ORIGINS First use credited to James Dixon, CTO at Pentaho, circa 2010 “If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state…” “The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
  • 8. DATA LAKE - EXPLAINED While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.
  • 9. DATA LAKE - WHY ??? ?
  • 10. DATA LAKE CHARACTER Unwashed Data: schema-on-read from RAW source Flexible Processing: batch, interactive, online, search MetaData Dependent: tag it or lose it Common Access: hdfs-centric toolset …in other words: this is not a glass-house Data Mart
  • 11. A REFERENCE ‘LAKE’ ARCHITECTURE GOVERNENCE DATA ACCESS SECURITY OPERATIONS INTEGRATION DATA MANAGEMENT
  • 12. A CEPHALOPOD IN THE LAKE? If this is import… Use this… Hadoop-native HDFS Locality-aware HDFS Distributed Name Svc Ceph Native Erasure Coding Ceph 20% Faster * Ceph * on Terasort benchmark over IB, Mar 2014
  • 14. DATA GRID “the unifying layer to how content and data are stored, protected, located and accessed”
  • 15. DATA GRID - ORIGINS The need for data grids was first recognized by the scientific community concerning climate modeling, where exchanging PB-size data sets became commonplace. Recently, large-scale instruments such as the Large Hadron Collider (LHC) at CERN are driving grid innovation.
  • 16. DATA GRID - EXPLAINED Data Grids present consistent access controls, governance, and metadata extensions to diverse storage media using a common, global interface for access and transport. Additionally, they offer a ‘micro-service’ architecture for the creation of standard tasks & policies, which are enforced by a distributed “grid control-plane.”
  • 17. DATA GRID - WHY ???
  • 18. DATA GRID - ATTRIBUTES Data Virtualization: common presentation of all content Universe-size Namespace: for files, objects & metadata Automation of Data Operations: distributed, scalable Policy Mgmt/Reporting: data valuation & action triggers
  • 19. CEPH MEETS GRID implemented: Direct CephFS & RBD Ceph libRADOS Remote Cloud Cold Storage Archive DATA GRID unified namespace HiSpeed Tier Link LIBRADOS Ceph + LIBRADOS Ceph + RBD
  • 20. GRID IRON ALL-STARS (Dan Bedard: danb@renci.org) technology group
  • 21. TIME 2 SUMMARIZE… We are in the midst of a Data Explosion We need robust, expandable, yet simple solutions to store data We also need effective, de-centralized ways to care for the data
  • 22. the SMART approach DATA AUTOMATION STACK Workflow Automation Ceph Wildly-Scalable Storage + Data Lake Data Grid
  • 23. thank you! san jose ceph days Paul Evans principal architect paul@daystrom.com technology group