6. DATA LAKE
“a storage repository that holds a vast amount of raw data in its native
format until it is needed”
7. DATA LAKE - ORIGINS
First use credited to James Dixon, CTO at Pentaho, circa 2010
“If you think of a datamart
as a store of bottled water
– cleansed and packaged
and structured for easy
consumption – the data
lake is a large body of water
in a more natural state…”
“The contents of the data
lake stream in from a
source to fill the lake, and
various users of the lake
can come to examine,
dive in, or take samples.”
8. DATA LAKE - EXPLAINED
While a hierarchical data warehouse stores
data in files or folders, a data lake uses a flat
architecture to store data. Each data element
in a lake is assigned a unique identifier and
tagged with a set of extended metadata tags.
When a business question arises, the data
lake can be queried for relevant data, and that
smaller set of data can then be analyzed to
help answer the question.
10. DATA LAKE CHARACTER
Unwashed Data: schema-on-read from RAW source
Flexible Processing: batch, interactive, online, search
MetaData Dependent: tag it or lose it
Common Access: hdfs-centric toolset
…in other words: this is not a glass-house Data Mart
11. A REFERENCE ‘LAKE’ ARCHITECTURE
GOVERNENCE DATA ACCESS SECURITY OPERATIONS
INTEGRATION
DATA MANAGEMENT
12. A CEPHALOPOD IN THE LAKE?
If this is import… Use this…
Hadoop-native
HDFS
Locality-aware
HDFS
Distributed Name Svc
Ceph
Native Erasure Coding
Ceph
20% Faster *
Ceph
* on Terasort benchmark over IB, Mar 2014
14. DATA GRID
“the unifying layer to how content and data are stored, protected, located
and accessed”
15. DATA GRID - ORIGINS
The need for data grids was first recognized by the scientific
community concerning climate modeling, where exchanging PB-size
data sets became commonplace. Recently, large-scale
instruments such as the Large Hadron Collider (LHC) at CERN
are driving grid innovation.
16. DATA GRID - EXPLAINED
Data Grids present consistent access
controls, governance, and metadata
extensions to diverse storage media
using a common, global interface for
access and transport.
Additionally, they offer a ‘micro-service’ architecture
for the creation of standard tasks & policies, which
are enforced by a distributed “grid control-plane.”
18. DATA GRID - ATTRIBUTES
Data Virtualization: common presentation of all content
Universe-size Namespace: for files, objects & metadata
Automation of Data Operations: distributed, scalable
Policy Mgmt/Reporting: data valuation & action triggers
19. CEPH MEETS GRID
implemented:
Direct
CephFS & RBD Ceph libRADOS Remote
Cloud
Cold Storage
Archive
DATA GRID unified namespace
HiSpeed
Tier
Link
LIBRADOS Ceph +
LIBRADOS Ceph +
RBD
21. TIME 2 SUMMARIZE…
We are in the midst of a Data Explosion
We need robust, expandable, yet simple solutions to store data
We also need effective, de-centralized ways to care for the data
22. the SMART approach
DATA
AUTOMATION
STACK
Workflow Automation
Ceph
Wildly-Scalable Storage
+
Data Lake
Data Grid
23. thank you!
san jose ceph days
Paul Evans
principal architect
paul@daystrom.com
technology group