This document discusses considerations for building an enterprise data lake. It begins by introducing the presenters and stating that the session will not focus on SQL. It then discusses how the traditional "crab" model of data delivery does not scale and how organizations have shifted to industrialized data publishing. The rest of the document discusses important aspects of data lake architecture, including how different types of data like sensor data require new approaches. It emphasizes that the data lake requires a distributed service architecture rather than a monolithic structure. It also stresses that the data lake consists of three core subsystems for acquisition, management, and access, and that these depend on underlying platform services.
2. Building The Enterprise Data Lake
Today’s Presenters
Mark
Madsen
Industry
Analyst
Third Nature
@markmadsen
Craig
Stewart
Sr. Dir.
Product
Management
SnapLogic
@01Badger
Erin
Curtis
Sr. Dir.
Product
Marketing
SnapLogic
@erncrts
3. Building
the
Enterprise
Data
Lake
Considera6ons
before
you
jump
in
December,
2015
Mark
Madsen
www.ThirdNature.net
@markmadsen1
7. Events
and
sensors
are
a
rela6vely
new
data
source
Sensor
data
doesn’t
fit
well
with
current
methods
of
modeling,
collecEon
and
storage,
or
with
the
technology
to
process
and
analyze
it.
10. These
sorts
of
things
slow
user
requests
down
Conclusion:
any
methodology
built
on
the
premise
that
you
must
know
and
model
all
the
data
first
is
untenable
19. Schema
In
the
DW
world
both
data
and
processing
are
bounded
No consideration for feedback loops and change
Processing only
happens here
Carefully
controlled
access
here
Nobodyherecreates
newinformation
Sources few and
well understood
Complex DI
is controlled
by IT
Schemas are few
and designed
Tools are authorized,
few in number and
kind
One way flow
This
is
a
monolithic,
layered
architecture
22. Data
lake
subsystems
/
components
The
acquisi6on
component
allows
any
data
to
be
collected
at
any
latency.
The
management
component
allows
some
data
to
be
standardized
and
integrated.
The
access
component
provides
access
at
any
latency
and
via
any
means
an
applica6on
chooses.
Processing
can
be
done
to
any
data
at
any
6me
from
any
area.
Data
AcquisiEon
Collect
&
Store
Incremental
Batch
One-‐Eme
copy
Real
Eme
Data
Lake
PlaOorm
Services
Data
Management
Process
&
Integrate
Data
Access
Deliver
&
Use
Data
storage
In
reality,
you
are
building
three
systems,
not
one.
Avoid
the
monolith.
37. About
Third
Nature
Third
Nature
is
a
consulEng
and
advisory
firm
focused
on
new
and
emerging
technology
and
pracEces
in
informaEon
strategy,
analyEcs,
business
intelligence
and
data
management.
If
your
quesEon
is
related
to
data,
analyEcs,
informaEon
strategy
and
technology
infrastructure
then
you‘re
at
the
right
place.
Our
goal
is
to
help
organizaEons
solve
problems
using
data.
We
offer
educaEon,
consulEng
and
research
services
to
support
business
and
IT
organizaEons
as
well
as
technology
vendors.
We
fill
the
gap
between
what
the
industry
analyst
firms
cover
and
what
IT
needs.
We
specialize
in
strategy
and
architecture,
so
we
look
at
emerging
technologies
and
markets,
evaluaEng
how
technologies
are
applied
to
solve
problems
rather
than
evaluaEng
product
features.
39. Anything
apps | APIs | things | data
Anytime
batch | streaming | real-time
Anywhere
on premises | in the cloud
SnapLogic helps enterprises
connect data and
applications faster
40. Modern Architecture: Hybrid and Elastic
Streams: No data is
stored/cached
Secure: 100%
standards-based
Elastic: Scales out &
handles data and app
integration use cases
Metadata
Data
Databases
On Prem
Apps
Big Data
Cloud Apps
and DataCloud-Based Designer, Manager,
Dashboard
Cloudplex
Groundplex
Hadooplex
Sparkplex
Firewall
41. z
Data
Acquisition
On Prem Apps
and Data
Data
Access
z
Data
Management
Data Lake
Add information
and improve data
Spark
Python
Scala
Java
R
Pig
Collect and
integrate data
from multiple
sources
HDFS
AWS S3
MS Azure Blob
• ERP
• CRM
• RDBMS
Cloud Apps
and Data
• CRM
• HCM
• Social
IoT Data
• Sensors
• Wearables
• Devices
Lakeshore
Data Mart
• MS Azure
• AWS
Redshift
• …
BI / Analytics
• Tableau
• MS
PowerBI /
Azure
• AWS
QuickSight
Organize and
prepare data for
visualization
HDFS
AWS S3
MS Azure Blob
Hive
Batch
Streaming
Schedule and manage:
Oozie, Ambari
Kafka, Sqoop,
Flume
Real-time
Ingest Prepare Deliver
Impala, HiveSQL,
SparkSQL
42. z
Data
Acquisition
On Prem Apps
and Data
Data
Access
z
Data
Management
The Modern Data Lake
Powered by SnapLogic
• ERP
• CRM
• RDBMS
Cloud Apps
and Data
• CRM
• HCM
• Social
IoT Data
• Sensors
• Wearables
• Devices
Lakeshore
Data Mart
• MS Azure
• AWS
Redshift
• …
BI / Analytics
• Tableau
• MS
PowerBI /
Azure
• AWS
QuickSight
Batch
Streaming
Schedule and manage: SnapLogicSnapLogic Pipelines
Real-time
Ingest Prepare Deliver
SnapLogic Pipelines
Sort,
Aggregate,
Join, Merge,
Transform
SnapLogic
abstracts and
operationalizes
with
SnapReduce or
Spark pipelines
Collect and
integrate data
from multiple
sources
SnapLogic
pipelines with
standard mode
execution
Organize and
prepare data for
visualization
SnapLogic
pipelines with
standard mode
execution
43. Thank You
Watch SnapLogic in action:"
video/snaplogic.com
Contact us:
info@snaplogic.com
Follow us on Twitter:
@SnapLogic