Building an Enterprise Data Lake: Considerations Before You Jump In

Building The
Enterprise Data Lake 
Important Considerations Before
You Jump In
December 8, 2015

Building The Enterprise Data Lake
Today’s Presenters
Mark
Madsen
Industry
Analyst
Third Nature
@markmadsen
Craig
Stewart
Sr. Dir.
Product
Management
SnapLogic
@01Badger
Erin
Curtis
Sr. Dir.
Product
Marketing
SnapLogic
@erncrts

Building
the

Enterprise
Data
Lake

Considera6ons
before
you

jump
in

December,
2015

Mark
Madsen

www.ThirdNature.net

@markmadsen1

What
This
Session
Isn’t

SQL..
.
SQL!
SQL?
SQL

The
craB
model
of
informa6on
delivery
does
not
scale

©
Third
Nature,
Inc.

So
we
shiBed
to
data
publishing

Industrialized
data
delivery
for
self-‐service
access.

Events
and
sensors
are
a
rela6vely
new
data
source

Sensor
data
doesn’t
ﬁt
well
with
current
methods
of
modeling,

collecEon
and
storage,
or
with
the
technology
to
process
and
analyze
it.

There’s
lots
of
other
new
data
involved

©
Third
Nature,
Inc.

You
can
store
this
data
in
an
RDBMS,
but…

These
sorts
of
things
slow
user
requests
down

Conclusion:
any
methodology
built
on
the
premise
that
you

must
know
and
model
all
the
data
ﬁrst
is
untenable

©
Third
Nature,
Inc.

Analy6cs
embiggens
data
volume
problems

Many
of
the
processing
problems
are
O(n2)
or
worse,
so

moderate
data
can
be
a
problem
for
scale-‐up
plaOorms

©
Third
Nature,
Inc.

Old
market
says:
There’s
nothing
wrong
with
what

you
have,
just
keep
buying
new
products
from
us

The
emerging
big
data
market
has
an
answer…

©
Third
Nature,
Inc.

The
data
lake

©
Third
Nature,
Inc.

Views
of
the
lake

Is
the
business
vs
supports
the
business?

ApplicaEon
vs
infrastructure?

©
Third
Nature,
Inc.

The
naïve
idea
of
a
data
lake
leads
to
predictable
results

©
Third
Nature,
Inc.

You
can’t
install
Hadoop
and
hope
it
solves
all
the
problems

Big
data
no
2

The
answer
isn’t
just
technology,
it’s
architecture

Schema
In
the
DW
world
both
data
and
processing
are
bounded

No consideration for feedback loops and change
Processing only
happens here
Carefully
controlled
access
here
Nobodyherecreates
newinformation
Sources few and
well understood
Complex DI
is controlled
by IT
Schemas are few
and designed
Tools are authorized,
few in number and
kind
One way flow
This
is
a
monolithic,
layered
architecture

©
Third
Nature,
Inc.

In
the
big
data
world
ﬂow
is
unbounded
and
con6nuous

Feedback
loops allowed
End-of-analysis
dataset may be
start of a BI dataset
Continuous data
integration and delivery
Files are back as both
input and storage
Minimal
barrier of /
control on
collection
Areas of
provisioned
data
Any shape in,
rectangles out
This
needs
a
distributed
service
architecture

©
Third
Nature,
Inc.

Deconstruc6ng
data
environments

There
are
three

things
happening
in
a

data
warehouse:

▪  Data
acquisiEon

▪  Data
management

▪  Data
delivery

Isolate
them
from
one

another,
allow
read-‐
write
use,
and
you
are

on
the
path.

Data
Warehouse

Data
lake
subsystems
/
components

The
acquisi6on
component
allows
any
data
to
be
collected
at
any
latency.
The

management

component
allows
some
data
to
be
standardized
and
integrated.
The

access
component
provides
access
at
any
latency
and
via
any
means
an
applica6on

chooses.
Processing
can
be
done
to
any
data
at
any
6me
from
any
area.

Data
AcquisiEon

Collect
&
Store

Incremental

Batch

One-‐Eme
copy

Real
Eme

Data
Lake
PlaOorm
Services

Data
Management

Process
&
Integrate

Data
Access

Deliver
&
Use

Data
storage

In
reality,
you
are
building
three
systems,
not
one.
Avoid
the
monolith.

©
Third
Nature,
Inc.

Data
lake
func6ons
depend
on
plaUorm
services

Base Platform Services
Data Movement MetadataData Persistence
Workflow
Management
Processing Engines Dataflow Services
Data Curation
Data Access
Services
Data
AcquisiEon

Collect
&
Store

Data
Management

Process
&
Integrate

Data
Access

Deliver
&
Use

PlaOorm
services
needed

DATA
ARCHITECTURE

We’re
so
focused
on
the
light
switch
that
we’re
not

talking
about
the
light

©
Third
Nature,
Inc.

Decouple
the
Data
Architecture

The
core
of
the
data
lake
isn’t
a
database
or
HDFS,

it’s
the
data
architecture
that
the
tools
implement.

We
need
a
data
architecture
that
is
not
limiEng:

▪  Deals
with
change
easily
and
at
scale

▪  Does
not
enforce
requirements
and
models
up
front

▪  Does
not
limit
the
format
or
structure
of
data

▪  Assumes
the
range
of
data
latencies
in
and
out,
from

streaming
to
one-‐Eme
bulk

©
Third
Nature,
Inc.

Food
supply
chain:
an
analogy
for
data

MulEple
contexts
of
use,
diﬀering
quality
levels

You
need
to
keep
the
original
because
just
like
baking,

you
can’t
unmake
dough
once
it’s
mixed.

©
Third
Nature,
Inc.

Data
architecture
is
required
by
the
services,
and
vice
versa

Raw data in an immutable
storage area
Standardized or
enhanced data
Common or
usage-
specific data
Transient data
Data
AcquisiEon

Collect
&
Store

PlaOorm
Services

Data
Access

Deliver
&
Use

Data
Management

Process
&
Integrate

©
Third
Nature,
Inc.

The
data
areas
map
(mostly)
to
func6onal
areas
of
the
lake

CollecEon
can’t
be
limited
by
database
scale
and
latency.

Immutability,
persistence
and
concurrency
are
required.

Incremental

Collect

Batch

One-‐Eme
copy

Real
Eme

Manage

&
Integrate
Process,
Deliver,
Use

©
Third
Nature,
Inc.

Stages,
not
layers

Some
tools
require
speciﬁc
repositories
or
models.

Others
can
reach
in
to
get
what
they
need.
Do
not

enforce
a
single
access
point
or
model.

©
Third
Nature,
Inc.

The
geography
has
been
redeﬁned

The
box
IT
created:

• not
any
data,
rigidly
typed
data

• not
any
form,
tabular
rows
and

columns
of
typed
data

• not
any
latency,
persist
what
the

DB
can
keep
up
with

• not
any
process,
only
queries

The
digital
world
was
diminished

to
only
what’s
inside
the
box
un6l

we
forgot
the
box
was
there.

©
Third
Nature,
Inc.

Layered
data
architecture

The
DW
assumed
a
single
ﬂat

model
of
data,
DB
in
the
center.

The
data
lake
enables
new
ways

to
organize
data:

▪  Raw
–
straight
from
the
source

▪  Enhanced
–cleaned,
standardized

▪  Integrated
–
modeled,

augmented,
~semi-‐persistent

▪  Derived
–
analyEc
output,

pacern
based
sets,
ephemeral

Implies
a
new
technology
architecture

and
data
modeling
approaches.

©
Third
Nature,
Inc.

The
data
lake
enables
evolu6onary
design
for
data

EvoluEonary
design
is
required
because
data
needs
change.
You

need
a
system
not
for
stability
–
we
have
that
in
the
DW
-‐
but
for

evoluEon
and
change,
the
data
lake.

Data
AcquisiEon

Collect
&
Store

Incremental

Batch

One-‐Eme
copy

Real
Eme

Data
Lake
PlaOorm
Services

Data
Management

Process
&
Integrate

Data
Access

Deliver
&
Use

Data
storage

You
can’t
build
this
all
at
once.
You
need
to
grow
it
over
6me.

©
Third
Nature,
Inc.

Away
from
“one
throat
to
choke”,
back
to
best
of
breed

Tight
coupling
leads
to
eﬃcient

reuse
and
standardizaEon,
and

to
slow
changes.

In
a
rapidly
evolving
market

componenEzed
architectures,

modularity

and
loose
coupling

are
favorable
over
monolithic

stacks,
single-‐vendor

architectures
and
Eght

coupling.

Architecture,
not
blueprints:

there
is
no
single
answer.
It

depends
on
your
goals
and

starEng
posiEon.

Ques6ons?
“When
a
new
technology
rolls
over
you,
you're
either
part

of
the
steamroller
or
part
of
the
road.”
–
Stewart
Brand

©
Third
Nature,
Inc.

CC
Image
Abribu6ons

Thanks
to
the
people
who
supplied
the
creaEve
commons
licensed
images
used
in
this
presentaEon:

donuts_4_views.jpg
-‐
hcp://www.ﬂickr.com/photos/le_hibou/76718773/

glass_buildings.jpg
-‐
hcp://www.ﬂickr.com/photos/erikvanhannen/547701721

©
Third
Nature,
Inc.

About
the
Presenter

Mark
Madsen
is
president
of
Third
Nature,
a

consulEng
and
advisory
ﬁrm
focused
on

analyEcs,
business
intelligence
and
data

management.
Mark
is
an
award-‐winning

author,
architect
and
CTO.
Over
the
past
ten

years
Mark
received
awards
for
his
work

from
the
American
ProducEvity
&
Quality

Center,
TDWI,
and
the
Smithsonian
InsEtute.

He
is
an
internaEonal
speaker,
a
contributor

to
Forbes,
member
of
the
O’Reilly
Strata

program
commicee.
For
more
informaEon
or

to
contact
Mark,
follow
@markmadsen
on

Twicer
or
visit

hcp://ThirdNature.net

About
Third
Nature

Third
Nature
is
a
consulEng
and
advisory
firm
focused
on
new
and
emerging
technology

and
pracEces
in
informaEon
strategy,
analyEcs,
business
intelligence
and
data

management.
If
your
quesEon
is
related
to
data,
analyEcs,
informaEon
strategy
and

technology
infrastructure
then
you‘re
at
the
right
place.

Our
goal
is
to
help
organizaEons
solve
problems
using
data.
We
offer
educaEon,

consulEng
and
research
services
to
support
business
and
IT
organizaEons
as
well
as

technology
vendors.

We
fill
the
gap
between
what
the
industry
analyst
firms
cover
and
what
IT
needs.
We

specialize
in
strategy
and
architecture,
so
we
look
at
emerging
technologies
and
markets,

evaluaEng
how
technologies
are
applied
to
solve
problems
rather
than
evaluaEng
product

features.

Modern Architecture: Hybrid and Elastic
Streams: No data is
stored/cached
Secure: 100%
standards-based
Elastic: Scales out &
handles data and app
integration use cases
Metadata
Data
Databases
On Prem
Apps
Big Data
Cloud Apps
and DataCloud-Based Designer, Manager,
Dashboard
Cloudplex
Groundplex
Hadooplex
Sparkplex
Firewall

z
Data
Acquisition
On Prem Apps
and Data
Data
Access
z
Data
Management
Data Lake
Add information
and improve data

Spark
Python
Scala
Java
R
Pig
Collect and
integrate data
from multiple
sources

HDFS 
AWS S3 
MS Azure Blob
•  ERP
•  CRM
•  RDBMS
Cloud Apps
and Data
•  CRM
•  HCM
•  Social
IoT Data
•  Sensors
•  Wearables
•  Devices
Lakeshore 
Data Mart
•  MS Azure
•  AWS
Redshift
•  …
BI / Analytics
•  Tableau
•  MS
PowerBI /
Azure
•  AWS
QuickSight
Organize and
prepare data for
visualization

HDFS 
AWS S3 
MS Azure Blob
Hive
Batch
Streaming
Schedule and manage:
Oozie, Ambari
Kafka, Sqoop,
Flume
Real-time
Ingest Prepare Deliver
Impala, HiveSQL,
SparkSQL

z
Data
Acquisition
On Prem Apps
and Data
Data
Access
z
Data
Management
The Modern Data Lake
Powered by SnapLogic
•  ERP
•  CRM
•  RDBMS
Cloud Apps
and Data
•  CRM
•  HCM
•  Social
IoT Data
•  Sensors
•  Wearables
•  Devices
Lakeshore 
Data Mart
•  MS Azure
•  AWS
Redshift
•  …
BI / Analytics
•  Tableau
•  MS
PowerBI /
Azure
•  AWS
QuickSight
Batch
Streaming
Schedule and manage: SnapLogicSnapLogic Pipelines
Real-time
Ingest Prepare Deliver
SnapLogic Pipelines
Sort,
Aggregate,
Join, Merge,
Transform

SnapLogic
abstracts and
operationalizes
with
SnapReduce or
Spark pipelines
Collect and
integrate data
from multiple
sources

SnapLogic
pipelines with
standard mode
execution
Organize and
prepare data for
visualization

SnapLogic
pipelines with
standard mode
execution

Thank You
Watch SnapLogic in action:"
video/snaplogic.com

Contact us:
info@snaplogic.com

Follow us on Twitter:
@SnapLogic

Building an Enterprise Data Lake: Considerations Before You Jump In

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Building an Enterprise Data Lake: Considerations Before You Jump In

Similar to Building an Enterprise Data Lake: Considerations Before You Jump In (20)

More from SnapLogic

More from SnapLogic (20)

Recently uploaded

Recently uploaded (20)

Building an Enterprise Data Lake: Considerations Before You Jump In