New times, new hype. Buzzwords like big data and Hadoop have been changed to AI and machine learning. But it's not technology, old or new, nor machine learning that separates companies that get value from data from the companies that struggle .
When big data was at its peak, several young, technology-intensive companies succeeded in absorbing big data successfully. They acquired large Hadoop clusters, learned to master data and created valuable products with machine learning. However, big data has had a limited impact at traditional companies, and the list of long and expensive data lake and Hadoop projects is long.
The key to implementing successful projects that transform data into business value is to democratise data - making it accessible and easy to use within an organisation.
2. www.scling.com
Big data adoption
22
● 2003-2007: Only Google
● 2007-2014: Hadoop era (Europe). Highly technical
companies succeed and disrupt.
● 2015-2019: Enterprise adoption (Europe). Big data
gone from Gartner hype cycle. “New normal”
● 2019: Many enterprises in production, but big data and
machine learning ROI still confined to high-tech.
4. www.scling.com
Efficiency gap, latency
4
We just took a machine
learning pipeline in
production after 8 months.
Great success!
Scandinavian retail
(pycon.se, 2019)Document similarity
pipeline finally in
production. Estimated 3
months, took 8 months.
Scandinavian telecom
(NDSML Summit 2019)
2016: Data platform approval
2018: Pipeline in production
Dutch bank
(Dataworks Summit 2018)
Bonnier News
(Riga DevOpsDays 2018)
Platform + 1st pipeline in production.
Seven weeks, 1 person.
Scandinavian retail
2018
New pipeline: < 1 day
Mend pipeline: < 1 hour
Spotify DataOps
transform, 2013
Platform + 1st pipeline in production.
Three weeks, 4 persons.
20 pipelines in 8 months.
5. www.scling.com
Efficiency gap, data cost & value
● Data processing produces datasets
● Each dataset has business value
○ Financial, sales, forecasting reports
○ A/B test, auto completion, insights
○ Recommendations, fraud
● Proxy metric: datasets / day
○ S-M traditional: < 10
○ Bank, telecom, media: 10-1000
5
2016: 20000 datasets / day
2017: 100B events collected / day
Spotify
2016: 1600 000 000
datasets / day
Google
6. www.scling.com
Data efficiency key factors
6
Data democratisation
● Making data available,
usable, accessible DataOps
● Short path from idea to production
● Cross-functional teams
○ Data engineering, domain experts, product, (data science)
○ Aligned with value, not function
● Low cost of failure
○ Machine and human failure
○ Risks ok → move fast
● Engineered operations
15. www.scling.com
Data lake
Transformation
Cold
store
Data pipelines at a glance
15
Mutation
Immutable,
shareable
Early Hadoop:
● Weak indexing
● No transactions
● Weak security
● Batch transformations
DataOps workflows:
● Immutable, shared data
● Resilient to failure
● Quick error recovery
● Low-risk experiments
16. www.scling.com
Late Hadoop adoption
16
Mutation
Can you please
implement mutability,
transactions, SQL, etc?
We would like to keep
our workflows.
Anything, as long as
you are buying.
DataOps workflows:
● Immutable, shared data
● Resilient to failure
● Quick error recovery
● Low-risk experiments
17. www.scling.com
Complex business logic - MDM @ Spotify ~2014
● 10 pipelines like this
● Pipeline dev environment
● Pipeline continuous deployment
infrastructure
One team of five engineers
17
18. www.scling.com
Data value = data + domain expertise + data practices
18
Disrupt?
https://xkcd.com/1831/
+ 1000s of failures...
19. www.scling.com
Data value = data + domain expertise + data practices
19
Disrupt?
https://xkcd.com/1831/
Adapt?
+ 1000s of failures...
20. www.scling.com
Data value = data + domain expertise + data practices
20
Data lake
Stream storage
Client data +
domain expertise
Practices from
data leaders
Disrupt?
https://xkcd.com/1831/
Collaborate?
Data-value-as-a-service
Adapt?
+ 1000s of failures...
26. www.scling.com
Factors of democratisation
26
Siloed Shared
Distributed
storage
Homogeneous
storage
Documentation
read+write accessNeed-to-know
basis
Code read+write
access
Closed code
ownership
Coordinated data
governanceLocal rituals
Common glossary,
semantics
Tribal
knowledge
Common data
provenance
Unclear data
origin
CoordinatedOrganic
27. www.scling.com
Factors of democratisation
27
Siloed Shared
Distributed
storage
Homogeneous
storage
Documentation
read+write accessNeed-to-know
basis
Code read+write
access
Closed code
ownership
Coordinated data
governanceLocal rituals
Common glossary,
semantics
Tribal
knowledge
Common DataOps
procedures
Lay-on-hands
deployment
Common data
provenance
Unclear data
origin
CoordinatedOrganic
28. www.scling.com
An e-shopping tale
28
1. Log in, search for product X
○ X + 100s of accessories, random order
2. Find X in product catalog
○ No link to web shop
3. Put in cart, delivery?
○ Ask for address, customer club number
4. …
Full story: “Avoid artificial stupidity” blog post
1. Log in, search for product X
○ Popular items first
2. Find X in product catalog
○ Take me to shop
3. Put in cart, delivery?
○ I am logged in
4. ...
29. www.scling.com
● Include minimal governance, security, privacy
Data lake
Transformation
Cold
store
Document a clean architecture
29
Mutation
Immutable,
shareable
30. ● Align team with use case
○ Zero budget
● Ingest only necessary data
● Key technical component: Workflow orchestrator (Luigi / Airflow)
A lean start
30
31. www.scling.com
An MVP is minimal
31
Out of scope
Minimal privacy -
limiting access
One use
case
In scope
Minimal
privacy
Security
One DB
source
One use
caseData
scala-
bility
High
availa-
bility
Dura-
bility
Most
privacy
Self
service
Data
quality
Auto-
mation
Clusters
Audita-
bility
Scalable
BI
Fill lake
Real-
time
Lineage
32. ● Remove complexity wherever possible
○ Unfamiliar tools may be less complex
● Pay attention to human and social factors
Journey towards data value
32
“Five dysfunctions of a data engineering team” -
Jesse Anderson
● Only database admins
● Set up for failure
● No one understands schema
● No veterans
● Too ambitious
“Avoiding big data antipatterns” -
Alex Holmes
● Big data tech for small data
● Point-to-point data integration
● Single tool for the job
● Excess volume or precision
● Lack of security