10 ways to stumble with big data

10 ways to stumble
with big data
2017-09-14
Lars Albertsson
www.mapflat.com
1

Who’s talking?
● KTH-PDC Center for High Performance Computing (MSc thesis)
● Swedish Institute of Computer Science (distributed system test+debug tools)
● Sun Microsystems (building very large machines)
● Google (Hangouts, productivity)
● Recorded Future (natural language processing startup)
● Cinnober Financial Tech. (trading systems)
● Spotify (data processing & modelling)
● Schibsted Media Group (data processing & modelling)
● Mapflat (independent data engineering consultant)
2

Data-centric systems, 1st generation
● The monolith
○ All data in one place
○ Analytics + online serving from
single database
3
DB
Presentation
Logic
Storage

Data-centric systems, 2nd generation
● Collect aggregated data from
multiple online systems to data
warehouse
● Aggregate to OLAP cubes
● Analytics focused
4
Service
Service
Service
Web application
Data
warehouse
Daily
aggregates

3rd generation - event oriented
5
Cluster storage
ETL
Data
lake
AI feature
DatasetJob
Pipeline
Data-driven product
development
Analytics

Why bother?
6
Development
iteration speed
Data-driven
development
Machine
learning
features
Democratised
data access

1 - Spending-driven development
7
● Large spending before value delivery
● Vendors want you to make this mistake
No workflow
orchestration
tool
Driven by
infrastructure
department
Project named
“data lake” or
“data platform”
High trust
in vendor
Warning signs

2 - Premature scaling
● You don’t have big data!
● Max cloud instance memory: 2TB
● Does your data
○ fit?
○ grow faster than Moore’s law?
● Scaling out only when needed
● Big data Lean data
○ Time-efficient data handling
○ Democratised data
○ Complex business logic
○ Human fault tolerance
○ Data agility
88
Funky
databases
In-memory
technology
Daily work
requires cluster

3 - The data waterfall
9
● Handovers add latency
● Low product agility
High time to
delivery
Unclear use
cases
Many teams
from source
to end
No workflow
orchestration
tool
Mono-functional teams

Right turn: Feature-driven teams & infrastructure
● Cross-functional teams own
specific feature
● Path from source data to end
user service
10
Start out with
workflow
orchestration
Self-service
infrastructure
added lazily
Postpone
clusters &
investments
End-to-end
proof of
concepts

Team that owns
data exports to lake
Team needing data
imports to lake
4 - Lake of trash
11111111
Excessive time
spent cleaning
Data feature
teams access
production data
Data quality
& semantics
issues

5 - Random walk
● Many iterative steps without a
target vision
● Works fine for months.
Pain then increases gradually.
● Difficult to be GDPR compliant.
1212121212
Autonomous /
microservice
culture
Little
technology
governance
No plan for
schemas,
deployment,
privacy Wide
changes
difficult

6 - Distinct crawl
● Batch data pipelines are forgiving
○ Workflow orchestration tool for recovery
● Many practices are cargo rituals
○ Release management
○ In situ testing
○ Performance testing
● Start minimal & quick
○ Developer integration tests
○ Continuous deployment pipeline
● Add process iff pain
131313131313
Enterprise
culture
Heavy
practice
governance
Standard
rituals
applied
Late first
delivery

7 - Data loss by design
14
Processing
during data
ingestion
Unclear
source of
truth
Mutable
master
data
Store every event
Immutable data
Reproducible
execution
Large recovery
buffers
Human error
tolerance
Component
error tolerance
Rapid
iteration
speed
Eliminate
manual
precautions

8 - AI first
● You can climb, not jump
● PoCs are possible
Credits: “The data science hierarchy of needs”,
Monica Rogati
15
AI
Deep learning
A/B testing
Machine learning
Analytics
Segments
Curation
Anomaly detection
Data infrastructure
Pipelines
Instrumentation
Data collection
Value Effort

9 - Technical bankruptcy
● Data pipeline == software product
● Apply common best practices
○ Quality tools & processes
○ Automated (integration) testing
○ CI/CD
○ Refactoring
● Avoid tools that steer you away
○ Local execution?
○ Difficult testing?
○ Mocks required?
● Strong software engineers
needed
○ Rotate if necessary
1616
Heterogeneous
environmentWeak
release
process
Few code
quality tools
Excessive
time on
operations

1717
Data engineer
Increasing
tech debt
10 - Team trinity unbalance
● Team sport
● Mutual respect & learning
● Be driven by
○ user value
● Balance with
○ innovation
○ engineering
17
Data scientist
Product ownerLittle
innovation
Low
business
value

11 - Miss the train
18
Big data + AI is not optional
C.f. Internet, smartphones, …
Product development speed impact is significant
Data-driven evaluation
Forgiving environment - move fast without breaking things
Democratised access to data

10 ways to stumble with big data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 10 ways to stumble with big data

Similar to 10 ways to stumble with big data (20)

More from Lars Albertsson

More from Lars Albertsson (17)

Recently uploaded

Recently uploaded (20)

10 ways to stumble with big data