Many companies have data with great potential. There are many ways to go wrong with Big Data projects, however; the difference between a successful and a failed project can be huge, both in cost and the return of investment. In this talk. we will describe the most common pitfalls, and how to avoid them. You will learn to:
- Be aware of the existing risk factors in your organisation that may cause a data project to fail.
- Learn how to recognise the most common and costly causes of project failure.
- Learn how to avoid or mitigate project problems in order to ensure return of investment in a lean manner.
1. 10 ways to stumble
with big data
2017-09-14
Lars Albertsson
www.mapflat.com
1
2. Who’s talking?
● KTH-PDC Center for High Performance Computing (MSc thesis)
● Swedish Institute of Computer Science (distributed system test+debug tools)
● Sun Microsystems (building very large machines)
● Google (Hangouts, productivity)
● Recorded Future (natural language processing startup)
● Cinnober Financial Tech. (trading systems)
● Spotify (data processing & modelling)
● Schibsted Media Group (data processing & modelling)
● Mapflat (independent data engineering consultant)
2
3. Data-centric systems, 1st generation
● The monolith
○ All data in one place
○ Analytics + online serving from
single database
3
DB
Presentation
Logic
Storage
4. Data-centric systems, 2nd generation
● Collect aggregated data from
multiple online systems to data
warehouse
● Aggregate to OLAP cubes
● Analytics focused
4
Service
Service
Service
Web application
Data
warehouse
Daily
aggregates
5. 3rd generation - event oriented
5
Cluster storage
ETL
Data
lake
AI feature
DatasetJob
Pipeline
Data-driven product
development
Analytics
7. 1 - Spending-driven development
7
● Large spending before value delivery
● Vendors want you to make this mistake
No workflow
orchestration
tool
Driven by
infrastructure
department
Project named
“data lake” or
“data platform”
High trust
in vendor
Warning signs
8. 2 - Premature scaling
● You don’t have big data!
● Max cloud instance memory: 2TB
● Does your data
○ fit?
○ grow faster than Moore’s law?
● Scaling out only when needed
● Big data Lean data
○ Time-efficient data handling
○ Democratised data
○ Complex business logic
○ Human fault tolerance
○ Data agility
88
Funky
databases
In-memory
technology
Daily work
requires cluster
9. 3 - The data waterfall
9
● Handovers add latency
● Low product agility
High time to
delivery
Unclear use
cases
Many teams
from source
to end
No workflow
orchestration
tool
Mono-functional teams
10. Right turn: Feature-driven teams & infrastructure
● Cross-functional teams own
specific feature
● Path from source data to end
user service
10
Start out with
workflow
orchestration
Self-service
infrastructure
added lazily
Postpone
clusters &
investments
End-to-end
proof of
concepts
11. Team that owns
data exports to lake
Team needing data
imports to lake
4 - Lake of trash
11111111
Excessive time
spent cleaning
Data feature
teams access
production data
Data quality
& semantics
issues
12. 5 - Random walk
● Many iterative steps without a
target vision
● Works fine for months.
Pain then increases gradually.
● Difficult to be GDPR compliant.
1212121212
Autonomous /
microservice
culture
Little
technology
governance
No plan for
schemas,
deployment,
privacy Wide
changes
difficult
13. 6 - Distinct crawl
● Batch data pipelines are forgiving
○ Workflow orchestration tool for recovery
● Many practices are cargo rituals
○ Release management
○ In situ testing
○ Performance testing
● Start minimal & quick
○ Developer integration tests
○ Continuous deployment pipeline
● Add process iff pain
131313131313
Enterprise
culture
Heavy
practice
governance
Standard
rituals
applied
Late first
delivery
14. 7 - Data loss by design
14
Processing
during data
ingestion
Unclear
source of
truth
Mutable
master
data
Store every event
Immutable data
Reproducible
execution
Large recovery
buffers
Human error
tolerance
Component
error tolerance
Rapid
iteration
speed
Eliminate
manual
precautions
15. 8 - AI first
● You can climb, not jump
● PoCs are possible
Credits: “The data science hierarchy of needs”,
Monica Rogati
15
AI
Deep learning
A/B testing
Machine learning
Analytics
Segments
Curation
Anomaly detection
Data infrastructure
Pipelines
Instrumentation
Data collection
Value Effort
16. 9 - Technical bankruptcy
● Data pipeline == software product
● Apply common best practices
○ Quality tools & processes
○ Automated (integration) testing
○ CI/CD
○ Refactoring
● Avoid tools that steer you away
○ Local execution?
○ Difficult testing?
○ Mocks required?
● Strong software engineers
needed
○ Rotate if necessary
1616
Heterogeneous
environmentWeak
release
process
Few code
quality tools
Excessive
time on
operations
17. 1717
Data engineer
Increasing
tech debt
10 - Team trinity unbalance
● Team sport
● Mutual respect & learning
● Be driven by
○ user value
● Balance with
○ innovation
○ engineering
17
Data scientist
Product ownerLittle
innovation
Low
business
value
18. 11 - Miss the train
18
Big data + AI is not optional
C.f. Internet, smartphones, …
Product development speed impact is significant
Data-driven evaluation
Forgiving environment - move fast without breaking things
Democratised access to data