DataOps is the transformation of data processing from a craft with manual processes to an automated data factory. Lean principles, which have proven successful in manufacturing, are equally applicable for data factories. We will describe how lean principles can be applied in practice for successful data processing.
1. www.scling.com
DataOps - Lean principles
and practices
Data 2030 Summit, 2021-02-11
Lars Albertsson, Founder, Scling
1
2. www.scling.com
Ask not what, but how
2
Ideas << execution
DataOps is the "how" of data & ML
2013: Transform @ Spotify
2014: "DataOps" term first seen
2018: Conference talk rejected
2019: Most watched recording @ Data Innovation Summit
2021: DataOps day @ Data 2030 Summit
3. www.scling.com
Enabling innovation
3
"The actual work that went into
Discover Weekly was very little,
because we're reusing things we
already had."
https://youtu.be/A259Yo8hBRs
https://youtu.be/ZcmJxli8WS8
https://musically.com/2018/08/08/daniel-ek-would-have-killed-discover-weekly-before-launch/
"Discover Weekly wasn't a great
strategic plan and 100 engineers.
It was 3 engineers that decided to
build something."
"I would have killed it. All of a sudden,
they shipped it. It’s one of the most
loved product features that we have."
- Daniel Ek, CEO
4. www.scling.com
IT craft to factory
4
Security Waterfall
Application
delivery
Traditional
operations
Traditional
QA
Infrastructure
DevSecOps Agile
Containers
DevOps CI/CD
Infrastructure
as code
7. www.scling.com
From craft to process
7
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history,
multiple parameter settings
Assess outcome data quality
Assess forecast success,
adapt parameters
9. www.scling.com
Towards sustainable production ML
9
Multiple models,
parameters, features
Assess ingress data quality
Repair broken data from
complementary source
Choose model and parameters based
on performance and input data
Benchmark models
Try multiple models,
measure, A/B test
10. www.scling.com
The Toyota Way
Selected lean principles:
● Long-term over short-term
● The right process will produce the right results
● Eliminate waste (muda)
● Continuous improvement (kaizen)
● Use pull systems to avoid unnecessary production
● Quality takes precedence (jidoka)
○ Stop to fix problems
● Standardised tasks and processes
● Reliable technology that serves people and process
● Develop your people
● Decisions slowly by consensus
● Relentless reflection (hansei), organisational learning
10
11. www.scling.com
Common waste species
● Cognitive waste
● Technology waste
● Delivery waste
● Operational waste
● Product waste
11
Companies are generally good
at handling some waste forms,
and blind to others.
Your blindness is your potential.
12. www.scling.com
Cognitive waste
● Why do we have 25 time formats?
○ ISO 8601, UTC assumed
○ ISO 8601 + timezone
○ Millis since epoch, UTC
○ Nanos since epoch, UTC
○ Millis since epoch, user local time
○ …
○ Float of seconds since epoch, as string.
WTF?!?
● my-kafka-topic-name, your_topic_name
12
● Definition of an order:
○ Abandoned cart?
○ Payment refused?
○ Returned goods?
○ Free promotion?
● Data entity source of truth
○ MySQL, Kafka, data lake?
● Code and documentation sprawl
○ Repositories & branches
○ Wikis
13. www.scling.com
What causes cognitive waste?
● We are autonomous!
○ Teams can choose technology, format, process, ...
● Cognitive debt
○ Short-term over long-term
○ Decisions without consensus
● Recognition and rewards
○ "You have made a similar independent pipeline, great work!"
13
14. www.scling.com
Avoiding cognitive waste
● Reusing semantic definitions
● Reusing code & technical definitions
○ Code transparency & sharing
○ Standardised technology
○ Document decisions & consensus process
● Read-only sharing not enough
○ Must be empowered to
■ change for reuse
■ improve quality
■ delete unused
○ Low risk - what will I break downstream?
○ Standardised, end-to-end QA processes
14
15. www.scling.com
● Code not yet fully utilised
● Code on its way to production
○ In a notebook
○ Waiting for approval
○ Waiting for release
○ Internally released, waiting
for dependants to upgrade
● Tests not fully used
○ Tests that cover code (shared component),
but are not yet executed
Delivery waste - code inventory
15
16. www.scling.com
Eliminating delivery waste
16
● Friction from code to production
○ Positive engineering: research, writing code, tests, docs, refactor, improve
○ All else is negative
● You are limited by your assumptions
○ State of practice far is from state of art
But the test suite
takes 3 hours.
We have this
checklist.
Security must
approve.
X must be
released before Y.
That is another
team's job.
We don't have
access.
We must test in
staging first.
We haven't
performance
tested yet.
20. www.scling.com
Operational waste
● Friction in operational manoeuvres
○ Fear of mistakes
○ Application-specific tooling
● Cost of incidents
○ Time to recovery
○ Impact of incident
○ Frequency of incidents
20
21. www.scling.com
Separating offline and online
21
Raw
Fraud
service
Fraud
model
Orders Orders
Replication /
Backup
Prudent procedures Prudent procedures
Lightweight procedures
● QA driven by internal efficiency
● Continuous deployment
● New pipeline < 1 day
● Upgrade < 1 hour
● Bug recovery < 1 hour
Careful handover Careful handover
22. www.scling.com
Many nines uptime (99.99.. %) A couple of sevens
Data speed Innovation speed
22
Nearline
Data processing tradeoff
Job
Stream
Offline
Online
Stream
Job
Stream
23. www.scling.com
Product waste
● Work not driven by use case
● Unrealised data potential due to friction
○ Unawareness of data
○ Difficulty to use data
● Collaboration and communication
○ Connection
○ Overhead
23
Data democratisation -
making data accessible
and usable
Form teams aligned to
value flows.
24. www.scling.com
Continuous improvement & learning
● Products, not projects
○ Owned, never done, always improving
● To production early
○ Minimal fear
○ Measure and monitor to learn
● Fail & iterate
○ No blame, no penalties
● Communication across organisation essential
○ Data source team - data processing team - stakeholders
24
25. www.scling.com
Data product quality assurance
● Product quality = f(code, data)
○ Cannot do full QA on code only
○ Only real data is production data
● Test in production
○ Quick QA cycle = quick production deployment
○ Measure, monitor, validate
25
29. www.scling.com
Scling - data-value-as-a-service
29
Data value through collaboration
Customer
Data factory
Data platform & lake
data
domain
expertise
Value from data!
Rapid data
innovation
Learning by doing,
in collaboration