SlideShare a Scribd company logo
1 of 58
Download to read offline
Data pipelines
from zero to solid
Lars Albertsson
Who’s talking?
Swedish Institute of Computer Science (test tools)
Sun Microsystems (very large machines)
Google (Hangouts, productivity)
Recorded Future (NLP startup)
Cinnober Financial Tech. (trading systems)
Spotify (data processing & modelling)
Schibsted (data processing & modelling)
Independent data engineering consultant
Presentation goals
● Overview of data pipelines for analytics / data products
● Target audience: Big data starters
○ Seen wordcount, need the stuff around
● Overview of necessary components & wiring
● Base recipe
○ In vicinity of state-of-practice
○ Baseline for comparing design proposals
● Subjective best practices - not single truth
● Technology suggestions, (alternatives)
Presentation non-goals
● Stream processing
○ High complexity in practice
○ Batch processing yields > 90% of value
● Technology enumeration or (fair) comparison
● Writing data processing code
○ Already covered en masse
Data product anatomy
Cluster storage
Unified log
Ingress ETL Egress
Computer program anatomy
Input data
Process Output
Execution path
Data pipeline = yet another program
Don’t veer from best practices
● Regression testing
● Design: Separation of concerns, modularity, etc
● Process: CI/CD, code review, lint tools
● Avoid anti-patterns: Global state, hard-coding location,
duplication, ...
In data engineering, slipping is the norm... :-(
Solved by mixing strong software engineers with data
engineers/scientists. Mutual respect is crucial.
Cluster storage
(NFS, S3, Google CS, C*)
Event collection
Reliable, simple,
write available
Bus with history
Google Pub/Sub)
Immediate handoff to append-only replicated log.
Once in the log, events eventually arrive in storage.
Unified log
Immutable events, append-only,
source of truth
Event registration
Unified log
Events are safe
from here
Replicated bus
with history
Asynchronous fire-and-forget handoff for unimportant data.
Synchronous, replicated, with ack for important data
Cluster storage
(NFS, S3, Google CS, C*)
Event transportation
Bus-to-bus WAN mirror
expect delays
Log has long history (months+) => robustness end to end.
Avoid risk of processing & decoration. Except timestamps.
Cluster storage
Event arrival
Bundle incoming events into datasets
● Sealed quickly, thereafter immutable
● Bucket on arrival / wall-clock time
● Predictable bucketing, e.g. hour
Database state collection
Cluster storage
(NFS, S3, Google CS, C*)
Source of truth sometimes in database.
Snapshot to cluster storage.
Easy on surface...
Anti-pattern: Send the oliphants!
● Sqoop (dump with MapReduce) production DB
● MapReduce from production API
Hadoop / Spark == internal DDoS service
Cluster storage
(NFS, S3, Google CS, C*)
Our preciousss
Deterministic slaves
Restore backup to offline slave
+ Standard procedure
- Serial or resource consuming
Cluster storage
(NFS, S3, Google CS, C*)
Using snapshots
● join(event, snapshot) => always time mismatch
● Usually acceptable
● Some behaviour difficult to catch with snapshots
○ E.g. user creates, then deletes account
Event sourcing
● Every change to unified log == source of truth
● snapshot(t + 1) = sum(snapshot(t), events(t, t+1))
● Allows view & join at any point in time
Application services still need DB for current state lookup
Event sourcing, synced database
A. Service interface generates events
and DB transactions
B. Generate stream from commit log
Postgres, MySQL -> Kafka
C. Build DB with stream processing
DB snapshot lessons learnt
● Put fences between online and offline components
○ The latter can kill the former
● Team that owns a database/service must own exporting
data to offline
○ Protect online stability
○ Affects choice of DB technology
The data lake
Unified log + snapshots
● Immutable datasets
● Raw, unprocessed
● Source of truth from batch
processing perspective
● Kept as long as permitted
● Technically homogeneous
Cluster storage
Data lake
● Pipeline equivalent of objects
● Dataset class == homogeneous records, open-ended
○ Compatible schema
○ E.g. MobileAdImpressions
● Dataset instance = dataset class + parameters
○ Immutable
○ E.g. MobileAdImpressions(hour=”2016-02-06T13”)
Representation - data lake & pipes
● Directory with multiple files
○ Parallel processing
○ Sealed with _SUCCESS (Hadoop convention)
○ Bundled schema format
■ JSON lines, Avro, Parquet
○ Avoid old, inadequate formats
○ RPC formats lack bundled schema
■ Protobuf, Thrift
Directory datasets
● Some tools, e.g. Spark, understand Hive name
Instance parameters,
Hive convention
Seal PartitionsPrivacy
Ingress / egress representation
Larger variation:
● Single file
● Relational database table
● Cassandra column family, other NoSQL
● BI tool storage
● BigQuery, Redshift, ...
Egress datasets are also atomic and immutable.
E.g. write full DB table / CF, switch service to use it, never
change it.
● There is always a schema
○ Plan your evolution
● New field, same semantic == compatible change
● Incompatible schema change => new dataset class
● Schema on read - assumptions in code
○ Dynamic typing
○ Quick schema changes possible
● Schema on write - enumerated fields
○ Static typing & code generation possible
○ Changes must propagate down pipeline code 24
Schema on read or write?
intelligenceChange agility important here
Production stability important here
Batch processing
Gradual refinement
1. Wash
- time shuffle, dedup, ...
2. Decorate
- geo, demographic, ...
3. Domain model
- similarity, clusters, ...
4. Application model
- Recommendations, ...
Data lake
Artifact of business value
E.g. service index
Batch job code
● Components should scale up
○ Spark, (Scalding, Crunch)
● And scale down
○ More important!
○ Component should support local mode
■ Integration tests
■ Small jobs - less risk, easier debugging
Language choice
● People and community thing, not a technical thing
● Need for simple & quick experiments
○ Java - too much ceremony and boilerplate
● Stable and static enough for production
○ Python/R - too dynamic
● Scala connects both worlds
○ Current home of data innovation
● Beware of complexity - keep it sane and simple
○ Avoid spaceships: <|*|> |@| <**>
Job == function([input datasets]): [output datasets]
● No orthogonal concerns
○ Invocation
○ Scheduling
○ Input / output location
● Testable
● No other input factors
● No side-effects
● Ideally: atomic, deterministic, idempotent
Batch job
● Pipeline equivalent of Command pattern
● Parameterised
○ Higher order, c.f. dataset class & instance
○ Job instance == job class + parameters
○ Inputs & outputs are dataset classes
● Instances are ideally executed when input appears
○ Not on cron schedule
Batch job class & instance
● Things will break
○ Input will be missing
○ Jobs will fail
○ Jobs will have bugs
● Datasets must be rebuilt
● Determinism,
● Backfill missing / failed
● Eventual correctness
Cluster storage
Data lake
Workflow manager
● Dataset “build tool”
● Run job instance when
○ input is available
○ output missing
○ resources are available
● Backfill for previous failures
● DSL describes DAG
● Includes ingress & egress
Luigi, (Airflow, Pinball)
ClientSessions A/B tests
DSL DAG example (Luigi)
class ClientActions(SparkSubmitTask):
hour = DateHourParameter()
def requires(self):
return [Actions(hour=self.hour - timedelta(hours=h)) for h in range(0, 12)] + 
class ClientSessions(SparkSubmitTask):
hour = DateHourParameter()
def requires(self):
return [ClientActions(hour=self.hour - timedelta(hours=h)) for h in range(0, 3)]
class SessionsABResults(SparkSubmitTask):
hour = DateHourParameter()
def requires(self):
return [ClientSessions(hour=self.hour), ABExperiments(hour=self.hour)]
def output(self):
return HdfsTarget(“hdfs://production/red/ab_sessions/v1/” +
Time shuffle,
user decorate
Form sessions
A/B compare
A/B session
● Expressive, embedded DSL - a must for ingress, egress
○ Avoid weak DSL tools: Oozie, AWS Data Pipeline
Dataset instance
Job (aka Task) classes
Egress datasets
● Serving
○ Precomputed user query answers
○ Denormalised
○ Cassandra, (many)
● Export & Analytics
○ SQL (single node / Hive, Presto, ..)
○ Workbenches (Zeppelin)
○ (Elasticsearch, proprietary OLAP)
● BI / analytics tool needs change frequently
○ Prepare to redirect pipelines 34
Test strategy considerations
● Developer productivity is the primary value of test
● Test at stable interface
○ Minimal maintenance
○ No barrier to refactorings
● Focus: single job + end to end
○ Jobs & pipelines are pure functions - easy to test
● Component, unit - only if necessary
○ Avoid dependency injection ceremony
Testing single job
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run in local mode 3. Verify output
f() p()
● (Tool-specific frameworks, e.g. for Spark?)
○ Usable, but rarely cover I/O - home of many bugs.
○ Tied to processing technology
Don’t commit -
expensive to maintain.
Generate / verify with
Runs well in
CI / from IDE
Testing pipelines - two options
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run custom multi-job
Test job with sequence of jobs
3. Verify output
f() p()
Customised workflow manager setup
+ Runs in CI
+ Runs in IDE
+ Quick setup
- Multi-job
+ Tests workflow logic
+ More authentic
- Workflow mgr setup
for testability
- Difficult to debug
- Dataset handling
with Python
● Both can be extended with Kafka, egress DBs
repo Luigi DSL, jars, config
> pip install my-pipe-7.tar.gz
Redundant cron schedule, higher
frequency + backfill (Luigi range tools)
* 10 * * * bin/my_pipe_daily 
--backfill 14
All that a pipeline needs, installed atomically
Continuous deployment
● Poll and pull latest on worker nodes
○ virtualenv package/version
■ No need to sync
environment & versions
○ Cron package/latest/bin/*
■ Old versions run pipelines to
completion, then exit
repo Luigi DSL, jars, config
HDFS hdfs://pipelines/
> virtualenv my_pipe/7
> pip install my-pipe-7.tar.gz
* 10 * * * my_pipe/7/bin/*
Start lean: assess needs
Your data & your jobs:
A. Fit in one machine, and will continue to do so
B. Fit in one machine, but grow faster than Moore’s law
C. Do not fit in one machine
● Most datasets / jobs: A
○ Even at large companies with millions of users
● cost(C) >> cost(A)
● Running A jobs on C infrastructure is expensive
Lean MVP
● Start simple, lean, end-to-end
○ No parallel cluster computations necessary?
○ Custom jobs or local Spark/Scalding/Crunch
● Shrink data
○ Downsample
○ Approximate algorithms (e.g. Count-min sketch)
● Get workflows running
○ Serial jobs on one/few machines
○ Simple job control (Luigi only / simple work queue)
Scale carefully
● Get end-to-end workflows in production for evaluation
○ Improvements driven by business value, not tech
● Keep focus small
○ Business value
○ Privacy needs attention early
● Keep iterations swift
○ Integration test end-to-end
○ Efficient code/test/deploy cycle
● Parallelise jobs only when forced
Protecting privacy in practice
● Removing old personal identifiable information (PII)
● Right to be forgotten
● Access control to PII data
● Audit of access and processing
● PII content definition is application-specific
● PII handling subject to business priorities
○ But you should have a plan from day one
Data lake Derived
Data retention
● Remove old, promote derived datasets to lake
Cluster storage
Data lake Derived
Cluster storage
PII removal
● Must rebuild downstream datasets regularly
○ In order for PII to be washed in x days
Split out PII,
wash on user
Key on PII => difficult to wash
Simple PII audit
● Classify PII level
○ Name, address, messages, ...
○ IP, city, ...
○ Total # page views, …
● Tag datasets and jobs in code
● Manual access through gateway tool
○ Verify permission, log
○ Dedicated machines only
● Log batch jobs
○ Deploy with CD only, log hg/git commit hash
Parting words + sales plug
Keep things simple; batch, homogeneity & little state
Focus on developer code, test, debug cycle - end to end
Harmony with technical ecosystems
Little technology overlap with yesterday - follow leaders
Plan early: Privacy, retention, audit, schema evolution
Please give feedback --
I help companies plan and build these things
Bonus slides
+ Operations
+ Security
+ Responsive scaling
- Development workflows
- Privacy
- Vendor lock-in
Cloud or not?
● Afterthought add-on for big data components
○ E.g. Kerberos support
○ Always trailing - difficult to choose global paradigm
● Container security simpler
○ Easy with cloud
○ Immature with on-premise solutions?
Data pipelines example
Views with
Sales with
Views with
Raw Derived
Form teams that are driven by business cases & need
Forward-oriented -> filters implicitly applied
Beware of: duplication, tech chaos/autonomy, privacy loss
Data pipelines team organisation
Conway’s law
“Organizations which design systems ... are
constrained to produce designs which are
copies of the communication structures of
these organizations.”
Better organise to match desired design, then.
Personae - important characteristics
- Technology updated
- Holistic: productivity, privacy
- Identify and facilitate governance
Backend developer
- Simplicity oriented
- Engineering practices obsessed
- Adapt to data world
Product owner
- Trace business value to
upstream design
- Find most ROI through difficult
- Explain what and why
- Facilitate process to determine how
- Enable, enable, enable
- Always increase automation
- Enable, don’t control
Data scientist
- Capable programmer
- Product oriented
Protect production servers
Cluster storage
(NFS, S3, Google CS, C*)
DB offline
+ Online service is safe
- Replication may be out of sync
- Cluster storage may be write unavailable
=> Delayed, inaccurate snapshot
Deterministic slaves
+ Standard procedure
- Serial or resource
+ Deterministic
- Ad-hoc solution
- Serial => not scalable
controlled replay
PII privacy control
● Simplify with coarse classification (red/yellow/green)
○ Datasets, potentially fields
○ Separate production areas
● Log batch jobs
○ Code checksum -> commit id -> source code
○ Tag job class with classification
■ Aids PII consideration in code review
■ Enables ad-hoc verification
● Audit manual access
● Wrap all functionality in gateway tool
○ Log datasets, output, code used
○ Disallow download to laptop
○ Wrapper tool happens to be great for enabling data
scientists, too - shields them from operations.

More Related Content

What's hot

Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
Some Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfSome Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfMichael Kogan
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshConfluentInc1
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksDatabricks
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptxDori Waldman
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
Security and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache AtlasSecurity and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache AtlasDataWorks Summit/Hadoop Summit
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks

What's hot (20)

Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Some Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfSome Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdf
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
Spark SQL
Spark SQLSpark SQL
Spark SQL
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Security and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache AtlasSecurity and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache Atlas
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Big Data
Big DataBig Data
Big Data

Viewers also liked

Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big dataLars Albertsson
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)Mathieu Bastian
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applicationsLars Albertsson
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Lars Albertsson
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs

Viewers also liked (9)

Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak

Similar to Data pipelines from zero to solid

Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelinesLars Albertsson
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxXinliShang1
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyGuillaume Lefranc
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...Marcin Bielak
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven productsLars Albertsson
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ IndixRajesh Muppalla
Revealing the Power of Legacy Machine Data
Revealing the Power of Legacy Machine DataRevealing the Power of Legacy Machine Data
Revealing the Power of Legacy Machine DataDatabricks
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...Facultad de Informática UCM
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamNeville Li
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data qualityLars Albertsson

Similar to Data pipelines from zero to solid (20)

Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
Revealing the Power of Legacy Machine Data
Revealing the Power of Legacy Machine DataRevealing the Power of Legacy Machine Data
Revealing the Power of Legacy Machine Data
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality

More from Lars Albertsson

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divideLars Albertsson
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with ScalametaLars Albertsson
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfLars Albertsson
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfLars Albertsson
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application qualityLars Albertsson
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetLars Albertsson
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesLars Albertsson
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift leftLars Albertsson
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityLars Albertsson
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish styleLars Albertsson
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data opsLars Albertsson
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processingLars Albertsson
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisisLars Albertsson
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipelineLars Albertsson
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platformLars Albertsson

More from Lars Albertsson (20)

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
Data democratised
Data democratisedData democratised
Data democratised
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platform

Recently uploaded

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

Recently uploaded (20)

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]

Data pipelines from zero to solid

  • 1. Data pipelines from zero to solid Lars Albertsson 1
  • 2. Who’s talking? Swedish Institute of Computer Science (test tools) Sun Microsystems (very large machines) Google (Hangouts, productivity) Recorded Future (NLP startup) Cinnober Financial Tech. (trading systems) Spotify (data processing & modelling) Schibsted (data processing & modelling) Independent data engineering consultant 2
  • 3. Presentation goals ● Overview of data pipelines for analytics / data products ● Target audience: Big data starters ○ Seen wordcount, need the stuff around ● Overview of necessary components & wiring ● Base recipe ○ In vicinity of state-of-practice ○ Baseline for comparing design proposals ● Subjective best practices - not single truth ● Technology suggestions, (alternatives) 3
  • 4. Presentation non-goals ● Stream processing ○ High complexity in practice ○ Batch processing yields > 90% of value ● Technology enumeration or (fair) comparison ● Writing data processing code ○ Already covered en masse 4
  • 5. Data product anatomy 5 Cluster storage Unified log Ingress ETL Egress DB Service DatasetJob Pipeline Service Export Business intelligence Data lake DB DB
  • 6. RAM Input File Computer program anatomy 6 Input data Process Output File File HID VariableFunction Execution path Lookup structure Output data Window
  • 7. Data pipeline = yet another program Don’t veer from best practices ● Regression testing ● Design: Separation of concerns, modularity, etc ● Process: CI/CD, code review, lint tools ● Avoid anti-patterns: Global state, hard-coding location, duplication, ... In data engineering, slipping is the norm... :-( Solved by mixing strong software engineers with data engineers/scientists. Mutual respect is crucial. 7
  • 8. Cluster storage HDFS (NFS, S3, Google CS, C*) Event collection 8 Service Unreliable Unreliable Reliable, simple, write available Bus with history Kafka (Kinesis, Google Pub/Sub) (Secor, Camus) Immediate handoff to append-only replicated log. Once in the log, events eventually arrive in storage. Unified log Immutable events, append-only, source of truth
  • 9. Event registration 9 Unified log Service (unimportant) Events are safe from here Replicated bus with history Asynchronous fire-and-forget handoff for unimportant data. Synchronous, replicated, with ack for important data Service (important)
  • 10. Cluster storage HDFS (NFS, S3, Google CS, C*) Event transportation 10 Bus-to-bus WAN mirror expect delays Log has long history (months+) => robustness end to end. Avoid risk of processing & decoration. Except timestamps.
  • 11. Cluster storage Event arrival 11 Bundle incoming events into datasets ● Sealed quickly, thereafter immutable ● Bucket on arrival / wall-clock time ● Predictable bucketing, e.g. hour (Secor, Camus) clicks/2016/02/08/14 clicks/2016/02/08/15
  • 12. Database state collection 12 Cluster storage HDFS (NFS, S3, Google CS, C*) Service DB DB Service Source of truth sometimes in database. Snapshot to cluster storage. Easy on surface... ?
  • 13. Anti-pattern: Send the oliphants! ● Sqoop (dump with MapReduce) production DB ● MapReduce from production API Hadoop / Spark == internal DDoS service 13 Cluster storage HDFS (NFS, S3, Google CS, C*) Service DB DB Service Our preciousss
  • 14. Deterministic slaves 14 DB Service backup snapshot Restore DB Restore backup to offline slave + Standard procedure - Serial or resource consuming Cluster storage HDFS (NFS, S3, Google CS, C*)
  • 15. Using snapshots ● join(event, snapshot) => always time mismatch ● Usually acceptable ● Some behaviour difficult to catch with snapshots ○ E.g. user creates, then deletes account 15 DB’DB join?
  • 16. Event sourcing ● Every change to unified log == source of truth ● snapshot(t + 1) = sum(snapshot(t), events(t, t+1)) ● Allows view & join at any point in time Application services still need DB for current state lookup 16 DB’DB
  • 17. Event sourcing, synced database A. Service interface generates events and DB transactions B. Generate stream from commit log Postgres, MySQL -> Kafka C. Build DB with stream processing 17 APIAPIAPI
  • 18. DB snapshot lessons learnt ● Put fences between online and offline components ○ The latter can kill the former ● Team that owns a database/service must own exporting data to offline ○ Protect online stability ○ Affects choice of DB technology 18
  • 19. The data lake Unified log + snapshots ● Immutable datasets ● Raw, unprocessed ● Source of truth from batch processing perspective ● Kept as long as permitted ● Technically homogeneous 19 Cluster storage Data lake
  • 20. Datasets ● Pipeline equivalent of objects ● Dataset class == homogeneous records, open-ended ○ Compatible schema ○ E.g. MobileAdImpressions ● Dataset instance = dataset class + parameters ○ Immutable ○ E.g. MobileAdImpressions(hour=”2016-02-06T13”) 20
  • 21. Representation - data lake & pipes ● Directory with multiple files ○ Parallel processing ○ Sealed with _SUCCESS (Hadoop convention) ○ Bundled schema format ■ JSON lines, Avro, Parquet ○ Avoid old, inadequate formats ■ CSV, XML ○ RPC formats lack bundled schema ■ Protobuf, Thrift 21
  • 22. Directory datasets 22 hdfs://red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS part-00000.json part-00001.json ● Some tools, e.g. Spark, understand Hive name conventions Dataset class Instance parameters, Hive convention Seal PartitionsPrivacy level Schema version
  • 23. Ingress / egress representation Larger variation: ● Single file ● Relational database table ● Cassandra column family, other NoSQL ● BI tool storage ● BigQuery, Redshift, ... Egress datasets are also atomic and immutable. E.g. write full DB table / CF, switch service to use it, never change it. 23
  • 24. Schemas ● There is always a schema ○ Plan your evolution ● New field, same semantic == compatible change ● Incompatible schema change => new dataset class ● Schema on read - assumptions in code ○ Dynamic typing ○ Quick schema changes possible ● Schema on write - enumerated fields ○ Static typing & code generation possible ○ Changes must propagate down pipeline code 24
  • 25. Schema on read or write? 25 DB DB DB Service Service Export Business intelligenceChange agility important here Production stability important here
  • 26. Batch processing Gradual refinement 1. Wash - time shuffle, dedup, ... 2. Decorate - geo, demographic, ... 3. Domain model - similarity, clusters, ... 4. Application model - Recommendations, ... 26 Data lake Artifact of business value E.g. service index Job Pipeline
  • 27. Batch job code ● Components should scale up ○ Spark, (Scalding, Crunch) ● And scale down ○ More important! ○ Component should support local mode ■ Integration tests ■ Small jobs - less risk, easier debugging 27
  • 28. Language choice ● People and community thing, not a technical thing ● Need for simple & quick experiments ○ Java - too much ceremony and boilerplate ● Stable and static enough for production ○ Python/R - too dynamic ● Scala connects both worlds ○ Current home of data innovation ● Beware of complexity - keep it sane and simple ○ Avoid spaceships: <|*|> |@| <**> 28
  • 29. Job == function([input datasets]): [output datasets] ● No orthogonal concerns ○ Invocation ○ Scheduling ○ Input / output location ● Testable ● No other input factors ● No side-effects ● Ideally: atomic, deterministic, idempotent Batch job 29 q
  • 30. ● Pipeline equivalent of Command pattern ● Parameterised ○ Higher order, c.f. dataset class & instance ○ Job instance == job class + parameters ○ Inputs & outputs are dataset classes ● Instances are ideally executed when input appears ○ Not on cron schedule Batch job class & instance 30
  • 31. Pipelines ● Things will break ○ Input will be missing ○ Jobs will fail ○ Jobs will have bugs ● Datasets must be rebuilt ● Determinism, idempotency ● Backfill missing / failed ● Eventual correctness 31 Cluster storage Data lake Pristine, immutable datasets Intermediate Derived, regenerable
  • 32. Workflow manager ● Dataset “build tool” ● Run job instance when ○ input is available ○ output missing ○ resources are available ● Backfill for previous failures ● DSL describes DAG ● Includes ingress & egress Luigi, (Airflow, Pinball) 32 DB
  • 33. ClientSessions A/B tests DSL DAG example (Luigi) 33 class ClientActions(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [Actions(hour=self.hour - timedelta(hours=h)) for h in range(0, 12)] + [UserDB(] ... class ClientSessions(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [ClientActions(hour=self.hour - timedelta(hours=h)) for h in range(0, 3)] ... class SessionsABResults(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [ClientSessions(hour=self.hour), ABExperiments(hour=self.hour)] def output(self): return HdfsTarget(“hdfs://production/red/ab_sessions/v1/” + “{:year=%Y/month=%m/day=%d/hour=%H}”.format(self.hour)) ... Actions UserDB Time shuffle, user decorate Form sessions A/B compare ClientActions A/B session evaluation ● Expressive, embedded DSL - a must for ingress, egress ○ Avoid weak DSL tools: Oozie, AWS Data Pipeline Dataset instance Job (aka Task) classes
  • 34. Egress datasets ● Serving ○ Precomputed user query answers ○ Denormalised ○ Cassandra, (many) ● Export & Analytics ○ SQL (single node / Hive, Presto, ..) ○ Workbenches (Zeppelin) ○ (Elasticsearch, proprietary OLAP) ● BI / analytics tool needs change frequently ○ Prepare to redirect pipelines 34
  • 35. Test strategy considerations ● Developer productivity is the primary value of test automation ● Test at stable interface ○ Minimal maintenance ○ No barrier to refactorings ● Focus: single job + end to end ○ Jobs & pipelines are pure functions - easy to test ● Component, unit - only if necessary ○ Avoid dependency injection ceremony 35
  • 36. Testing single job 36 Job Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run in local mode 3. Verify output f() p() ● (Tool-specific frameworks, e.g. for Spark?) ○ Usable, but rarely cover I/O - home of many bugs. ○ Tied to processing technology Don’t commit - expensive to maintain. Generate / verify with code. Runs well in CI / from IDE
  • 37. Testing pipelines - two options 37 Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run custom multi-job Test job with sequence of jobs 3. Verify output f() p() A: Customised workflow manager setup + Runs in CI + Runs in IDE + Quick setup - Multi-job maintenance p() + Tests workflow logic + More authentic - Workflow mgr setup for testability - Difficult to debug - Dataset handling with Python f() B: ● Both can be extended with Kafka, egress DBs
  • 38. Deployment 38 Hg/git repo Luigi DSL, jars, config my-pipe-7.tar.gz HDFS Luigi daemon > pip install my-pipe-7.tar.gz Worker Worker Worker Worker Worker Worker Worker Worker Redundant cron schedule, higher frequency + backfill (Luigi range tools) * 10 * * * bin/my_pipe_daily --backfill 14 All that a pipeline needs, installed atomically
  • 39. Continuous deployment 39 ● Poll and pull latest on worker nodes ○ virtualenv package/version ■ No need to sync environment & versions ○ Cron package/latest/bin/* ■ Old versions run pipelines to completion, then exit Hg/git repo Luigi DSL, jars, config my-pipe-7.tar.gz HDFS hdfs://pipelines/ Worker > virtualenv my_pipe/7 > pip install my-pipe-7.tar.gz * 10 * * * my_pipe/7/bin/*
  • 40. Start lean: assess needs Your data & your jobs: A. Fit in one machine, and will continue to do so B. Fit in one machine, but grow faster than Moore’s law C. Do not fit in one machine ● Most datasets / jobs: A ○ Even at large companies with millions of users ● cost(C) >> cost(A) ● Running A jobs on C infrastructure is expensive 40
  • 41. Lean MVP ● Start simple, lean, end-to-end ○ No parallel cluster computations necessary? ○ Custom jobs or local Spark/Scalding/Crunch ● Shrink data ○ Downsample ○ Approximate algorithms (e.g. Count-min sketch) ● Get workflows running ○ Serial jobs on one/few machines ○ Simple job control (Luigi only / simple work queue) 41
  • 42. Scale carefully ● Get end-to-end workflows in production for evaluation ○ Improvements driven by business value, not tech ● Keep focus small ○ Business value ○ Privacy needs attention early ● Keep iterations swift ○ Integration test end-to-end ○ Efficient code/test/deploy cycle ● Parallelise jobs only when forced 42
  • 43. Protecting privacy in practice ● Removing old personal identifiable information (PII) ● Right to be forgotten ● Access control to PII data ● Audit of access and processing ● PII content definition is application-specific ● PII handling subject to business priorities ○ But you should have a plan from day one 43
  • 44. Data lake Derived Data retention ● Remove old, promote derived datasets to lake 44 Cluster storage Data lake Derived Cluster storage
  • 45. PII removal ● Must rebuild downstream datasets regularly ○ In order for PII to be washed in x days 45 bobwhite,http://site_a/,2015-01-03T bobwhite,http://site_b/,2015-01-03T joeblack,http://site_c/,2015-01-03T bobwhite,Bath,uk joeblack,Bristol,uk bobwhite,http://site_a/,2015-01-03T,Bath,uk bobwhite,http://site_b/,2015-01-03T,Bath,uk joeblack,http://site_c/,2015-01-03T,Bristol,uk 34ac,http://site_a/,2015-01-03T 34ac,http://site_b/,2015-01-03T 56bd,http://site_c/,2015-01-03T 34ac,Bath,uk 56db,Bristol,uk bobwhite,http://site_a/,2015-01-03T,Bath, uk bobwhite,http://site_b/,2015-01-03T,Bath, uk null,http://site_c/,2015-01-03T,Bristol,uk 34ac,bobwhite 56bd,null Split out PII, wash on user deletion Key on PII => difficult to wash
  • 46. Simple PII audit 46 ● Classify PII level ○ Name, address, messages, ... ○ IP, city, ... ○ Total # page views, … ● Tag datasets and jobs in code ● Manual access through gateway tool ○ Verify permission, log ○ Dedicated machines only ● Log batch jobs ○ Deploy with CD only, log hg/git commit hash
  • 47. Parting words + sales plug Keep things simple; batch, homogeneity & little state Focus on developer code, test, debug cycle - end to end Harmony with technical ecosystems Little technology overlap with yesterday - follow leaders Plan early: Privacy, retention, audit, schema evolution Please give feedback -- I help companies plan and build these things 47
  • 49. + Operations + Security + Responsive scaling - Development workflows - Privacy - Vendor lock-in Cloud or not?
  • 50. Security? ● Afterthought add-on for big data components ○ E.g. Kerberos support ○ Always trailing - difficult to choose global paradigm ● Container security simpler ○ Easy with cloud ○ Immature with on-premise solutions? 50
  • 51. Data pipelines example 51 Users Page views Sales Sales reports Views with demographics Sales with demographics Conversion analytics Conversion analytics Views with demographics Raw Derived
  • 52. Form teams that are driven by business cases & need Forward-oriented -> filters implicitly applied Beware of: duplication, tech chaos/autonomy, privacy loss Data pipelines team organisation
  • 53. Conway’s law “Organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations.” Better organise to match desired design, then.
  • 54. Personae - important characteristics Architect - Technology updated - Holistic: productivity, privacy - Identify and facilitate governance Backend developer - Simplicity oriented - Engineering practices obsessed - Adapt to data world Product owner - Trace business value to upstream design - Find most ROI through difficult questions Manager - Explain what and why - Facilitate process to determine how - Enable, enable, enable Devops - Always increase automation - Enable, don’t control Data scientist - Capable programmer - Product oriented
  • 55. Protect production servers 55 Cluster storage HDFS (NFS, S3, Google CS, C*) DB offline slave Service + Online service is safe - Replication may be out of sync - Cluster storage may be write unavailable => Delayed, inaccurate snapshot
  • 56. Deterministic slaves 56 + Standard procedure - Serial or resource consuming DB Service backup snapshot Restore DB Service + Deterministic - Ad-hoc solution - Serial => not scalable commit log Incremental, controlled replay DB DB
  • 57. PII privacy control ● Simplify with coarse classification (red/yellow/green) ○ Datasets, potentially fields ○ Separate production areas ● Log batch jobs ○ Code checksum -> commit id -> source code ○ Tag job class with classification ■ Aids PII consideration in code review ■ Enables ad-hoc verification 57
  • 58. Audit ● Audit manual access ● Wrap all functionality in gateway tool ○ Log datasets, output, code used ○ Disallow download to laptop ○ Wrapper tool happens to be great for enabling data scientists, too - shields them from operations. 58