SlideShare a Scribd company logo
1 of 54
Download to read offline
www.mimeria.com
Eventually, time will kill your
data pipeline
Berlin Buzzwords, 2019-06-17
Lars Albertsson
Mimeria
1
www.mimeria.com
Out of joint
“Time is out of joint. O cursed spite,
That ever I was born to set it right.”
- Hamlet, prince of Denmark
2
www.mimeria.com
Out of joint
“Time is out of joint. O cursed spite,
That ever I was born to set it right.”
- Hamlet, prince of Denmark
3
www.mimeria.com
Out of joint
“Time is out of joint. O cursed spite,
That ever I was born to set it right.”
- Hamlet, prince of Denmark
4
www.mimeria.com
Out of joint
5
Free users using
premium service
www.mimeria.com
Time will kill us all
● Time handling causes data processing problems
● Observed issues
● Principles, patterns, anti-patterns
Goals:
● Awareness, recognition
● Tools from my toolbox
6
www.mimeria.com
Data categories, time angle
● Facts
○ Events, observations
○ Time stamped by clock(s)
○ Continuous stream
7
www.mimeria.com
Data categories, time angle
● Facts
○ Events, observations
○ Time stamped by clock(s)
○ Continuous stream
● State
○ Snapshot view of system state
○ Lookup in live system. Dumped to data lake.
○ Regular intervals (daily)
8
www.mimeria.com
Data categories, time angle
● Facts
○ Events, observations
○ Time stamped by clock(s)
○ Continuous stream
● State
○ Snapshot view of system state
○ Lookup in live system. Dumped to data lake.
○ Regular intervals (daily)
● Claims
○ Statement about the past
○ Time window scope
9
www.mimeria.com
Domain time - scope of claim. “These are suspected fraudulent users for March 2019.”
Time scopes
10
Service
Registration time
Event time
Ingest time
Processing time
www.mimeria.com
Controlled backend - good clocks, UTC
Clocks
11
● Computer clocks measure elapsed time
● Good clocks, bad clocks, wrong clocks
Legacy systems - good clocks
Time zone?
Clients - bad clocks
Time zone?
www.mimeria.com
Calendars, naive
Maps time to astronomical and social domains
Naive calendar
● 365 days + leap years
● 12 months, weird number of days
● 7 day weeks
● 24 hours
● 60 minutes
● 60 seconds
● 24 hour time zones
12
www.mimeria.com
Naive calendar properties
t + 1.day == t + 86400.seconds
(y + 45.years).asInt == y.asInt + 45
(date(y, 1, 1) + 364.days).month == 12
(t + 60.minutes).hour == (t.hour + 1) % 24
datetime(y, mon, d, h, m, s, tz).inZone(tz2).day in [d - 1, d, d + 1]
datetime(y, mon, d, h, m, s, tz).inZone(tz2).minute == m
date(y, 12, 29) + 1.day == date(y, 12, 30)
(date(y, 2, 28) + 2.days).month == 3
13
Can you find the counter
examples?
www.mimeria.com
Calendar reality
t + 1.day == t + 86400.seconds
(y + 45.years).asInt == y.asInt + 45
(date(y, 1, 1) + 364.days).month == 12
(t + 60.minutes).hour == (t.hour + 1) % 24
datetime(y, mon, d, h, m, s, tz).inZone(tz2).day in [d - 1, d, d + 1]
datetime(y, mon, d, h, m, s, tz).inZone(tz2).minute == m
date(y, 12, 29) + 1.day == date(y, 12, 30)
(date(y, 2, 28) + 2.days).month == 3
14
Leap seconds
Year zero
Julian / Gregorian
calendar switch
Sweden: 1712-02-30
Daylight savings time
Samoa crosses
date line. Again.
Time zones:
Span 26 hours
15 minute granularity
www.mimeria.com
Except DST
● Analytics quality
● Changes with political decision
● Might affect technical systems
● (Affects health!)
Small problems in practice
15
Leap seconds
Year zero
Julian / Gregorian
calendar switch
Sweden: 1712-02-30
Samoa crosses
date line. Again.
Time zones:
Span 26 hours
15 minute granularity
Daylight savings time
www.mimeria.com
Typical loading dock ingress
16
● File arrives every hour
● Ingest job copies to lake,
applies data platform
conventions
● Source system determines
format, naming, and
timestamps
class SalesArrived(ExternalTask):
"""Receive a file with sales transactions every hour."""
hour = DateHourParameter()
def matches(self):
return [u for u in GCSClient().listdir(f'gs://ingress/sales/')
if re.match(rf'{self.hour:%Y%m%d%h}.*.json', u)]
def output(self):
return GCSTarget(self.matches()[0])
@requires(SalesArrived)
class SalesIngest(Task):
def output(self):
return GCSFlagTarget('gs://lake/sales/'
f'{self.hour:year=%Y/month=%m/day=%d/hour=%h}/')
def run(self):
_, base = self.input().path.rsplit('/', 1)
dst = f'{self.output().path}{base}'
self.output().fs.copy(src.path, dst)
self.output().fs.put_string('',
f'{self.output().path}{self.output().flag)}'
crontab:
*/10 * * * * luigi RangeHourly --of SalesIngest
www.mimeria.com
Typical loading dock ingress
17
● File arrives every hour
● Ingest job copies to lake,
applies data platform
conventions
● Source system determines
format, naming, and
timestamps, incl. zone
● Spring: halted ingest
Autumn: data loss
class SalesArrived(ExternalTask):
"""Receive a file with sales transactions every hour."""
hour = DateHourParameter()
def matches(self):
return [u for u in GCSClient().listdir(f'gs://ingress/sales/')
if re.match(rf'{self.hour:%Y%m%d%h}.*.json', u)]
def output(self):
return GCSTarget(self.matches()[0])
@requires(SalesArrived)
class SalesIngest(Task):
def output(self):
return GCSFlagTarget('gs://lake/sales/'
f'{self.hour:year=%Y/month=%m/day=%d/hour=%h}/')
def run(self):
_, base = self.input().path.rsplit('/', 1)
dst = f'{self.output().path}{base}'
self.output().fs.copy(src.path, dst)
self.output().fs.put_string('',
f'{self.output().path}{self.output().flag)}'
crontab:
*/10 * * * * luigi RangeHourly --of SalesIngest
www.mimeria.com
Offline:
Processing time != event time
Online:
Processing time ~= event time
Offline / online
18
Service
DB
DB
Service
www.mimeria.com
Job == function([input datasets]): [output datasets]
● Ideally: atomic, deterministic, idempotent
● No external factors → deterministic
○ No (mutable) database queries
○ No service lookup
○ Don't read wall clock
○ No random numbers
● Known, bounded input data
● No orthogonal concerns & input factors
○ Invocation
○ Scheduling
○ Input / output location
● No side-effects
Batch job - functional principles
19
q
DB Service
www.mimeria.com
● Simple approach: Daily full table snapshot to cluster storage dataset.
● Easy on surface...
Database dumping
20
Service
DB
DB
Service
?
www.mimeria.com
● Sqoop (dump with MapReduce) production DB
● MapReduce from production API
Hadoop / Spark == internal DDoS service
Anti-pattern: Death by elephant
21
Service
DB
DB
Service
www.mimeria.com
● Restore backup to offline replica
● Dump from replica
Pattern: Offline replica
22
DB
Service
backup
snapshot
Restore
DB
www.mimeria.com
Using snapshots
● join(event, snapshot) → always time mismatch
● Usually acceptable
○ In one direction
23
DB’DB
join?
www.mimeria.com
Using snapshots
● join(event, snapshot) → always time mismatch
● Usually acceptable
○ In one direction
24
DB’DB
join?
Free users using
premium service
www.mimeria.com
● Time mismatch in both directions
Window misalign
25
DB
join
www.mimeria.com
Event sourcing
● Every change to unified log == source of truth
● snapshot(t + 1) = sum(snapshot(t), events(t, t+1))
● Allows view & join at any point in time
○ But more complex
26
DB’DB
www.mimeria.com
Easier to express with streaming?
state + event → state'
● State is a view of sum of all events
○ Join with the sum table
○ Beautiful!
○ Not simple in practice
● Mixing event time and processing time
○ Every join is a race condition
27
Stream
Stream
Stream
Stream
www.mimeria.com
Stream window operations
28
Stream
Stream
Stream
Stream
Sliding windows
Tumbling
windows
Stream
Stream
Stream
Window
join
www.mimeria.com
Window over what?
● Number of events
● Processing time
● Registration time
● Event time
29
Less relevant for domain
Deterministic
Predictable semantics
Predictable resources
More relevant for domain
Deterministic
Unpredictable semantics
Unpredictable resources
Less relevant for domain
Indeterministic
Unpredictable semantics
Predictable resources
www.mimeria.com
Batch is easier?
● Yes
● But, some pitfalls
Event ingest
30
How to divide events
to datasets?
When to start
processing?
www.mimeria.com
Ancient data collection:
● Events in log files, partitioned hourly
● Copy each hour of every host to lake
Anti-pattern: Bucket by registration time
31
Service
All hosts
represented?
www.mimeria.com
● Start processing optimistically
● Reprocess after x% new data has arrived
Anti-pattern: Reprocess
32
www.mimeria.com
● Start processing optimistically
● Reprocess after x% new data has arrived
Solution space:
● Apache Beam / Google → stream processing ops
● Data versioning
● Provenance
Requires good (lacking) tooling
Supporting reprocessing
33
www.mimeria.com
Event collection
34
Service
Registration time
Event time
Ingest time
Processing time
www.mimeria.com
● Bundle incoming events into datasets
○ Bucket on ingest / wall-clock time
○ Predictable bucketing, e.g. hour
Pattern: Ingest time bucketing
35
clicks/2016/02/08/14
clicks/2016/02/08/15
● Sealed quickly at end of hour
● Mark dataset as complete
○ E.g. _SUCCESS flag
www.mimeria.com
When data is late
36
or
www.mimeria.com
val orderLateCounter = longAccumulator("order-event-late")
val hourPaths = conf.order.split(",")
val order = hourPaths
.map(spark.read.avro(_))
.reduce(a, b => a.union(b))
val orderThisHour = order
.map({ cl =>
# Count the events that came after the delay window
if (cl.eventTime.hour + config.delayHours <
config.hour) {
orderLateCounter.add(1)
}
order
})
.filter(cl => cl.eventTime.hour == config.hour)
class OrderShuffle(SparkSubmitTask):
hour = DateHourParameter()
delay_hours = IntParameter()
jar = 'orderpipeline.jar'
entry_class = 'com.example.shop.OrderShuffleJob'
def requires(self):
# Note: This delays processing by N hours.
return [Order(hour=hour) for hour in
[self.hour + timedelta(hour=h) for h in
range(self.delay_hours)]]
def output(self):
return HdfsTarget("/prod/red/order/v1/"
f"delay={self.delay}/"
f"{self.hour:%Y/%m/%d/%H}/")
def app_options(self):
return [ "--hour", self.hour,
"--delay-hours", self.delay_hours,
"--order",
",".join([i.path for i in self.input()]),
"--output", self.output().path]
Incompleteness recovery
37
www.mimeria.com
class OrderShuffleAll(WrapperTask):
hour = DateHourParameter()
def requires(self):
return [OrderShuffle(hour=self.hour, delay_hour=d)
for d in [0, 4, 12]]
class OrderDashboard(mysql.CopyToTable):
hour = DateHourParameter()
def requires(self):
return OrderShuffle(hour=self.hour, delay_hour=0)
class FinancialReport(SparkSubmitTask):
date = DateParameter()
def requires(self):
return [OrderShuffle(
hour=datetime.combine(self.date, time(hours=h)),
delay_hour=12)
for h in range(24)]
Fast data, complete data
38
Delay: 0
Delay: 4
Delay: 12
www.mimeria.com
class ReportBase(SparkSubmitTask):
date = DateParameter()
jar = 'reportpipeline.jar'
entry_class = 'com.example.shop.ReportJob'
def requires(self):
return [OrderShuffle(
hour=datetime.combine(self.date, time(hour=h)),
delay_hour=self.delay)
for h in range(24)]
class PreliminaryReport(ReportBase):
delay = 0
class FinalReport(ReportBase):
delay = 12
def requires(self):
return super().requires() + [SuspectedFraud(self.date)]
Lambda in batch
39
www.mimeria.com
Human-delayed data
“These are my compensation claims for last January.”
● Want: accurate reporting
● Want: current report status
Anti-pattern: Reprocessing
40
www.mimeria.com
class ClaimsReport(SparkSubmitTask):
domain_month = MonthParameter()
date = DateParameter()
jar = 'reportpipeline.jar'
entry_class = 'com.example.shop.ClaimsReportJob'
def requires(self):
return [Claims(
domain_month=self.domain_month,
date=d)
for date in date_range(
self.domain_month + timedelta(month=1),
self.date)]
Dual time scopes
41
● What we knew about month X at date Y
● Deterministic
○ Can be backfilled, audited
● May require sparse dependency trees
www.mimeria.com
Recursive dependencies
● Use yesterday’s dataset + some delta
○ Starting point necessary
● Often convenient, but operational risk
○ Slow backfills
● Mitigation: recursive jumps
○ Depend on previous month + all previous days
in this month
42
class StockAggregateJob(SparkSubmitTask):
date = DateParameter()
jar = 'stockpipeline.jar'
entry_class = 'com.example.shop.StockAggregate'
def requires(self):
yesterday = self.date - timedelta(days=1)
previous = StockAggregateJob( date=yesterday)
return [StockUpdate(date=self.date), previous]
www.mimeria.com
Recursive dependency strides
● Mitigation: recursive strides
● y/m/1 depends on y/m-1/1
● Others depend on y/m/1 + all previous
days in this month
43
class StockAggregateStrideJob(SparkSubmitTask):
date = DateParameter()
jar = 'stockpipeline.jar'
entry_class = 'com.example.shop.StockAggregate'
def requires(self):
first_in_month = self.date.replace(day=1)
base = first_in_month - relativedelta(months=1) 
if self.date.day == 1 else first_in_month
return ([StockAggregateStrideJob(date=base)] +
[StockUpdate(date=d) for d in
rrule(freq=DAILY, dtstart=base, until=self.date)])
www.mimeria.com
Business logic with history
Example: Forming sessions
● Valuable for
○ User insights
○ Product insights
○ A/B testing
○ ...
44
www.mimeria.com
What is a session?
● Sequence of clicks at most 5 minutes
apart?
● In order to emit sessions in one hour,
which hours of clicks are needed?
45
www.mimeria.com
What is a session?
● Sequence of clicks at most 5 minutes
apart?
● Maximum length 3 hours?
● In order to emit sessions in one hour,
which hours of clicks are needed?
46
www.mimeria.com
What is a session?
● Sequence of clicks at most 5 minutes
apart.
● Maximum length 3 hours.
Examples, window = 5am - 9am:
● Clicks at [6:00, 6:01, 6:03, 6:45, 6:47]?
○ Two sessions, 3 and 2 minutes long.
47
www.mimeria.com
What is a session?
● Sequence of clicks at most 5 minutes
apart.
● Maximum length 3 hours.
Examples, window = 5am - 9am:
● Clicks at [6:00, 6:01, 6:03, 6:45, 6:47]?
○ Two sessions, 3 and 2 minutes long.
● [5:00, 5:01, 5:03]?
○ One session, 3 minutes?
48
www.mimeria.com
What is a session?
● Sequence of clicks at most 5 minutes
apart.
● Maximum length 3 hours.
Examples, window = 5am - 9am:
● Clicks at [6:00, 6:01, 6:03, 6:45, 6:47]?
○ Two sessions, 3 and 2 minutes long.
● [5:00, 5:01, 5:03]?
○ One session, 3 minutes?
○ [(4:57), 5:00, 5:01, 5:03]?
○ [(2:01), (every 3 minutes), (4:57), 5:00,
5:01, 5:03]?
49
www.mimeria.com
Occurrences with unbounded time spans
● E.g. sessions, behavioural patterns
● You often need a broader time range than expected
● You may need data from the future
● You may need infinite history
○ Recursive strides?
○ Introduce static limits, e.g. cut all sessions at midnight?
○ Emit counters to monitor assumptions.
50
www.mimeria.com
Secrets of valuable data engineering
1. Avoid complexity and distributed systems
○ Your data fits in one machine.
2. Pick the slowest data integration you can live with
○ Batch >> streaming >> service.
○ Slow data → easy operations → high innovation speed
3. Functional architectural principles
○ Pipelines
○ Immutability
○ Reproducibility
○ Idempotency
4. Master workflow orchestration
51
www.mimeria.com
Team concurrency
● Pipelines
○ Parallel development without risk
● Immutability
○ Data reusability without risk
● Reproducibility, idempotency
○ Preserving immutability
○ Reduces operational risk
○ Stable experiments
● Workflow orchestration
○ Data handover between systems & teams
○ Reduces operational overhead
52
Data science
reproducibility crisis!
www.mimeria.com
The real value of big data
53
Stream storage
Data lake
Data
democratised
www.mimeria.com
Resources, credits
Time libraries:
● Java: Joda time java.time
● Scala: chronoscala
● Python: dateutil, pendulum
Presentation on operational tradeoffs:
https://www.slideshare.net/lallea/data-ops-in-practice
54
Thank you,
● Konstantinos Chaidos, Spotify
● Lena Sundin, independent

More Related Content

What's hot

The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)Mathieu Bastian
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in productionPingCAP
 
Golang in TiDB (GopherChina 2017)
Golang in TiDB  (GopherChina 2017)Golang in TiDB  (GopherChina 2017)
Golang in TiDB (GopherChina 2017)PingCAP
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...Flink Forward
 
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...Altinity Ltd
 
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)spil-engineering
 
Realtime Data Analytics
Realtime Data AnalyticsRealtime Data Analytics
Realtime Data AnalyticsBo Yang
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processingYogi Devendra Vyavahare
 
Virtual Flink Forward 2020: Cogynt: Flink without code - Samantha Chan, Aslam...
Virtual Flink Forward 2020: Cogynt: Flink without code - Samantha Chan, Aslam...Virtual Flink Forward 2020: Cogynt: Flink without code - Samantha Chan, Aslam...
Virtual Flink Forward 2020: Cogynt: Flink without code - Samantha Chan, Aslam...Flink Forward
 
From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...Neville Li
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDBPingCAP
 
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014Luigi Dell'Aquila
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)PingCAP
 
Building a transactional key-value store that scales to 100+ nodes (percona l...
Building a transactional key-value store that scales to 100+ nodes (percona l...Building a transactional key-value store that scales to 100+ nodes (percona l...
Building a transactional key-value store that scales to 100+ nodes (percona l...PingCAP
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalSub Szabolcs Feczak
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 

What's hot (20)

Privacy by design
Privacy by designPrivacy by design
Privacy by design
 
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
 
Golang in TiDB (GopherChina 2017)
Golang in TiDB  (GopherChina 2017)Golang in TiDB  (GopherChina 2017)
Golang in TiDB (GopherChina 2017)
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
 
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
 
Druid
DruidDruid
Druid
 
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
 
Realtime Data Analytics
Realtime Data AnalyticsRealtime Data Analytics
Realtime Data Analytics
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processing
 
Virtual Flink Forward 2020: Cogynt: Flink without code - Samantha Chan, Aslam...
Virtual Flink Forward 2020: Cogynt: Flink without code - Samantha Chan, Aslam...Virtual Flink Forward 2020: Cogynt: Flink without code - Samantha Chan, Aslam...
Virtual Flink Forward 2020: Cogynt: Flink without code - Samantha Chan, Aslam...
 
From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDB
 
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)
 
Building a transactional key-value store that scales to 100+ nodes (percona l...
Building a transactional key-value store that scales to 100+ nodes (percona l...Building a transactional key-value store that scales to 100+ nodes (percona l...
Building a transactional key-value store that scales to 100+ nodes (percona l...
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 

Similar to Eventually, time will kill your data pipeline

[WSO2Con EU 2018] Patterns for Building Streaming Apps
[WSO2Con EU 2018] Patterns for Building Streaming Apps[WSO2Con EU 2018] Patterns for Building Streaming Apps
[WSO2Con EU 2018] Patterns for Building Streaming AppsWSO2
 
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...
Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data...Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data...
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...Flink Forward
 
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark ProcessingBulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark ProcessingSpark Summit
 
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...GeeksLab Odessa
 
Auditing data and answering the life long question, is it the end of the day ...
Auditing data and answering the life long question, is it the end of the day ...Auditing data and answering the life long question, is it the end of the day ...
Auditing data and answering the life long question, is it the end of the day ...Simona Meriam
 
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...Altinity Ltd
 
The Great and Mighty C++
The Great and Mighty C++The Great and Mighty C++
The Great and Mighty C++Andrey Karpov
 
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEODangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEOAltinity Ltd
 
5-minute Practical Streaming Techniques that can Save You Millions
5-minute Practical Streaming Techniques that can Save You Millions5-minute Practical Streaming Techniques that can Save You Millions
5-minute Practical Streaming Techniques that can Save You MillionsHostedbyConfluent
 
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...Srinath Perera
 
Stream processing with Apache Flink - Maximilian Michels Data Artisans
Stream processing with Apache Flink - Maximilian Michels Data ArtisansStream processing with Apache Flink - Maximilian Michels Data Artisans
Stream processing with Apache Flink - Maximilian Michels Data ArtisansEvention
 
Chronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the BlockChronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the BlockQAware GmbH
 
Dataframes in Spark - Data Analysts' perspective
Dataframes in Spark - Data Analysts' perspectiveDataframes in Spark - Data Analysts' perspective
Dataframes in Spark - Data Analysts' perspectiveMarcin Szymaniuk
 
Crunching Gigabytes Locally
Crunching Gigabytes LocallyCrunching Gigabytes Locally
Crunching Gigabytes LocallyDima Korolev
 
IOT with PostgreSQL
IOT with PostgreSQLIOT with PostgreSQL
IOT with PostgreSQLEDB
 
Querying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaQuerying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaYaroslav Tkachenko
 

Similar to Eventually, time will kill your data pipeline (20)

[WSO2Con EU 2018] Patterns for Building Streaming Apps
[WSO2Con EU 2018] Patterns for Building Streaming Apps[WSO2Con EU 2018] Patterns for Building Streaming Apps
[WSO2Con EU 2018] Patterns for Building Streaming Apps
 
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...
Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data...Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data...
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...
 
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark ProcessingBulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
 
Python datetime
Python datetimePython datetime
Python datetime
 
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
 
Auditing data and answering the life long question, is it the end of the day ...
Auditing data and answering the life long question, is it the end of the day ...Auditing data and answering the life long question, is it the end of the day ...
Auditing data and answering the life long question, is it the end of the day ...
 
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
 
The Great and Mighty C++
The Great and Mighty C++The Great and Mighty C++
The Great and Mighty C++
 
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEODangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
 
5-minute Practical Streaming Techniques that can Save You Millions
5-minute Practical Streaming Techniques that can Save You Millions5-minute Practical Streaming Techniques that can Save You Millions
5-minute Practical Streaming Techniques that can Save You Millions
 
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
 
Big Data Warsaw
Big Data WarsawBig Data Warsaw
Big Data Warsaw
 
Stream processing with Apache Flink - Maximilian Michels Data Artisans
Stream processing with Apache Flink - Maximilian Michels Data ArtisansStream processing with Apache Flink - Maximilian Michels Data Artisans
Stream processing with Apache Flink - Maximilian Michels Data Artisans
 
Chronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the BlockChronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the Block
 
Dataframes in Spark - Data Analysts' perspective
Dataframes in Spark - Data Analysts' perspectiveDataframes in Spark - Data Analysts' perspective
Dataframes in Spark - Data Analysts' perspective
 
Spark3
Spark3Spark3
Spark3
 
Crunching Gigabytes Locally
Crunching Gigabytes LocallyCrunching Gigabytes Locally
Crunching Gigabytes Locally
 
IOT with PostgreSQL
IOT with PostgreSQLIOT with PostgreSQL
IOT with PostgreSQL
 
Querying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaQuerying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS Athena
 
Nexmark with beam
Nexmark with beamNexmark with beam
Nexmark with beam
 

More from Lars Albertsson

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divideLars Albertsson
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with ScalametaLars Albertsson
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfLars Albertsson
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfLars Albertsson
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application qualityLars Albertsson
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetLars Albertsson
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesLars Albertsson
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift leftLars Albertsson
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityLars Albertsson
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish styleLars Albertsson
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data opsLars Albertsson
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisisLars Albertsson
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science teamLars Albertsson
 

More from Lars Albertsson (20)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
Data democratised
Data democratisedData democratised
Data democratised
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science team
 
Big data == lean data
Big data == lean dataBig data == lean data
Big data == lean data
 

Recently uploaded

ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 

Recently uploaded (20)

ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 

Eventually, time will kill your data pipeline

  • 1. www.mimeria.com Eventually, time will kill your data pipeline Berlin Buzzwords, 2019-06-17 Lars Albertsson Mimeria 1
  • 2. www.mimeria.com Out of joint “Time is out of joint. O cursed spite, That ever I was born to set it right.” - Hamlet, prince of Denmark 2
  • 3. www.mimeria.com Out of joint “Time is out of joint. O cursed spite, That ever I was born to set it right.” - Hamlet, prince of Denmark 3
  • 4. www.mimeria.com Out of joint “Time is out of joint. O cursed spite, That ever I was born to set it right.” - Hamlet, prince of Denmark 4
  • 5. www.mimeria.com Out of joint 5 Free users using premium service
  • 6. www.mimeria.com Time will kill us all ● Time handling causes data processing problems ● Observed issues ● Principles, patterns, anti-patterns Goals: ● Awareness, recognition ● Tools from my toolbox 6
  • 7. www.mimeria.com Data categories, time angle ● Facts ○ Events, observations ○ Time stamped by clock(s) ○ Continuous stream 7
  • 8. www.mimeria.com Data categories, time angle ● Facts ○ Events, observations ○ Time stamped by clock(s) ○ Continuous stream ● State ○ Snapshot view of system state ○ Lookup in live system. Dumped to data lake. ○ Regular intervals (daily) 8
  • 9. www.mimeria.com Data categories, time angle ● Facts ○ Events, observations ○ Time stamped by clock(s) ○ Continuous stream ● State ○ Snapshot view of system state ○ Lookup in live system. Dumped to data lake. ○ Regular intervals (daily) ● Claims ○ Statement about the past ○ Time window scope 9
  • 10. www.mimeria.com Domain time - scope of claim. “These are suspected fraudulent users for March 2019.” Time scopes 10 Service Registration time Event time Ingest time Processing time
  • 11. www.mimeria.com Controlled backend - good clocks, UTC Clocks 11 ● Computer clocks measure elapsed time ● Good clocks, bad clocks, wrong clocks Legacy systems - good clocks Time zone? Clients - bad clocks Time zone?
  • 12. www.mimeria.com Calendars, naive Maps time to astronomical and social domains Naive calendar ● 365 days + leap years ● 12 months, weird number of days ● 7 day weeks ● 24 hours ● 60 minutes ● 60 seconds ● 24 hour time zones 12
  • 13. www.mimeria.com Naive calendar properties t + 1.day == t + 86400.seconds (y + 45.years).asInt == y.asInt + 45 (date(y, 1, 1) + 364.days).month == 12 (t + 60.minutes).hour == (t.hour + 1) % 24 datetime(y, mon, d, h, m, s, tz).inZone(tz2).day in [d - 1, d, d + 1] datetime(y, mon, d, h, m, s, tz).inZone(tz2).minute == m date(y, 12, 29) + 1.day == date(y, 12, 30) (date(y, 2, 28) + 2.days).month == 3 13 Can you find the counter examples?
  • 14. www.mimeria.com Calendar reality t + 1.day == t + 86400.seconds (y + 45.years).asInt == y.asInt + 45 (date(y, 1, 1) + 364.days).month == 12 (t + 60.minutes).hour == (t.hour + 1) % 24 datetime(y, mon, d, h, m, s, tz).inZone(tz2).day in [d - 1, d, d + 1] datetime(y, mon, d, h, m, s, tz).inZone(tz2).minute == m date(y, 12, 29) + 1.day == date(y, 12, 30) (date(y, 2, 28) + 2.days).month == 3 14 Leap seconds Year zero Julian / Gregorian calendar switch Sweden: 1712-02-30 Daylight savings time Samoa crosses date line. Again. Time zones: Span 26 hours 15 minute granularity
  • 15. www.mimeria.com Except DST ● Analytics quality ● Changes with political decision ● Might affect technical systems ● (Affects health!) Small problems in practice 15 Leap seconds Year zero Julian / Gregorian calendar switch Sweden: 1712-02-30 Samoa crosses date line. Again. Time zones: Span 26 hours 15 minute granularity Daylight savings time
  • 16. www.mimeria.com Typical loading dock ingress 16 ● File arrives every hour ● Ingest job copies to lake, applies data platform conventions ● Source system determines format, naming, and timestamps class SalesArrived(ExternalTask): """Receive a file with sales transactions every hour.""" hour = DateHourParameter() def matches(self): return [u for u in GCSClient().listdir(f'gs://ingress/sales/') if re.match(rf'{self.hour:%Y%m%d%h}.*.json', u)] def output(self): return GCSTarget(self.matches()[0]) @requires(SalesArrived) class SalesIngest(Task): def output(self): return GCSFlagTarget('gs://lake/sales/' f'{self.hour:year=%Y/month=%m/day=%d/hour=%h}/') def run(self): _, base = self.input().path.rsplit('/', 1) dst = f'{self.output().path}{base}' self.output().fs.copy(src.path, dst) self.output().fs.put_string('', f'{self.output().path}{self.output().flag)}' crontab: */10 * * * * luigi RangeHourly --of SalesIngest
  • 17. www.mimeria.com Typical loading dock ingress 17 ● File arrives every hour ● Ingest job copies to lake, applies data platform conventions ● Source system determines format, naming, and timestamps, incl. zone ● Spring: halted ingest Autumn: data loss class SalesArrived(ExternalTask): """Receive a file with sales transactions every hour.""" hour = DateHourParameter() def matches(self): return [u for u in GCSClient().listdir(f'gs://ingress/sales/') if re.match(rf'{self.hour:%Y%m%d%h}.*.json', u)] def output(self): return GCSTarget(self.matches()[0]) @requires(SalesArrived) class SalesIngest(Task): def output(self): return GCSFlagTarget('gs://lake/sales/' f'{self.hour:year=%Y/month=%m/day=%d/hour=%h}/') def run(self): _, base = self.input().path.rsplit('/', 1) dst = f'{self.output().path}{base}' self.output().fs.copy(src.path, dst) self.output().fs.put_string('', f'{self.output().path}{self.output().flag)}' crontab: */10 * * * * luigi RangeHourly --of SalesIngest
  • 18. www.mimeria.com Offline: Processing time != event time Online: Processing time ~= event time Offline / online 18 Service DB DB Service
  • 19. www.mimeria.com Job == function([input datasets]): [output datasets] ● Ideally: atomic, deterministic, idempotent ● No external factors → deterministic ○ No (mutable) database queries ○ No service lookup ○ Don't read wall clock ○ No random numbers ● Known, bounded input data ● No orthogonal concerns & input factors ○ Invocation ○ Scheduling ○ Input / output location ● No side-effects Batch job - functional principles 19 q DB Service
  • 20. www.mimeria.com ● Simple approach: Daily full table snapshot to cluster storage dataset. ● Easy on surface... Database dumping 20 Service DB DB Service ?
  • 21. www.mimeria.com ● Sqoop (dump with MapReduce) production DB ● MapReduce from production API Hadoop / Spark == internal DDoS service Anti-pattern: Death by elephant 21 Service DB DB Service
  • 22. www.mimeria.com ● Restore backup to offline replica ● Dump from replica Pattern: Offline replica 22 DB Service backup snapshot Restore DB
  • 23. www.mimeria.com Using snapshots ● join(event, snapshot) → always time mismatch ● Usually acceptable ○ In one direction 23 DB’DB join?
  • 24. www.mimeria.com Using snapshots ● join(event, snapshot) → always time mismatch ● Usually acceptable ○ In one direction 24 DB’DB join? Free users using premium service
  • 25. www.mimeria.com ● Time mismatch in both directions Window misalign 25 DB join
  • 26. www.mimeria.com Event sourcing ● Every change to unified log == source of truth ● snapshot(t + 1) = sum(snapshot(t), events(t, t+1)) ● Allows view & join at any point in time ○ But more complex 26 DB’DB
  • 27. www.mimeria.com Easier to express with streaming? state + event → state' ● State is a view of sum of all events ○ Join with the sum table ○ Beautiful! ○ Not simple in practice ● Mixing event time and processing time ○ Every join is a race condition 27 Stream Stream Stream Stream
  • 28. www.mimeria.com Stream window operations 28 Stream Stream Stream Stream Sliding windows Tumbling windows Stream Stream Stream Window join
  • 29. www.mimeria.com Window over what? ● Number of events ● Processing time ● Registration time ● Event time 29 Less relevant for domain Deterministic Predictable semantics Predictable resources More relevant for domain Deterministic Unpredictable semantics Unpredictable resources Less relevant for domain Indeterministic Unpredictable semantics Predictable resources
  • 30. www.mimeria.com Batch is easier? ● Yes ● But, some pitfalls Event ingest 30 How to divide events to datasets? When to start processing?
  • 31. www.mimeria.com Ancient data collection: ● Events in log files, partitioned hourly ● Copy each hour of every host to lake Anti-pattern: Bucket by registration time 31 Service All hosts represented?
  • 32. www.mimeria.com ● Start processing optimistically ● Reprocess after x% new data has arrived Anti-pattern: Reprocess 32
  • 33. www.mimeria.com ● Start processing optimistically ● Reprocess after x% new data has arrived Solution space: ● Apache Beam / Google → stream processing ops ● Data versioning ● Provenance Requires good (lacking) tooling Supporting reprocessing 33
  • 35. www.mimeria.com ● Bundle incoming events into datasets ○ Bucket on ingest / wall-clock time ○ Predictable bucketing, e.g. hour Pattern: Ingest time bucketing 35 clicks/2016/02/08/14 clicks/2016/02/08/15 ● Sealed quickly at end of hour ● Mark dataset as complete ○ E.g. _SUCCESS flag
  • 37. www.mimeria.com val orderLateCounter = longAccumulator("order-event-late") val hourPaths = conf.order.split(",") val order = hourPaths .map(spark.read.avro(_)) .reduce(a, b => a.union(b)) val orderThisHour = order .map({ cl => # Count the events that came after the delay window if (cl.eventTime.hour + config.delayHours < config.hour) { orderLateCounter.add(1) } order }) .filter(cl => cl.eventTime.hour == config.hour) class OrderShuffle(SparkSubmitTask): hour = DateHourParameter() delay_hours = IntParameter() jar = 'orderpipeline.jar' entry_class = 'com.example.shop.OrderShuffleJob' def requires(self): # Note: This delays processing by N hours. return [Order(hour=hour) for hour in [self.hour + timedelta(hour=h) for h in range(self.delay_hours)]] def output(self): return HdfsTarget("/prod/red/order/v1/" f"delay={self.delay}/" f"{self.hour:%Y/%m/%d/%H}/") def app_options(self): return [ "--hour", self.hour, "--delay-hours", self.delay_hours, "--order", ",".join([i.path for i in self.input()]), "--output", self.output().path] Incompleteness recovery 37
  • 38. www.mimeria.com class OrderShuffleAll(WrapperTask): hour = DateHourParameter() def requires(self): return [OrderShuffle(hour=self.hour, delay_hour=d) for d in [0, 4, 12]] class OrderDashboard(mysql.CopyToTable): hour = DateHourParameter() def requires(self): return OrderShuffle(hour=self.hour, delay_hour=0) class FinancialReport(SparkSubmitTask): date = DateParameter() def requires(self): return [OrderShuffle( hour=datetime.combine(self.date, time(hours=h)), delay_hour=12) for h in range(24)] Fast data, complete data 38 Delay: 0 Delay: 4 Delay: 12
  • 39. www.mimeria.com class ReportBase(SparkSubmitTask): date = DateParameter() jar = 'reportpipeline.jar' entry_class = 'com.example.shop.ReportJob' def requires(self): return [OrderShuffle( hour=datetime.combine(self.date, time(hour=h)), delay_hour=self.delay) for h in range(24)] class PreliminaryReport(ReportBase): delay = 0 class FinalReport(ReportBase): delay = 12 def requires(self): return super().requires() + [SuspectedFraud(self.date)] Lambda in batch 39
  • 40. www.mimeria.com Human-delayed data “These are my compensation claims for last January.” ● Want: accurate reporting ● Want: current report status Anti-pattern: Reprocessing 40
  • 41. www.mimeria.com class ClaimsReport(SparkSubmitTask): domain_month = MonthParameter() date = DateParameter() jar = 'reportpipeline.jar' entry_class = 'com.example.shop.ClaimsReportJob' def requires(self): return [Claims( domain_month=self.domain_month, date=d) for date in date_range( self.domain_month + timedelta(month=1), self.date)] Dual time scopes 41 ● What we knew about month X at date Y ● Deterministic ○ Can be backfilled, audited ● May require sparse dependency trees
  • 42. www.mimeria.com Recursive dependencies ● Use yesterday’s dataset + some delta ○ Starting point necessary ● Often convenient, but operational risk ○ Slow backfills ● Mitigation: recursive jumps ○ Depend on previous month + all previous days in this month 42 class StockAggregateJob(SparkSubmitTask): date = DateParameter() jar = 'stockpipeline.jar' entry_class = 'com.example.shop.StockAggregate' def requires(self): yesterday = self.date - timedelta(days=1) previous = StockAggregateJob( date=yesterday) return [StockUpdate(date=self.date), previous]
  • 43. www.mimeria.com Recursive dependency strides ● Mitigation: recursive strides ● y/m/1 depends on y/m-1/1 ● Others depend on y/m/1 + all previous days in this month 43 class StockAggregateStrideJob(SparkSubmitTask): date = DateParameter() jar = 'stockpipeline.jar' entry_class = 'com.example.shop.StockAggregate' def requires(self): first_in_month = self.date.replace(day=1) base = first_in_month - relativedelta(months=1) if self.date.day == 1 else first_in_month return ([StockAggregateStrideJob(date=base)] + [StockUpdate(date=d) for d in rrule(freq=DAILY, dtstart=base, until=self.date)])
  • 44. www.mimeria.com Business logic with history Example: Forming sessions ● Valuable for ○ User insights ○ Product insights ○ A/B testing ○ ... 44
  • 45. www.mimeria.com What is a session? ● Sequence of clicks at most 5 minutes apart? ● In order to emit sessions in one hour, which hours of clicks are needed? 45
  • 46. www.mimeria.com What is a session? ● Sequence of clicks at most 5 minutes apart? ● Maximum length 3 hours? ● In order to emit sessions in one hour, which hours of clicks are needed? 46
  • 47. www.mimeria.com What is a session? ● Sequence of clicks at most 5 minutes apart. ● Maximum length 3 hours. Examples, window = 5am - 9am: ● Clicks at [6:00, 6:01, 6:03, 6:45, 6:47]? ○ Two sessions, 3 and 2 minutes long. 47
  • 48. www.mimeria.com What is a session? ● Sequence of clicks at most 5 minutes apart. ● Maximum length 3 hours. Examples, window = 5am - 9am: ● Clicks at [6:00, 6:01, 6:03, 6:45, 6:47]? ○ Two sessions, 3 and 2 minutes long. ● [5:00, 5:01, 5:03]? ○ One session, 3 minutes? 48
  • 49. www.mimeria.com What is a session? ● Sequence of clicks at most 5 minutes apart. ● Maximum length 3 hours. Examples, window = 5am - 9am: ● Clicks at [6:00, 6:01, 6:03, 6:45, 6:47]? ○ Two sessions, 3 and 2 minutes long. ● [5:00, 5:01, 5:03]? ○ One session, 3 minutes? ○ [(4:57), 5:00, 5:01, 5:03]? ○ [(2:01), (every 3 minutes), (4:57), 5:00, 5:01, 5:03]? 49
  • 50. www.mimeria.com Occurrences with unbounded time spans ● E.g. sessions, behavioural patterns ● You often need a broader time range than expected ● You may need data from the future ● You may need infinite history ○ Recursive strides? ○ Introduce static limits, e.g. cut all sessions at midnight? ○ Emit counters to monitor assumptions. 50
  • 51. www.mimeria.com Secrets of valuable data engineering 1. Avoid complexity and distributed systems ○ Your data fits in one machine. 2. Pick the slowest data integration you can live with ○ Batch >> streaming >> service. ○ Slow data → easy operations → high innovation speed 3. Functional architectural principles ○ Pipelines ○ Immutability ○ Reproducibility ○ Idempotency 4. Master workflow orchestration 51
  • 52. www.mimeria.com Team concurrency ● Pipelines ○ Parallel development without risk ● Immutability ○ Data reusability without risk ● Reproducibility, idempotency ○ Preserving immutability ○ Reduces operational risk ○ Stable experiments ● Workflow orchestration ○ Data handover between systems & teams ○ Reduces operational overhead 52 Data science reproducibility crisis!
  • 53. www.mimeria.com The real value of big data 53 Stream storage Data lake Data democratised
  • 54. www.mimeria.com Resources, credits Time libraries: ● Java: Joda time java.time ● Scala: chronoscala ● Python: dateutil, pendulum Presentation on operational tradeoffs: https://www.slideshare.net/lallea/data-ops-in-practice 54 Thank you, ● Konstantinos Chaidos, Spotify ● Lena Sundin, independent