Do you think you can write a system to get data from sensors across the world, do real time analytics, and display the data on a dashboard in under 100 lines of code? Would you like to add some monitoring and autoscaling too? And what about serverless? In this talk I'll show you all the technologies GCP offers to build such a system reliably and at scale.
4. Calculate the average of several numbers. By the way, they might be MANY
numbers. They will probably not fit in memory. They might not even fit in one
file or on a single hard drive.
An easy big data problem
5. Calculate the average of several numbers. By the way, they might be MANY
numbers. They will probably not fit in memory. They might not even fit in one
file or on a single hard drive.
Truth is they will not be in one file, but they will be streamed live from
different sensors…
An easy big data and streaming problem
6. Calculate the average of several numbers. By the way, they might be MANY
numbers. They will probably not fit in memory. They might not even fit in one
file or on a single hard drive.
Truth is they will not be in one file, but they will be streamed live from
different sensors… In different parts of the world
A not so easy streaming data problem
7. Calculate the average of several numbers. By the way, they might be MANY
numbers. They will probably not fit in memory. They might not even fit in one
file or on a single hard drive.
Truth is they will not be in one file, but they will be streamed live from
different sensors… In different parts of the world
Some sensors might send a few events per hour, some a few thousands per
second…
An autoscaling streaming problem
8. Calculate the average of several numbers. By the way, they might be MANY
numbers. They will probably not fit in memory. They might not even fit in one
file or on a single hard drive.
Truth is they will not be in one file, but they will be streamed live from
different sensors… In different parts of the world
Some sensors might send a few events per hour, some a few thousands per
second… We want not just the total average of all the points, but the moving
average every 30 seconds, for every sensor. And the hourly, daily, and
monthly averages
A hard streaming analytics problem
9. Calculate the average of several numbers. By the way, they might be MANY
numbers. They will probably not fit in memory. They might not even fit in one
file or on a single hard drive.
Truth is they will not be in one file, but they will be streamed live from
different sensors… In different parts of the world
Some sensors might send a few events per hour, some a few thousands per
second… We want not just the total average of all the points, but the moving
average every 30 seconds, for every sensor. And the hourly, daily, and
monthly averages
Sometimes the sensors will have connectivity issues and will not send their
data until later, but of course I want the calculations to still be correct
A real life analytics problem
10. All of the above, plus monitoring, alerts, self-healing, a way to query the data
efficiently, and a pretty dashboard on top
What your client/boss will expect
Calculate the average of several numbers. By the way, they might be MANY numbers. They will probably not fit in memory. They might
not even fit in one file or on a single hard drive.
Truth is they will not be in one file, but they will be streamed live from different sensors… In different parts of the world
Some sensors might send a few events per hour, some a few thousands per second… We want not just the total average of all the
points, but the moving average every 30 seconds, for every sensor. And the hourly, daily, and monthly averages
Sometimes the sensors will have connectivity issues and will not send their data until later, but of course I want the calculations to still
be correct
14. What we need
A streaming data pipeline components
Data
acquisition
Data
validation
Transformation
/ Aggregation
Visualization
Storage/
Analytics
Monitoring and alerts of all the components
16. Google Cloud Pub/Sub
Google Cloud Pub/Sub brings the scalability, flexibility, and reliability of
enterprise message-oriented middleware to the cloud. By providing
many-to-many, asynchronous messaging that decouples senders and
receivers, it allows for secure and highly available communication between
independently written applications.
Google Cloud Pub/Sub delivers low-latency, durable messaging that helps
developers quickly integrate systems hosted on the Google Cloud Platform
and externally.
Ingest event streams from anywhere, at any scale, for simple, reliable, real-time stream analytics
17. The spotify proof of concept
Currently our production load peaks at around 700K events per second. To account for the future
growth and possible disaster recovery scenarios, we settled on a test load of 2M events per second.
To make it extra hard for Pub/Sub, we wanted to publish this amount of traffic from a single data
center, so that all the requests were hitting the Pub/Sub machines in the same zone. We made the
assumption that Google plans zones as independent failure domains and that each zone can handle
equal amounts of traffic.
In theory, if we’re able to push 2M messages to a single zone, we should be able to push
number_of_zones * 2M messages across all zones.
Our hope was that the system would be able to handle this traffic on both the producing and
consuming side for a long time without the service degrading.
https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the-cloud-part-ii/
18. The spotify proof of concept
https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the-cloud-part-ii/
They pushed 2 million events per second (to two
topics) from 29 servers, non-stop, for five days.
“We did not observe any lost messages whatsoever
during the test period.”
19. The no operations advantage
https://labs.spotify.com/2016/03/10/spotifys-event-delivery-the-road-to-the-cloud-part-iii/
Event Delivery System In Cloud
We’re actively working on bringing the new system to production. The preliminary
numbers we obtained from running the new system in the experimental phase look very
promising. The worst end-to-end latency observed with the new system is four times
lower than the end-to-end latency of old system.
But boosting performance isn’t the only thing we want to get from the new system. Our
bet is that by using cloud-managed products we will have a much lower operational
overhead. That in turn means we will have much more time to make Spotify’s
products better.
21. Pub/Sub now works with Cloud IOT Core
Device Manager
The device manager allows individual devices to be configured and managed securely in a
coarse-grained way; management can be done through a console or programmatically. The device
manager establishes the identity of a device, and provides the mechanism for authenticating a
device when connecting. It also maintains a logical configuration of each device and can be used to
remotely control the device from the cloud.
Protocol Bridge
The protocol bridge provides connection endpoints for protocols with automatic load balancing for
all device connections. The protocol bridge has native support for secure connection over MQTT, an
industry-standard IoT protocol. The protocol bridge publishes all device telemetry to Cloud
Pub/Sub, which can then be consumed by downstream analytic systems.
Which is very cool if you are into Arduino, Raspberry PI, Android, or embedded systems
23. BigQuery
A database where you can send as much (or as little) data as you want, either
batch or streaming, and run any SQL you want, no matter how big your data
is.
Even if you have petabytes of data.
Even if you want to join data from different projects or from public data
sources.
Even if you want to query external data on Spreadsheets or Cloud Storage.
Even if you want to create your own User Defined Functions in JavaScript.
24. BigQuery also...
… is serverless and zero configuration. You never have to worry about
memory, CPU, network, or disk. You send your data, you send your queries,
you get results.
Behind the scenes BigQuery will use up to 2000 CPUs in parallel for your
queries, and a huge amount of networked storage. But you don’t care.
You pay for how much data you send and how much data you query. If you
are not using the database, you are not paying anything. But it’s always
available
25. Hope you are not easily impressed
How long it would take to read 4 Terabytes from a hard drive at 100 MB/s?
And to filter 100 billion data points using a regular expression for each?
And moving 278 GB across a 1 Gbps network?
26. Hope you are not easily impressed
How long it would take to read 4 Terabytes from a hard drive at 100 MB/s?
About 11 hours
And to filter 100 billion data points using a regular expression for each?
About 27 hours
And moving 278 GB across a 1 Gbps network?
About 40 minutes
38. Apache BEAM: An advanced unified programming model
Apache Beam is an open source, unified model for
defining both batch and streaming data-parallel
processing pipelines. Using one of the open source
Beam SDKs, you build a program that defines the
pipeline. The pipeline is then executed by one of
Beam’s supported distributed processing back-ends,
which include Apache Apex, Apache Flink, Apache
Spark, and Google Cloud Dataflow.
Beam is particularly useful for Embarrassingly Parallel
data processing tasks, in which the problem can be
decomposed into many smaller bundles of data that
can be processed independently and in parallel. You
can also use Beam for Extract, Transform, and Load
(ETL) tasks and pure data integration. These tasks are
useful for moving data between different storage
media and data sources, transforming data into a
more desirable format, or loading data onto a new
system.
42. Averages with BEAM: Overview
Boilerplate
and
configuration
Writing the output to BigQuery
This is the code that actually processes and
aggregates the data
Start the pipeline
45. Averages with BEAM: The processing itself
Transform/Filter. We are just parsing a line
of text into multiple fields
Aggregate. We are outputting the mean
speed of the last minute per sensor, every
30 seconds
46. Google Cloud Dataflow: BEAM with no-operations
Google developed internally BEAM as a closed-source product. Then they
realised it would make sense to open-source it and they donated it to the
Apache community.
Anyone can use BEAM completely for free, and choose the runner in which to
execute your pipeline.
Google Cloud Dataflow is a BEAM runner to execute your pipelines with
no-operations, with logging, monitoring, auto-scaling, shuffling, and dynamic
re-balancing.
It’s like BEAM, but as a managed service.
71. CHEERS!
I’m happy to answer any questions you may have at lunchtime
or the coffee breaks.
Or ping me at @supercoco9 on twitter. You got 240 chars now
Demo source code available at:
https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/courses/streaming/process
Javier Ramirez
72. End-to-end streaming analytics on
Google Cloud Platform
From event capture to dashboard to monitoring
Javier Ramirez
@supercoco9
73. Template Design Credits
The Template provides a theme with four basic
colors:
The backgrounds were created by Free Google
Slides Templates.
The original template for this presentation was
provided by, and it’s property of, Free Google
Slides Templates -
http://freegoogleslidestemplates.com
Vectorial Shapes in this Template were created
by Free Google Slides Templates and
downloaded from pexels.com and
unsplash.com.
Icons in this Template are part of Google®
Material Icons and 1001freedownloads.com.
Shapes & Icons Backgrounds
Fonts Color Palette
The fonts used in this template are taken from
Google fonts. ( Dosis,Open Sans )
You can download the fonts from the following
url: https://www.google.com/fonts/
#93c47dff #0097a7ff
#78909cff #eeeeeeff
#f7b600ff #00ce00e3
#de445eff #000000ff