Teaser: provide developers a new way of understanding advanced analytics and choosing the right cloud architecture
The new buzzword is #serverless, as there are many great services that helps us abstract away the complexity associated with managing servers. In this session we will see how serverless helps on large data analytics backends.
We will see how to architect for Cloud and implement into an existing project components that will take us into the #serverless architecture that will ingest our streaming data, run advanced analytics on petabytes of data using BigQuery on Google Cloud Platform - all this next to an existing stack, without being forced to reengineer our app.
BigQuery enables super-fast, SQL/Javascript queries against petabytes of data using the processing power of Google’s infrastructure. We will cover its core features, SQL 2011 standard, working with streaming inserts, User Defined Functions written in Javascript, reference external JS libraries, and several use cases for everyday backend developer: funnel analytics, email heatmap, custom data processing, building dashboards, extracting data using JS functions, emitting rows based on business logic.
Jax, FL Admin Community Group 05.14.2024 Combined Deck
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
1. Creating #serverless data analytics
system on GCP using BigQuery
Márton Kodok / @martonkodok
Google Developer Expert at REEA.net
March 2018 - Tirgu Mures, Romania
2. ● Geek. Hiker. Do-er.
● Among the Top3 romanians on Stackoverflow 120k reputation
● Google Developer Expert on Cloud technologies
● Crafting Web/Mobile backends at REEA.net
● BigQuery/Redis and database engine expert
● Active in mentoring and IT community
Twitter: @martonkodok
StackOverflow: pentium10
Slideshare: martonkodok
GitHub: pentium10
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
About me
4. Google Cloud Platform (GCP)
Compute Big Data
BigQuery
Cloud
Dataflow
Cloud
Dataproc
Cloud
Datalab
Cloud
Pub/Sub
Genomics
Storage & Databases
Cloud
Storage
Cloud
Bigtable
Cloud
Datastore
Cloud SQL
Cloud
Spanner
Persistent
Disk
Machine Learning
Cloud Machine
Learning
Cloud
Vision API
Cloud
Speech API
Cloud Natural
Language API
Cloud
Translation
API
Cloud
Jobs API
Data
Studio
Cloud
Dataprep
Cloud Video
Intelligence
API
Advanced
Solutions Lab
Compute
Engine
App
Engine
Kubernetes
Engine
GPU
Cloud
Functions
Container-
Optimized OS
Identity & Security
Cloud IAM
Cloud Resource
Manager
Cloud Security
Scanner
Key
Management
Service
BeyondCorp
Data Loss
Prevention API
Identity-Aware
Proxy
Security Key
Enforcement
Internet of Things
Cloud IoT
Core
Transfer
Appliance
5. Google Cloud Platform (GCP)
Developer Tools
Cloud SDK
Cloud
Deployment
Manager
Cloud Source
Repositories
Cloud
Tools for
Android Studio
Cloud Tools
for IntelliJ
Cloud
Tools for
PowerShell
Cloud
Tools for
Visual Studio
Container
Registry
Google Plug-in
for Eclipse
Cloud Test
Lab
Networking
Virtual
Private Cloud
Cloud Load
Balancing
Cloud
CDN
Cloud
Interconnect
Cloud DNS
Cloud
Network
Cloud
External IP
Addresses
Cloud
Firewall Rules
Cloud
Routes
Cloud VPN
Management Tools
Stackdriver Monitoring Logging
Error
Reporting
Trace
Debugger
Cloud
Deployment
Manager
Cloud
Endpoints
Cloud
Console
Cloud
Shell
Cloud Mobile
App
Cloud
Billing API
Cloud
APIs
Cloud
Router
Dedicated
Interconnect
Container
Builder
7. Meet Serverless
serverless data center depicted
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
8. Event-driven serverless compute platform
Cloud
Services
Changes in data state
Business logic events
Integrations
Event Router
Gateway
HTTPS
Event Source
Multiple Platforms
Data Warehouse
Pub/Sub
Cloud Functions
Streaming
Business Value
Application
Task
Analysis
9. Serverless is about maximizing elasticity, cost
savings, and agility of cloud computing.
@martonkodok
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
10. Crafting a solution for building high-performance,
petabyte scale data analytics, serverless
reporting system on Google Cloud Platform
Goal today
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
11. Legacy Reporting System
App
Cloud Load
Balancing
NGINX
Compute Engine
10GB PD
2 1
Database Service (Master/Slave)
Compute Engine
10GB PD
4 1
Compute Engine
10GB PD
4 1
Compute Engine
10GB PD
4 1
Report & Share
Business Analysis
Scheduled
Tasks
Batch Processing
Compute Engine
Multiple Instances
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
12. Serverless Reporting System
App
Cloud Load
Balancing
NGINX
Compute Engine
10GB PD
2 1
Database Service (Master/Slave)
Compute Engine
10GB PD
4 1
Compute Engine
10GB PD
4 1
Compute Engine
10GB PD
4 1
Report & Share
Business Analysis
Scheduled
Tasks
Batch Processing
Compute Engine
Multiple Instances
BigQuery Data Studio
Report & Share
Business Analysis
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
14. Analytics-as-a-Service - Data Warehouse in the Cloud
Scales into Petabytes on Managed Google Infrastructure (US or EU zone)
SQL 2011 + Javascript UDF (User Defined Functions)
Familiar DB Structure (table, views, struct, nested, JSON)
Integrates with Google Sheets + Cloud Storage + Pub/Sub connectors
Decent pricing (queries $5/TB, storage: $20/TB cold: $10/TB) *Mar 2018
Open Interfaces (Web UI, BQ command line tool, REST, ODBC)
What is BigQuery?
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
15. Columnar storage (max 10 000 columns in table)
Large files for loading: 5TB (CSV or JSON)
UDF in Javascript or SQL
Rich SQL 2011: JSON,IP,Math,RegExp,Geocode,Window functions
Modern data types: Record, Nested, Struct, Array.
Append-only tables prefered (DML syntax available)
Day column partitioned tables (select * from t where day=’2018-01-01’)
BigQuery: Convenience of SQL
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
16. Architecting for The Cloud
BigQuery
On-Premises Servers
Pipelines
ETL
Engine
Event Sourcing
Frontend
Platform Services
Metrics / Logs/
Streaming
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
17. “ Our project generates many/big files.
How can I seamlessly ingest them?
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
18. Serverless file ingest
BigQuery
On-Premises Servers
ApplicationEvent Sourcing
Frontend
Platform Services
Metrics / Logs/
Streaming
Cloud
Storage
Cloud
Functions
Triggered Code
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
19. “ Data needs to be processed in
multiple services.
How can we pipe to multiple places?
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
20. Architecting for The Cloud
On-Premises Servers
Event Sourcing
Frontend
Platform Services
Analyze
Metrics / Logs/
Streaming
Cloud Storage
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Cloud
Dataflow
Process
BigQuery
Cloud SQL
Stream
Batch
Data
Studio
Third-Party
Tools
21. “ We have our app outside of GCP.
How can we use the benefits of BigQuery?
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
22. Data Pipeline Integration at REEA.net
Analytics Backend
BigQuery
On-Premises Servers
Pipelines
FluentD
Event Sourcing
Frontend
Platform Services
Metrics / Logs/
Streaming
Development
Team
Data Analysts
Report & Share
Business Analysis
Tools
Tableau
QlikView
Data Studio
Internal
Dashboard
Database
SQL
Application
ServersServers
Cloud Storage
archive
Load
Export
Replay
Standard
Devices
HTTPS
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
23. The following slides will present a sample Fluentd configuration to:
1. Transform a record
2. Copy event to multiple outputs
3. Store event data in File (for backup/log purposes)
4. Stream to BigQuery (for immediate analyses)
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
24. <filter frontend.user.*>
@type record_transformer
</filter>
<match frontend.user.*>
@type copy
<store>
@type forest
subtype file
</store>
<store>
@type bigquery
</store>
…
</match>
Filter plugin mutates incoming data. Add/modify/delete
event data transform attributes without a code deploy.1
2
3
4
The copy output plugin copies events to multiple outputs.
File(s), multiple databases, DB engines.
Great to ship same event to multiple subsystems.
The Bigquery output plugin on the fly streams the event to
the BigQuery warehouse. No need to write integration.
Data is available immediately for querying.
Whenever needed other output plugins can be wired in:
Kafka, Google Cloud Storage output plugin.
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
25. record_transformer copy file BigQuery
<filter frontend.user.*>
@type record_transformer
enable_ruby
remove_keys host
<record>
bq {"insert_id":"${uid}","host":"${host}",
"created":"${time.to_i}"}
avg ${record["total"] / record["count"]}
</record>
</filter>
syntax: Ruby, easy to use.
Great for:
- date transformation,
- quick normalizations,
- calculating something on the fly,
and store in clear log/analytics db
- renaming without code deploy.
1 2 3 4
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
26. record_transformer copy file BigQuery
<match frontend.user.*>
@type copy
<store>
@type forest
subtype file
<template>
path /tank/storage/${tag}.*.log
time_slice_format %Y%m%d
</template>
</store>
</match>
1 2 3 4
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
27. record_transformer copy file BigQuery
<match frontend.user.*>
@type bigquery
method insert
auth_method json_key
json_key /etc/td-agent/keys/key-31da042be48c.json
time_field timestamp
time_slice_format %Y%m%d
table user$%{time_slice}
ignore_unknown_values
schema_path /etc/td-agent/schema/user_login.json
</match>
1 2 3 4
Connector uses:
- JSON key auth file
- JSON table schema
Pro features:
- streaming to Partitioned tables
- ignore unknown values
(not reflected in schema)
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
28. ● On data that it is difficult to process/analyze using traditional databases
● Not a replacement to traditional DBs, but it compliments the system
● Major strength is handling Large datasets
● Applying Javascript UDF on columnar storage to resolve complex tasks
(eg: JS for natural language processing)
● On streams (forms, IoT, Kafka)
● On exploring unstructured data
Where to use BigQuery?
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
29. ➢ Optimize product pages
➢ Email engagement
➢ Funnel Analysis
Achievements - goal reached by measuring everything
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
31. Funnel analysis: Time on upsell pages
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
32. Example HITS chain:
● article1 -> page2 -> page3 -> page4 -> orderpage1 -> thankyoupage1
● page1 -> article2-> page3 -> orderpage2 -> ...
Attribute credit to first article visited on purchase
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
33. ● Funnel Analysis
● Email URL click heatmap
● Email Health Dashboard (SPAM, ISP deferral, content
A/B split tests, trends or low open rate campaigns)
● Advanced segmentation (all raw data stored)
● Behavioral analytics - engaged users etc...
Achievements Continued
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
34. ● SQL language to run BigData queries
● run raw ad-hoc queries (either by analysts/sales or Devs)
● no more throwing away-, expiring-, aggregating old data
● no provisioning/deploy
● no running out of resources
● no more focus on large scale execution plan
● no need to re-implement tricky concepts
(time windows / join streams)
Our benefits
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
35. ● No manual sharding
● No capacity guessing
● No idle resources
● No maintenance windows
● No manual scaling
● No file mgmt
BigQuery: Serverless Data Warehouse
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
36. ● No servers to provision or manage
● Abstract away the complexity
● Scales with usage (ready every time for viral spikes or #BlackFriday)
● Availability and fault tolerance built in
● No orchestration in code
● Never pay for idle
● Cost savings (ps: we don’t have the same budget for security like GCP or AWS)
● Decoupled: APIs as contracts
● Monitored: Metrics and logging are a universal right
● Think concurrent, stateless, queue, stream based.
Serverlessmeans
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
37. Easily Build Custom Reports and Dashboards
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
38. Thank you.
Slides available on:
slideshare.net/martonkodok
Reea.net - Integrated web solutions driven by
creativity to deliver projects.