SlideShare a Scribd company logo
1 of 39
Introduction to Snowplow - an
open source event analytics
platform
Big Data & Data Science – Israel
Agenda today
1. Introduction to Snowplow
2. Current Snowplow design and architecture
3. Agile event analytics with Snowplow & Looker
4. Evolution of Snowplow
5. Questions
Many thanks for organizing to:
Introduction to Snowplow
Snowplow is an open-source web and event analytics platform,
first version released in early 2012
• Co-founders Alex Dean and Yali Sassoon met at
OpenX, the open-source ad technology business
in 2008
• After leaving OpenX, Alex and Yali set up
Keplar, a niche digital product and analytics
consultancy
• We released Snowplow as a skunkworks
prototype at start of 2012:
github.com/snowplow/snowplow
• We started working full time on Snowplow in
summer 2013
At Keplar, we grew frustrated by significant limitations in
traditional web analytics programs
• Sample-based (e.g.
Google Analytics)
• Limited set of events e.g.
page
views, goals, transaction
s
• Limited set of ways of
describing events
(custom dim 1, custom
dim 2…)
Data collection Data processing Data access
• Data is processed ‘once’
• No validation
• No opportunity to
reprocess e.g. following
update to business rules
• Data is aggregated
prematurely
• Only particular
combinations of metrics
/ dimensions can be
pivoted together
(Google Analytics)
• Only particular type of
analysis are possible on
different types of
dimension (e.g. sProps,
eVars, conversion goals
in SiteCatalyst
• Data is either aggregated
(e.g. Google
Analytics), or available as
a complete log file for a
fee (e.g. Adobe
SiteCatalyst)
• As a result, data is siloed:
hard to join with other
data sets
And we saw the potential of new “big data” technologies and
services to solve these problems in a scalable, low-cost manner
These tools make it possible to capture, transform, store and analyse all your
granular, event-level data, to you can perform any analysis
Amazon EMRAmazon S3CloudFront Amazon Redshift
We wanted to take a fresh approach to web analytics
• Your own web event data -> in your own data warehouse
• Your own event data model
• Slice / dice and mine the data in highly bespoke ways to answer your
specific business questions
• Plug in the broadest possible set of analysis tools to drive value from your
data
Data warehouseData pipeline
Analyse your data in
any analysis tool
Early on, we made a crucial decision: Snowplow should be
composed of a set of loosely coupled subsystems
1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsA B C D
D = Standardised data protocols
Generate event
data from any
environment
Launched with:
• JavaScript
tracker
Log raw events
from trackers
Launched with:
• CloudFront
collector
Validate and
enrich raw
events
Launched with:
• HiveQL +
Java UDF-
based
enrichment
Store enriched
events ready
for analysis
Launched with:
• Amazon S3
Analyze
enriched
events
Launched with:
• HiveQL
recipes
These turned out to be critical to allowing us
to evolve the above stack
Our initial skunkworks version of Snowplow – it was basic but it
worked, and we started getting traction
Website / webapp
Snowplow data pipeline v1 (spring 2012)
CloudFront-
based pixel
collector
HiveQL +
Java UDF
“ETL”
Amazon S3
JavaScript
event tracker
What did people start using it for?
Warehousing their
web event data
Agile aka ad hoc
analytics
To enable…
Marketing
attribution
modelling
Customer lifetime
value calculations
Customer churn
detection
RTB fraud
Product
recommendations
Current Snowplow design
and architecture
Our protocol-first, loosely-coupled approach made it possible to
start swapping out existing components…
Website / webapp
Snowplow data pipeline v2 (spring 2013)
CloudFront-
based event
collector
Scalding-
based
enrichment
JavaScript
event tracker
HiveQL +
Java UDF
“ETL”
Amazon
Redshift /
PostgreSQL
Amazon S3
or
Clojure-
based event
collector
Our protocol-first, loosely-coupled approach made it possible to
start swapping out existing components…
Website / webapp
Snowplow data pipeline v2 (spring 2013)
CloudFront-
based event
collector
Scalding-
based
enrichment
JavaScript
event tracker
HiveQL +
Java UDF
“ETL”
Amazon
Redshift /
PostgreSQL
Amazon S3
or
Clojure-
based event
collector
• Allow Snowplow
users to set a
third-party cookie
with a user ID
• Important for ad
networks, widget
companies, multi-
domain retailers
• Because
Snowplow users
wanted a much
faster query loop
than
HiveQL/MapReduc
e
• We wanted a
robust, feature-rich
framework for
managing
validations, enrich
ments etc
What is Scalding?
• Scalding is a Scala API over Cascading, the Java framework for building
data processing pipelines on Hadoop:
Hadoop DFS
Hadoop MapReduce
Cascading Hive Pig
Java
Scalding Cascalog PyCascading
cascading.
jruby
We chose Cascading because we liked their “plumbing”
abstraction over vanilla MapReduce
Why did we choose Scalding instead of one of the other
Cascading DSLs/APIs?
• Lots of internal experience with Scala – could hit the
ground running (only very basic awareness of Clojure
when we started the project)
• Scalding created and supported by Twitter, who use it
throughout their organization – so we knew it was a
safe long-term bet
• We believe that data pipelines should be as strongly
typed as possible – all the other DSLs/APIs on top of
Cascading encourage dynamic typing:
• Define the inputs and outputs of each of your data processing
steps in an unambiguous way
• Catch errors as soon as possible – and report them in a strongly
typed way too
Our “enrichment process” (formerly known as ETL) actually does
two things: validation and enrichment
• Our validation model looks like this:
• Under the covers, we use a lot of monadic Scala (Scalaz) code
Raw events
“Bad” raw
events +
reasons why
they are bad
“Good”
enriched
events
Enrichment
Manager
Adding the enrichments that web analysts expect = very
important to Snowplow uptake
• Web analysts are used to a very specific set of enrichments from Google
Analytics, Site Catalyst etc
• These enrichments have evolved over the past 15-20 years and are very domain
specific:
• Page querystring -> marketing campaign information (utm_ fields)
• Referer data -> search engine name, country, keywords
• IP address -> geographical location
• Useragent -> browser, OS, computer information
We aim to make our validation and enrichment process as
modular as possible
Enrichment
Manager
Not yet
integrated
• This encourages testability and re-use – also it widens the number of
contributors vs this functionality being embedded in Snowplow
• The Enrichment Manager uses external libraries (hosted in a Snowplow
repository) which can be used in non-Snowplow projects:
Agile event analytics with
Snowplow and Looker
Just last week we announced our official partnership with
Looker
• Looker is a BI visualization and data modelling startup with some cool features:
1. Slice and dice any combination of dimension and metrics
2. Quickly and easily define dimensions and metrics that are specific to your business
using Looker's light-weight metadata model
3. Drill-up and drill-down to visitor-level and event-level data
4. Dashboards are a starting point for more involved analysis
5. Access your data from any application: Looker as a general purpose data server
+
Demo – first let’s look at some enriched Snowplow events in
Redshift
Demo – now let’s see how that translates into Looker
Evolution of Snowplow
There are three big aspects to Snowplow’s roadmap
1. Make Snowplow work as well for non-web (e.g. mobile, IoT) environments as
the web
2. Make Snowplow work as well with unstructured events as it does with
structured events (aka page views, ecommerce transactions etc)
3. Move Snowplow away from an S3-based data pipeline to a unified log
(Kinesis/Kafka)-based data pipeline
Snowplow is developing into an event analytics platform (not
just a web analytics platform)
Data warehouse
Collect event data
from any connected
device
So far we have open-sourced a few different trackers – with
more planned
JavaScript Tracker
– the original
No-JS aka pixel
tracker
Lua Tracker – for
games
Arduino Tracker –
for the Internet of
Things
Python Tracker –
releasing this week
As we get further away from the web, we need to start
supporting unstructured events
• By unstructured events, we mean events represented as JSONs with arbitrary
name: value pairs (arbitrary to Snowplow, not to the company using Snowplow!)
_snaq.push(['trackUnstructEvent', 'Viewed Product',
{
product_id: 'ASO01043',
category: 'Dresses',
brand: 'ACME',
returning: true,
price: 49.95,
sizes: ['xs', 's', 'l', 'xl', 'xxl'],
available_since$dt: new Date(2013,3,7)
}
]);
Supporting structured and unstructured events is a difficult
problem
• Almost all of our competitors fall on one or other side of the structured-
unstructured divide:
Structured events (page views etc) Unstructured events (JSONs)
We want to bridge that divide, making it so that
Snowplow comes with structured events “out of the
box”, but is extensible with unstructured events
Structured events (page views etc) Unstructured events (JSONs)
This is super-important to enable businesses to construct their
own high-value bespoke analytics
• What is the impact of different ad campaigns and creative on the way users
behave, subsequently? What is the return on that ad spend?
• How do visitors use social channels (Facebook / Twitter) to interact around video
content? How can we predict which content will “go viral”?
• How do updates to our product change the “stickiness” of our service? ARPU?
Does that vary by customer segment?
To achieve this, we are prototyping a new approach using JSON
Schema, Thrift/Avro and a shredding library
• We are planning to replace the existing flow with a JSON Schema-driven
approach:
Enrichment
Manager
Raw
events in
JSON
format
JSON Schema defining events
Enriched
events in
Thrift or
Arvo
format
Shredder
1. Define
structure
2. Validate
events
3. Define
structure
4. Drive
shredding
Enriched
events in
TSV ready
for loading
into db
5. Define
structure
JSON Schema just gives us a way of representing structure – we
are also evolving a grammar to represent events
Subject
Direct
Object
Indirect
Object
Verb
Event Context
Prep.
Object~
In parallel, we plan to evolve Snowplow from an event analytics
platform into a “digital nervous system” for data driven
companies
• The event data fed into Snowplow is written into a “Unified Log”
• This becomes the “single source of truth”, upstream from the datawarehouse
• The same source of truth is used for real-time data processing as analytics e.g.
• Product recommendations
• Ad targeting
• Real-time website personalisation
• Systems monitoring
Snowplow will drive data-driven processes as well as off-
line analytics
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email
marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs /
web hooks
Unified log
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
Systems
monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’s
Ad hoc
analytics
Management
reporting
Fraud
detection
Churn
prevention
APIs
Some background on unified log based architectures
We are part way through our Kinesis support, with additional
components being released soon
Scala Stream
Collector
Raw event
stream
Enrich
Kinesis app
Bad raw
events stream
Enriched
event
stream
S3
Redshift
S3 sink Kinesis
app
Redshift sink
Kinesis app
Snowplow
Trackers
• The parts in grey are still
under development – we
are working with
Snowplow community
members on these
collaboratively
Questions?
http://snowplowanalytics.com
https://github.com/snowplow/snowplow
@snowplowdata
To have a meeting, coffee or beer tomorrow (Monday) –
@alexcrdean or alex@snowplowanalytics.com
Useful for answering questions…
Website / webapp
Snowplow data pipeline v2 (spring 2013)
CloudFront-
based event
collector
Scalding-
based
enrichment
JavaScript
event tracker
HiveQL +
Java UDF
“ETL”
Amazon
Redshift /
PostgreSQL
Amazon S3
or
Clojure-
based event
collector
Useful for answering questions…
1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsA B C D
D = Standardised data protocols
Generate event
data from any
environment
Launched with:
• JavaScript
tracker
Log raw events
from trackers
Launched with:
• CloudFront
collector
Validate and
enrich raw
events
Launched with:
• HiveQL +
Java UDF-
based
enrichment
Store enriched
events ready
for analysis
Launched with:
• Amazon S3
Analyze
enriched
events
Launched with:
• HiveQL
recipes
These turned out to be critical to allowing us
to evolve the above stack

More Related Content

Viewers also liked

Logical Data Warehouse and Data Lakes
Logical Data Warehouse and Data Lakes Logical Data Warehouse and Data Lakes
Logical Data Warehouse and Data Lakes Denodo
 
Understanding event data
Understanding event dataUnderstanding event data
Understanding event datayalisassoon
 
Snowplow is at the core of everything we do
Snowplow is at the core of everything we doSnowplow is at the core of everything we do
Snowplow is at the core of everything we doyalisassoon
 
Using Snowplow for A/B testing and user journey analysis at CustomMade
Using Snowplow for A/B testing and user journey analysis at CustomMadeUsing Snowplow for A/B testing and user journey analysis at CustomMade
Using Snowplow for A/B testing and user journey analysis at CustomMadeyalisassoon
 
Snowplow: where we came from and where we are going - March 2016
Snowplow: where we came from and where we are going - March 2016Snowplow: where we came from and where we are going - March 2016
Snowplow: where we came from and where we are going - March 2016yalisassoon
 
Simply Business and Snowplow - Multichannel Attribution Analysis
Simply Business and Snowplow - Multichannel Attribution AnalysisSimply Business and Snowplow - Multichannel Attribution Analysis
Simply Business and Snowplow - Multichannel Attribution AnalysisStewart Duncan
 
Data driven video advertising campaigns - JustWatch & Snowplow
Data driven video advertising campaigns - JustWatch & SnowplowData driven video advertising campaigns - JustWatch & Snowplow
Data driven video advertising campaigns - JustWatch & SnowplowGiuseppe Gaviani
 
Introducing Sauna - Decisioning and response platform from Snowplow
Introducing Sauna - Decisioning and response platform from SnowplowIntroducing Sauna - Decisioning and response platform from Snowplow
Introducing Sauna - Decisioning and response platform from SnowplowGiuseppe Gaviani
 
How Incuda builds user journey models with Snowplow
How Incuda builds user journey models with SnowplowHow Incuda builds user journey models with Snowplow
How Incuda builds user journey models with SnowplowGiuseppe Gaviani
 
From Architecture to Analytics: A look at Simply Business’s data strategy
From Architecture to Analytics: A look at Simply Business’s data strategy From Architecture to Analytics: A look at Simply Business’s data strategy
From Architecture to Analytics: A look at Simply Business’s data strategy Looker
 
Snowplow - Evolve your analytics stack with your business
Snowplow - Evolve your analytics stack with your businessSnowplow - Evolve your analytics stack with your business
Snowplow - Evolve your analytics stack with your businessGiuseppe Gaviani
 
The Data Opportunity - Rock your data with Segment.com
The Data Opportunity - Rock your data with Segment.comThe Data Opportunity - Rock your data with Segment.com
The Data Opportunity - Rock your data with Segment.comSemetis
 
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016yalisassoon
 
OPEN SOURCE SEMINAR PRESENTATION
OPEN SOURCE SEMINAR PRESENTATIONOPEN SOURCE SEMINAR PRESENTATION
OPEN SOURCE SEMINAR PRESENTATIONRitwick Halder
 

Viewers also liked (16)

Logical Data Warehouse and Data Lakes
Logical Data Warehouse and Data Lakes Logical Data Warehouse and Data Lakes
Logical Data Warehouse and Data Lakes
 
Understanding event data
Understanding event dataUnderstanding event data
Understanding event data
 
Snowplow is at the core of everything we do
Snowplow is at the core of everything we doSnowplow is at the core of everything we do
Snowplow is at the core of everything we do
 
Using Snowplow for A/B testing and user journey analysis at CustomMade
Using Snowplow for A/B testing and user journey analysis at CustomMadeUsing Snowplow for A/B testing and user journey analysis at CustomMade
Using Snowplow for A/B testing and user journey analysis at CustomMade
 
Snowplow: where we came from and where we are going - March 2016
Snowplow: where we came from and where we are going - March 2016Snowplow: where we came from and where we are going - March 2016
Snowplow: where we came from and where we are going - March 2016
 
Simply Business and Snowplow - Multichannel Attribution Analysis
Simply Business and Snowplow - Multichannel Attribution AnalysisSimply Business and Snowplow - Multichannel Attribution Analysis
Simply Business and Snowplow - Multichannel Attribution Analysis
 
Data driven video advertising campaigns - JustWatch & Snowplow
Data driven video advertising campaigns - JustWatch & SnowplowData driven video advertising campaigns - JustWatch & Snowplow
Data driven video advertising campaigns - JustWatch & Snowplow
 
Introducing Sauna - Decisioning and response platform from Snowplow
Introducing Sauna - Decisioning and response platform from SnowplowIntroducing Sauna - Decisioning and response platform from Snowplow
Introducing Sauna - Decisioning and response platform from Snowplow
 
How Incuda builds user journey models with Snowplow
How Incuda builds user journey models with SnowplowHow Incuda builds user journey models with Snowplow
How Incuda builds user journey models with Snowplow
 
From Architecture to Analytics: A look at Simply Business’s data strategy
From Architecture to Analytics: A look at Simply Business’s data strategy From Architecture to Analytics: A look at Simply Business’s data strategy
From Architecture to Analytics: A look at Simply Business’s data strategy
 
Snowplow - Evolve your analytics stack with your business
Snowplow - Evolve your analytics stack with your businessSnowplow - Evolve your analytics stack with your business
Snowplow - Evolve your analytics stack with your business
 
The Data Opportunity - Rock your data with Segment.com
The Data Opportunity - Rock your data with Segment.comThe Data Opportunity - Rock your data with Segment.com
The Data Opportunity - Rock your data with Segment.com
 
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
 
Business intelligence kpi
Business intelligence kpiBusiness intelligence kpi
Business intelligence kpi
 
OPEN SOURCE SEMINAR PRESENTATION
OPEN SOURCE SEMINAR PRESENTATIONOPEN SOURCE SEMINAR PRESENTATION
OPEN SOURCE SEMINAR PRESENTATION
 
Open Source Technology
Open Source TechnologyOpen Source Technology
Open Source Technology
 

More from Alexander Dean

Asynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAsynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAlexander Dean
 
What Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registriesWhat Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registriesAlexander Dean
 
Snowplow New York City Meetup #2
Snowplow New York City Meetup #2Snowplow New York City Meetup #2
Snowplow New York City Meetup #2Alexander Dean
 
Introducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricIntroducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricAlexander Dean
 
Unified Log London (May 2015) - Why your company needs a unified log
Unified Log London (May 2015) - Why your company needs a unified logUnified Log London (May 2015) - Why your company needs a unified log
Unified Log London (May 2015) - Why your company needs a unified logAlexander Dean
 
AWS User Group UK: Why your company needs a unified log
AWS User Group UK: Why your company needs a unified logAWS User Group UK: Why your company needs a unified log
AWS User Group UK: Why your company needs a unified logAlexander Dean
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaAlexander Dean
 
Span Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logSpan Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logAlexander Dean
 
Big Data Beers - Introducing Snowplow
Big Data Beers - Introducing SnowplowBig Data Beers - Introducing Snowplow
Big Data Beers - Introducing SnowplowAlexander Dean
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...Alexander Dean
 
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...Alexander Dean
 

More from Alexander Dean (11)

Asynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAsynchronous micro-services and the unified log
Asynchronous micro-services and the unified log
 
What Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registriesWhat Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registries
 
Snowplow New York City Meetup #2
Snowplow New York City Meetup #2Snowplow New York City Meetup #2
Snowplow New York City Meetup #2
 
Introducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricIntroducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabric
 
Unified Log London (May 2015) - Why your company needs a unified log
Unified Log London (May 2015) - Why your company needs a unified logUnified Log London (May 2015) - Why your company needs a unified log
Unified Log London (May 2015) - Why your company needs a unified log
 
AWS User Group UK: Why your company needs a unified log
AWS User Group UK: Why your company needs a unified logAWS User Group UK: Why your company needs a unified log
AWS User Group UK: Why your company needs a unified log
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in Scala
 
Span Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logSpan Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified log
 
Big Data Beers - Introducing Snowplow
Big Data Beers - Introducing SnowplowBig Data Beers - Introducing Snowplow
Big Data Beers - Introducing Snowplow
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
 
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
 

Recently uploaded

Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Roland Driesen
 
Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Servicediscovermytutordmt
 
Unlocking the Secrets of Affiliate Marketing.pdf
Unlocking the Secrets of Affiliate Marketing.pdfUnlocking the Secrets of Affiliate Marketing.pdf
Unlocking the Secrets of Affiliate Marketing.pdfOnline Income Engine
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLSeo
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageMatteo Carbone
 
Regression analysis: Simple Linear Regression Multiple Linear Regression
Regression analysis:  Simple Linear Regression Multiple Linear RegressionRegression analysis:  Simple Linear Regression Multiple Linear Regression
Regression analysis: Simple Linear Regression Multiple Linear RegressionRavindra Nath Shukla
 
Event mailer assignment progress report .pdf
Event mailer assignment progress report .pdfEvent mailer assignment progress report .pdf
Event mailer assignment progress report .pdftbatkhuu1
 
Pharma Works Profile of Karan Communications
Pharma Works Profile of Karan CommunicationsPharma Works Profile of Karan Communications
Pharma Works Profile of Karan Communicationskarancommunications
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...amitlee9823
 
Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...
Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...
Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...lizamodels9
 
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779Delhi Call girls
 
7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...Paul Menig
 
Monte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMMonte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMRavindra Nath Shukla
 
Monthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptxMonthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptxAndy Lambert
 
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Dipal Arora
 
Ensure the security of your HCL environment by applying the Zero Trust princi...
Ensure the security of your HCL environment by applying the Zero Trust princi...Ensure the security of your HCL environment by applying the Zero Trust princi...
Ensure the security of your HCL environment by applying the Zero Trust princi...Roland Driesen
 
HONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael HawkinsHONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael HawkinsMichael W. Hawkins
 

Recently uploaded (20)

Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...
 
Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Service
 
Unlocking the Secrets of Affiliate Marketing.pdf
Unlocking the Secrets of Affiliate Marketing.pdfUnlocking the Secrets of Affiliate Marketing.pdf
Unlocking the Secrets of Affiliate Marketing.pdf
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
 
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabiunwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usage
 
Regression analysis: Simple Linear Regression Multiple Linear Regression
Regression analysis:  Simple Linear Regression Multiple Linear RegressionRegression analysis:  Simple Linear Regression Multiple Linear Regression
Regression analysis: Simple Linear Regression Multiple Linear Regression
 
Event mailer assignment progress report .pdf
Event mailer assignment progress report .pdfEvent mailer assignment progress report .pdf
Event mailer assignment progress report .pdf
 
Pharma Works Profile of Karan Communications
Pharma Works Profile of Karan CommunicationsPharma Works Profile of Karan Communications
Pharma Works Profile of Karan Communications
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
 
Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...
Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...
Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...
 
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
 
Forklift Operations: Safety through Cartoons
Forklift Operations: Safety through CartoonsForklift Operations: Safety through Cartoons
Forklift Operations: Safety through Cartoons
 
7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...
 
Monte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMMonte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSM
 
VVVIP Call Girls In Greater Kailash ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Greater Kailash ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...VVVIP Call Girls In Greater Kailash ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Greater Kailash ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
 
Monthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptxMonthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptx
 
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
 
Ensure the security of your HCL environment by applying the Zero Trust princi...
Ensure the security of your HCL environment by applying the Zero Trust princi...Ensure the security of your HCL environment by applying the Zero Trust princi...
Ensure the security of your HCL environment by applying the Zero Trust princi...
 
HONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael HawkinsHONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael Hawkins
 

Introduction to Snowplow - Big Data & Data Science Israel

  • 1. Introduction to Snowplow - an open source event analytics platform Big Data & Data Science – Israel
  • 2. Agenda today 1. Introduction to Snowplow 2. Current Snowplow design and architecture 3. Agile event analytics with Snowplow & Looker 4. Evolution of Snowplow 5. Questions Many thanks for organizing to:
  • 4. Snowplow is an open-source web and event analytics platform, first version released in early 2012 • Co-founders Alex Dean and Yali Sassoon met at OpenX, the open-source ad technology business in 2008 • After leaving OpenX, Alex and Yali set up Keplar, a niche digital product and analytics consultancy • We released Snowplow as a skunkworks prototype at start of 2012: github.com/snowplow/snowplow • We started working full time on Snowplow in summer 2013
  • 5. At Keplar, we grew frustrated by significant limitations in traditional web analytics programs • Sample-based (e.g. Google Analytics) • Limited set of events e.g. page views, goals, transaction s • Limited set of ways of describing events (custom dim 1, custom dim 2…) Data collection Data processing Data access • Data is processed ‘once’ • No validation • No opportunity to reprocess e.g. following update to business rules • Data is aggregated prematurely • Only particular combinations of metrics / dimensions can be pivoted together (Google Analytics) • Only particular type of analysis are possible on different types of dimension (e.g. sProps, eVars, conversion goals in SiteCatalyst • Data is either aggregated (e.g. Google Analytics), or available as a complete log file for a fee (e.g. Adobe SiteCatalyst) • As a result, data is siloed: hard to join with other data sets
  • 6. And we saw the potential of new “big data” technologies and services to solve these problems in a scalable, low-cost manner These tools make it possible to capture, transform, store and analyse all your granular, event-level data, to you can perform any analysis Amazon EMRAmazon S3CloudFront Amazon Redshift
  • 7. We wanted to take a fresh approach to web analytics • Your own web event data -> in your own data warehouse • Your own event data model • Slice / dice and mine the data in highly bespoke ways to answer your specific business questions • Plug in the broadest possible set of analysis tools to drive value from your data Data warehouseData pipeline Analyse your data in any analysis tool
  • 8. Early on, we made a crucial decision: Snowplow should be composed of a set of loosely coupled subsystems 1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsA B C D D = Standardised data protocols Generate event data from any environment Launched with: • JavaScript tracker Log raw events from trackers Launched with: • CloudFront collector Validate and enrich raw events Launched with: • HiveQL + Java UDF- based enrichment Store enriched events ready for analysis Launched with: • Amazon S3 Analyze enriched events Launched with: • HiveQL recipes These turned out to be critical to allowing us to evolve the above stack
  • 9. Our initial skunkworks version of Snowplow – it was basic but it worked, and we started getting traction Website / webapp Snowplow data pipeline v1 (spring 2012) CloudFront- based pixel collector HiveQL + Java UDF “ETL” Amazon S3 JavaScript event tracker
  • 10. What did people start using it for? Warehousing their web event data Agile aka ad hoc analytics To enable… Marketing attribution modelling Customer lifetime value calculations Customer churn detection RTB fraud Product recommendations
  • 12. Our protocol-first, loosely-coupled approach made it possible to start swapping out existing components… Website / webapp Snowplow data pipeline v2 (spring 2013) CloudFront- based event collector Scalding- based enrichment JavaScript event tracker HiveQL + Java UDF “ETL” Amazon Redshift / PostgreSQL Amazon S3 or Clojure- based event collector
  • 13. Our protocol-first, loosely-coupled approach made it possible to start swapping out existing components… Website / webapp Snowplow data pipeline v2 (spring 2013) CloudFront- based event collector Scalding- based enrichment JavaScript event tracker HiveQL + Java UDF “ETL” Amazon Redshift / PostgreSQL Amazon S3 or Clojure- based event collector • Allow Snowplow users to set a third-party cookie with a user ID • Important for ad networks, widget companies, multi- domain retailers • Because Snowplow users wanted a much faster query loop than HiveQL/MapReduc e • We wanted a robust, feature-rich framework for managing validations, enrich ments etc
  • 14. What is Scalding? • Scalding is a Scala API over Cascading, the Java framework for building data processing pipelines on Hadoop: Hadoop DFS Hadoop MapReduce Cascading Hive Pig Java Scalding Cascalog PyCascading cascading. jruby
  • 15. We chose Cascading because we liked their “plumbing” abstraction over vanilla MapReduce
  • 16. Why did we choose Scalding instead of one of the other Cascading DSLs/APIs? • Lots of internal experience with Scala – could hit the ground running (only very basic awareness of Clojure when we started the project) • Scalding created and supported by Twitter, who use it throughout their organization – so we knew it was a safe long-term bet • We believe that data pipelines should be as strongly typed as possible – all the other DSLs/APIs on top of Cascading encourage dynamic typing: • Define the inputs and outputs of each of your data processing steps in an unambiguous way • Catch errors as soon as possible – and report them in a strongly typed way too
  • 17. Our “enrichment process” (formerly known as ETL) actually does two things: validation and enrichment • Our validation model looks like this: • Under the covers, we use a lot of monadic Scala (Scalaz) code Raw events “Bad” raw events + reasons why they are bad “Good” enriched events Enrichment Manager
  • 18. Adding the enrichments that web analysts expect = very important to Snowplow uptake • Web analysts are used to a very specific set of enrichments from Google Analytics, Site Catalyst etc • These enrichments have evolved over the past 15-20 years and are very domain specific: • Page querystring -> marketing campaign information (utm_ fields) • Referer data -> search engine name, country, keywords • IP address -> geographical location • Useragent -> browser, OS, computer information
  • 19. We aim to make our validation and enrichment process as modular as possible Enrichment Manager Not yet integrated • This encourages testability and re-use – also it widens the number of contributors vs this functionality being embedded in Snowplow • The Enrichment Manager uses external libraries (hosted in a Snowplow repository) which can be used in non-Snowplow projects:
  • 20. Agile event analytics with Snowplow and Looker
  • 21. Just last week we announced our official partnership with Looker • Looker is a BI visualization and data modelling startup with some cool features: 1. Slice and dice any combination of dimension and metrics 2. Quickly and easily define dimensions and metrics that are specific to your business using Looker's light-weight metadata model 3. Drill-up and drill-down to visitor-level and event-level data 4. Dashboards are a starting point for more involved analysis 5. Access your data from any application: Looker as a general purpose data server +
  • 22. Demo – first let’s look at some enriched Snowplow events in Redshift
  • 23. Demo – now let’s see how that translates into Looker
  • 25. There are three big aspects to Snowplow’s roadmap 1. Make Snowplow work as well for non-web (e.g. mobile, IoT) environments as the web 2. Make Snowplow work as well with unstructured events as it does with structured events (aka page views, ecommerce transactions etc) 3. Move Snowplow away from an S3-based data pipeline to a unified log (Kinesis/Kafka)-based data pipeline
  • 26. Snowplow is developing into an event analytics platform (not just a web analytics platform) Data warehouse Collect event data from any connected device
  • 27. So far we have open-sourced a few different trackers – with more planned JavaScript Tracker – the original No-JS aka pixel tracker Lua Tracker – for games Arduino Tracker – for the Internet of Things Python Tracker – releasing this week
  • 28. As we get further away from the web, we need to start supporting unstructured events • By unstructured events, we mean events represented as JSONs with arbitrary name: value pairs (arbitrary to Snowplow, not to the company using Snowplow!) _snaq.push(['trackUnstructEvent', 'Viewed Product', { product_id: 'ASO01043', category: 'Dresses', brand: 'ACME', returning: true, price: 49.95, sizes: ['xs', 's', 'l', 'xl', 'xxl'], available_since$dt: new Date(2013,3,7) } ]);
  • 29. Supporting structured and unstructured events is a difficult problem • Almost all of our competitors fall on one or other side of the structured- unstructured divide: Structured events (page views etc) Unstructured events (JSONs)
  • 30. We want to bridge that divide, making it so that Snowplow comes with structured events “out of the box”, but is extensible with unstructured events Structured events (page views etc) Unstructured events (JSONs)
  • 31. This is super-important to enable businesses to construct their own high-value bespoke analytics • What is the impact of different ad campaigns and creative on the way users behave, subsequently? What is the return on that ad spend? • How do visitors use social channels (Facebook / Twitter) to interact around video content? How can we predict which content will “go viral”? • How do updates to our product change the “stickiness” of our service? ARPU? Does that vary by customer segment?
  • 32. To achieve this, we are prototyping a new approach using JSON Schema, Thrift/Avro and a shredding library • We are planning to replace the existing flow with a JSON Schema-driven approach: Enrichment Manager Raw events in JSON format JSON Schema defining events Enriched events in Thrift or Arvo format Shredder 1. Define structure 2. Validate events 3. Define structure 4. Drive shredding Enriched events in TSV ready for loading into db 5. Define structure
  • 33. JSON Schema just gives us a way of representing structure – we are also evolving a grammar to represent events Subject Direct Object Indirect Object Verb Event Context Prep. Object~
  • 34. In parallel, we plan to evolve Snowplow from an event analytics platform into a “digital nervous system” for data driven companies • The event data fed into Snowplow is written into a “Unified Log” • This becomes the “single source of truth”, upstream from the datawarehouse • The same source of truth is used for real-time data processing as analytics e.g. • Product recommendations • Ad targeting • Real-time website personalisation • Systems monitoring Snowplow will drive data-driven processes as well as off- line analytics
  • 35. CLOUD VENDOR / OWN DATA CENTER Search Silo SOME LOW LATENCY LOCAL LOOPS E-comm Silo CRM SAAS VENDOR #2 Email marketing ERP Silo CMS Silo SAAS VENDOR #1 NARROW DATA SILOES Streaming APIs / web hooks Unified log Archiving Hadoop < WIDE DATA COVERAGE > < FULL DATA HISTORY > Systems monitoring Eventstream HIGH LATENCY LOW LATENCY Product rec’s Ad hoc analytics Management reporting Fraud detection Churn prevention APIs Some background on unified log based architectures
  • 36. We are part way through our Kinesis support, with additional components being released soon Scala Stream Collector Raw event stream Enrich Kinesis app Bad raw events stream Enriched event stream S3 Redshift S3 sink Kinesis app Redshift sink Kinesis app Snowplow Trackers • The parts in grey are still under development – we are working with Snowplow community members on these collaboratively
  • 37. Questions? http://snowplowanalytics.com https://github.com/snowplow/snowplow @snowplowdata To have a meeting, coffee or beer tomorrow (Monday) – @alexcrdean or alex@snowplowanalytics.com
  • 38. Useful for answering questions… Website / webapp Snowplow data pipeline v2 (spring 2013) CloudFront- based event collector Scalding- based enrichment JavaScript event tracker HiveQL + Java UDF “ETL” Amazon Redshift / PostgreSQL Amazon S3 or Clojure- based event collector
  • 39. Useful for answering questions… 1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsA B C D D = Standardised data protocols Generate event data from any environment Launched with: • JavaScript tracker Log raw events from trackers Launched with: • CloudFront collector Validate and enrich raw events Launched with: • HiveQL + Java UDF- based enrichment Store enriched events ready for analysis Launched with: • Amazon S3 Analyze enriched events Launched with: • HiveQL recipes These turned out to be critical to allowing us to evolve the above stack