SlideShare a Scribd company logo
1 of 16
Using big data tools to analyse web
analytics data

Why use big data tools to analyse web analytics data?
How would you use big data tools to analyse web
analytics data (with Snowplow and Qubole)
Web event data is incredibly valuable
• It tells you how your customers actually behave (in lots of detail), and how that varies
• Between different customers
• For the same customers over time. (Seasonality, progress in customer journey)
• How behaviour drives value

• It tells you how customers engage with you via your website / webapp
• How that varies by different versions of your product
• How improvements to your product drive increased customer satisfaction and lifetime value

• It tells you how customers and prospective customers engage with your different
marketing campaigns and how that drives subsequent behaviour

Web analytics data should be essential to driving customer
development, product development and marketing decisions
Deriving value from web analytics data often involves very
bespoke analytics
• The web is a rich and varied space! E.g.
•
•
•
•
•
•
•

Bank
Newspaper
Social network
Analytics application
Government organisation (e.g. tax office)
Retailer
Marketplace

• For each type of business you’d expect different :
•
•
•
•

Types of events, with different types of associated data
Ecosystem of customers / partners with different types of relationships
Product development cycle (and approach to product development)
Types of business questions / priorities to inform how the data is analysed
Web analytics tools are good at delivering the standard reports
that are common across different business types…
• Where does your traffic come from e.g.
• Sessions by marketing campaign / referrer
• Sessions by landing page

• Understanding events common across business types (page views, transactions, ‘goals’)
e.g.
•
•
•
•

Page views per session
Page views per web page
Conversion rate by traffic source
Transaction value by traffic source

• Capturing contextual data common people browsing the web
•
•
•
•
•
•

Timestamps
Referer data
Web page data (e.g. page title, URL)
Browser data (e.g. type, plugins, language)
Operating system (e.g. type, timezone)
Hardware (e.g. mobile / tablet / desktop, screen resolution, colour depth)
…but not at enabling the high-value bespoke analytics
• What is the impact of different ad campaigns and creative on the way users
behave, subsequently? What is the return on that ad spend?

• How do visitors use social channels (Facebook / Twitter) to interact around video
content? How can we predict which content will “go viral”?

• How do updates to our product change the “stickiness” of our service? ARPU?
Does that vary by customer segment?
That is because there are significant limitations in the way
traditional web analytics programmes handle:
Data collection
• Sample-based (e.g.
Google Analytics)
• Limited set of events e.g.
page views, goals,
transactions

• Limited set of ways of
describing events
(custom dim 1, custom
dim 2…)

Data processing

Data access

• Data is processed ‘once’

• Data is either aggregated
(e.g. Google Analytics),
or available as a
complete log file for a
fee (e.g. Adobe
SiteCatalyst)

• No validation
• No opportunity to
reprocess e.g. following
update to business rules

• Data is aggregated
prematurely
• Only particular
combinations of metrics
/ dimensions can be
pivoted together
(Google Analytics)
• Only particular type of
analysis are possible on
different types of
dimension (e.g. sProps,
eVars, conversion goals
in SiteCatalyst

• As a result, data is siloed:
hard to join with other
data sets
We built Snowplow to address those limitations and enable high
value, bespoke analytics on web event data

Data pipeline

Big data store

Snowplow is a data pipeline:
•
•
•

Captures data from website via Javascript tags
Validates, cleans, and enriches the incoming data (using Hadoop)
Loads the cleaned / enriched data store into a big data store for
analysis e.g. S3 where it can be analysed using big data tools e.g.
Qubole
Understanding the technology that powers the Snowplow data
pipeline
The Snowplow data pipeline consists of five loosely coupled modules:
Understanding the technology that powers the Snowplow data
pipeline
The Snowplow data pipeline consists of five loosely coupled modules:

Trackers generate event data
•
•
•
•
•

Javascript tracker for collecting data client-side
No-JS / pixel tracker (e.g. for email marketing)
Server side trackers (e.g. Lua tracker). Python / Ruby / Java / Scala on roadmap
Mobile trackers (iOS, Android on the roadmap…)
Internet of things (e.g. Arduino tracker)
Understanding the technology that powers the Snowplow data
pipeline
The Snowplow data pipeline consists of five loosely coupled modules:

Collectors receive data and write it to a queue for processing
• Cloudfront collector writes data to S3
• Clojure collector sets 3rd party cookie writes to S3
• Scala RT collector sets 3rd party cookie writes to S3 AND Kinesis
Understanding the technology that powers the Snowplow data
pipeline
The Snowplow data pipeline consists of five loosely coupled modules:

Enrichment validates and enriches the data
• Validates e.g. checks expected fields are set for each event type
• Enrichments e.g. categorising referrers (search / social), inferring location from IP
• Hadoop-based enrichment module (easy reprocessing of data)
• Kinesis-based enrichment module (real time processing) in development
Understanding the technology that powers the Snowplow data
pipeline
The Snowplow data pipeline consists of five loosely coupled modules:

Storage – make data available for analysis
• Store data in Amazon S3 for processing using big data tools e.g. Qubole
• Also support storage in Amazon Redshift / PostgreSQL for analysis using
traditional BI tools
So what does Snowplow data look like?
• A single table
• One line of data per event
• Fat table: 98 different fields (and counting)…
Type of field

Example field(s)

Description

User ID

domain_userid,
network_userid

Fields to identify user performing browsing. 1st and 3rd party
cookie IDs, browser fingerprints, IP address and separate field for
setting to custom value all available

Web page

page_urlpath

Fields that describe the web page the event occurred on,
including document size, URL, title

Traffic source

mkt_source, refr_source

Fields that relate to indicate the source of traffic. Snowplow
includes fields that can be set via utm parameters and others
based on the referrer

Event (rather
than context)

event, se_action, tr_total

Fields that relate to a specific event (e.g. transaction total)

User tech
setup

br_type, os_name,
dvce_type, br_viewheight

Fields that describe the user’s browser / OS / device setup

…

…

…
How do you analyse Snowplow data with Qubole?
• Common approach: use Hive on Qubole (could also use Pig or other Hadoop-based jobs)
• Create the events table (incl. recovering partitions)
• Write highly bespoke queries directly against the complete events table
DEMO!
Performing more sophisticated analysis
• Unfortunately there’s not time on this webinar to do a deeper demo…
• …however, there are resources available, in particular, the Snowplow Analytics
Cookbook - http://snowplowanalytics.com/analytics/index.html

More Related Content

What's hot

Big data meetup budapest adding data schemas to snowplow
Big data meetup budapest   adding data schemas to snowplowBig data meetup budapest   adding data schemas to snowplow
Big data meetup budapest adding data schemas to snowplow
yalisassoon
 

What's hot (20)

Snowplow - Evolve your analytics stack with your business
Snowplow - Evolve your analytics stack with your businessSnowplow - Evolve your analytics stack with your business
Snowplow - Evolve your analytics stack with your business
 
A taste of Snowplow Analytics data
A taste of Snowplow Analytics dataA taste of Snowplow Analytics data
A taste of Snowplow Analytics data
 
Big data meetup budapest adding data schemas to snowplow
Big data meetup budapest   adding data schemas to snowplowBig data meetup budapest   adding data schemas to snowplow
Big data meetup budapest adding data schemas to snowplow
 
Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3
 
Snowplow the evolving data pipeline
Snowplow   the evolving data pipelineSnowplow   the evolving data pipeline
Snowplow the evolving data pipeline
 
Big Data Beers - Introducing Snowplow
Big Data Beers - Introducing SnowplowBig Data Beers - Introducing Snowplow
Big Data Beers - Introducing Snowplow
 
Snowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your businessSnowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your business
 
Clickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache SparkClickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache Spark
 
Simply Business - Near Real Time Event Processing
Simply Business - Near Real Time Event ProcessingSimply Business - Near Real Time Event Processing
Simply Business - Near Real Time Event Processing
 
Snowplow: open source game analytics powered by AWS
Snowplow: open source game analytics powered by AWSSnowplow: open source game analytics powered by AWS
Snowplow: open source game analytics powered by AWS
 
How Gousto is moving to just-in-time personalization with Snowplow
How Gousto is moving to just-in-time personalization with SnowplowHow Gousto is moving to just-in-time personalization with Snowplow
How Gousto is moving to just-in-time personalization with Snowplow
 
2016 09 measurecamp - event data modeling
2016 09 measurecamp - event data modeling2016 09 measurecamp - event data modeling
2016 09 measurecamp - event data modeling
 
Data driven video advertising campaigns - JustWatch & Snowplow
Data driven video advertising campaigns - JustWatch & SnowplowData driven video advertising campaigns - JustWatch & Snowplow
Data driven video advertising campaigns - JustWatch & Snowplow
 
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and moreBig Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
 
Modelling event data in look ml
Modelling event data in look mlModelling event data in look ml
Modelling event data in look ml
 
Introduction to Amazon Kinesis Firehose - AWS August Webinar Series
Introduction to Amazon Kinesis Firehose - AWS August Webinar SeriesIntroduction to Amazon Kinesis Firehose - AWS August Webinar Series
Introduction to Amazon Kinesis Firehose - AWS August Webinar Series
 
Amazon big success using big data analytics
Amazon big success using big data analyticsAmazon big success using big data analytics
Amazon big success using big data analytics
 
Implementing improved and consistent arbitrary event tracking company-wide us...
Implementing improved and consistent arbitrary event tracking company-wide us...Implementing improved and consistent arbitrary event tracking company-wide us...
Implementing improved and consistent arbitrary event tracking company-wide us...
 
How to Build Fast Data Applications: Evaluating the Top Contenders
How to Build Fast Data Applications: Evaluating the Top ContendersHow to Build Fast Data Applications: Evaluating the Top Contenders
How to Build Fast Data Applications: Evaluating the Top Contenders
 
Snowplow is at the core of everything we do
Snowplow is at the core of everything we doSnowplow is at the core of everything we do
Snowplow is at the core of everything we do
 

Viewers also liked

Web Metircs and KPI
Web Metircs and KPIWeb Metircs and KPI
Web Metircs and KPI
Shipra Malik
 

Viewers also liked (16)

Using Snowplow for A/B testing and user journey analysis at CustomMade
Using Snowplow for A/B testing and user journey analysis at CustomMadeUsing Snowplow for A/B testing and user journey analysis at CustomMade
Using Snowplow for A/B testing and user journey analysis at CustomMade
 
Snowplow at Sigfig
Snowplow at SigfigSnowplow at Sigfig
Snowplow at Sigfig
 
Chefsfeed presentation to Snowplow Meetup San Francisco, Oct 2015
Chefsfeed presentation to Snowplow Meetup San Francisco, Oct 2015Chefsfeed presentation to Snowplow Meetup San Francisco, Oct 2015
Chefsfeed presentation to Snowplow Meetup San Francisco, Oct 2015
 
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
 
Qubole - Big data in cloud
Qubole - Big data in cloudQubole - Big data in cloud
Qubole - Big data in cloud
 
How to exploit Data with Tools for Social Media: Followerwonk
How to exploit Data with Tools for Social Media: FollowerwonkHow to exploit Data with Tools for Social Media: Followerwonk
How to exploit Data with Tools for Social Media: Followerwonk
 
Big Data in Online Classifieds
Big Data in Online ClassifiedsBig Data in Online Classifieds
Big Data in Online Classifieds
 
Web Analytics Concepts & Theories
Web Analytics Concepts & TheoriesWeb Analytics Concepts & Theories
Web Analytics Concepts & Theories
 
Web Metircs and KPI
Web Metircs and KPIWeb Metircs and KPI
Web Metircs and KPI
 
Web Analytics: Challenges in Data Modeling
Web Analytics: Challenges in Data ModelingWeb Analytics: Challenges in Data Modeling
Web Analytics: Challenges in Data Modeling
 
QUÉ ES TRABAJAR
QUÉ ES TRABAJARQUÉ ES TRABAJAR
QUÉ ES TRABAJAR
 
WEB Analytics - Data Mining - MIS - eBusiness website
WEB Analytics  - Data Mining - MIS - eBusiness website WEB Analytics  - Data Mining - MIS - eBusiness website
WEB Analytics - Data Mining - MIS - eBusiness website
 
CV - Vivek Bajpai
CV - Vivek BajpaiCV - Vivek Bajpai
CV - Vivek Bajpai
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 
20140806 AWS Meister BlackBelt - Amazon Redshift (Korean)
20140806 AWS Meister BlackBelt - Amazon Redshift (Korean)20140806 AWS Meister BlackBelt - Amazon Redshift (Korean)
20140806 AWS Meister BlackBelt - Amazon Redshift (Korean)
 

Similar to Why use big data tools to do web analytics? And how to do it using Snowplow and Qubole

Similar to Why use big data tools to do web analytics? And how to do it using Snowplow and Qubole (20)

Deep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
Deep.bi - Real-time, Deep Data Analytics Platform For EcommerceDeep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
Deep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
 
Analytics in Your Enterprise
Analytics in Your EnterpriseAnalytics in Your Enterprise
Analytics in Your Enterprise
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016
 
Web Analytics Primer
Web Analytics PrimerWeb Analytics Primer
Web Analytics Primer
 
Hadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural PatternsHadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural Patterns
 
Splunk Digital Intelligence
Splunk Digital IntelligenceSplunk Digital Intelligence
Splunk Digital Intelligence
 
SplunkLive! Milano 2016 - customer presentation - Unicredit
SplunkLive! Milano 2016 -  customer presentation - UnicreditSplunkLive! Milano 2016 -  customer presentation - Unicredit
SplunkLive! Milano 2016 - customer presentation - Unicredit
 
WSO2 Analytics Platform - The one stop shop for all your data needs
WSO2 Analytics Platform - The one stop shop for all your data needsWSO2 Analytics Platform - The one stop shop for all your data needs
WSO2 Analytics Platform - The one stop shop for all your data needs
 
Big Data and User Segmentation in Mobile Context
Big Data and User Segmentation in Mobile ContextBig Data and User Segmentation in Mobile Context
Big Data and User Segmentation in Mobile Context
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
[WSO2Con USA 2018] Patterns for Building Streaming Apps
[WSO2Con USA 2018] Patterns for Building Streaming Apps[WSO2Con USA 2018] Patterns for Building Streaming Apps
[WSO2Con USA 2018] Patterns for Building Streaming Apps
 
UNIT I Streaming Data & Architectures.pptx
UNIT I Streaming Data & Architectures.pptxUNIT I Streaming Data & Architectures.pptx
UNIT I Streaming Data & Architectures.pptx
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
Analytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformAnalytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data Platform
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Business Analytics Paradigm Change
Business Analytics Paradigm ChangeBusiness Analytics Paradigm Change
Business Analytics Paradigm Change
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Dealing with Common Data Requirements in Your Enterprise
Dealing with Common Data Requirements in Your EnterpriseDealing with Common Data Requirements in Your Enterprise
Dealing with Common Data Requirements in Your Enterprise
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 

More from yalisassoon

More from yalisassoon (8)

Snowplow: putting digital analysts at the heart of digital analytics - the fo...
Snowplow: putting digital analysts at the heart of digital analytics - the fo...Snowplow: putting digital analysts at the heart of digital analytics - the fo...
Snowplow: putting digital analysts at the heart of digital analytics - the fo...
 
Capturing online customer data to create better insights and targeted actions...
Capturing online customer data to create better insights and targeted actions...Capturing online customer data to create better insights and targeted actions...
Capturing online customer data to create better insights and targeted actions...
 
Yali presentation for snowplow amsterdam meetup number 2
Yali presentation for snowplow amsterdam meetup number 2Yali presentation for snowplow amsterdam meetup number 2
Yali presentation for snowplow amsterdam meetup number 2
 
Snowplow at DA Hub emerging technology showcase
Snowplow at DA Hub emerging technology showcaseSnowplow at DA Hub emerging technology showcase
Snowplow at DA Hub emerging technology showcase
 
Modeling event data
Modeling event dataModeling event data
Modeling event data
 
The analytics journey at Viewbix - how they came to use Snowplow and the setu...
The analytics journey at Viewbix - how they came to use Snowplow and the setu...The analytics journey at Viewbix - how they came to use Snowplow and the setu...
The analytics journey at Viewbix - how they came to use Snowplow and the setu...
 
Customer lifetime value
Customer lifetime valueCustomer lifetime value
Customer lifetime value
 
A KPI framework for startups
A KPI framework for startupsA KPI framework for startups
A KPI framework for startups
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Why use big data tools to do web analytics? And how to do it using Snowplow and Qubole

  • 1. Using big data tools to analyse web analytics data Why use big data tools to analyse web analytics data? How would you use big data tools to analyse web analytics data (with Snowplow and Qubole)
  • 2. Web event data is incredibly valuable • It tells you how your customers actually behave (in lots of detail), and how that varies • Between different customers • For the same customers over time. (Seasonality, progress in customer journey) • How behaviour drives value • It tells you how customers engage with you via your website / webapp • How that varies by different versions of your product • How improvements to your product drive increased customer satisfaction and lifetime value • It tells you how customers and prospective customers engage with your different marketing campaigns and how that drives subsequent behaviour Web analytics data should be essential to driving customer development, product development and marketing decisions
  • 3. Deriving value from web analytics data often involves very bespoke analytics • The web is a rich and varied space! E.g. • • • • • • • Bank Newspaper Social network Analytics application Government organisation (e.g. tax office) Retailer Marketplace • For each type of business you’d expect different : • • • • Types of events, with different types of associated data Ecosystem of customers / partners with different types of relationships Product development cycle (and approach to product development) Types of business questions / priorities to inform how the data is analysed
  • 4. Web analytics tools are good at delivering the standard reports that are common across different business types… • Where does your traffic come from e.g. • Sessions by marketing campaign / referrer • Sessions by landing page • Understanding events common across business types (page views, transactions, ‘goals’) e.g. • • • • Page views per session Page views per web page Conversion rate by traffic source Transaction value by traffic source • Capturing contextual data common people browsing the web • • • • • • Timestamps Referer data Web page data (e.g. page title, URL) Browser data (e.g. type, plugins, language) Operating system (e.g. type, timezone) Hardware (e.g. mobile / tablet / desktop, screen resolution, colour depth)
  • 5. …but not at enabling the high-value bespoke analytics • What is the impact of different ad campaigns and creative on the way users behave, subsequently? What is the return on that ad spend? • How do visitors use social channels (Facebook / Twitter) to interact around video content? How can we predict which content will “go viral”? • How do updates to our product change the “stickiness” of our service? ARPU? Does that vary by customer segment?
  • 6. That is because there are significant limitations in the way traditional web analytics programmes handle: Data collection • Sample-based (e.g. Google Analytics) • Limited set of events e.g. page views, goals, transactions • Limited set of ways of describing events (custom dim 1, custom dim 2…) Data processing Data access • Data is processed ‘once’ • Data is either aggregated (e.g. Google Analytics), or available as a complete log file for a fee (e.g. Adobe SiteCatalyst) • No validation • No opportunity to reprocess e.g. following update to business rules • Data is aggregated prematurely • Only particular combinations of metrics / dimensions can be pivoted together (Google Analytics) • Only particular type of analysis are possible on different types of dimension (e.g. sProps, eVars, conversion goals in SiteCatalyst • As a result, data is siloed: hard to join with other data sets
  • 7. We built Snowplow to address those limitations and enable high value, bespoke analytics on web event data Data pipeline Big data store Snowplow is a data pipeline: • • • Captures data from website via Javascript tags Validates, cleans, and enriches the incoming data (using Hadoop) Loads the cleaned / enriched data store into a big data store for analysis e.g. S3 where it can be analysed using big data tools e.g. Qubole
  • 8. Understanding the technology that powers the Snowplow data pipeline The Snowplow data pipeline consists of five loosely coupled modules:
  • 9. Understanding the technology that powers the Snowplow data pipeline The Snowplow data pipeline consists of five loosely coupled modules: Trackers generate event data • • • • • Javascript tracker for collecting data client-side No-JS / pixel tracker (e.g. for email marketing) Server side trackers (e.g. Lua tracker). Python / Ruby / Java / Scala on roadmap Mobile trackers (iOS, Android on the roadmap…) Internet of things (e.g. Arduino tracker)
  • 10. Understanding the technology that powers the Snowplow data pipeline The Snowplow data pipeline consists of five loosely coupled modules: Collectors receive data and write it to a queue for processing • Cloudfront collector writes data to S3 • Clojure collector sets 3rd party cookie writes to S3 • Scala RT collector sets 3rd party cookie writes to S3 AND Kinesis
  • 11. Understanding the technology that powers the Snowplow data pipeline The Snowplow data pipeline consists of five loosely coupled modules: Enrichment validates and enriches the data • Validates e.g. checks expected fields are set for each event type • Enrichments e.g. categorising referrers (search / social), inferring location from IP • Hadoop-based enrichment module (easy reprocessing of data) • Kinesis-based enrichment module (real time processing) in development
  • 12. Understanding the technology that powers the Snowplow data pipeline The Snowplow data pipeline consists of five loosely coupled modules: Storage – make data available for analysis • Store data in Amazon S3 for processing using big data tools e.g. Qubole • Also support storage in Amazon Redshift / PostgreSQL for analysis using traditional BI tools
  • 13. So what does Snowplow data look like? • A single table • One line of data per event • Fat table: 98 different fields (and counting)… Type of field Example field(s) Description User ID domain_userid, network_userid Fields to identify user performing browsing. 1st and 3rd party cookie IDs, browser fingerprints, IP address and separate field for setting to custom value all available Web page page_urlpath Fields that describe the web page the event occurred on, including document size, URL, title Traffic source mkt_source, refr_source Fields that relate to indicate the source of traffic. Snowplow includes fields that can be set via utm parameters and others based on the referrer Event (rather than context) event, se_action, tr_total Fields that relate to a specific event (e.g. transaction total) User tech setup br_type, os_name, dvce_type, br_viewheight Fields that describe the user’s browser / OS / device setup … … …
  • 14. How do you analyse Snowplow data with Qubole? • Common approach: use Hive on Qubole (could also use Pig or other Hadoop-based jobs) • Create the events table (incl. recovering partitions) • Write highly bespoke queries directly against the complete events table
  • 15. DEMO!
  • 16. Performing more sophisticated analysis • Unfortunately there’s not time on this webinar to do a deeper demo… • …however, there are resources available, in particular, the Snowplow Analytics Cookbook - http://snowplowanalytics.com/analytics/index.html