SlideShare a Scribd company logo
1 of 17
Snowplow: scalable open source web and
event analytics platform, built on AWS
Using EMR, Redshift, Cloudfront and Elastic Beanstalk to build a
scalable, log-everything, query-everything data infrastructure
What is Snowplow?
ā€¢ Web analytics platform
ā€¢ Javascript tags -> event-level data delivered in your own Amazon Redshift or
PostgreSQL database, for analysis in R, Excel, Tableau
ā€¢ Open source -> run on your own AWS account
ā€¢ Own your own data
ā€¢ Join with 3rd party data sets (PPC, Facebook, CRM)
ā€¢ Analyse with any tool you want
ā€¢ Architected to scale
ā€¢ Ad networks track 100Ms of events (impressions) per day
ā€¢ General purpose event analytics platform -> Universal Event Analytics
ā€¢ Log-everything infrastructure works for web data and other event data sets
Why we built Snowplow
ā€¢ Traditional web analytics tools are very limited
ā€¢ Siloed -> hard to integrate
ā€¢ Reports built for publishers and retailers in the 1990s
ā€¢ Impressed by how easy AWS makes it to collect, manage and process massive
data sets
ā€¢ More on this in a secondā€¦
ā€¢ Impressed by new generation of agile BI tools
ā€¢ Tableau, Excel, Rā€¦
ā€¢ Commoditise and standardise event data capture (esp. data structure) -> enable
innovation in the use of that data
ā€¢ Lots of tech companies have built a similar stack to handle data internally
ā€¢ Makes sense for everyone to standardise around an open source product
Snowplowā€™s (loosely coupled) technical architecture
1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsB C D
A D Standardised data protocols
Generate event
data (e.g.
Javascript
tracker)
Receive data
from trackers
and log it to S3
Clean and
enrich raw data
(e.g. geoIP
lookup, session
ization, referrer
parsing)
Store data in
format suitable
to enable
analysis
The Snowplow technology stack: trackers
1. Trackers 2. Collectors 3. Enrich 4. Storage 5. Analytics
Javascript tracker
Pixel (No-JS) tracker
Arduino tracker
Lua tracker
Trackers on the roadmap:
ā€¢ Java
ā€¢ Python
ā€¢ Ruby
ā€¢ Android
ā€¢ iOSā€¦
The Snowplow technology stack: collectors
1. Trackers 2. Collectors 3. Enrich 4. Storage 5. Analytics
Cloudfront collector
Clojure collector
on Elastic Beanstalk
ā€¢ Tracker: GET request to pixel hosted
on Cloudfront
ā€¢ Event data appended to the GET
request as a query string
ā€¢ Cloudfront logging -> data
automatically logged to S3
ā€¢ Scalable ā€“ Cloudfront CDN built to
handle enormous volume and
velocity of requests
ā€¢ Enable tracking users across
domains, by setting a 3rd party
cookie server side
ā€¢ Clojure collector runs on Tomcat:
customize format of Tomcat logs to
match Cloudfront log file format
ā€¢ Elastic Beanstalk supports rotation of
Tomcat logs into S3
ā€¢ Scalable: Elastic Beanstalk makes it
easy to handle spikes in request
volumes
The Snowplow technology stack: data enrichment
1. Trackers 2. Collectors 3. Enrich 4. Storage 5. Analytics
Scalding Enrichment on EMR
ā€¢ Enrichment process run 1-4x per day
ā€¢ Consolidate log files from collector, clean up, enrich, and write back to storage (S3)
ā€¢ Enrichments incl. referrer parsing, Geo-IP lookups, server-side sessionization
ā€¢ Process written in Scalding: a Scala API for Cascading
ā€¢ Cascading: a high level library for Hadoop esp. well suited for building robust data pipelines
(ETL) that e.g. push bad data into separate sinks to validated data
ā€¢ Powered by EMR: cluster fired up to perform the enrichment step, then shut down
Hadoop and EMR are excellent for data enrichment
ā€¢ For many, the volume of data processed with each run is not large enough to necessitate
a big data solutionā€¦
ā€¢ ā€¦ but building the process on Hadoop / EMR means it is easy to rerun the entire
historical Snowplow data set through Enrichment e.g.
ā€¢ When a new enrichment becomes available
ā€¢ When the company wants to apply a new definition of a key variable in their Snowplow data
set (e.g. new definition for sessionization, or new definition for user cohort) i.e. change in
business logic
ā€¢ Reprocessing entire data set isnā€™t just possible -> itā€™s easy (as easy as just processing new
data) and fast (just fire up a larger cluster)
ā€¢ This is game changing in web analytics, where reprocessing data has never been possible
Scalding + Scalaz make it easy for us to build rich, validated ETL
pipelines to run on EMR
ā€¢ Scalaz is a functional programming library for Scala ā€“ it has a Validation data type which
lets us accumulate errors as we process our raw Snowplow rows
ā€¢ Scalding + Scalaz lets us write ETL in a very expressive way:
ā€¢ In the above, ValidatedMaybeCanonicalOutput contains either a valid Snowplow
event, or a list of validation failures (Strings) which were encountered trying to parse the
raw Snowplow log row
Scalding + Scalaz make it easy for us to build rich, validated ETL
pipelines to run on EMR (continued)
ā€¢ Scalding + Scalaz lets us route our bad raw rows into a ā€œbad bucketā€ in S3, along with all
of the validation errors which were encountered for that row:
ā€¢ (This is pretty-printed ā€“ in fact the flatfile is one JSON object per line)
ā€¢ In the future we could add an aggregation job to process these ā€œbad bucketā€ files and
report on the number of errors encountered and most common validation failures
The Snowplow technology stack: storage and analytics
1. Trackers 2. Collectors 3. Enrich 4. Storage 5. Analytics
S3
Redshift
Postgres (coming soon)
Loading Redshift from an EMR job is relatively
straightforward, with some gotchas to be aware of
ā€¢ Load Redshift from S3, not DynamoDB ā€“ the costs for loading from DynamoDB only
make sense if you need the data in DynamoDB anyway
ā€¢ Your EMR job can either write directly to S3 (slow), or write to local HDFS and then
S3DistCp to S3 (faster)
ā€¢ For Scalding, our Redshift table target is a POJO assembled using
scala.reflect.BeanProperty ā€“ with fields declared in same order as in Redshift:
Make sure to escape tabs, newlines etc in your strings
ā€¢ Once we have Snowplow events in CanonicalOutput form, we simply unpack them into
tuple fields for writing:
ā€¢ Remember you are loading tab-separated, newline terminated values into Redshift, so
make sure to escape all tabs, newlines, other special characters in your strings:
You need to handle field length too
ā€¢ You can either handle string length proactively in your code, or add TRUNCATECOLUMNS to
your Redshift COPY command
ā€¢ Currently we proactively truncate:
ā€¢ BUT this code is not unicode-aware (Redshift varchar field lengths are in terms of
bytes, not characters) and rather fragile ā€“ we will likely switch to using
TRUNCATECOLUMNS
Then use STL_LOAD_ERRORS, Excel and MAXERROR to help
debug load errors
ā€¢ If you do get load errors, then check STL_LOAD_ERRORS in Redshift ā€“ it gives you all the
information you need to fix the load error
ā€¢ If the error is non-obvious, pull your POJO, Redshift table definition and bad row (from
STL_LOAD_ERRORS) into Excel to compare:
ā€¢ COPY ā€¦ MAXERROR X is your friend ā€“ lets you see more than just the first load error
TSV text files are great for feeding Redshift, but be careful of
using them as your ā€œmaster data storeā€
ā€¢ Some limitations to using tab-separated flat files to store your data:
ā€¢ Inefficient for storage/querying ā€“ versus e.g. binary files
ā€¢ Schemaless ā€“ no way of knowing the structure without visually eyeballing
ā€¢ Fragile ā€“ problems with field length, tabs, newlines, control characters etc
ā€¢ Inexpressive ā€“ no support for things like Union data types; rows can only be 65kb wide (you
can insert fatter rows into Redshift, but cannot query them)
ā€¢ Brittle ā€“ adding a new field to Redshift means the old files donā€™t load; need to re-run the
EMR job over all of your archived input data to re-generate
ā€¢ All of this means we will be moving to a more robust Snowplow event storage format
on disk (Avro), and simply generating TSV files from those Avro events as needed to
feed Redshift (or Postgres or Amazon RDS or ā€¦)
ā€¢ Recommendation: write a new Hadoop job step to take your existing outputs from
EMR and convert into Redshift-friendly TSVs; donā€™t start hacking on your existing data
flow
Any questions?
?
Learn more
ā€¢ https://github.com/snowplow/snowplow
ā€¢ http://snowplowanalytics.com/
ā€¢ @snowplowdata

More Related Content

Viewers also liked

Using Snowplow for A/B testing and user journey analysis at CustomMade
Using Snowplow for A/B testing and user journey analysis at CustomMadeUsing Snowplow for A/B testing and user journey analysis at CustomMade
Using Snowplow for A/B testing and user journey analysis at CustomMadeyalisassoon
Ā 
Snowplow: where we came from and where we are going - March 2016
Snowplow: where we came from and where we are going - March 2016Snowplow: where we came from and where we are going - March 2016
Snowplow: where we came from and where we are going - March 2016yalisassoon
Ā 
BenevolentTech - Harnessing the power of AI to accelerate global scientific d...
BenevolentTech - Harnessing the power of AI to accelerate global scientific d...BenevolentTech - Harnessing the power of AI to accelerate global scientific d...
BenevolentTech - Harnessing the power of AI to accelerate global scientific d...Project Juno
Ā 
Dressipi - Personalised recommendation engine for fashion consumers
Dressipi - Personalised recommendation engine for fashion consumersDressipi - Personalised recommendation engine for fashion consumers
Dressipi - Personalised recommendation engine for fashion consumersProject Juno
Ā 
Slamcore - Next-Generation SLAM
Slamcore  - Next-Generation SLAMSlamcore  - Next-Generation SLAM
Slamcore - Next-Generation SLAMProject Juno
Ā 
Reconfigure.io - Cloud-based FPGA Acceleration for AI applications
Reconfigure.io - Cloud-based FPGA Acceleration for AI applicationsReconfigure.io - Cloud-based FPGA Acceleration for AI applications
Reconfigure.io - Cloud-based FPGA Acceleration for AI applicationsProject Juno
Ā 
Big data meetup budapest adding data schemas to snowplow
Big data meetup budapest   adding data schemas to snowplowBig data meetup budapest   adding data schemas to snowplow
Big data meetup budapest adding data schemas to snowplowyalisassoon
Ā 
Snowplow at DA Hub emerging technology showcase
Snowplow at DA Hub emerging technology showcaseSnowplow at DA Hub emerging technology showcase
Snowplow at DA Hub emerging technology showcaseyalisassoon
Ā 
Implementing improved and consistent arbitrary event tracking company-wide us...
Implementing improved and consistent arbitrary event tracking company-wide us...Implementing improved and consistent arbitrary event tracking company-wide us...
Implementing improved and consistent arbitrary event tracking company-wide us...yalisassoon
Ā 
Snowplow the evolving data pipeline
Snowplow   the evolving data pipelineSnowplow   the evolving data pipeline
Snowplow the evolving data pipelineyalisassoon
Ā 
Snowplow at the heart of Busuu's data & analytics infrastructure
Snowplow at the heart of Busuu's data & analytics infrastructureSnowplow at the heart of Busuu's data & analytics infrastructure
Snowplow at the heart of Busuu's data & analytics infrastructureGiuseppe Gaviani
Ā 
Yali presentation for snowplow amsterdam meetup number 2
Yali presentation for snowplow amsterdam meetup number 2Yali presentation for snowplow amsterdam meetup number 2
Yali presentation for snowplow amsterdam meetup number 2yalisassoon
Ā 
Snowplow - Evolve your analytics stack with your business
Snowplow - Evolve your analytics stack with your businessSnowplow - Evolve your analytics stack with your business
Snowplow - Evolve your analytics stack with your businessGiuseppe Gaviani
Ā 
Snowplow at Sigfig
Snowplow at SigfigSnowplow at Sigfig
Snowplow at Sigfigyalisassoon
Ā 
Snowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your businessSnowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your businessyalisassoon
Ā 
Snowplow Analytics and Looker at Oyster.com
Snowplow Analytics and Looker at Oyster.comSnowplow Analytics and Looker at Oyster.com
Snowplow Analytics and Looker at Oyster.comyalisassoon
Ā 

Viewers also liked (16)

Using Snowplow for A/B testing and user journey analysis at CustomMade
Using Snowplow for A/B testing and user journey analysis at CustomMadeUsing Snowplow for A/B testing and user journey analysis at CustomMade
Using Snowplow for A/B testing and user journey analysis at CustomMade
Ā 
Snowplow: where we came from and where we are going - March 2016
Snowplow: where we came from and where we are going - March 2016Snowplow: where we came from and where we are going - March 2016
Snowplow: where we came from and where we are going - March 2016
Ā 
BenevolentTech - Harnessing the power of AI to accelerate global scientific d...
BenevolentTech - Harnessing the power of AI to accelerate global scientific d...BenevolentTech - Harnessing the power of AI to accelerate global scientific d...
BenevolentTech - Harnessing the power of AI to accelerate global scientific d...
Ā 
Dressipi - Personalised recommendation engine for fashion consumers
Dressipi - Personalised recommendation engine for fashion consumersDressipi - Personalised recommendation engine for fashion consumers
Dressipi - Personalised recommendation engine for fashion consumers
Ā 
Slamcore - Next-Generation SLAM
Slamcore  - Next-Generation SLAMSlamcore  - Next-Generation SLAM
Slamcore - Next-Generation SLAM
Ā 
Reconfigure.io - Cloud-based FPGA Acceleration for AI applications
Reconfigure.io - Cloud-based FPGA Acceleration for AI applicationsReconfigure.io - Cloud-based FPGA Acceleration for AI applications
Reconfigure.io - Cloud-based FPGA Acceleration for AI applications
Ā 
Big data meetup budapest adding data schemas to snowplow
Big data meetup budapest   adding data schemas to snowplowBig data meetup budapest   adding data schemas to snowplow
Big data meetup budapest adding data schemas to snowplow
Ā 
Snowplow at DA Hub emerging technology showcase
Snowplow at DA Hub emerging technology showcaseSnowplow at DA Hub emerging technology showcase
Snowplow at DA Hub emerging technology showcase
Ā 
Implementing improved and consistent arbitrary event tracking company-wide us...
Implementing improved and consistent arbitrary event tracking company-wide us...Implementing improved and consistent arbitrary event tracking company-wide us...
Implementing improved and consistent arbitrary event tracking company-wide us...
Ā 
Snowplow the evolving data pipeline
Snowplow   the evolving data pipelineSnowplow   the evolving data pipeline
Snowplow the evolving data pipeline
Ā 
Snowplow at the heart of Busuu's data & analytics infrastructure
Snowplow at the heart of Busuu's data & analytics infrastructureSnowplow at the heart of Busuu's data & analytics infrastructure
Snowplow at the heart of Busuu's data & analytics infrastructure
Ā 
Yali presentation for snowplow amsterdam meetup number 2
Yali presentation for snowplow amsterdam meetup number 2Yali presentation for snowplow amsterdam meetup number 2
Yali presentation for snowplow amsterdam meetup number 2
Ā 
Snowplow - Evolve your analytics stack with your business
Snowplow - Evolve your analytics stack with your businessSnowplow - Evolve your analytics stack with your business
Snowplow - Evolve your analytics stack with your business
Ā 
Snowplow at Sigfig
Snowplow at SigfigSnowplow at Sigfig
Snowplow at Sigfig
Ā 
Snowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your businessSnowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your business
Ā 
Snowplow Analytics and Looker at Oyster.com
Snowplow Analytics and Looker at Oyster.comSnowplow Analytics and Looker at Oyster.com
Snowplow Analytics and Looker at Oyster.com
Ā 

More from yalisassoon

2016 09 measurecamp - event data modeling
2016 09 measurecamp - event data modeling2016 09 measurecamp - event data modeling
2016 09 measurecamp - event data modelingyalisassoon
Ā 
Capturing online customer data to create better insights and targeted actions...
Capturing online customer data to create better insights and targeted actions...Capturing online customer data to create better insights and targeted actions...
Capturing online customer data to create better insights and targeted actions...yalisassoon
Ā 
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016yalisassoon
Ā 
Modeling event data
Modeling event dataModeling event data
Modeling event datayalisassoon
Ā 
The analytics journey at Viewbix - how they came to use Snowplow and the setu...
The analytics journey at Viewbix - how they came to use Snowplow and the setu...The analytics journey at Viewbix - how they came to use Snowplow and the setu...
The analytics journey at Viewbix - how they came to use Snowplow and the setu...yalisassoon
Ā 
Customer lifetime value
Customer lifetime valueCustomer lifetime value
Customer lifetime valueyalisassoon
Ā 
A KPI framework for startups
A KPI framework for startupsA KPI framework for startups
A KPI framework for startupsyalisassoon
Ā 

More from yalisassoon (7)

2016 09 measurecamp - event data modeling
2016 09 measurecamp - event data modeling2016 09 measurecamp - event data modeling
2016 09 measurecamp - event data modeling
Ā 
Capturing online customer data to create better insights and targeted actions...
Capturing online customer data to create better insights and targeted actions...Capturing online customer data to create better insights and targeted actions...
Capturing online customer data to create better insights and targeted actions...
Ā 
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Ā 
Modeling event data
Modeling event dataModeling event data
Modeling event data
Ā 
The analytics journey at Viewbix - how they came to use Snowplow and the setu...
The analytics journey at Viewbix - how they came to use Snowplow and the setu...The analytics journey at Viewbix - how they came to use Snowplow and the setu...
The analytics journey at Viewbix - how they came to use Snowplow and the setu...
Ā 
Customer lifetime value
Customer lifetime valueCustomer lifetime value
Customer lifetime value
Ā 
A KPI framework for startups
A KPI framework for startupsA KPI framework for startups
A KPI framework for startups
Ā 

Recently uploaded

Lucknow šŸ’‹ Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow šŸ’‹ Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow šŸ’‹ Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow šŸ’‹ Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...anilsa9823
Ā 
Tech Startup Growth Hacking 101 - Basics on Growth Marketing
Tech Startup Growth Hacking 101  - Basics on Growth MarketingTech Startup Growth Hacking 101  - Basics on Growth Marketing
Tech Startup Growth Hacking 101 - Basics on Growth MarketingShawn Pang
Ā 
Value Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsValue Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsP&CO
Ā 
Yaroslav Rozhankivskyy: Š¢Ń€Šø сŠŗŠ»Š°Š“Š¾Š²Ń– і трŠø ŠæŠµŃ€ŠµŠ“уŠ¼Š¾Š²Šø Š¼Š°ŠŗсŠøŠ¼Š°Š»ŃŒŠ½Š¾Ń— ŠæрŠ¾Š“уŠŗтŠøŠ²Š½...
Yaroslav Rozhankivskyy: Š¢Ń€Šø сŠŗŠ»Š°Š“Š¾Š²Ń– і трŠø ŠæŠµŃ€ŠµŠ“уŠ¼Š¾Š²Šø Š¼Š°ŠŗсŠøŠ¼Š°Š»ŃŒŠ½Š¾Ń— ŠæрŠ¾Š“уŠŗтŠøŠ²Š½...Yaroslav Rozhankivskyy: Š¢Ń€Šø сŠŗŠ»Š°Š“Š¾Š²Ń– і трŠø ŠæŠµŃ€ŠµŠ“уŠ¼Š¾Š²Šø Š¼Š°ŠŗсŠøŠ¼Š°Š»ŃŒŠ½Š¾Ń— ŠæрŠ¾Š“уŠŗтŠøŠ²Š½...
Yaroslav Rozhankivskyy: Š¢Ń€Šø сŠŗŠ»Š°Š“Š¾Š²Ń– і трŠø ŠæŠµŃ€ŠµŠ“уŠ¼Š¾Š²Šø Š¼Š°ŠŗсŠøŠ¼Š°Š»ŃŒŠ½Š¾Ń— ŠæрŠ¾Š“уŠŗтŠøŠ²Š½...Lviv Startup Club
Ā 
Unlocking the Secrets of Affiliate Marketing.pdf
Unlocking the Secrets of Affiliate Marketing.pdfUnlocking the Secrets of Affiliate Marketing.pdf
Unlocking the Secrets of Affiliate Marketing.pdfOnline Income Engine
Ā 
Understanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key InsightsUnderstanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key Insightsseri bangash
Ā 
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779Delhi Call girls
Ā 
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesMysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesDipal Arora
Ā 
Sales & Marketing Alignment: How to Synergize for Success
Sales & Marketing Alignment: How to Synergize for SuccessSales & Marketing Alignment: How to Synergize for Success
Sales & Marketing Alignment: How to Synergize for SuccessAggregage
Ā 
9599632723 Top Call Girls in Delhi at your Door Step Available 24x7 Delhi
9599632723 Top Call Girls in Delhi at your Door Step Available 24x7 Delhi9599632723 Top Call Girls in Delhi at your Door Step Available 24x7 Delhi
9599632723 Top Call Girls in Delhi at your Door Step Available 24x7 DelhiCall Girls in Delhi
Ā 
Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Roland Driesen
Ā 
Best Basmati Rice Manufacturers in India
Best Basmati Rice Manufacturers in IndiaBest Basmati Rice Manufacturers in India
Best Basmati Rice Manufacturers in IndiaShree Krishna Exports
Ā 
It will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayIt will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayNZSG
Ā 
VIP Kolkata Call Girl Howrah šŸ‘‰ 8250192130 Available With Room
VIP Kolkata Call Girl Howrah šŸ‘‰ 8250192130  Available With RoomVIP Kolkata Call Girl Howrah šŸ‘‰ 8250192130  Available With Room
VIP Kolkata Call Girl Howrah šŸ‘‰ 8250192130 Available With Roomdivyansh0kumar0
Ā 
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Dipal Arora
Ā 
HONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael HawkinsHONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael HawkinsMichael W. Hawkins
Ā 
Keppel Ltd. 1Q 2024 Business Update Presentation Slides
Keppel Ltd. 1Q 2024 Business Update  Presentation SlidesKeppel Ltd. 1Q 2024 Business Update  Presentation Slides
Keppel Ltd. 1Q 2024 Business Update Presentation SlidesKeppelCorporation
Ā 
M.C Lodges -- Guest House in Jhang.
M.C Lodges --  Guest House in Jhang.M.C Lodges --  Guest House in Jhang.
M.C Lodges -- Guest House in Jhang.Aaiza Hassan
Ā 
A DAY IN THE LIFE OF A SALESMAN / WOMAN
A DAY IN THE LIFE OF A  SALESMAN / WOMANA DAY IN THE LIFE OF A  SALESMAN / WOMAN
A DAY IN THE LIFE OF A SALESMAN / WOMANIlamathiKannappan
Ā 
Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Servicediscovermytutordmt
Ā 

Recently uploaded (20)

Lucknow šŸ’‹ Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow šŸ’‹ Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow šŸ’‹ Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow šŸ’‹ Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Ā 
Tech Startup Growth Hacking 101 - Basics on Growth Marketing
Tech Startup Growth Hacking 101  - Basics on Growth MarketingTech Startup Growth Hacking 101  - Basics on Growth Marketing
Tech Startup Growth Hacking 101 - Basics on Growth Marketing
Ā 
Value Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsValue Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and pains
Ā 
Yaroslav Rozhankivskyy: Š¢Ń€Šø сŠŗŠ»Š°Š“Š¾Š²Ń– і трŠø ŠæŠµŃ€ŠµŠ“уŠ¼Š¾Š²Šø Š¼Š°ŠŗсŠøŠ¼Š°Š»ŃŒŠ½Š¾Ń— ŠæрŠ¾Š“уŠŗтŠøŠ²Š½...
Yaroslav Rozhankivskyy: Š¢Ń€Šø сŠŗŠ»Š°Š“Š¾Š²Ń– і трŠø ŠæŠµŃ€ŠµŠ“уŠ¼Š¾Š²Šø Š¼Š°ŠŗсŠøŠ¼Š°Š»ŃŒŠ½Š¾Ń— ŠæрŠ¾Š“уŠŗтŠøŠ²Š½...Yaroslav Rozhankivskyy: Š¢Ń€Šø сŠŗŠ»Š°Š“Š¾Š²Ń– і трŠø ŠæŠµŃ€ŠµŠ“уŠ¼Š¾Š²Šø Š¼Š°ŠŗсŠøŠ¼Š°Š»ŃŒŠ½Š¾Ń— ŠæрŠ¾Š“уŠŗтŠøŠ²Š½...
Yaroslav Rozhankivskyy: Š¢Ń€Šø сŠŗŠ»Š°Š“Š¾Š²Ń– і трŠø ŠæŠµŃ€ŠµŠ“уŠ¼Š¾Š²Šø Š¼Š°ŠŗсŠøŠ¼Š°Š»ŃŒŠ½Š¾Ń— ŠæрŠ¾Š“уŠŗтŠøŠ²Š½...
Ā 
Unlocking the Secrets of Affiliate Marketing.pdf
Unlocking the Secrets of Affiliate Marketing.pdfUnlocking the Secrets of Affiliate Marketing.pdf
Unlocking the Secrets of Affiliate Marketing.pdf
Ā 
Understanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key InsightsUnderstanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key Insights
Ā 
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Ā 
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesMysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Ā 
Sales & Marketing Alignment: How to Synergize for Success
Sales & Marketing Alignment: How to Synergize for SuccessSales & Marketing Alignment: How to Synergize for Success
Sales & Marketing Alignment: How to Synergize for Success
Ā 
9599632723 Top Call Girls in Delhi at your Door Step Available 24x7 Delhi
9599632723 Top Call Girls in Delhi at your Door Step Available 24x7 Delhi9599632723 Top Call Girls in Delhi at your Door Step Available 24x7 Delhi
9599632723 Top Call Girls in Delhi at your Door Step Available 24x7 Delhi
Ā 
Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...
Ā 
Best Basmati Rice Manufacturers in India
Best Basmati Rice Manufacturers in IndiaBest Basmati Rice Manufacturers in India
Best Basmati Rice Manufacturers in India
Ā 
It will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayIt will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 May
Ā 
VIP Kolkata Call Girl Howrah šŸ‘‰ 8250192130 Available With Room
VIP Kolkata Call Girl Howrah šŸ‘‰ 8250192130  Available With RoomVIP Kolkata Call Girl Howrah šŸ‘‰ 8250192130  Available With Room
VIP Kolkata Call Girl Howrah šŸ‘‰ 8250192130 Available With Room
Ā 
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Ā 
HONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael HawkinsHONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael Hawkins
Ā 
Keppel Ltd. 1Q 2024 Business Update Presentation Slides
Keppel Ltd. 1Q 2024 Business Update  Presentation SlidesKeppel Ltd. 1Q 2024 Business Update  Presentation Slides
Keppel Ltd. 1Q 2024 Business Update Presentation Slides
Ā 
M.C Lodges -- Guest House in Jhang.
M.C Lodges --  Guest House in Jhang.M.C Lodges --  Guest House in Jhang.
M.C Lodges -- Guest House in Jhang.
Ā 
A DAY IN THE LIFE OF A SALESMAN / WOMAN
A DAY IN THE LIFE OF A  SALESMAN / WOMANA DAY IN THE LIFE OF A  SALESMAN / WOMAN
A DAY IN THE LIFE OF A SALESMAN / WOMAN
Ā 
Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Service
Ā 

Snowplow presentation to hug uk

  • 1. Snowplow: scalable open source web and event analytics platform, built on AWS Using EMR, Redshift, Cloudfront and Elastic Beanstalk to build a scalable, log-everything, query-everything data infrastructure
  • 2. What is Snowplow? ā€¢ Web analytics platform ā€¢ Javascript tags -> event-level data delivered in your own Amazon Redshift or PostgreSQL database, for analysis in R, Excel, Tableau ā€¢ Open source -> run on your own AWS account ā€¢ Own your own data ā€¢ Join with 3rd party data sets (PPC, Facebook, CRM) ā€¢ Analyse with any tool you want ā€¢ Architected to scale ā€¢ Ad networks track 100Ms of events (impressions) per day ā€¢ General purpose event analytics platform -> Universal Event Analytics ā€¢ Log-everything infrastructure works for web data and other event data sets
  • 3. Why we built Snowplow ā€¢ Traditional web analytics tools are very limited ā€¢ Siloed -> hard to integrate ā€¢ Reports built for publishers and retailers in the 1990s ā€¢ Impressed by how easy AWS makes it to collect, manage and process massive data sets ā€¢ More on this in a secondā€¦ ā€¢ Impressed by new generation of agile BI tools ā€¢ Tableau, Excel, Rā€¦ ā€¢ Commoditise and standardise event data capture (esp. data structure) -> enable innovation in the use of that data ā€¢ Lots of tech companies have built a similar stack to handle data internally ā€¢ Makes sense for everyone to standardise around an open source product
  • 4. Snowplowā€™s (loosely coupled) technical architecture 1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsB C D A D Standardised data protocols Generate event data (e.g. Javascript tracker) Receive data from trackers and log it to S3 Clean and enrich raw data (e.g. geoIP lookup, session ization, referrer parsing) Store data in format suitable to enable analysis
  • 5. The Snowplow technology stack: trackers 1. Trackers 2. Collectors 3. Enrich 4. Storage 5. Analytics Javascript tracker Pixel (No-JS) tracker Arduino tracker Lua tracker Trackers on the roadmap: ā€¢ Java ā€¢ Python ā€¢ Ruby ā€¢ Android ā€¢ iOSā€¦
  • 6. The Snowplow technology stack: collectors 1. Trackers 2. Collectors 3. Enrich 4. Storage 5. Analytics Cloudfront collector Clojure collector on Elastic Beanstalk ā€¢ Tracker: GET request to pixel hosted on Cloudfront ā€¢ Event data appended to the GET request as a query string ā€¢ Cloudfront logging -> data automatically logged to S3 ā€¢ Scalable ā€“ Cloudfront CDN built to handle enormous volume and velocity of requests ā€¢ Enable tracking users across domains, by setting a 3rd party cookie server side ā€¢ Clojure collector runs on Tomcat: customize format of Tomcat logs to match Cloudfront log file format ā€¢ Elastic Beanstalk supports rotation of Tomcat logs into S3 ā€¢ Scalable: Elastic Beanstalk makes it easy to handle spikes in request volumes
  • 7. The Snowplow technology stack: data enrichment 1. Trackers 2. Collectors 3. Enrich 4. Storage 5. Analytics Scalding Enrichment on EMR ā€¢ Enrichment process run 1-4x per day ā€¢ Consolidate log files from collector, clean up, enrich, and write back to storage (S3) ā€¢ Enrichments incl. referrer parsing, Geo-IP lookups, server-side sessionization ā€¢ Process written in Scalding: a Scala API for Cascading ā€¢ Cascading: a high level library for Hadoop esp. well suited for building robust data pipelines (ETL) that e.g. push bad data into separate sinks to validated data ā€¢ Powered by EMR: cluster fired up to perform the enrichment step, then shut down
  • 8. Hadoop and EMR are excellent for data enrichment ā€¢ For many, the volume of data processed with each run is not large enough to necessitate a big data solutionā€¦ ā€¢ ā€¦ but building the process on Hadoop / EMR means it is easy to rerun the entire historical Snowplow data set through Enrichment e.g. ā€¢ When a new enrichment becomes available ā€¢ When the company wants to apply a new definition of a key variable in their Snowplow data set (e.g. new definition for sessionization, or new definition for user cohort) i.e. change in business logic ā€¢ Reprocessing entire data set isnā€™t just possible -> itā€™s easy (as easy as just processing new data) and fast (just fire up a larger cluster) ā€¢ This is game changing in web analytics, where reprocessing data has never been possible
  • 9. Scalding + Scalaz make it easy for us to build rich, validated ETL pipelines to run on EMR ā€¢ Scalaz is a functional programming library for Scala ā€“ it has a Validation data type which lets us accumulate errors as we process our raw Snowplow rows ā€¢ Scalding + Scalaz lets us write ETL in a very expressive way: ā€¢ In the above, ValidatedMaybeCanonicalOutput contains either a valid Snowplow event, or a list of validation failures (Strings) which were encountered trying to parse the raw Snowplow log row
  • 10. Scalding + Scalaz make it easy for us to build rich, validated ETL pipelines to run on EMR (continued) ā€¢ Scalding + Scalaz lets us route our bad raw rows into a ā€œbad bucketā€ in S3, along with all of the validation errors which were encountered for that row: ā€¢ (This is pretty-printed ā€“ in fact the flatfile is one JSON object per line) ā€¢ In the future we could add an aggregation job to process these ā€œbad bucketā€ files and report on the number of errors encountered and most common validation failures
  • 11. The Snowplow technology stack: storage and analytics 1. Trackers 2. Collectors 3. Enrich 4. Storage 5. Analytics S3 Redshift Postgres (coming soon)
  • 12. Loading Redshift from an EMR job is relatively straightforward, with some gotchas to be aware of ā€¢ Load Redshift from S3, not DynamoDB ā€“ the costs for loading from DynamoDB only make sense if you need the data in DynamoDB anyway ā€¢ Your EMR job can either write directly to S3 (slow), or write to local HDFS and then S3DistCp to S3 (faster) ā€¢ For Scalding, our Redshift table target is a POJO assembled using scala.reflect.BeanProperty ā€“ with fields declared in same order as in Redshift:
  • 13. Make sure to escape tabs, newlines etc in your strings ā€¢ Once we have Snowplow events in CanonicalOutput form, we simply unpack them into tuple fields for writing: ā€¢ Remember you are loading tab-separated, newline terminated values into Redshift, so make sure to escape all tabs, newlines, other special characters in your strings:
  • 14. You need to handle field length too ā€¢ You can either handle string length proactively in your code, or add TRUNCATECOLUMNS to your Redshift COPY command ā€¢ Currently we proactively truncate: ā€¢ BUT this code is not unicode-aware (Redshift varchar field lengths are in terms of bytes, not characters) and rather fragile ā€“ we will likely switch to using TRUNCATECOLUMNS
  • 15. Then use STL_LOAD_ERRORS, Excel and MAXERROR to help debug load errors ā€¢ If you do get load errors, then check STL_LOAD_ERRORS in Redshift ā€“ it gives you all the information you need to fix the load error ā€¢ If the error is non-obvious, pull your POJO, Redshift table definition and bad row (from STL_LOAD_ERRORS) into Excel to compare: ā€¢ COPY ā€¦ MAXERROR X is your friend ā€“ lets you see more than just the first load error
  • 16. TSV text files are great for feeding Redshift, but be careful of using them as your ā€œmaster data storeā€ ā€¢ Some limitations to using tab-separated flat files to store your data: ā€¢ Inefficient for storage/querying ā€“ versus e.g. binary files ā€¢ Schemaless ā€“ no way of knowing the structure without visually eyeballing ā€¢ Fragile ā€“ problems with field length, tabs, newlines, control characters etc ā€¢ Inexpressive ā€“ no support for things like Union data types; rows can only be 65kb wide (you can insert fatter rows into Redshift, but cannot query them) ā€¢ Brittle ā€“ adding a new field to Redshift means the old files donā€™t load; need to re-run the EMR job over all of your archived input data to re-generate ā€¢ All of this means we will be moving to a more robust Snowplow event storage format on disk (Avro), and simply generating TSV files from those Avro events as needed to feed Redshift (or Postgres or Amazon RDS or ā€¦) ā€¢ Recommendation: write a new Hadoop job step to take your existing outputs from EMR and convert into Redshift-friendly TSVs; donā€™t start hacking on your existing data flow
  • 17. Any questions? ? Learn more ā€¢ https://github.com/snowplow/snowplow ā€¢ http://snowplowanalytics.com/ ā€¢ @snowplowdata