Anomaly detection and data imputation within time series
Snowplow Analytics and Looker at Oyster.com
1. SNOWPLOW AND LOOKER
AT OYSTER.COM
SNOWPLOW MEETUP NYC – MARCH 30, 2016
BEN HOYT, DEVON POHL
2. WHAT IS OYSTER.COM?
• “The Hotel Tell-All”
• Authentic hotel reviews and
photos
• We visit every hotel in person
• 1000 hotels per month
• 7M high-res photos
• 100k 360° panoramas
3. (SOME OF) OUR TECH STACK
• Python to run our backend: web, scripting, photo processing, ETL
• PostgreSQL for all content data (eg: hotels, metadata for 12M images)
• Amazon S3 for image storage, EC2 spot instances for photo processing
• Amazon Redshift for analytics and reporting data
• Looker for reporting and visualizations
• for analytics tracking and analytics ETL
4. GOOGLE ANALYTICS V. SNOWPLOW
Google Analytics
• Good for web, but little control and flexibility
• Hard to get data out of (your data!)
• Crazy pricing model ($0 for free tier, or $150,000/y for premium)
• Can only do web analytics, not other business reporting
Snowplow
• Free and open source, with great support and paid tiers
• Puts data into a standard, easily-queryable database (Redshift)
• Focuses on tracking and analytics ETL and does that part well
5. WHY & HOW WE SWITCHED (1 YEAR AGO)
• We were considering Looker for reporting and visualization
• Looker rep: “majority of our customers use Snowplow to collect their data”
• We dug into Snowplow and liked what we saw
• Initially the design felt a bit overkill, but it’s definitely built to scale
• We implemented the tracking and pipeline, and haven’t looked back
6. OUR CONTEXT SCHEMA
• We use one “custom fields” schema to rule them all
• Simple, one table, one SQL join gives us all our custom fields
{
"self": {"name": "custom_fields", "vendor": "com.oyster", "version": "1-0-9"},
"properties": {
"page_type": {"type": "string"},
"page_subtype": {"type": "string"},
"template_type": {"type": "string", "enum": ["desktop", "mobile"]},
"hotel_id": {"$ref": "#/definitions/positiveInteger32"},
"account_id": {"$ref": "#/definitions/positiveInteger32"},
"ab_cell": {"type": "integer", "minimum": 1, "maximum": 20},
"checkin_date": {"type": "string", "pattern": "^[0-9]{4}-[0-9]{2}-[0-9]{2}$"},
...
7. OUR DATASET
• A large, though not a massive, dataset
• Redshift cluster: 6 dc1.large SSD nodes, ~1TB storage
• 640 million rows in our events table
• We add 1.5 million event rows per day
• We copy (a subset of) our PostgreSQL content database into
Redshift nightly
• Enables business reporting and advanced content-based queries
10. REPORTING
• Snowplow and content data are merged to provide insights into:
• Product
• A/B testing
• Funnel mapping
• Marketing
• SEO monitoring
• Ad Campaigns
• Operations
• Workflow Optimization
• ROI Modeling
• Business Trends
• Traffic
• Revenue
11. VISIT TABLE
• Event data is large and granular – often hard to digest
• Most valuable pre-processing we do is building the visit table
• Incremental build Python ETL run on Redshift
• This is key to most of our reporting infrastructure
• Combines events, custom fields data
• This visit table:
• Is user and user-session-ID granular
• Includes counts of a variety of event types
• Includes all information associated with first event of a visit
• A/B testing cells
• Referral information
• Etc.
12. LOOKER
• Looker is our core data exploration and reporting tool
• Web-based YAML + visualization wrapper on Redshift
• Enables non-technical business owners self-serve reporting and explore
• Used for other pre-processing via persistent derived tables (PDTs)
• PDTs are temporary tables built and managed by Looker defined by a query
• Good for small-to-medium size pre-processing
• Applications include de-duping and revenue attribution
Hi – I’m Devon. I work with Oyster.com and Jetsetter sites at Tripadvisor. Just after I joined, Oyster implemented Snowplow and I’ve spent much of the last year using the platform to build reporting and analytics.
We use snowplow data for a wide variety things. With snowplow we learn about:
Business health – we have dashboards and daily updates on traffic, revenue and other metrics. We use Looker for most of this, which I’ll talk about more in a moment
Product – We’ve learned about our site and users through A/B testing and site traffic mapping
Marketing – Snowplow allows us to monitor and analyze marketing efforts, including SEO
Operations – We monitor how existing assets are performing and model how prospective assets would likely perform to prioritize work for ops and editorial teams
A clear and intuitive context schema is key to marrying snowplow to other data sources – thanks Ben!
The event-level tables can be difficult to use in it’s raw form. We tend to use a derivative of the events table, our visit table, most for reporting and analytics.
This is currently incrementally updated nightly with a straightforward ETL. It includes session-level event counts and first-event referral information.
We also do marketing campaign and email related pre-processing on the event table.
Looker is our core data exploration and reporting tool. I have a couple screenshots of dashboards and exploration in looker on the following slides. We use looker for:
Automated email reporting
Saved dashboards and individual reports
Data exploration – even for non-technical business owners
Small to medium sized pre-processing jobs, such as de-duping and revenue attribution. This is done through persistent derived tables, which are query-defined temporary tables built and managed by Looker. These allow pre-processed tables to be modified on the fly and are managed by Looker, reducing pre-processing infrastructure development and maintenance cost.