Geetu Ambwani, Principal Data Scientist, Huffington Post at MLconf NYC - 4/15/16
1. Data Science in the
Newsroom
Geetu Ambwani
Principal Data Scientist
geetu.ambwani@huffingtonpost.com
2. What is the Huffington Post?
Founded May 2005
Ranking among Digital-only news websites 1
Cross-platform monthly unique visitors Over 187 Million
Number of articles per day Over 500
Number of international editions 15
Bloggers Over 100,000
3. News Industry - Trends
HuffPost has consistently been an innovator in the digital publishing space.
Massive Blogging Network:
More than 100K bloggers across the globe
4. News Industry - Trends
HuffPost has consistently been an innovator in the digital publishing space.
Google Site Rank
5. News Industry - Trends
HuffPost has consistently been an innovator in the digital publishing space.
Biggest Social publisher
10. Content Creation: How Can Data Help ?
● Tools to help surface, discover trends in different parts of the web
● Content Enhancement with multimedia based on semantic matching (images, slideshows, videos)
● Optimizing headlines/images (RobinHood Platform)
12. Content Consumption: How Can Data Help?
Know Your Audience
● User Cohorts:
○ Social Traffic versus FrontPage Clickers consume different content
○ Desktop Vs Mobile consumption
● Recommendations/Personalization
● Can we use data to inform product design and interface ?
○ Rearrange share buttons based on traffic origin (Facebook vs Pinterest)
14. Content Distribution: Can Data Help ?
● People’s attention is increasingly concentrated on social streams
○ More traffic to publishers from social than any other way
● Are Distributed Platforms the new home page ?
○ Facebook Instant, Apple News, Snapchat Discover, Google Amp
○ Messenger Bots
● You need to be where your audience is:
○ Identify the content mix that is maximally engaging on an external platform
○ Can we use data to seed these distribution networks ? (Facebook HuffPost Pages, Snapchat
Discover)
15. Content Distribution: Can Data Help ?
● HuffPost produces 1000 articles a day - which of these do we promote ?
● Article PVs follow a very skewed distribution of success
○ Only 1% of our articles > 100k PVs
● Content performs differently on different networks.
● Can we predict the articles that will get traction in advance so
■ We can optimally seed multiple distribution channels (Facebook HP Pages, Snapchat
Discover)
■ Target for premium/high value ads to maximize revenue
■ Populate Recommendation Widgets
16. Content Distribution: Can Data Help ?
Challenges
● Histogram of traffic distribution - highly skewed.
● The very act of promoting something causes a bump in traffic.
● Data normalization - how long do want to wait before predicting ?
● Very imbalanced data set
Our Approach
● Random Forest classifier.
● Multiple success criteria
● Historical examples of (+) and (-) articles. Downsampling.
● Different normalization thresholds
● Feature engineering: traffic growth ratios; initial organic social traffic per minute; distinct referrers;
17. Slackbot for the social promotion team
● 20% lift in PVs per predicted article
19. Conclusion
A Data Driven Newsroom today means
● More than just keeping track of clicks and shares
● Using predictive analytics to drive product and content placement
Machine Learning will be a key driver for success with the advent of distributed
content