SlideShare a Scribd company logo
1 of 75
Download to read offline
Scaling Quality On Quora
Using Machine Learning
Nikhil Garg
@nikhilgarg28
@Quora @QconSF 11/7/16
● Introducing specific product problems we need to solve to stay high-quality
● Describing our formulation and approach to these problems.
● Identifying common themes of ML problems in the quality domain.
● Sharing high level lessons that we have learnt over time.
Goals Of The Talk
● At Quora since 2012
● Currently leading two engineering teams:
○ ML Platform
○ Content Quality
● Interested in the intersection of distributed
systems, machine learning and human behavior
A bit about me...
@nikhilgarg28
To Grow And Share World’s Knowledge
Over 100 million monthly uniques
Millions of questions & answers
In hundreds of thousands of topics
Supported by 80 engineers
ML @ Quora
ML’s Importance For Quora
● ML is not just something we do on the side, it is mission critical for us.
● It’s one of the most important core competencies for us.
Data: Billions of relationships
Users
Answers
Questions
Topics Votes
Follow
Ask
Write
Cast
Have
Contain
Get
Comments
Get
Follow
Write
Have Have
Data: Billions of words in high quality corpus
● Questions
● Answers
● Comments
● Topic biographies
● ...
Data: Interaction History
● Highly engaged users => long history of activity e.g search queries, upvotes etc.
● Ever-green content => long history of users engaging with the content in search, feed etc.
● Answer ranking
● Feed ranking
● Search ranking
● User recommendations
● Topic recommendations
● Duplicate questions
● Email Digest
● Request Answers
● Trending now
● Topic expertise prediction
● Spam, abuse detection
● ….
ML Applications At Quora
● Logistic Regression
● Elastic Nets
● Random Forests
● Gradient Boosted Decision Trees
● Matrix Factorization
● (Deep) Neural Networks
● LambdaMart
● Clustering
● Random walk based methods
● Word Embeddings
● LDA
● ...
ML Algorithms At Quora
What We Care About
Relevance
Quality Demand
Is content high quality?
Is user an expert in the topic?
Do lots of people want to get answers
to this question?
What is the search intent of the user?
Would user be interested in reading answer?
Would user be able to answer the question?
What We Care About
Relevance
Quality Demand
Is content high quality?
Is user an expert in the topic?
Do lots of people want to get answers
to this question?
What is the search intent of the user?
Would user be interested in reading answer?
Would user be able to answer the question?
1. Duplicate Question Detection
2. Answer Ranking
3. Topic Expertise Detection
4. Moderation
1. Duplicate Question Detection
2. Answer Ranking
3. Topic Expertise Detection
4. Moderation
● Energy of people who can answer the question gets divided
● No single question page becomes the best resource for that question
● People looking for answers have to search and read many question pages
● Bad experience if the “same” question shows up in feed again and again
● Search engines can not rank any one page very highly
Why Duplicate Questions Are Bad
Duplicate Question Detection
● Need to detect duplicates even before
question reaches the system
● When user adds a question, we search
ALL our questions to check for duplicates.
● Latency: tens of milliseconds.
● ML algorithm aside, this is also a crazy
hard engineering problems.
Duplicate Question Detection
Detect if a new question is duplicate of an
existing question.
Problem Statement
What is the Sun’s temperature?
How hot does the Sun get?
What is the average temperature of Sun?
What is the temperature of Sun’s surface and that of Sun’s core?
What is the hottest object in our solar system? How hot is it?
What is the temperature, pressure and density of Sun?
What is the temperature of a yellow star like our Sun?
● Syntax
● Semantics
● Generality
● High precision
● High recall
Algorithmic Challenges
Recent Work
Problem Formulation
Binary classification on pairs of questions
Training Data Sources
Hand labeled data, Semi-supervised approaches to bootstrap data, Random negative sampling,
User browse/search behavior, Language model on standard datasets, ...
Models
Logistic Regression, Random Forests, GBDT, Deep Neural Networks, Ensembles…
Features
Word embeddings, conventional IR features, usage based features, …
Our Approach
● Judgements are hairy for even humans.
● Can’t optimize some user action directly.
● Training data is scarce -- need to fuse multiple data sources together.
Duplicate Questions: Problem Properties
1. Duplicate Question Detection
2. Answer Ranking
3. Topic Expertise Detection
4. Moderation
Given a question, how do you rank answers by quality?
Rank answers to a question by their “quality”
Problem Statement
A simple function of upvotes and downvotes,
with some precomputed author priors.
Previous Approach
● Popular answers != factually correct
● Joke answers get disproportionately many upvotes
● Expert answers ranked lower than answers by popular writers
● Rich get richer
● Poor ranking for new answers
● ...
Great Baseline, but...
● Upvote means different things to different people e.g funny, correct, useful.
● Doesn’t always correspond to quality
● ...what is quality?
Why do all these problems exist?
Defining High Quality Answers
● Answers the question
● Is factually correct
● Is clear and easy to read
● Supported with rationale
● Demonstrates credibility
● ...
Answer Ranking: Formulation
● Item-wise regression on answers.
● Also tried item-wise multi-class
classification on score buckets
● Item-wise enable comparing answers
across different questions.
● Can also discover “Quora Gold” and
“really bad” answers
Answer 1
Answer 2
Answer 3
Question
0.9
0.8
0.5
● R2
● Weighted R2 with different weights for
different parts of the quality spectrum
● NDCG
● Kendall’s Tau
● ...
Answer Ranking: Evaluation
● Hand labeled data
● Language model on standard datasets
● Explicit quality survey shown to users
● Implicit data from usage
● Semi-supervised approaches for label propagation
● Surrogate learning (e.g predicting if “topic experts” will
upvote the answer)
● ...
Answer Ranking: Training Data
We tested 100+ features, the final model uses ~50 features after feature pruning
● User features -- e.g “Is the author an expert in the topic?”
● Answer text features -- e.g “What is the syntactic complexity of the text?”
● Question/Answer features -- e.g “Is the answer answering the question?”
● Voter features -- e.g “Is voter an expert in the topic?”
● Metadata features -- e.g “How many answers did the question have when the answer was
written?”
Answer Ranking: Features
Models
● Logistic Regression
● Random Forests
● Gradient Boosted Decision Trees
● Recurrent Neural Networks
● Ensembles
● …
GBDTs provide a good balance between accuracy,
complexity, training time, prediction time and ease of
deploying in production.
Answer Ranking: Models
Answer Ranking: Interpretability
Our Approach
● Latency: tens of milliseconds
● Computing 100 features each for 100 answers, even at 10us per feature computation,
can take 100ms.
● Need to parallelize computation, and also cache feature values/scores.
● Caching → need to support real-time cache dirties/updates.
Answer Ranking: Productionizing
● Trick -- don’t recompute scores if the feature
doesn’t flip any ‘decision branch’.
Answer Ranking: Productionizing
Answer Quality: Problem Properties
● Need to start with defining what we want the model to learn.
● Feature engineering and interpretability are important.
● Class imbalance for classification problems.
● Training data is scarce -- need to fuse multiple data sources together.
1. Duplicate Question Detection
2. Answer Ranking
3. Topic Expertise Detection
4. Moderation
● Important signal to all other quality
systems.
● Can make content more trustworthy.
● Helps retaining and engaging experts
Topic Expertise Matters For Quality
Topic Experts
Relevant
Credentials
Predict topic expertise level of users.
Problem Statement
Deducing Expertise From Topic Biography
“Researcher at MSR since 2005”
“ML Engineer at Quora”
“Taken undergraduate courses”
Degree of Expertise In Topic “Machine Learning”
“Invented AdaBoost”“Learning machine programming”
Problem Formulation
Multi-class classification on text of biography, classes being discrete buckets on the expertise spectrum.
Experts are sparse → class imbalance.
Training Data Sources
Hand labeled data, Data from other quality measures, Label propagation, Users can “report” bios...
Our Approach
Models
Logistic Regression, Random Forests, Gradient Boosted Decision Trees, ...
Regularization is very important
Features
Ngrams, Named Entities, Cosine similarity between topic name and biography text, ...
Our Approach
● Andrew Ng is an ML expert → his ML
answers must be good.
● What about answers upvoted by him?
● What about answers written by
someone whose own answers were
upvoted by him?
● …
Deducing Expertise From Voting
Topic Expertise Propagation Graph
● Trust in expertise is transitive:
A → B, and B → C ⇒ A → C
● So trust in expertise propagates
through the network
● We can mine the graph to discover
topic expertise using graph
algorithms like PageRank
● Unsupervised learning!
Ensembles To Combine Submodels
● Models trained on heterogeneous data
● Models trained using supervised,
semi-supervised and unsupervised
approaches
● Low correlation between different models
● Can combine them using ensembles
Topic Expertise: Problem Properties
● Class imbalance for classification problems.
● Unsupervised learning is powerful.
● Ensembles can help combine learners trained on data from different sources.
1. Duplicate Question Detection
2. Answer Ranking
3. Topic Expertise Detection
4. Moderation
Moderation
● Any user-generated-content product has lots of
moderation challenges.
● Content -- spam, hate-speech, porn, plagiarism etc.
● Account -- fake names, impersonation, sockpuppets
● Quora specific policies -- e.g. answers making fun of
questions, insincere questions.
Moderation: ML Challenges
● Super nuanced judgements, too hard for even
trained humans.
● Noisy labeled data
● Too hard a problem for a computer.
Moderation: ML Challenges
● Difficulty in learning due to severe class
imbalance.
● Metrics can be deceiving -- useless
models at 99 % accuracy
● Scarce labeled data, class imbalance
makes it worse
Moderation: ML Challenges
● High precision needed.
● Usually get extremely low recall at desired precision
levels.
● Want to detect problems users see bad content, so
can’t rely on any user interactions.
Moderation: Tricks and Approaches
● Start with looking at the right metrics.
● Can use standard metrics like F1, AUC.
● Or fine tune metrics based on your
application needs.
Moderation: Tricks and Approaches
● Oversampling minority class, undersampling
majority class
● Random sampling to generate negative
examples
● Higher cost of mistakes on the minority
class.
Moderation: Tricks and Approaches
● Successive iterations of: hand labeling →
model to reduce the space → hand labeling …
● Scarce data, high feature dimensionality →
simple models with regularization often work
very well
● Using low precision classifiers for making
human review efficient
Moderation: Problem Properties
● Severe class imbalance.
● Need to look at the “right” evaluation metrics.
● Sampling techniques can be very effective.
Summary
What do all quality problems have in common?
● Start with defining what we want the model to do.
● Product intuition is important for feature engineering.
● Interpretability of models is important for iteration.
Machine Learning + Product Understanding
● Good and cheap training data is unavailable/costly.
● Often need to combine data from different sources.
● Unsupervised/semi-supervised learning are very useful.
● Ensembles are your friends.
Training Data Scarcity
Dealing With Class Imbalance
● Look at the right metrics.
● Re-sampling techniques can be very effective.
● Can incorporate “data collection” cost into the algorithm.
● More specific semi-supervised learning approaches.
● Combining Quality, Relevance and Demand together.
● Avoiding (and sometimes creating) feedback loops.
● Engineering challenges behind these ML problems.
Topics For Some Other Day
Nikhil Garg
@nikhilgarg28
Thank You! Questions?
Standard Disclaimer: Quora Is Hiring :)

More Related Content

What's hot

Presentation of Domain Specific Question Answering System Using N-gram Approach.
Presentation of Domain Specific Question Answering System Using N-gram Approach.Presentation of Domain Specific Question Answering System Using N-gram Approach.
Presentation of Domain Specific Question Answering System Using N-gram Approach.Tasnim Ara Islam
 
Open domain Question Answering System - Research project in NLP
Open domain  Question Answering System - Research project in NLPOpen domain  Question Answering System - Research project in NLP
Open domain Question Answering System - Research project in NLPGVS Chaitanya
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk Vijay Ganti
 
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)Avkash Chauhan
 
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskDeep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskSaurabh Saxena
 
MaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - OverviewMaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - Overviewananth
 
Deep Learning Models for Question Answering
Deep Learning Models for Question AnsweringDeep Learning Models for Question Answering
Deep Learning Models for Question AnsweringSujit Pal
 
Instant Question Answering System
Instant Question Answering SystemInstant Question Answering System
Instant Question Answering SystemDhwaj Raj
 
Efficient Top-N Recommendation by Linear Regression
Efficient Top-N Recommendation by Linear RegressionEfficient Top-N Recommendation by Linear Regression
Efficient Top-N Recommendation by Linear RegressionMark Levy
 
Machine Learning in NLP
Machine Learning in NLPMachine Learning in NLP
Machine Learning in NLPVijay Ganti
 
Memory Networks, Neural Turing Machines, and Question Answering
Memory Networks, Neural Turing Machines, and Question AnsweringMemory Networks, Neural Turing Machines, and Question Answering
Memory Networks, Neural Turing Machines, and Question AnsweringAkram El-Korashy
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningBigDataCloud
 
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP MeetupDealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP MeetupYves Peirsman
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language ProcessingJaganadh Gopinadhan
 
Recsys2016 Tutorial by Xavier and Deepak
Recsys2016 Tutorial by Xavier and DeepakRecsys2016 Tutorial by Xavier and Deepak
Recsys2016 Tutorial by Xavier and DeepakDeepak Agarwal
 
Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?NAVER Engineering
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project PresentationAryak Sengupta
 

What's hot (20)

Presentation of Domain Specific Question Answering System Using N-gram Approach.
Presentation of Domain Specific Question Answering System Using N-gram Approach.Presentation of Domain Specific Question Answering System Using N-gram Approach.
Presentation of Domain Specific Question Answering System Using N-gram Approach.
 
Open domain Question Answering System - Research project in NLP
Open domain  Question Answering System - Research project in NLPOpen domain  Question Answering System - Research project in NLP
Open domain Question Answering System - Research project in NLP
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
 
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
 
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskDeep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
 
MaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - OverviewMaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - Overview
 
Deep Learning Models for Question Answering
Deep Learning Models for Question AnsweringDeep Learning Models for Question Answering
Deep Learning Models for Question Answering
 
Instant Question Answering System
Instant Question Answering SystemInstant Question Answering System
Instant Question Answering System
 
Efficient Top-N Recommendation by Linear Regression
Efficient Top-N Recommendation by Linear RegressionEfficient Top-N Recommendation by Linear Regression
Efficient Top-N Recommendation by Linear Regression
 
NLP Bootcamp
NLP BootcampNLP Bootcamp
NLP Bootcamp
 
Persian setiment analysis
Persian setiment analysisPersian setiment analysis
Persian setiment analysis
 
Machine Learning in NLP
Machine Learning in NLPMachine Learning in NLP
Machine Learning in NLP
 
Memory Networks, Neural Turing Machines, and Question Answering
Memory Networks, Neural Turing Machines, and Question AnsweringMemory Networks, Neural Turing Machines, and Question Answering
Memory Networks, Neural Turing Machines, and Question Answering
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
 
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP MeetupDealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
 
Recsys2016 Tutorial by Xavier and Deepak
Recsys2016 Tutorial by Xavier and DeepakRecsys2016 Tutorial by Xavier and Deepak
Recsys2016 Tutorial by Xavier and Deepak
 
Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project Presentation
 
ISEC-2021-Presentation-Saikat-Mondal
ISEC-2021-Presentation-Saikat-MondalISEC-2021-Presentation-Saikat-Mondal
ISEC-2021-Presentation-Saikat-Mondal
 

Similar to Scaling Quality on Quora Using Machine Learning

Machine Learning at Quora (2/26/2016)
Machine Learning at Quora (2/26/2016)Machine Learning at Quora (2/26/2016)
Machine Learning at Quora (2/26/2016)Nikhil Dandekar
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Xavier Amatriain
 
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...Sri Ambati
 
Maintaining high quality user generated content through machine learning
Maintaining high quality user generated content through machine learningMaintaining high quality user generated content through machine learning
Maintaining high quality user generated content through machine learningNikhil Dandekar
 
Quora ML Workshop: Maintaining High Quality User-Generated Content through Ma...
Quora ML Workshop: Maintaining High Quality User-Generated Content through Ma...Quora ML Workshop: Maintaining High Quality User-Generated Content through Ma...
Quora ML Workshop: Maintaining High Quality User-Generated Content through Ma...Quora
 
Search, Discovery and Questions at Quora
Search, Discovery and Questions at QuoraSearch, Discovery and Questions at Quora
Search, Discovery and Questions at QuoraNikhil Dandekar
 
Staying Shallow & Lean in a Deep Learning World
Staying Shallow & Lean in a Deep Learning WorldStaying Shallow & Lean in a Deep Learning World
Staying Shallow & Lean in a Deep Learning WorldXavier Amatriain
 
Machine Learning for Q&A Sites: The Quora Example
Machine Learning for Q&A Sites: The Quora ExampleMachine Learning for Q&A Sites: The Quora Example
Machine Learning for Q&A Sites: The Quora ExampleXavier Amatriain
 
Past, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectivePast, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectiveXavier Amatriain
 
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systemsBIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systemsXavier Amatriain
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning SystemsXavier Amatriain
 
Recommender Systems In Industry
Recommender Systems In IndustryRecommender Systems In Industry
Recommender Systems In IndustryXavier Amatriain
 
Webinar: Question Answering and Virtual Assistants with Deep Learning
Webinar: Question Answering and Virtual Assistants with Deep LearningWebinar: Question Answering and Virtual Assistants with Deep Learning
Webinar: Question Answering and Virtual Assistants with Deep LearningLucidworks
 
Machine Learning Interviews_ Lessons from Both Sides - FSDL.pptx
Machine Learning Interviews_ Lessons from Both Sides - FSDL.pptxMachine Learning Interviews_ Lessons from Both Sides - FSDL.pptx
Machine Learning Interviews_ Lessons from Both Sides - FSDL.pptxAbhinavSagar21
 
Machine Learning to Grow the World's Knowledge
Machine Learning to Grow  the World's KnowledgeMachine Learning to Grow  the World's Knowledge
Machine Learning to Grow the World's KnowledgeXavier Amatriain
 
Open source ml systems that need to be built
Open source ml systems that need to be builtOpen source ml systems that need to be built
Open source ml systems that need to be builtNikhil Garg
 
Aiinpractice2017deepaklongversion
Aiinpractice2017deepaklongversionAiinpractice2017deepaklongversion
Aiinpractice2017deepaklongversionDeepak Agarwal
 
MLConf Seattle 2015 - ML@Quora
MLConf Seattle 2015 - ML@QuoraMLConf Seattle 2015 - ML@Quora
MLConf Seattle 2015 - ML@QuoraXavier Amatriain
 
Xavier Amatriain, VP of Engineering, Quora at MLconf SEA - 5/01/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SEA - 5/01/15Xavier Amatriain, VP of Engineering, Quora at MLconf SEA - 5/01/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SEA - 5/01/15MLconf
 

Similar to Scaling Quality on Quora Using Machine Learning (20)

Machine Learning at Quora (2/26/2016)
Machine Learning at Quora (2/26/2016)Machine Learning at Quora (2/26/2016)
Machine Learning at Quora (2/26/2016)
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
 
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
 
Maintaining high quality user generated content through machine learning
Maintaining high quality user generated content through machine learningMaintaining high quality user generated content through machine learning
Maintaining high quality user generated content through machine learning
 
Quora ML Workshop: Maintaining High Quality User-Generated Content through Ma...
Quora ML Workshop: Maintaining High Quality User-Generated Content through Ma...Quora ML Workshop: Maintaining High Quality User-Generated Content through Ma...
Quora ML Workshop: Maintaining High Quality User-Generated Content through Ma...
 
Search, Discovery and Questions at Quora
Search, Discovery and Questions at QuoraSearch, Discovery and Questions at Quora
Search, Discovery and Questions at Quora
 
Staying Shallow & Lean in a Deep Learning World
Staying Shallow & Lean in a Deep Learning WorldStaying Shallow & Lean in a Deep Learning World
Staying Shallow & Lean in a Deep Learning World
 
Machine Learning for Q&A Sites: The Quora Example
Machine Learning for Q&A Sites: The Quora ExampleMachine Learning for Q&A Sites: The Quora Example
Machine Learning for Q&A Sites: The Quora Example
 
Past, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectivePast, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspective
 
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systemsBIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
 
Recommender Systems In Industry
Recommender Systems In IndustryRecommender Systems In Industry
Recommender Systems In Industry
 
Webinar: Question Answering and Virtual Assistants with Deep Learning
Webinar: Question Answering and Virtual Assistants with Deep LearningWebinar: Question Answering and Virtual Assistants with Deep Learning
Webinar: Question Answering and Virtual Assistants with Deep Learning
 
Machine Learning Interviews_ Lessons from Both Sides - FSDL.pptx
Machine Learning Interviews_ Lessons from Both Sides - FSDL.pptxMachine Learning Interviews_ Lessons from Both Sides - FSDL.pptx
Machine Learning Interviews_ Lessons from Both Sides - FSDL.pptx
 
Machine Learning to Grow the World's Knowledge
Machine Learning to Grow  the World's KnowledgeMachine Learning to Grow  the World's Knowledge
Machine Learning to Grow the World's Knowledge
 
Open source ml systems that need to be built
Open source ml systems that need to be builtOpen source ml systems that need to be built
Open source ml systems that need to be built
 
Aiinpractice2017deepaklongversion
Aiinpractice2017deepaklongversionAiinpractice2017deepaklongversion
Aiinpractice2017deepaklongversion
 
Role of Data Science in eCommerce
Role of Data Science in eCommerceRole of Data Science in eCommerce
Role of Data Science in eCommerce
 
MLConf Seattle 2015 - ML@Quora
MLConf Seattle 2015 - ML@QuoraMLConf Seattle 2015 - ML@Quora
MLConf Seattle 2015 - ML@Quora
 
Xavier Amatriain, VP of Engineering, Quora at MLconf SEA - 5/01/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SEA - 5/01/15Xavier Amatriain, VP of Engineering, Quora at MLconf SEA - 5/01/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SEA - 5/01/15
 

More from Vo Viet Anh

DropDeck - Token Sale
DropDeck - Token SaleDropDeck - Token Sale
DropDeck - Token SaleVo Viet Anh
 
Brexit simulation back in 2013
Brexit simulation back in 2013Brexit simulation back in 2013
Brexit simulation back in 2013Vo Viet Anh
 
FeedLoop for Forbes
FeedLoop for ForbesFeedLoop for Forbes
FeedLoop for ForbesVo Viet Anh
 
Sleep Peeper (eHub ver 1) - business plan 2008
Sleep Peeper (eHub ver 1) - business plan 2008Sleep Peeper (eHub ver 1) - business plan 2008
Sleep Peeper (eHub ver 1) - business plan 2008Vo Viet Anh
 
eHub - paper 2008
eHub - paper 2008eHub - paper 2008
eHub - paper 2008Vo Viet Anh
 
PropMap Pitch deck (3 mins)
PropMap Pitch deck (3 mins)PropMap Pitch deck (3 mins)
PropMap Pitch deck (3 mins)Vo Viet Anh
 
PropMap Pitch Deck
PropMap Pitch DeckPropMap Pitch Deck
PropMap Pitch DeckVo Viet Anh
 
Pichutz (Singularity Pyramid) concept April 28
Pichutz (Singularity Pyramid) concept April 28Pichutz (Singularity Pyramid) concept April 28
Pichutz (Singularity Pyramid) concept April 28Vo Viet Anh
 
XPRIZE Think Tank Meeting April 26
XPRIZE Think Tank Meeting April 26XPRIZE Think Tank Meeting April 26
XPRIZE Think Tank Meeting April 26Vo Viet Anh
 
Singularity Pyramid overview Dec 21
Singularity Pyramid overview Dec 21Singularity Pyramid overview Dec 21
Singularity Pyramid overview Dec 21Vo Viet Anh
 
Singularity Pyramid
Singularity PyramidSingularity Pyramid
Singularity PyramidVo Viet Anh
 
Impacts of MLM, Amway in particular
Impacts of MLM, Amway in particularImpacts of MLM, Amway in particular
Impacts of MLM, Amway in particularVo Viet Anh
 
Capital Flow based Emission Control Mechanism (CAFECOM)
Capital Flow based Emission Control Mechanism (CAFECOM)Capital Flow based Emission Control Mechanism (CAFECOM)
Capital Flow based Emission Control Mechanism (CAFECOM)Vo Viet Anh
 
eHub ver. 2 presented at Rework Summit, Sweden, June 2010
eHub ver. 2 presented at Rework Summit, Sweden, June 2010eHub ver. 2 presented at Rework Summit, Sweden, June 2010
eHub ver. 2 presented at Rework Summit, Sweden, June 2010Vo Viet Anh
 
Call to Arms: Inventing a "capital-flow-based emission control mechanism" Pro...
Call to Arms: Inventing a "capital-flow-based emission control mechanism" Pro...Call to Arms: Inventing a "capital-flow-based emission control mechanism" Pro...
Call to Arms: Inventing a "capital-flow-based emission control mechanism" Pro...Vo Viet Anh
 
Norwegian funding proposal by YOUNGO
Norwegian funding proposal by YOUNGONorwegian funding proposal by YOUNGO
Norwegian funding proposal by YOUNGOVo Viet Anh
 
eHub paper presentation (closed for reworking)
eHub paper presentation (closed for reworking)eHub paper presentation (closed for reworking)
eHub paper presentation (closed for reworking)Vo Viet Anh
 

More from Vo Viet Anh (17)

DropDeck - Token Sale
DropDeck - Token SaleDropDeck - Token Sale
DropDeck - Token Sale
 
Brexit simulation back in 2013
Brexit simulation back in 2013Brexit simulation back in 2013
Brexit simulation back in 2013
 
FeedLoop for Forbes
FeedLoop for ForbesFeedLoop for Forbes
FeedLoop for Forbes
 
Sleep Peeper (eHub ver 1) - business plan 2008
Sleep Peeper (eHub ver 1) - business plan 2008Sleep Peeper (eHub ver 1) - business plan 2008
Sleep Peeper (eHub ver 1) - business plan 2008
 
eHub - paper 2008
eHub - paper 2008eHub - paper 2008
eHub - paper 2008
 
PropMap Pitch deck (3 mins)
PropMap Pitch deck (3 mins)PropMap Pitch deck (3 mins)
PropMap Pitch deck (3 mins)
 
PropMap Pitch Deck
PropMap Pitch DeckPropMap Pitch Deck
PropMap Pitch Deck
 
Pichutz (Singularity Pyramid) concept April 28
Pichutz (Singularity Pyramid) concept April 28Pichutz (Singularity Pyramid) concept April 28
Pichutz (Singularity Pyramid) concept April 28
 
XPRIZE Think Tank Meeting April 26
XPRIZE Think Tank Meeting April 26XPRIZE Think Tank Meeting April 26
XPRIZE Think Tank Meeting April 26
 
Singularity Pyramid overview Dec 21
Singularity Pyramid overview Dec 21Singularity Pyramid overview Dec 21
Singularity Pyramid overview Dec 21
 
Singularity Pyramid
Singularity PyramidSingularity Pyramid
Singularity Pyramid
 
Impacts of MLM, Amway in particular
Impacts of MLM, Amway in particularImpacts of MLM, Amway in particular
Impacts of MLM, Amway in particular
 
Capital Flow based Emission Control Mechanism (CAFECOM)
Capital Flow based Emission Control Mechanism (CAFECOM)Capital Flow based Emission Control Mechanism (CAFECOM)
Capital Flow based Emission Control Mechanism (CAFECOM)
 
eHub ver. 2 presented at Rework Summit, Sweden, June 2010
eHub ver. 2 presented at Rework Summit, Sweden, June 2010eHub ver. 2 presented at Rework Summit, Sweden, June 2010
eHub ver. 2 presented at Rework Summit, Sweden, June 2010
 
Call to Arms: Inventing a "capital-flow-based emission control mechanism" Pro...
Call to Arms: Inventing a "capital-flow-based emission control mechanism" Pro...Call to Arms: Inventing a "capital-flow-based emission control mechanism" Pro...
Call to Arms: Inventing a "capital-flow-based emission control mechanism" Pro...
 
Norwegian funding proposal by YOUNGO
Norwegian funding proposal by YOUNGONorwegian funding proposal by YOUNGO
Norwegian funding proposal by YOUNGO
 
eHub paper presentation (closed for reworking)
eHub paper presentation (closed for reworking)eHub paper presentation (closed for reworking)
eHub paper presentation (closed for reworking)
 

Recently uploaded

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Recently uploaded (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Scaling Quality on Quora Using Machine Learning

  • 1. Scaling Quality On Quora Using Machine Learning Nikhil Garg @nikhilgarg28 @Quora @QconSF 11/7/16
  • 2. ● Introducing specific product problems we need to solve to stay high-quality ● Describing our formulation and approach to these problems. ● Identifying common themes of ML problems in the quality domain. ● Sharing high level lessons that we have learnt over time. Goals Of The Talk
  • 3. ● At Quora since 2012 ● Currently leading two engineering teams: ○ ML Platform ○ Content Quality ● Interested in the intersection of distributed systems, machine learning and human behavior A bit about me... @nikhilgarg28
  • 4.
  • 5. To Grow And Share World’s Knowledge
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11. Over 100 million monthly uniques Millions of questions & answers In hundreds of thousands of topics Supported by 80 engineers
  • 13. ML’s Importance For Quora ● ML is not just something we do on the side, it is mission critical for us. ● It’s one of the most important core competencies for us.
  • 14. Data: Billions of relationships Users Answers Questions Topics Votes Follow Ask Write Cast Have Contain Get Comments Get Follow Write Have Have
  • 15. Data: Billions of words in high quality corpus ● Questions ● Answers ● Comments ● Topic biographies ● ...
  • 16. Data: Interaction History ● Highly engaged users => long history of activity e.g search queries, upvotes etc. ● Ever-green content => long history of users engaging with the content in search, feed etc.
  • 17. ● Answer ranking ● Feed ranking ● Search ranking ● User recommendations ● Topic recommendations ● Duplicate questions ● Email Digest ● Request Answers ● Trending now ● Topic expertise prediction ● Spam, abuse detection ● …. ML Applications At Quora
  • 18. ● Logistic Regression ● Elastic Nets ● Random Forests ● Gradient Boosted Decision Trees ● Matrix Factorization ● (Deep) Neural Networks ● LambdaMart ● Clustering ● Random walk based methods ● Word Embeddings ● LDA ● ... ML Algorithms At Quora
  • 19. What We Care About Relevance Quality Demand Is content high quality? Is user an expert in the topic? Do lots of people want to get answers to this question? What is the search intent of the user? Would user be interested in reading answer? Would user be able to answer the question?
  • 20. What We Care About Relevance Quality Demand Is content high quality? Is user an expert in the topic? Do lots of people want to get answers to this question? What is the search intent of the user? Would user be interested in reading answer? Would user be able to answer the question?
  • 21. 1. Duplicate Question Detection 2. Answer Ranking 3. Topic Expertise Detection 4. Moderation
  • 22. 1. Duplicate Question Detection 2. Answer Ranking 3. Topic Expertise Detection 4. Moderation
  • 23. ● Energy of people who can answer the question gets divided ● No single question page becomes the best resource for that question ● People looking for answers have to search and read many question pages ● Bad experience if the “same” question shows up in feed again and again ● Search engines can not rank any one page very highly Why Duplicate Questions Are Bad
  • 25. ● Need to detect duplicates even before question reaches the system ● When user adds a question, we search ALL our questions to check for duplicates. ● Latency: tens of milliseconds. ● ML algorithm aside, this is also a crazy hard engineering problems. Duplicate Question Detection
  • 26. Detect if a new question is duplicate of an existing question. Problem Statement
  • 27. What is the Sun’s temperature? How hot does the Sun get? What is the average temperature of Sun? What is the temperature of Sun’s surface and that of Sun’s core? What is the hottest object in our solar system? How hot is it? What is the temperature, pressure and density of Sun? What is the temperature of a yellow star like our Sun? ● Syntax ● Semantics ● Generality ● High precision ● High recall Algorithmic Challenges
  • 29. Problem Formulation Binary classification on pairs of questions Training Data Sources Hand labeled data, Semi-supervised approaches to bootstrap data, Random negative sampling, User browse/search behavior, Language model on standard datasets, ... Models Logistic Regression, Random Forests, GBDT, Deep Neural Networks, Ensembles… Features Word embeddings, conventional IR features, usage based features, … Our Approach
  • 30. ● Judgements are hairy for even humans. ● Can’t optimize some user action directly. ● Training data is scarce -- need to fuse multiple data sources together. Duplicate Questions: Problem Properties
  • 31. 1. Duplicate Question Detection 2. Answer Ranking 3. Topic Expertise Detection 4. Moderation
  • 32. Given a question, how do you rank answers by quality?
  • 33. Rank answers to a question by their “quality” Problem Statement
  • 34.
  • 35. A simple function of upvotes and downvotes, with some precomputed author priors. Previous Approach
  • 36. ● Popular answers != factually correct ● Joke answers get disproportionately many upvotes ● Expert answers ranked lower than answers by popular writers ● Rich get richer ● Poor ranking for new answers ● ... Great Baseline, but...
  • 37. ● Upvote means different things to different people e.g funny, correct, useful. ● Doesn’t always correspond to quality ● ...what is quality? Why do all these problems exist?
  • 38. Defining High Quality Answers ● Answers the question ● Is factually correct ● Is clear and easy to read ● Supported with rationale ● Demonstrates credibility ● ...
  • 39. Answer Ranking: Formulation ● Item-wise regression on answers. ● Also tried item-wise multi-class classification on score buckets ● Item-wise enable comparing answers across different questions. ● Can also discover “Quora Gold” and “really bad” answers Answer 1 Answer 2 Answer 3 Question 0.9 0.8 0.5
  • 40. ● R2 ● Weighted R2 with different weights for different parts of the quality spectrum ● NDCG ● Kendall’s Tau ● ... Answer Ranking: Evaluation
  • 41. ● Hand labeled data ● Language model on standard datasets ● Explicit quality survey shown to users ● Implicit data from usage ● Semi-supervised approaches for label propagation ● Surrogate learning (e.g predicting if “topic experts” will upvote the answer) ● ... Answer Ranking: Training Data
  • 42. We tested 100+ features, the final model uses ~50 features after feature pruning ● User features -- e.g “Is the author an expert in the topic?” ● Answer text features -- e.g “What is the syntactic complexity of the text?” ● Question/Answer features -- e.g “Is the answer answering the question?” ● Voter features -- e.g “Is voter an expert in the topic?” ● Metadata features -- e.g “How many answers did the question have when the answer was written?” Answer Ranking: Features
  • 43. Models ● Logistic Regression ● Random Forests ● Gradient Boosted Decision Trees ● Recurrent Neural Networks ● Ensembles ● … GBDTs provide a good balance between accuracy, complexity, training time, prediction time and ease of deploying in production. Answer Ranking: Models
  • 46. ● Latency: tens of milliseconds ● Computing 100 features each for 100 answers, even at 10us per feature computation, can take 100ms. ● Need to parallelize computation, and also cache feature values/scores. ● Caching → need to support real-time cache dirties/updates. Answer Ranking: Productionizing
  • 47. ● Trick -- don’t recompute scores if the feature doesn’t flip any ‘decision branch’. Answer Ranking: Productionizing
  • 48. Answer Quality: Problem Properties ● Need to start with defining what we want the model to learn. ● Feature engineering and interpretability are important. ● Class imbalance for classification problems. ● Training data is scarce -- need to fuse multiple data sources together.
  • 49. 1. Duplicate Question Detection 2. Answer Ranking 3. Topic Expertise Detection 4. Moderation
  • 50.
  • 51. ● Important signal to all other quality systems. ● Can make content more trustworthy. ● Helps retaining and engaging experts Topic Expertise Matters For Quality
  • 53. Predict topic expertise level of users. Problem Statement
  • 54. Deducing Expertise From Topic Biography “Researcher at MSR since 2005” “ML Engineer at Quora” “Taken undergraduate courses” Degree of Expertise In Topic “Machine Learning” “Invented AdaBoost”“Learning machine programming”
  • 55. Problem Formulation Multi-class classification on text of biography, classes being discrete buckets on the expertise spectrum. Experts are sparse → class imbalance. Training Data Sources Hand labeled data, Data from other quality measures, Label propagation, Users can “report” bios... Our Approach
  • 56. Models Logistic Regression, Random Forests, Gradient Boosted Decision Trees, ... Regularization is very important Features Ngrams, Named Entities, Cosine similarity between topic name and biography text, ... Our Approach
  • 57. ● Andrew Ng is an ML expert → his ML answers must be good. ● What about answers upvoted by him? ● What about answers written by someone whose own answers were upvoted by him? ● … Deducing Expertise From Voting
  • 58. Topic Expertise Propagation Graph ● Trust in expertise is transitive: A → B, and B → C ⇒ A → C ● So trust in expertise propagates through the network ● We can mine the graph to discover topic expertise using graph algorithms like PageRank ● Unsupervised learning!
  • 59. Ensembles To Combine Submodels ● Models trained on heterogeneous data ● Models trained using supervised, semi-supervised and unsupervised approaches ● Low correlation between different models ● Can combine them using ensembles
  • 60. Topic Expertise: Problem Properties ● Class imbalance for classification problems. ● Unsupervised learning is powerful. ● Ensembles can help combine learners trained on data from different sources.
  • 61. 1. Duplicate Question Detection 2. Answer Ranking 3. Topic Expertise Detection 4. Moderation
  • 62. Moderation ● Any user-generated-content product has lots of moderation challenges. ● Content -- spam, hate-speech, porn, plagiarism etc. ● Account -- fake names, impersonation, sockpuppets ● Quora specific policies -- e.g. answers making fun of questions, insincere questions.
  • 63. Moderation: ML Challenges ● Super nuanced judgements, too hard for even trained humans. ● Noisy labeled data ● Too hard a problem for a computer.
  • 64. Moderation: ML Challenges ● Difficulty in learning due to severe class imbalance. ● Metrics can be deceiving -- useless models at 99 % accuracy ● Scarce labeled data, class imbalance makes it worse
  • 65. Moderation: ML Challenges ● High precision needed. ● Usually get extremely low recall at desired precision levels. ● Want to detect problems users see bad content, so can’t rely on any user interactions.
  • 66. Moderation: Tricks and Approaches ● Start with looking at the right metrics. ● Can use standard metrics like F1, AUC. ● Or fine tune metrics based on your application needs.
  • 67. Moderation: Tricks and Approaches ● Oversampling minority class, undersampling majority class ● Random sampling to generate negative examples ● Higher cost of mistakes on the minority class.
  • 68. Moderation: Tricks and Approaches ● Successive iterations of: hand labeling → model to reduce the space → hand labeling … ● Scarce data, high feature dimensionality → simple models with regularization often work very well ● Using low precision classifiers for making human review efficient
  • 69. Moderation: Problem Properties ● Severe class imbalance. ● Need to look at the “right” evaluation metrics. ● Sampling techniques can be very effective.
  • 70. Summary What do all quality problems have in common?
  • 71. ● Start with defining what we want the model to do. ● Product intuition is important for feature engineering. ● Interpretability of models is important for iteration. Machine Learning + Product Understanding
  • 72. ● Good and cheap training data is unavailable/costly. ● Often need to combine data from different sources. ● Unsupervised/semi-supervised learning are very useful. ● Ensembles are your friends. Training Data Scarcity
  • 73. Dealing With Class Imbalance ● Look at the right metrics. ● Re-sampling techniques can be very effective. ● Can incorporate “data collection” cost into the algorithm.
  • 74. ● More specific semi-supervised learning approaches. ● Combining Quality, Relevance and Demand together. ● Avoiding (and sometimes creating) feedback loops. ● Engineering challenges behind these ML problems. Topics For Some Other Day
  • 75. Nikhil Garg @nikhilgarg28 Thank You! Questions? Standard Disclaimer: Quora Is Hiring :)