Bringing salary transparency to the world: Computing robust compensation insights via LinkedIn Salary

Computing Robust Compensation
Insights via LinkedIn Salary
Krishnaram Kenthapadi
AI @ LinkedIn
(Joint work with Stuart Ambler, Liang Zhang, Deepak Agarwal)

Outline
▪ LinkedIn Salary Overview
▪ Challenges: Privacy, Modeling
▪ Bayesian Hierarchical Smoothing
▪ Outlier Detection

LinkedIn Salary (launched in Nov, 2016)

Salary Collection Flow via Email Targeting

Current Reach (November 2017)
▪ A few million responses out of several millions of
members targeted
– Targeted via emails since early 2016
▪ Countries: US, CA, UK, DE
▪ Insights available for a large fraction of US monthly
active users

▪ Minimize the risk of inferring any one individual’s
compensation data
▪ Protection against data breach
– No single point of failure
Data Privacy Challenges
Achieved by a combination of
techniques: encryption, access
control, ,
aggregation, thresholding
K. Kenthapadi, A. Chudhary, and S.
Ambler, LinkedIn Salary: A System
for Secure Collection and
Presentation of Structured
Compensation Insights to Job
Seekers, IEEE PAC 2017
(arxiv.org/abs/1705.06976)

Title Region
$$
User Exp
Designer
SF Bay
Area
100K
User Exp
Designer
SF Bay
Area
115K
... ... ...
Title Region
$$
User Exp
Designer
SF Bay
Area
100K
De-identification Example
Title Region Company Industry Years
of exp
Degree Field of
Study
Skills
$$
User Exp
Designer
SF Bay
Area
Google Internet 12 BS Interacti
ve
Media
UX,
Graphic
s, ...
100K
Title Region Industry
$$
User Exp
Designer
SF Bay
Area
Internet 100K
Title Region Years
of exp
$$
User Exp
Designer
SF Bay
Area
10+ 100K
Title Region Company Years
of exp
$$
User Exp
Designer
SF Bay
Area
Google 10+ 100K
#data
points >
threshold? Yes ⇒ Copy to
Hadoop (HDFS)
Note: Original submission stored as encrypted objects.

▪ Evaluation
▪ Modeling on de-identified data
▪ Robustness and stability
▪ Outlier detection
Modeling Challenges

Problem Statement
▪How do we compute robust, reliable compensation
insights based on de-identified compensation data,
while addressing the product requirements such as
coverage?

Bayesian
Hierarchical
Smoothing
▪ Modeling on de-identified data
▪ Robustness and stability

Coverage vs Data Quality / Data Privacy Tradeoff
▪ Can we achieve them simultaneously?
Better
Coverage
Better Data
Quality &
Data Privacy
Threshold
(min # data points for returning insights)

Bayesian Smoothing
▪ Large sample size cohorts (>= 20): Report empirical percentiles
▪ Small sample size cohorts (< 20):
– Empirical percentiles unreliable & unstable
– Idea:
▪ Exploit hierarchical structure
▪ “Borrow strength” from the ancestral cohort that has enough data and best fit
– Bayesian hierarchical smoothing
▪ “Combine” cohort estimates with actual observed entries
▪ Greater weighting for observed data as #observed entries increases
▪ Cohorts with no data:
– Build regression models to predict the salary insights

UX Designer, SF Bay Area, Internet Industry, 10+ yrs
UX Designer,
SF Bay Area,
Internet Industry
UX Designer,
Internet Industry,
10+ yrs
UX Designer,
SF Bay Area,
10+ yrs
SF Bay Area,
Internet Industry,
10+yrs
UX Designer,
SF Bay Area
UX Designer,
10+ yrs
UX Designer,
Internet Industry
Internet Industry,
10+yrs
UX Designer SF Bay Area Internet Industry
All data
10+yrs
...
?

UX Designer,
SF Bay Area,
Internet Industry
UX Designer,
Internet Industry,
10+ yrs
UX Designer,
SF Bay Area,
10+ yrs
SF Bay Area,
Internet Industry,
10+yrs
UX Designer,
SF Bay Area
UX Designer,
10+ yrs
UX Designer,
Internet Industry
Internet Industry,
10+yrs
All data
10+yrs
...
Available ancestors
with enough data

UX Designer,
SF Bay Area,
Internet Industry
UX Designer,
Internet Industry,
10+ yrs
UX Designer,
SF Bay Area,
10+ yrs
SF Bay Area,
Internet Industry,
10+yrs
UX Designer,
SF Bay Area
UX Designer,
10+ yrs
UX Designer,
Internet Industry
Internet Industry,
10+yrs
All data
10+yrs
...
Best ancestor:
ancestor that results in
max (log) likelihood for
the observed entries
Prior
distribution
Compute posterior
distribution based on
observed entries

Bayesian Smoothing
[Assume: the compensation data follows a log-normal distribution]
– Use Gaussian-gamma distribution as the conjugate prior
Steps
– Apply logarithmic transformation to all data entries
– Find the “best” ancestral cohort
– Obtain prior log-normal distribution from this cohort
– Compute posterior log-normal distribution based on observed entries for the
cohort of interest

Validation of Log-normal Assumption

Regression Model for Title-Region Cohorts
▪ Smoothing motivation:
– Only 30% of title-region cohorts have 30+ data points
– Title-only or region-only parent cohorts not ideal
▪ Regression model used to obtain the prior distribution
– instead of falling back to the parent cohort
▪ Inference motivation:
– Infer salary insights for title-region cohorts with no data

Offline Evaluation of Smoothing
▪ Observed entries => training (90%) + test (10%)
▪ Goodness-of-fit analysis using log-likelihood of test data
▪ Quantile coverage test for statistical consistency (Non-
parametric)
– Fraction of the test data that lies between 10th and 90th percentiles of
predicted insights
– Ideally: 80%
– Cohorts with >=5 samples: Smoothing (83%), Empirical (71%)
– Cohorts with 3-4 samples: Smoothing (86%), Empirical (39%)

Outlier Detection using BLS OES Dataset
⇒ Need to map to LinkedIn taxonomy

Mapping BLS OES Dataset to LinkedIn Taxonomy
▪ BLS occupations coarser
– 805 special occupation codes vs 25K standardized titles
▪ Title mapping:
– BLS Special occupation code (SOC) --> O*Net alternate titles -->
LinkedIn standardized titles
▪ BLS regions finer
▪ Region mapping:
– BLS regions --> Zip codes --> LI regions
▪ Coverage for 6.5K standardized titles (and nearly all
(~285) US region codes), 1.5M <titleId, region code> pairs

Outlier Detection: Box and Whisker Method
▪ Cold-start: mapped
BLS data
▪ Then, with member
submitted
compensation
entries
▪ Floored at federal
minimum wage

Deployment Challenges & Lessons Learned
▪ Extensible APIs to support evolving product needs
▪ Lack of good public “ground truth” datasets
▪ Coverage vs. robustness tradeoffs via simulations
▪ Choosing smoothing threshold

Summary & Reflections
▪ LinkedIn Salary: a new internet application
– Robust, reliable compensation insights via statistical
modeling techniques
▪ Bayesian Hierarchical Smoothing
▪ Outlier Detection
– Empirical evaluation & deployment lessons
▪ Future Directions
– Career marketplace efficiency using compensation insights
– Detecting inconsistencies in the insights across cohorts
▪ Position transition graphs, salaries from job postings, …
– Addressing sample selection bias & response bias

Thanks & Pointers
▪ Related paper: K. Kenthapadi, A. Chudhary, and S. Ambler, LinkedIn Salary: A System
for Secure Collection and Presentation of Structured Compensation Insights to Job
Seekers, IEEE PAC 2017 (arxiv.org/abs/1705.06976)
▪ Team:
Careers Engineering
 Ahsan Chudhary
 Alan Yang
 Alex Navasardyan
 Brandyn Bennett
 Hrishikesh S
 Jim Tao
 Juan Pablo Lomeli Diaz
 Lu Zheng
 Patrick Schutz
 Ricky Yan
 Stephanie Chou
 Joseph Florencio
 Santosh Kumar Kancha
 Anthony Duerr
Data Relevance Engineering
 Krishnaram Kenthapadi, Stuart Ambler, Xi Chen, Yiqun Liu, Parul Jain,
Liang Zhang, Ganesh Venkataraman, Tim Converse, Deepak Agarwal
Product Managers: Ryan Sandler, Keren Baruch
UED: Julie Kuang
Marketing: Phil Bunge
Business Operations: Prateek Janardhan
BA: Fiona Li
Testing: Bharath Shetty
ProdOps/VOM: Sunil Mahadeshwar
Security: Cory Scott, Tushar Dalvi, and team
linkedin.com/salary

Bringing salary transparency to the world: Computing robust compensation insights via LinkedIn Salary

Bringing salary transparency to the world: Computing robust compensation insights via LinkedIn Salary

Recommended

Recommended

More Related Content

More from Krishnaram Kenthapadi

More from Krishnaram Kenthapadi (12)

Recently uploaded

Recently uploaded (20)

Bringing salary transparency to the world: Computing robust compensation insights via LinkedIn Salary

Editor's Notes