The recently launched LinkedIn Salary product has been designed with the goal of providing compensation insights to the world's professionals and thereby helping them optimize their earning potential. We describe the overall design and architecture of the statistical modeling system underlying this product. We focus on the unique data mining challenges while designing and implementing the system, and describe the modeling components such as Bayesian hierarchical smoothing that help to compute and present robust compensation insights to users. We report on extensive evaluation with nearly one year of de-identified compensation data collected from over one million LinkedIn users, thereby demonstrating the efficacy of the statistical models. We also highlight the lessons learned through the deployment of our system at LinkedIn.
Presented at ACM International Conference on Information and Knowledge Management (ACM CIKM), 2017.
Recipient of Best Case Studies Paper Award at ACM CIKM, 2017.
Corresponding paper: Bringing Salary Transparency to the World: Computing Robust Compensation Insights via LinkedIn Salary, ACM CIKM, 2017 (available at https://arxiv.org/abs/1703.09845).
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Bringing salary transparency to the world: Computing robust compensation insights via LinkedIn Salary
1. Computing Robust Compensation
Insights via LinkedIn Salary
Krishnaram Kenthapadi
AI @ LinkedIn
(Joint work with Stuart Ambler, Liang Zhang, Deepak Agarwal)
5. Current Reach (November 2017)
▪ A few million responses out of several millions of
members targeted
– Targeted via emails since early 2016
▪ Countries: US, CA, UK, DE
▪ Insights available for a large fraction of US monthly
active users
6. ▪ Minimize the risk of inferring any one individual’s
compensation data
▪ Protection against data breach
– No single point of failure
Data Privacy Challenges
Achieved by a combination of
techniques: encryption, access
control, ,
aggregation, thresholding
K. Kenthapadi, A. Chudhary, and S.
Ambler, LinkedIn Salary: A System
for Secure Collection and
Presentation of Structured
Compensation Insights to Job
Seekers, IEEE PAC 2017
(arxiv.org/abs/1705.06976)
7. Title Region
$$
User Exp
Designer
SF Bay
Area
100K
User Exp
Designer
SF Bay
Area
115K
... ... ...
Title Region
$$
User Exp
Designer
SF Bay
Area
100K
De-identification Example
Title Region Company Industry Years
of exp
Degree Field of
Study
Skills
$$
User Exp
Designer
SF Bay
Area
Google Internet 12 BS Interacti
ve
Media
UX,
Graphic
s, ...
100K
Title Region Industry
$$
User Exp
Designer
SF Bay
Area
Internet 100K
Title Region Years
of exp
$$
User Exp
Designer
SF Bay
Area
10+ 100K
Title Region Company Years
of exp
$$
User Exp
Designer
SF Bay
Area
Google 10+ 100K
#data
points >
threshold? Yes ⇒ Copy to
Hadoop (HDFS)
Note: Original submission stored as encrypted objects.
8. ▪ Evaluation
▪ Modeling on de-identified data
▪ Robustness and stability
▪ Outlier detection
Modeling Challenges
9. Problem Statement
▪How do we compute robust, reliable compensation
insights based on de-identified compensation data,
while addressing the product requirements such as
coverage?
12. Coverage vs Data Quality / Data Privacy Tradeoff
▪ Can we achieve them simultaneously?
Better
Coverage
Better Data
Quality &
Data Privacy
Threshold
(min # data points for returning insights)
13. Bayesian Smoothing
▪ Large sample size cohorts (>= 20): Report empirical percentiles
▪ Small sample size cohorts (< 20):
– Empirical percentiles unreliable & unstable
– Idea:
▪ Exploit hierarchical structure
▪ “Borrow strength” from the ancestral cohort that has enough data and best fit
– Bayesian hierarchical smoothing
▪ “Combine” cohort estimates with actual observed entries
▪ Greater weighting for observed data as #observed entries increases
▪ Cohorts with no data:
– Build regression models to predict the salary insights
14. UX Designer, SF Bay Area, Internet Industry, 10+ yrs
UX Designer,
SF Bay Area,
Internet Industry
UX Designer,
Internet Industry,
10+ yrs
UX Designer,
SF Bay Area,
10+ yrs
SF Bay Area,
Internet Industry,
10+yrs
UX Designer,
SF Bay Area
UX Designer,
10+ yrs
UX Designer,
Internet Industry
Internet Industry,
10+yrs
UX Designer SF Bay Area Internet Industry
All data
10+yrs
...
?
15. UX Designer, SF Bay Area, Internet Industry, 10+ yrs
UX Designer,
SF Bay Area,
Internet Industry
UX Designer,
Internet Industry,
10+ yrs
UX Designer,
SF Bay Area,
10+ yrs
SF Bay Area,
Internet Industry,
10+yrs
UX Designer,
SF Bay Area
UX Designer,
10+ yrs
UX Designer,
Internet Industry
Internet Industry,
10+yrs
UX Designer SF Bay Area Internet Industry
All data
10+yrs
...
Available ancestors
with enough data
16. UX Designer, SF Bay Area, Internet Industry, 10+ yrs
UX Designer,
SF Bay Area,
Internet Industry
UX Designer,
Internet Industry,
10+ yrs
UX Designer,
SF Bay Area,
10+ yrs
SF Bay Area,
Internet Industry,
10+yrs
UX Designer,
SF Bay Area
UX Designer,
10+ yrs
UX Designer,
Internet Industry
Internet Industry,
10+yrs
UX Designer SF Bay Area Internet Industry
All data
10+yrs
...
Best ancestor:
ancestor that results in
max (log) likelihood for
the observed entries
Prior
distribution
Compute posterior
distribution based on
observed entries
17. Bayesian Smoothing
[Assume: the compensation data follows a log-normal distribution]
– Use Gaussian-gamma distribution as the conjugate prior
Steps
– Apply logarithmic transformation to all data entries
– Find the “best” ancestral cohort
– Obtain prior log-normal distribution from this cohort
– Compute posterior log-normal distribution based on observed entries for the
cohort of interest
19. Regression Model for Title-Region Cohorts
▪ Smoothing motivation:
– Only 30% of title-region cohorts have 30+ data points
– Title-only or region-only parent cohorts not ideal
▪ Regression model used to obtain the prior distribution
– instead of falling back to the parent cohort
▪ Inference motivation:
– Infer salary insights for title-region cohorts with no data
20. Offline Evaluation of Smoothing
▪ Observed entries => training (90%) + test (10%)
▪ Goodness-of-fit analysis using log-likelihood of test data
▪ Quantile coverage test for statistical consistency (Non-
parametric)
– Fraction of the test data that lies between 10th and 90th percentiles of
predicted insights
– Ideally: 80%
– Cohorts with >=5 samples: Smoothing (83%), Empirical (71%)
– Cohorts with 3-4 samples: Smoothing (86%), Empirical (39%)
23. Mapping BLS OES Dataset to LinkedIn Taxonomy
▪ BLS occupations coarser
– 805 special occupation codes vs 25K standardized titles
▪ Title mapping:
– BLS Special occupation code (SOC) --> O*Net alternate titles -->
LinkedIn standardized titles
▪ BLS regions finer
▪ Region mapping:
– BLS regions --> Zip codes --> LI regions
▪ Coverage for 6.5K standardized titles (and nearly all
(~285) US region codes), 1.5M <titleId, region code> pairs
24. Outlier Detection: Box and Whisker Method
▪ Cold-start: mapped
BLS data
▪ Then, with member
submitted
compensation
entries
▪ Floored at federal
minimum wage
25. Deployment Challenges & Lessons Learned
▪ Extensible APIs to support evolving product needs
▪ Lack of good public “ground truth” datasets
▪ Coverage vs. robustness tradeoffs via simulations
▪ Choosing smoothing threshold
26. Summary & Reflections
▪ LinkedIn Salary: a new internet application
– Robust, reliable compensation insights via statistical
modeling techniques
▪ Bayesian Hierarchical Smoothing
▪ Outlier Detection
– Empirical evaluation & deployment lessons
▪ Future Directions
– Career marketplace efficiency using compensation insights
– Detecting inconsistencies in the insights across cohorts
▪ Position transition graphs, salaries from job postings, …
– Addressing sample selection bias & response bias
27. Thanks & Pointers
▪ Related paper: K. Kenthapadi, A. Chudhary, and S. Ambler, LinkedIn Salary: A System
for Secure Collection and Presentation of Structured Compensation Insights to Job
Seekers, IEEE PAC 2017 (arxiv.org/abs/1705.06976)
▪ Team:
Careers Engineering
Ahsan Chudhary
Alan Yang
Alex Navasardyan
Brandyn Bennett
Hrishikesh S
Jim Tao
Juan Pablo Lomeli Diaz
Lu Zheng
Patrick Schutz
Ricky Yan
Stephanie Chou
Joseph Florencio
Santosh Kumar Kancha
Anthony Duerr
Data Relevance Engineering
Krishnaram Kenthapadi, Stuart Ambler, Xi Chen, Yiqun Liu, Parul Jain,
Liang Zhang, Ganesh Venkataraman, Tim Converse, Deepak Agarwal
Product Managers: Ryan Sandler, Keren Baruch
UED: Julie Kuang
Marketing: Phil Bunge
Business Operations: Prateek Janardhan
BA: Fiona Li
Testing: Bharath Shetty
ProdOps/VOM: Sunil Mahadeshwar
Security: Cory Scott, Tushar Dalvi, and team
linkedin.com/salary
Editor's Notes
Corresponding paper: K. Kenthapadi, S. Ambler, L. Zhang, and D. Agarwal, Bringing salary transparency to the world: Computing robust compensation insights via LinkedIn Salary, 2017
(arxiv.org/abs/1703.09845)
Why LinkedIn Salary:
Compensation a key factor when choosing a new job opportunity
But, not easily available (asymmetry between job seekers and job providers)
Goal: help job seekers explore compensation along different dimensions, make more informed career decisions / optimize their earning potential
Compensation data can also help improve other LinkedIn product/services such as job recommendations
Other social benefits:
Better understand the monetary dimensions of the economic graph
Greater transparency / address pay inequality
Greater efficiency in the labor marketplace (reduce asymmetry of knowledge)
Encourage workers to learn skills needed for obtaining well paying jobs (narrow the skills gap)
Example link: https://www.linkedin.com/salary/explorer?countryCode=us®ionCode=84&titleId=3114
In the publicly launched LinkedIn Salary product, users can explore compensation insights by searching for different titles and regions. For a given title and location, we present the quantiles (10th and 90th percentiles, median) and histograms for base salary, bonus, and other types of compensation. We also present more granular insights on how the pay varies based on factors such as region, experience, education, company size, and industry, and which locations, industries, or companies pay the most.
We started reaching out to members during early 2016.
The compensation insights shown in the product are based on compensation data that we have been collecting from LinkedIn users. We designed a give-to-get model based data collection process as follows. First, cohorts (such as User Experience Designers in San Francisco Bay Area) with a sufficient number of LinkedIn users are selected. Within each cohort, emails are sent to a random subset of users, requesting them to submit their compensation data (in return for aggregated compensation insights later). Once we collect sufficient data, we get back to the responding users with the compensation insights, and also reach out to the remaining users in those cohorts, promising corresponding insights immediately upon submission of their compensation data.
Data collection process (at a high level):
Select (title, region) cohorts with enough members
Wave 1: Emails sent to a random subset in each cohort, requesting members to submit their salary (with promise of insights once there is enough data)
Wave 2: Once there is enough data, get back to the responding members with insights, and also reach out to the remaining members, promising immediate insights
Considering the sensitive nature of compensation data and the desire for preserving privacy of users, we designed our system such that there is protection against data breach, and any one individual's compensation data cannot be inferred by observing the outputs of the system.
Encryption:
Member attributes and compensation data encrypted separately
De-identification:
Slice data points along limited number of attributes towards de-identification
Aggregation:
Sliced data grouped before processing, subject to minimum threshold
Note: for illustration purposes, we have hidden / abstracted several details. In particular, the member attribute fields and the compensation data are stored in encrypted form (with different set of keys).
Once we have enough entries to meet the threshold, we put slice data in a queue so that it can be associated with compensation data.
Modeling Challenges
Evaluation: In contrast to several other user-facing products such as movie and job recommendations, we face unique evaluation and data quality challenges. Users themselves may not have a good perception of the true compensation range, and hence it is not feasible to perform online A/B testing to compare the compensation insights generated by different models. Further, there are very few reliable and easily available ground truth datasets in the compensation domain, and even when available (e.g., BLS OES dataset), mapping such datasets to LinkedIn's taxonomy is inevitably noisy.
Modeling on de-identified data: Due to the privacy requirements, the salary modeling system has access only to cohort level data containing de-identified compensation submissions (e.g., salaries for UX Designers in San Francisco Bay Area), limited to those cohorts having at least a minimum number of entries. Each cohort is defined by a combination of attributes such as title, country, region, company, and years of experience, and contains aggregated compensation entries obtained from individuals having the same values of those attributes. Within a cohort, each individual entry consists of values for different compensation types such as base salary, annual bonus, sign-on bonus, commission, annual monetary value of vested stocks, and tips, and is available without associated user name, id, or any attributes other than those that define the cohort. Consequently, our modeling choices are limited since we have access only to the de-identified data, and cannot, for instance, build prediction models that make use of more discriminating features not available due to de-identification.
Robustness and Stability: While some cohorts may each have a large sample size, a large number of cohorts typically contain very few (< 20) data points each. Given the desire to have data for as many cohorts as possible, we need to ensure that the compensation insights are robust and stable even when there is data sparsity. That is, for such cohorts, the insights should be reliable, and not too sensitive to the addition of a new entry. A related challenge is whether we can reliably infer the insights for cohorts with no data at all.
Outlier Detection: As the quality of the insights depends on the quality of submitted data, detecting and pruning potential outlier entries is crucial. Such entries could arise due to either mistakes/misunderstandings during submission, or intentional falsification (such as someone attempting to game the system). We needed a solution to this problem that would work even during the early stages of data collection, when this problem was more challenging, and there may not be sufficient data across say, related cohorts.
Our problem can thus be stated as follows: How do we design LinkedIn Salary system to meet the immediate and future needs of LinkedIn Salary and other LinkedIn products? How do we design our system taking into account the unique privacy and security challenges, while addressing the product requirements?
How do we design the salary modeling system to meet the immediate and future needs of LinkedIn Salary and other LinkedIn products?
How do we compute robust, reliable compensation insights based on de-identified compensation data, while addressing the product requirements such as coverage?
We describe the overall design and architecture of the salary modeling system deployed as part of the recently launched LinkedIn Salary product. Our system consists of an online component that uses a service oriented architecture for retrieving compensation insights corresponding to the query from the user facing product, and an offline component for processing de-identified compensation data and generating compensation insights.
LinkedIn Salary Platform: The REST Server provides compensation insights on request by instances of REST Client. The REST API allows retrieval of individual insights, or lists of them. For each cohort, an insight includes, when data is available, the quantiles (10th and 90th percentiles, median), and histograms for base salary, bonus, and other compensation types. For robustness of the insights in the face of small numbers of submissions and changes as data is collected, we report quantiles such as 10th and 90th percentiles and median, rather than absolute range and mean.
LinkedIn Salary Use Case: For an eligible user, compensation insights are obtained via a REST Client from the REST Server implementing the Salary Platform REST API. These are then presented as part of LinkedIn Salary product. Based on the product and business needs, the eligibility can be defined in terms of criteria such as whether the user has submitted his/her compensation data within the last one year (give-to-get model), or whether the user has a premium membership.
Our Salary Platform has four service APIs to give the information needed for LinkedIn Salary insight page: (1) a “criteria” finder to obtain the core compensation insight for a cohort, (2) a “facets” finder to provide information on cohorts with insights for filters such as industry and years of experience, (3) a “relatedInsights” finder to obtain compensation insights for related titles, regions, etc., and (4) a “topInsights” finder to obtain compensation insights by top paying industries, companies, etc. These finders were carefully designed to be extensible as the product needs evolve over time. For instance, although we had originally designed the “criteria” finder to provide insights during the compensation collection stage, we were able to reuse and extend it for LinkedIn Salary and other use cases.
Offline System for Computing Compensation Insights: The insights are generated using an offline workflow (discussed in depth here), that consumes the De-identified Submissions data (corresponding to cohorts having at least a minimum number of entries) on HDFS, and then pushes the results to the Insights and Lists Voldemort key-value stores for use by the REST Server. The offline workflow includes the modeling components such as outlier detection and Bayesian hierarchical smoothing, which we describe next.
Bootstrapping problem when we started data collection: How do we ensure quality of compensation insights when very little data is available? Afterwards, how to address reliability for cohorts with very little or no data at all?
Idea: use statistical modeling techniques to address the modeling challenges, and to borrow strengths from related cohorts
There is an inherent trade-off between the quality of compensation insights (the higher the threshold for # samples, the better) and the product coverage (the lower the threshold for # samples, the better). In other words, there exists a tension between the validity of our summary statistics (which argues for requiring many data points before we're willing to display compensation insights for a cohort) and coverage (having a lot of cohorts to show).
We addressed this issue by performing statistical data smoothing for cohorts with small sample size. For cohorts with large sample size (e.g., >= 20), we report the empirical percentiles. For cohorts with small sample size, we apply Bayesian hierarchical smoothing, by combining cohort estimates with the actual observed entries. For example, for a (title, region) cohort, the intuition is that the estimates could be obtained by making use of data for the underlying title and the underlying region cohorts. Further, there will be greater weighting to the data based on observed entries, as the number of observed entries increases.
More specifically, we consider two approaches for the estimation model: (1) a log-linear model to estimate log(salary) for a given cohort as a function of title, region, industry, etc, (2) a hierarchical model to determine the best ancestral cohort for a given cohort to determine the prior mean & variance. Then, we obtain smoothed estimates of percentiles by using this prior mean and variance.
Idea: “borrow strength” from y’s ancestral cohorts that have enough data to estimate a distribution (e.g. 20+ points). The list of candidate parent cohorts can be picked as the set of all cohorts that contain y as a subset, and with sample size meeting the threshold (e.g., 20). Denote this set as P.
Since it is very likely that the size of P ends up with greater than 1, and mathematically, it is much easier to start with only 1 ancestor to work with (we can consider extending this approach to the case of multiple parents later on), the first step is to be able to pick the “best” ancestral cohort out of P.
Algorithm
Find “best” ancestral cohort and view it as the prior distribution
Best ancestor: ancestor that results in max (log) likelihood for the observed entries in the cohort of interest
Compute posterior distribution based on observed entries
More details:
The compensation data overall is log-normally distributed.
Hence, we apply logarithmic transformation, and then use Gauss-Gamma conjugate prior distribution
There are two key smoothing parameters, which are tuned for different segments by performing a grid search and computing the likelihood of observing the hold-out data (described under smoothing validation)
Idea: “borrow strength” from y’s ancestral cohorts that have enough data to estimate a distribution (e.g. 20+ points). The list of candidate parent cohorts can be picked as the set of all cohorts that contain y as a subset, and with sample size meeting the threshold (e.g., 20). Denote this set as P.
Since it is very likely that the size of P ends up with greater than 1, and mathematically, it is much easier to start with only 1 ancestor to work with (we can consider extending this approach to the case of multiple parents later on), the first step is to be able to pick the “best” ancestral cohort out of P.
Algorithm
Find “best” ancestral cohort and view it as the prior distribution
Best ancestor: ancestor that results in max (log) likelihood for the observed entries in the cohort of interest
Compute posterior distribution based on observed entries
More details:
The compensation data overall is log-normally distributed.
Hence, we apply logarithmic transformation, and then use Gauss-Gamma conjugate prior distribution
There are two key smoothing parameters, which are tuned for different segments by performing a grid search and computing the likelihood of observing the hold-out data (described under smoothing validation)
Idea: “borrow strength” from y’s ancestral cohorts that have enough data to estimate a distribution (e.g. 20+ points). The list of candidate parent cohorts can be picked as the set of all cohorts that contain y as a subset, and with sample size meeting the threshold (e.g., 20). Denote this set as P.
Since it is very likely that the size of P ends up with greater than 1, and mathematically, it is much easier to start with only 1 ancestor to work with (we can consider extending this approach to the case of multiple parents later on), the first step is to be able to pick the “best” ancestral cohort out of P.
Algorithm
Find “best” ancestral cohort and view it as the prior distribution
Best ancestor: ancestor that results in max (log) likelihood for the observed entries in the cohort of interest
Compute posterior distribution based on observed entries
More details:
The compensation data overall is log-normally distributed.
Hence, we apply logarithmic transformation, and then use Gauss-Gamma conjugate prior distribution
There are two key smoothing parameters, which are tuned for different segments by performing a grid search and computing the likelihood of observing the hold-out data (described under smoothing validation)
24 models corresponding to three countries and eight compensation types.
Evaluated using cross-validation, and also by computing the coefficient of determination, R2
Smoothed distribution results in greater likelihood for the observed data from the hold-out set, compared to using just the empirical distribution
We evaluated statistical smoothing using a “goodness-of-fit” analysis and a quantile coverage test.
With the former analysis, we observed that combining ancestor and cohort data with statistical smoothing results in better goodness-of-fit for the observed data in the hold-out set, compared to using ancestor or cohort alone. An intuitive explanation is that statistical smoothing provides better stability and robustness of insights. Inferring a distribution based on just the observed entries in the training set for a cohort and using it to fit the corresponding hold-out set is not as robust as using the smoothed distribution to fit the hold-out set, especially when the cohort contains very few entries.
In the latter test, we measured what fraction of a hold-out set lies between two quantiles say, 10th and 90th percentiles, computed based on the training set (1) empirically without smoothing, and (2) after applying smoothing. Ideally, this fraction should equal 80%. We observed that the fractions computed using smoothed percentiles are significantly better than those computed using empirical percentiles. This fraction was 85% with smoothing (close to the ideal of 80%), but only 54% with empirical approach. We observed similar results for various segments such as cohorts containing a company, an industry, or a degree.
These two tests together establish that statistical smoothing leads to significantly better and robust compensation insights. By employing statistical smoothing, we were able to reduce the threshold used for displaying compensation insights in LinkedIn Salary product, thereby achieving significant increase in product coverage, while simultaneously preserving the quality of the insights.
Bootstrapping problem when we started data collection: How do we ensure quality of compensation insights when very little data is available?
Idea: make use of reliable, publicly available compensation datasets
BLS OES dataset
Initial thought: why not use as prior distribution?
Bootstrapping problem when we started data collection: How do we ensure quality of compensation insights when very little data is available?
Idea: make use of reliable, publicly available compensation datasets
BLS OES dataset
Initial thought: why not use as prior distribution?
Accurate mapping is a challenging task, due to different taxonomies
805 BLS special occupation codes (SOC) vs 25K standardized titles
Please note that there is noise in the various mappings (especially BLS SOC --> O*Net Alternate titles --> v2 standardized titles), and hence the resulting data values are more suited for computing the range for outlier detection (since this has sufficient room for accommodating the noise), than for presenting this data directly to users.
We next present the challenges encountered and the lessons learned through the production deployment of both our salary modeling offline workflow and REST service as part of the overall LinkedIn Salary system for more than a year, initially during the compensation collection and later as part of the publicly launched product. We performed several iterative deployments of the offline workflow and the REST service, always maintaining a sandbox version (for testing) and a production version of our system. The service was deployed to multiple servers across different data centers to ensure better fault-tolerance and latency.
One of the greatest challenges has been the lack of good public “ground truth” datasets. In the United States, the Internal Revenue Service has fairly complete salary data, but it is not public. The Bureau of Labor Statistics makes available aggregate data, but that brings with it the challenge of mapping to LinkedIn taxonomy. Some state governments, for example California, make available government employee salaries, but government jobs differ from private-sector jobs.
We did simulations sampling smaller from larger cohorts to get an idea of how much variation in the reported quantiles might be expected if we decreased the sample size threshold. This analysis helped us understand the tradeoffs between increasing coverage and ensuring robust insights, and guided the product team decisions on whether to reduce the threshold. In fact, this analysis helped motivate the need for applying Bayesian smoothing, after which we were able to further reduce the threshold while retaining robustness of insights. The choice of smoothing threshold (h = 20) was determined by similar tradeoffs. While applying smoothing is desirable for even larger sized cohorts from the perspective of robustness, a practical limitation is that the smoothed histograms have to be computed based on a parametrized (log-normal) distribution, resulting in all smoothed histograms having identical shape (truncated log-normal distribution from 10th to 90th percentiles). As the empirical histogram was considered more valuable from the product perspective, we decided not to display any histogram for smoothed cohorts, and chose a relatively low threshold.
We are excited about the immense potential for LinkedIn Salary to help job seekers, and the related research possibilities to better understand and improve the efficiency of career marketplace.
We are also currently pursuing several directions to extend this work. For instance, we have recently extended this work to infer insights for cohorts with no data.
Broadly, we would like to improve quality of insights and product coverage via better data collection and processing, including improvement of the statistical smoothing methodology, better estimation of variance in the regression models, creation of intermediary levels such as company clusters between companies and industries, and improved outlier detection.
Another direction is to use other datasets (e.g., position transition graphs, salaries extracted from job postings) to detect/correct inconsistencies in the insights across cohorts.
Finally, mechanisms can be explored to quantify and address different types of biases such as sample selection bias and response bias. For example, models could be developed to predict response propensity based on user profile and behavioral attributes, which could then be used to compensate for response bias through techniques such as inverse probability weighting.
Privacy-preserving machine learning models in a practical setting [e.g., Chaudhuri et al, Papernot et al]
Applicability of provably privacy-preserving machine learning approaches
Would require a redesign; build richer predictions models while preserving privacy
Outlier detection during submission stage
Using user profile & behavioral features
Link to this paper: K. Kenthapadi, S. Ambler, L. Zhang, and D. Agarwal, Bringing salary transparency to the world: Computing robust compensation insights via LinkedIn Salary, 2017
(arxiv.org/abs/1703.09845)