Preserving privacy of users is a key requirement of web-scale analytics and reporting applications, and has witnessed a renewed focus in light of recent data breaches and new regulations such as GDPR. We focus on the problem of computing robust, reliable analytics in a privacy-preserving manner, while satisfying product requirements. We present PriPeARL, a framework for privacy-preserving analytics and reporting, inspired by differential privacy. We describe the overall design and architecture, and the key modeling components, focusing on the unique challenges associated with privacy, coverage, utility, and consistency. We perform an experimental study in the context of ads analytics and reporting at LinkedIn, thereby demonstrating the tradeoffs between privacy and utility needs, and the applicability of privacy-preserving mechanisms to real-world data. We also highlight the lessons learned from the production deployment of our system at LinkedIn.
Presented at ACM CIKM 2018. Link to our paper: https://arxiv.org/pdf/1809.07754
PriPeARL: A Framework for Privacy-Preserving Analytics and Reporting at LinkedIn
1. PriPeARL: A Framework for Privacy-
Preserving Analytics and Reporting at
LinkedIn
CIKM 2018
Krishnaram Kenthapadi, Thanh Tran
Data @ LinkedIn
1
2. Analytics Products at LinkedIn
Profile View Analytics
2
Content Analytics
Ad Campaign Analytics
All showing demographics
of members engaging with
the product
3. Product Requirements: Utility and Privacy
3
• Insights into the audience engaging with
the product (e.g., profile, article, or ad)
→ Desirable for the aggregate statistics
to be available and accurate.
• Different aspects of data consistency:
- Repeated queries
- Over time
- Total vs. Demographic breakdowns
- Hierarchy (e.g., time, entity)
Utility Privacy
• Member actions could be considered
sensitive information (e.g., click on an
article or an ad).
→ Individual’s action cannot be
inferred from the results of analytics.
• Assume malicious use cases, e.g.,
attacker can set up ad campaigns to
infer the behavior of a certain member.
4. LMS
Application: LinkedIn Ads Analytics
4
Objective:
Compute robust, reliable analytics in a privacy-preserving
manner, while addressing the product desiderata such as utility,
coverage, and consistency.
Ad
Ad
Targeting
LI Ad
Serving
Ad
Analytics
Advertiser
5. Possible Attacks
5
Targeting:
Senior directors in US, who studied at Cornell
Matches ~16k LinkedIn members
→ over minimum targeting threshold
Demographic breakdown:
E.g., company = X
Matches exactly one person
→ can determine whether the person
clicks on the ad or not
Enforcing minimum reporting threshold
Attacker could create fake profiles
E.g., if threshold is 10, create 9 fake
profiles that all click.
Rounding mechanism
E.g., report incremental of 10
Still amenable to attacks
E.g., using incremental counts over time
to infer individuals’ actions
Need rigorous techniques to preserve member privacy, not
revealing exact aggregate counts
6. Differential Privacy: Definition
6
● ε-Differential Privacy: For neighboring databases D and D’ (differ by one record),
the distribution of the curator’s outputs on both databases are nearly the same .
● Parameter ε (ε > 0) quantifies information leakage
○ Smaller ε, more private
Dwork, McSherry, Nissim, Smith [TCC 2006]
7. Differential Privacy: Random Noise Addition
7
● Achieving differential privacy via random noise addition.
● Common approach: noise draw from the Laplace distribution.
○ Let s be L1 sensitivity of the query function f
s = max D, D’ || f(D) - f(D’) ||, D and D’ differ by one record
○ and ε the privacy parameter.
○ Then the parameter for Laplace distribution is (s/ε)
Dwork, McSherry, Nissim, Smith [TCC 2006]
8. ● This query form also applies for other analytics applications
Ad Analytics Canonical Queries
8
SELECT COUNT(*)
FROM table(stateType, entity)
WHERE timestamp ≥ startTime AND timestamp ≤ endTime
AND dAttr = dVal
E.g., clicks on a given ad
E.g., Title = “Senior Director”
● Application admits a predetermined query form.
● Preserving privacy by adding Laplace noise
○ Protect privacy at the event level
9. PriPeARL: A Framework for Privacy-Preserving Analytics
9
Pseudo-random noise generation, inspired by differential privacy
● Entity id (creative/campaign/
campaign group/account)
● Demographic dimension
● Stat type (impressions, clicks)
● Time range
● Fixed secret seed
Uniformly Random
Fraction
● Cryptographic
hash
● Normalize to
(0,1)
Random
Noise
Laplace
Noise
● Fixed ε
True
count
Reported
count
To satisfy consistency
requirements
● Pseudo-random noise → same query has same result over time, avoid
averaging attack.
● For non-canonical queries (e.g., time ranges, aggregate multiple entities)
○ Use the hierarchy and partition into canonical queries
○ Compute noise for each canonical queries and sum up the noisy counts
11. Performance Evaluation: Setup
11
● Experiments using LinkedIn ad analytics data
○ Consider distribution of impression and click queries
across (account, ad campaign) and demographic
breakdowns.
● Examine
○ Tradeoff between privacy and utility
○ Effect of varying minimum threshold (non-negative)
○ Top-n queries
12. Performance Evaluation: Results
12
Privacy and Utility Tradeoff
● For ε = 1, average absolute and signed errors
are small for both queries.
● Variance is also small, ~95% of queries have
error of at most 2.
Top-N Queries
● Common use case in LinkedIn applications.
● Jaccard distance as a function of ε and n.
● (This shows the worst case as queries with
return sets ≤ n and error=0 were omitted.)
13. Lessons Learned
13
● Lessons from privacy breaches → need “Privacy by Design”
● Consider business requirements and usability
○ Various consistency desiderata to ensure results useful and insightful
● Scaling across analytics applications
○ Abstract away application specifics, build libraries, and optimize for
performance
14. Acknowledgements
▹ Team:
▸ AI/ML: Krishnaram Kenthapadi, Thanh T. L. Tran
▸ Ad Analytics Product & Engineering: Mark Dietz, Taylor Greason, Ian
Koeppe
▸ Legal / Security: Sara Harrington, Sharon Lee, Rohit Pitke
▹ Additional Acknowledgements
▸ Deepak Agarwal, Igor Perisic, Arun Swami, Ya Xu, Yang Zhou
14
15. ▹ Framework to compute robust, privacy-preserving analytics
▸ Addressing challenges such as preserving member privacy, product
coverage, utility, and data consistency
▹ Future
▸ Utility maximization problem given constraints on the ‘privacy loss budget’ per user
⬩ E.g., noise with larger variance to impressions but less noise to clicks (or
conversions)
⬩ E.g., more noise to broader time range sub-queries and less noise to granular
time range sub-queries
▹ Tech Report: K. Kenthapadi, T. Tran, PriPeARL: A Framework for Privacy-
Preserving Analytics and Reporting at LinkedIn, ACM CIKM 2018
(https://arxiv.org/pdf/1809.07754)
Summary
15
16. What’s Next: Privacy for ML / Data Applications
▹ Hard open questions
▸ Can we simultaneously develop highly personalized models
and ensure that the models do not encode private information
of members?
▸ How do we guarantee member privacy over time without
exhausting the “privacy loss budget”?
▸ How do we enable privacy-preserving mechanisms for data
marketplaces?
▹ Thanks!
16