Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)

Privacy-preserving
Data Mining in Industry
WSDM 2019 Tutorial
February 2019
Krishnaram Kenthapadi (AI @ LinkedIn)
Ilya Mironov (Google AI)
Abhradeep Thakurta (UC Santa Cruz)
https://sites.google.com/view/wsdm19-privacy-tutorial

Fairness Privacy
Transparency Explainability

Fairness Privacy
Related WSDM’19 sessions:
1.Tutorial: Fairness-Aware Machine Learning: Practical Challenges and Lessons Learned
(Monday, 13:30 – 17:00)
2.H.V. Jagadish's invited talk: Responsible Data Science (Tuesday, 14:45 – 15:30)
3.Session 4: FATE & Privacy (Tuesday, 16:15 – 17:30)
4.Aleksandra Korolova’s invited talk: Privacy-Preserving WSDM (Wednesday, 14:45–15:30)

Outline / Learning Outcomes
• Privacy breaches and lessons learned
• Evolution of privacy techniques
• Differential privacy: definition and techniques
• Privacy techniques in practice: Challenges and Lessons Learned
• Google’s RAPPOR
• Apple’s differential privacy deployment for iOS
• Privacy in AI @ LinkedIn (Analytics framework & LinkedIn Salary)
• Key Takeaways

Privacy: A Historical Perspective
Evolution of Privacy Techniques and Privacy Breaches

Privacy Breaches and Lessons Learned
Attacks on privacy
•Governor of Massachusetts
•AOL
•Netflix
•Web browsing data
•Facebook
•Amazon
•Australian Gov't

born July 31, 1945
resident of 02138
Massachusetts Group Insurance Commission (1997):
Anonymized medical history of state employees (all
hospital visits, diagnosis, prescriptions)
Latanya Sweeney (MIT grad student): $20 – Cambridge
voter roll
William Weld vs Latanya Sweeney

64
%uniquely identifiable with
ZIP + birth date + gender
(in the US population)
Golle, “Revisiting the Uniqueness of Simple Demographics in the US Population”,

Attacker's Advantage
Auxiliary information

August 4, 2006: AOL
Research publishes
anonymized search logs
of 650,000 users
August 9:
New York Times
AOL Data Release

Enough to succeed on a small fraction of inputs

Oct 2006: Netflix announces Netflix
Prize
• 10% of their users
• average 200 ratings per user
Narayanan, Shmatikov (2006):
Netflix Prize

Deanonymizing Netflix Data
Narayanan, Shmatikov, Robust De-
anonymization of Large Datasets (How to
Break Anonymity of the Netflix Prize
Dataset), 2008

● Noam Chomsky in Our Times
● Farenheit 9/11
● Jesus of Nazareth
● Queer as Folk

Key idea:
● Similar intuition as the attack on medical records
● Medical records: Each person can be identified
based on a combination of a few attributes
● Web browsing history: Browsing history is unique for
each person
● Each person has a distinctive social network  links
appearing in one’s feed is unique
● Users likely to visit links in their feed with higher
probability than a random user
● “Browsing histories contain tell-tale marks of identity”
Su et al, De-anonymizing Web Browsing Data with Social Networks, 2017
De-anonymizing Web Browsing Data with Social Networks

High dimensionality

Ad targeting:
Korolova, “Privacy Violations Using Microtargeted Ads: A Case Study”, PADM
Privacy Attacks On Ad Targeting

10 campaigns targeting 1 person (zip code, gender,
workplace, alma mater)
Facebook vs Korolova
Age
21
22
23
…
30
Ad Impressions in a week
0
0
8
…
0

10 campaigns targeting 1 person (zip code, gender,
workplace, alma mater)
Facebook vs Korolova
Interest
A
B
C
…
Z
Ad Impressions in a week
0
0
8
…
0

● Context: Microtargeted Ads
● Takeaway: Attackers can instrument ad campaigns to
identify individual users.
● Two types of attacks:
○ Inference from Impressions
○ Inference from Clicks
Facebook vs Korolova: Recap

Active

Items frequently bought together
Bought: A B C D E
Z: Customers Who Bought This Item Also Bought
Calandrino, Kilzer, Narayanan, Felten, Shmatikov, “You Might Also Like: Privacy Risks of Collaborative
Attacking Amazon.com
A C D E

Active
Observant

Homer et al., “Resolving individuals contributing trace
amounts of DNA to highly complex mixtures using high-
density SNP genotyping microarrays”, PLoS Genetics,
2008
Genetic data

Reference population
Bayesian Analysis

“In all mixtures, the identification
of the presence of a person’s
genomic DNA was possible.”

Zerhouni, NIH Director:
“As a result, the NIH has removed from
open-access databases the aggregate
results (including P values and genotype
counts) for all the GWAS that had been
available on NIH sites”
… one week later

Active
Observant
Clever

Australian Medicare Release
August 2016: For 10% of Australians (2.9M) medical records and
prescription information from 1984–2014 published by the
federal government.
● Patient: year of birth, gender
● Medical events, codes, the state, price paid
● Dates are perturbed by ±2 weeks
● Supplier IDs are “encrypted”

September 2016: U of Melbourne researchers re-identified
politicians, sports figures, people from news reports.
● 55K women are unique based on their childbirth event(s)
October 2016: Government introduced a bill criminalizing re-
identification of published government data. The bill is pending in
the committee.
“Health Data in an Open World”, Chris Culnane, Benjamin I. P. Rubinstein, Vanessa Teague, https://arxiv.org/abs/1712.05627

Dinur-Nissim
0 1 1 0 1 0 0 0 1 1 0 1Data
query: 𝚺
Dinur-Nissim 2003:
If error is o(√n), then reconstruction is possible up to n−o(n)
...even if 23.9% of errors are arbitrary [DMT07]
...even with O(n) queries [DY08]

Dwork-Naor
Tore Dalenius desideratum (aka as “semantic security”):
“Access to a statistical database should not enable one to
learn anything about an individual that could not be learned
without access.” (1977)
Dwork-Naor (~2006):
If the database teaches us anything, there is always some auxiliary
information that breaks Dalenius desideratum.

Curator
Defining Privacy: Fool's Errand

Defining Privacy
39
CuratorCurator
+ your data
- your data

Differential Privacy
40
Databases D and D′ are neighbors if they differ in one person’s data.
Differential Privacy: The distribution of the curator’s output M(D) on database
D is (nearly) the same as M(D′).
CuratorCurator
+ your data
- your data
Dwork, McSherry, Nissim, Smith [TCC 2006]

ε-Differential Privacy: The distribution of the curator’s output M(D) on
database D is (nearly) the same as M(D′).
41
CuratorCurator
Parameter ε quantifies
information leakage
∀S: Pr[M(D)∊S] ≤ exp(ε) ∙ Pr[M(D′)∊S].
+ your data
- your data
Dwork, McSherry, Nissim, Smith [TCC 2006]

ε-Differential Privacy: The distribution of the curator’s output M(D) on
database D is (nearly) the same as M(D′).
42
CuratorCurator
Parameter ε quantifies
information leakage
∀S: Pr[M(D)∊S] ≤ exp(ε) ∙ Pr[M(D′)∊S]＋𝛿.
Parameter 𝛿 gives
some slack
Dwork, Kenthapadi, McSherry, Mironov, Naor [EUROCRYPT 2006]
+ your data
- your data

43
f(D) f(D′)
— bad outcomes
— probability with record x
— probability without record x
“Bad Outcomes” Interpretation

● Prior on databases p
● Observed output O
● Does the database contain record x?
44
Bayesian Interpretation

● Robustness to auxiliary data
● Post-processing:
If M(D) is differentially private, so is f(M(D)).
● Composability:
Run two ε-DP mechanisms. Full interaction is 2ε-DP.
● Group privacy:
Graceful degradation in the presence of correlated inputs.
45

What Differential Privacy Isn’t
● Algorithm, architecture, or a rule book
● Secure Computation: what not how
● All-encompassing guarantee: trends may be
sensitive too

BBC: “Fitness app Strava lights up staff at military bases”

Differential Privacy: Takeaway points
• Privacy as a notion of stability of randomized algorithms in
respect to small perturbations in their input
• Worst-case definition
• Robust (to auxiliary data, correlated inputs)
• Composable
• Quantifiable
• Concept of a privacy budget
• Noise injection

ε-Differential Privacy: The distribution of the output M(D) on database
D is (nearly) the same as M(D′) for all adjacent databases D and D′:

Local Differential Privacy
ε-Differential Privacy: The distribution of the output M(D) on database
D is (nearly) the same as M(D′) for all adjacent databases D and D′:

Local-Differentially Private Mechanisms
● Stanley L. Warner, "Randomized response: a survey technique for
eliminating evasive answer bias", Journal of American Statistical
Association, March 1965.
● Arijit Chaudhuri, Rahul Mukerjee. Randomized
Response. Theory and Techniques. 1988.

Randomized Response (Warner 1965)
Q1: Are you a citizen of the United States?
Q2: Are you not a citizen of the United States?
𝜃 - the true fraction of citizens in the sample
Answer Q1 Answer Q2
p 1 − p
-DP

RAPPOR
Erlingsson, Pihur, Korolova. "RAPPOR: Randomized Aggregatable Privacy-Preserving
Ordinal Response." ACM CCS 2014.

RAPPOR: two-level randomized response
Can we do repeated surveys of sensitive attributes?
— Average of randomized responses will reveal a user’s true answer :-(
Solution: Memoize! Re-use the same random answer
— Memoization can hurt privacy too! Long, random bit sequence can
be a unique tracking ID :-(
Solution: Use 2-levels! Randomize the memoized response

● Store client value v into bloom filter B using hash functions
● Memoize a Permanent Randomized Response (PRR) B′
● Report an Instantaneous Randomized Response (IRR) S

● Store client value v into bloom filter B using hash functions
● Memoize a Permanent Randomized Response (PRR) B′
● Report an Instantaneous Randomized Response (IRR) S
f = ½
q = ¾ , p = ½

RAPPOR: Life of a report
Value
Bloom
Filter
PRR
IRR
“www.google.com”

Value
Bloom
Filter
PRR
IRR
P(1) =
0.25
P(1) =
0.75

Value
Bloom
Filter
PRR
IRR
P(1) =
0.50
P(1) =
0.75

Differential privacy of RAPPOR
● Permanent Randomized Response satisfies differential privacy at
● Instantaneous Randomized Response has differential privacy at
= 4 ln(3)
= ln(3)

Differential Privacy of RAPPOR:
Measurable privacy bounds
Each report offers differential privacy with
ε = ln(3)
Attacker’s guess goes from 0.1% → 0.3% in the worst case
Differential privacy even if attacker gets all reports (infinite data!!!)
Also… Base Rate Fallacy prevents attackers from finding needles in
haystacks

Cohorts
Bloom Filter: 2 bits out of 128 — too many false positives
...
user 0xA0FE91B76:
google.com
cohort 2cohort 1 cohort 128
h2

From Raw Counts to De-noised Counts
True bit counts, with no noise
De-noised RAPPOR reports

From De-Noised Count to Distribution
True bit counts, with no noise
De-noised RAPPOR reports
google.com:
yahoo.com:
bing.com:

From De-Noised Count to Distribution
Linear Regression:
minX ||B - A X||2
LASSO:
minX (||B - A X||2)2 + λ||X||1
Hybrid:
1. Find support of X via LASSO
2. Solve linear regression to find weights

Explaining RAPPOR
“Having the cake and eating it too…”
“Seeing the forest without seeing the trees…”

Microdata: An Individual’s Report

Microdata: An Individual’s Report
Each bit is flipped with
probability
25%

Google Chrome Privacy White Paper
https://www.google.com/chrome/browser/privacy/whitepaper.html
Phishing and malware protection
Google Chrome includes an optional feature called "Safe Browsing" to help protect you against phishing and malware attacks. This
helps prevent evil-doers from tricking you into sharing personal information with them (“phishing”) or installing malicious software
on your computer (“malware”). The approach used to accomplish this was designed specifically to protect your privacy and is also
used by other popular browsers.
If you'd rather not send any information to Safe Browsing, you can also turn these features off. Please be aware that Chrome will no
longer be able to protect you from websites that try to steal your information or install harmful software if you disable this feature.
We really don't recommend turning it off.
…
If a URL was indeed dangerous, Chrome reports this anonymously to Google to improve Safe Browsing. The data sent is randomized,
constructed in a manner that ensures differential privacy, permitting only monitoring of aggregate statistics that apply to tens of
thousands of users at minimum. The reports are an instance of Randomized Aggregatable Privacy-Preserving Ordinal Responses,
whose full technical details have been published in a technical report and presented at the 2014 ACM Computer and Communications
Security conference. This means that Google cannot infer which website you have visited from this.

Growing Pains
● Transitioning from a research prototype to a real product
● Scalability
● Versioning

Maintaining Candidates List
No missing candidates Three missing candidates
4%
13% 17%

RAPPOR Metrics in Chrome
https://chromium.googlesource.com/chromium/src/+log/master/tools/metrics/rappor/rappor.xml

Open Source Efforts
https://github.com/google/rappor
- demo you can run with a couple
of shell commands
- client library in several languages
- analysis tool and simulation
- documentation

Follow-up
- Bassily, Smith, “Local, Private, Efficient Protocols for Succinct
Histograms,” STOC 2015
- Kairouz, Bonawitz, Ramage, “Discrete Distribution Estimation under
Local Privacy”, https://arxiv.org/abs/1602.07387
- Qin et al., “Heavy Hitter Estimation over Set-Valued Data with Local
Differential Privacy”, CCS 2016

Key takeaway points
RAPPOR - locally differentially-private mechanism for reporting of
categorical and string data
● First Internet-scale deployment of differential privacy
● Explainable
● Conservative
● Open-sourced

Apple's On-Device Differential
Privacy
Abhradeep Thakurta, UC Santa Cruz

References
https://arxiv.org/abs/1709.02753

Phablet
Derp
Photobomb
Woot
Phablet
OMG
Woot
Troll
Prepone
Phablet
awwww
dp
Learning from private data
Learn new (and frequent) words typed

Learning from private data
Learn frequent emojis typed

Apple's On-Device Differential
Privacy: Discovering New Words

Roadmap
1. Private frequency estimation with count-min-sketch
2. Private heavy hitters with puzzle piece algorithm
3. Private heavy hitters with tree histogram protocol

Private frequency oracle
Building block for private heavy hitters
𝑑2𝑑1 𝑑 𝑛
All errors within
𝛾 = O( 𝑛 log|𝒮|)
frequency
Words (𝒮)
𝛾
"phablet"
frequency("phablet")

Private frequency oracle:
Design constraints
Computational and communication constraints:
Client side:
size of the domain (|S|) and n
Communication to server:
very few bits
Server-side cost for one query:

Design constraints
Client side:
# characters > 3,000
For 8-character words:
size of the domain |S|=3,000^8
number of clients ~ 1B
Efficiently [BS15] ~ n
Our goal ~ O(log |S|)

Design constraints
Client side:
O(log |S|)
O(1) bits
O(log |S|)

A starter solution: Randomized response
𝑑
0 1 0
𝑖
1 0 1
𝑖
Protects ε-differential privacy
(with the right bias)
Randomized response: d′

1 0 0
1 1 0
1 0 1
+ With bias
correction
frequency
All domain elements
Error in each estimate:
Θ( 𝑛 log|𝒮|)
Optimal error under privacy

Client side:
O(|S|)
O(|S|) bits
O(1)
1 0 1
𝑖

𝑑
0 01
0 01
0 01
Hash function: ℎ1
Hash function: ℎ2
Hash function: ℎ 𝑘
Number of hash bins: 𝑛
Computation= 𝑂(log|𝒮|)
𝑘 ≈ log|𝒮|
Non-private count-min sketch [CM05]

0 01
0 01
0 01
0 01
1 00
0 11
1
𝑘
1
+
245
127
9123
2132
𝑛
Reducing server computation

Reducing server computation
1
𝑘
1
Phablet
245
127
9123
2132
𝑛
9146
2212
Frequency estimate:
min (9146, 2212, 2132)
O( 𝑛log|𝒮|)
Server side query cost:
𝑂(log|𝒮|)
𝑘 ≈ log |𝒮|
"phablet"

Private count-min sketch
𝑑
Making client computation differentially private
0 01
0 01
0 01
1 01
1 00
0 00
𝑘𝜖-diff. private, since 𝑘 pieces of information

𝑑
Theorem: Sampling ensures 𝜖-differential privacy without hurting accuracy,
rather improves it by a factor of 𝑘
0 01 1 00

Reducing client communication
0 01 +1 +1-1
Hadamard transform

Reducing client communication
0 01 +1 +1-1
Hadamard transform
-1 +1
Communication: 𝑂(1) bit
Theorem: Hadamard transform and sampling
do not hurt accuracy

Client side:
O(log |S|)
O(1) bits
O(log |S|)
O( 𝑛log|𝒮|)

Private heavy hitters:
Using the frequency oracle
Domain 𝒮
Too many elements in 𝒮 to search.
Element s in S
Frequency(s)
Find all s in S with
frequency > γ

Puzzle piece algorithm
(works well in practice, no theoretical guarantees)
[Bassily Nissim Stemmer Thakurta, 2017 and Apple differential privacy team, 2017]

Private heavy hitters
Observation: If a word is frequent, its bigrams are frequent too.
Ph ab le t$ Frequency > 𝛾
Each bi-gram frequency > 𝛾

Natural algorithm: Cartesian product of frequent bi-grams
Sanitized
bi-grams, and the
complete word
ab
ad
ph
ba
ab
ax
le
ab
ab
Position P1 Position P2 Position P3
le
ab
t$
Position P4
Frequent bi-grams

ab
ad
ph
ba
ab
ax
le
ab
ab
le
ab
t$
Position P4
Frequent bi-grams Candidate words
P1 x P2 x P3 x P4
Find frequent
words

Candidate words
P1 x P2 x P3 x P4
Find frequent
words
Combinatorial explosion
In practice, all bi-grams are frequent

Puzzle piece algorithm
Ph ab le t$
≜
h=Hash(Phablet)
Hash: 𝒮 → 1, … , ℓ
Ph ab le t$h h h h
Privatized
bi-grams tagged
with the hash, and
the complete
word

Puzzle piece algorithm: Server side
ab 1
ad 5
Ph 3
ba 4
ab 3
ax 9
le 3
le 7
ab 1
le 1
ab 9
t$ 3
Position P4
Frequent bi-grams tagged with {1, … , ℓ}
Candidate words
P1 x P2 x P3 x P4
Find frequent
words
Combine only matching
bi-grams

Tree histogram algorithm
(works well in practice + optimal theoretical guarantees)
[Bassily Nissim Stemmer Thakurta, 2017]

Tree histograms (based on [CM05])
1 0 0
Any string in 𝒮:
log |𝒮| bits
Idea: Construct prefixes of the heavy hitter bit by bit

Tree histograms
0 1

Tree histograms
0 1
Level 1: Frequent prefix of length 1
Use private frequency oracle
If a string is a heavy hitter, its prefixes are too.

Tree histograms
00 01 10 11

Tree histograms
Level 2: Frequent prefix of length two
Idea: Each level has ≈ 𝑛 heavy hitters
00 01 10 11

Tree histograms
Client side:
O(log |S|)
O(1) bits
Server-side computation:
O(n log |S|)
Theorem: Finds all heavy hitters with frequency at least
𝑂( 𝑛 log|𝒮|)

Key takeaway points
• Keeping local differential privacy constant:
•One low-noise report is better than many noisy ones
•Weak signal with probability 1 is better than strong signal with small probability
• We can learn the dictionary – at a cost
• Longitudinal privacy remains a challenge

Microsoft: Discretization of continuous variables
"These guarantees are particularly strong when user’s behavior remains
approximately the same, varies slowly, or varies around a small number of
values over the course of data collection."

Microsoft's deployment
"Our mechanisms have been deployed by
Microsoft across millions of devices ... to protect
users’ privacy while collecting application usage
statistics."
B. Ding, J. Kulkarni, S. Yekhanin, NeurIPS 2017

Microsoft Research Blog, Dec 8, 2017

Privacy in AI @ LinkedIn
• Framework to compute robust, privacy-preserving analytics
• Privacy challenges/design for a large crowdsourced system (LinkedIn Salary)

Analytics & Reporting Products at LinkedIn
Profile View Analytics
140
Content Analytics
Ad Campaign Analytics
All showing
demographics of
members engaging with
the product

• Admit only a small # of predetermined query types
• Querying for the number of member actions, for a specified time period,
together with the top demographic breakdowns

• Admit only a small # of predetermined query types
• Querying for the number of member actions, for a specified time period,
together with the top demographic breakdowns
E.g., Clicks on a
given adE.g., Title = “Senior
Director”

Privacy Requirements
• Attacker cannot infer whether a member performed an action
• E.g., click on an article or an ad
• Attacker may use auxiliary knowledge
• E.g., knowledge of attributes associated with the target member (say,
obtained from this member’s LinkedIn profile)
• E.g., knowledge of all other members that performed similar action

Possible Privacy Attacks
144
Targeting:
Senior directors in US, who studied at Cornell
Matches ~16k LinkedIn members
→ over minimum targeting threshold
Demographic breakdown:
Company = X
May match exactly one person
→ can determine whether the person
clicks on the ad or not
Require minimum reporting threshold
Still amenable to attacks
(Refer our ACM CIKM’18 paper for details)
Rounding mechanism
E.g., report incremental of 10
Still amenable to attacks
E.g. using incremental counts over time to
infer individuals’ actions
Need rigorous techniques to preserve member privacy
(not reveal exact aggregate counts)

Key Product Desiderata
• Coverage & Utility
• Data Consistency
• for repeated queries
• over time
• between total and breakdowns
• across entity/action hierarchy
• for top k queries

Problem Statement
Compute robust, reliable analytics in a privacy-
preserving manner, while addressing the product
desiderata such as coverage, utility, and consistency.

Differential Privacy: Random Noise Addition
If ℓ1-sensitivity of f : D → ℝn:
maxD,D′ ||f(D) − f(D′)||1 = s,
then adding Laplacian noise to true output
f(D) + Laplacen(s/ε)
offers ε-differential privacy.
Dwork, McSherry, Nissim, Smith, “Calibrating Noise to Sensitivity in Private Data Analysis”, TCC 2006

PriPeARL: A Framework for Privacy-Preserving Analytics
K. Kenthapadi, T. T. L. Tran, ACM CIKM 2018
148
Pseudo-random noise generation, inspired by differential privacy
● Entity id (e.g., ad
creative/campaign/account)
● Demographic dimension
● Stat type (impressions, clicks)
● Time range
● Fixed secret seed
Uniformly Random
Fraction
● Cryptographic
hash
● Normalize to
(0,1)
Random
Noise
Laplace
Noise
● Fixed ε
True
Count
Noisy
Count
To satisfy consistency
requirements
● Pseudo-random noise → same query has same result over time, avoid
averaging attack.
● For non-canonical queries (e.g., time ranges, aggregate multiple entities)
○ Use the hierarchy and partition into canonical queries
○ Compute noise for each canonical queries and sum up the noisy
counts

Lessons Learned from Deployment (> 1 year)
• Semantic consistency vs. unbiased, unrounded noise
• Suppression of small counts
• Online computation and performance requirements
• Scaling across analytics applications
• Tools for ease of adoption (code/API library, hands-on how-to tutorial) help!

Summary
• Framework to compute robust, privacy-preserving analytics
• Addressing challenges such as preserving member privacy, product coverage,
utility, and data consistency
• Future
• Utility maximization problem given constraints on the ‘privacy loss budget’
per user
• E.g., noise with larger variance to impressions but less noise to clicks (or conversions)
• E.g., more noise to broader time range sub-queries and less noise to granular time range
sub-queries
• Reference: K. Kenthapadi, T. Tran, PriPeARL: A Framework for Privacy-Preserving
Analytics and Reporting at LinkedIn, ACM CIKM 2018.

Acknowledgements
•Team:
• AI/ML: Krishnaram Kenthapadi, Thanh T. L. Tran
• Ad Analytics Product & Engineering: Mark Dietz, Taylor Greason, Ian
Koeppe
• Legal / Security: Sara Harrington, Sharon Lee, Rohit Pitke
•Acknowledgements (in alphabetical order)
• Deepak Agarwal, Igor Perisic, Arun Swami

Outline
• LinkedIn Salary Overview
• Challenges: Privacy, Modeling
• System Design & Architecture
• Privacy vs. Modeling Tradeoffs

LinkedIn Salary (launched in Nov, 2016)

Salary Collection Flow via Email Targeting

Current Reach (February 2019)
• A few million responses out of several millions of members targeted
• Targeted via emails since early 2016
• Countries: US, CA, UK, DE, IN, …
• Insights available for a large fraction of US monthly active users

Data Privacy Challenges
• Minimize the risk of inferring any one individual’s compensation data
• Protection against data breach
• No single point of failure
Achieved by a combination of
techniques: encryption, access control,
, aggregation,
thresholding
K. Kenthapadi, A. Chudhary, and S.
Ambler, LinkedIn Salary: A System
for Secure Collection and
Presentation of Structured
Compensation Insights to Job
Seekers, IEEE PAC 2017
(arxiv.org/abs/1705.06976)

Modeling Challenges
• Evaluation
• Modeling on de-identified data
• Robustness and stability
• Outlier detection
X. Chen, Y. Liu, L. Zhang, and K.
Kenthapadi, How LinkedIn
Economic Graph Bonds
Information and Product:
Applications in LinkedIn Salary,
KDD 2018
K. Kenthapadi, S. Ambler,
L. Zhang, and D. Agarwal,
Bringing salary transparency to
the world: Computing robust
compensation insights via
LinkedIn Salary, CIKM 2017

Problem Statement
•How do we design LinkedIn Salary system taking into
account the unique privacy and security challenges,
while addressing the product requirements?

Differential Privacy? [Dwork et al, 2006]
• Rich privacy literature (Adam-Worthmann, Samarati-Sweeney, Agrawal-Srikant, …,
Kenthapadi et al, Machanavajjhala et al, Li et al, Dwork et al)
• Limitation of anonymization techniques (as discussed in the first part)
• Worst case sensitivity of quantiles to any one user’s compensation data is
large
•  Large noise to be added, depriving reliability/usefulness
• Need compensation insights on a continual basis
• Theoretical work on applying differential privacy under continual observations
• No practical implementations / applications
• Local differential privacy / Randomized response based approaches (Google’s RAPPOR; Apple’s
iOS differential privacy; Microsoft’s telemetry collection) not applicable

Title Region
$$
User Exp
Designer
SF Bay
Area 100K
User Exp
Designer
SF Bay
Area 115K
... ...
...
Title Region
$$
User Exp
Designer
SF Bay
Area 100K
De-identification Example
Title Region Company Industry Years of
exp
Degree FoS Skills
$$
User Exp
Designer
SF Bay
Area
Google Internet 12 BS Interactive
Media
UX,
Graphics,
...
100K
Title Region Industry
$$
User Exp
Designer
SF Bay
Area
Internet
100K
Title Region Years of
exp $$
User Exp
Designer
SF Bay
Area
10+
100K
Title Region Company Years of
exp $$
User Exp
Designer
SF Bay
Area
Google 10+
100K
#data
points >
threshold?
Yes ⇒ Copy to
Hadoop (HDFS) Note: Original submission stored as encrypted objects.

Collection & Storage
• Allow members to submit their compensation info
• Extract member attributes
• E.g., canonical job title, company, region, by invoking LinkedIn standardization services
• Securely store member attributes & compensation data

De-identification & Grouping
• Approach inspired by k-Anonymity [Samarati-Sweeney]
• “Cohort” or “Slice”
• Defined by a combination of attributes
• E.g, “User experience designers in SF Bay Area”
• Contains aggregated compensation entries from corresponding individuals
• No user name, id or any attributes other than those that define the cohort
• A cohort available for offline processing only if it has at least k entries
• Apply LinkedIn standardization software (free-form attribute  canonical version)
before grouping
• Analogous to the generalization step in k-Anonymity

De-identification & Grouping
• Slicing service
• Access member attribute info &
submission identifiers (no
compensation data)
• Generate slices & track #
submissions for each slice
• Preparation service
• Fetch compensation data (using
submission identifiers), associate
with the slice data, copy to HDFS

Insights & Modeling
• Salary insight service
• Check whether the member is
eligible
• Give-to-get model
• If yes, show the insights
• Offline workflow
• Consume de-identified HDFS
dataset
• Compute robust compensation
insights
• Outlier detection
• Bayesian smoothing/inference
• Populate the insight key-value
stores

Security
Mechanisms
• Encryption of
member attributes
& compensation
data using different
sets of keys
• Separation of
processing
• Limiting access to
the keys

Security
Mechanisms
• Key rotation
• No single point of
failure
• Infra security

Preventing Timestamp Join based Attacks
• Inference attack by joining these on timestamp
• De-identified compensation data
• Page view logs (when a member accessed compensation collection web interface)
•  Not desirable to retain the exact timestamp
• Perturb by adding random delay (say, up to 48 hours)
• Modification based on k-Anonymity
• Generalization using a hierarchy of timestamps
• But, need to be incremental
•  Process entries within a cohort in batches of size k
• Generalize to a common timestamp
• Make additional data available only in such incremental batches

Privacy vs Modeling Tradeoffs
• LinkedIn Salary system deployed in production for ~2.5 years
• Study tradeoffs between privacy guarantees (‘k’) and data available for
computing insights
• Dataset: Compensation submission history from 1.5M LinkedIn members
• Amount of data available vs. minimum threshold, k
• Effect of processing entries in batches of size, k

Amount of
data
available vs.
threshold, k

Percent of
data available
vs. batch size,
k

Median delay
due to
batching vs.
batch size, k

Key takeaway points
• LinkedIn Salary: a new internet application, with
unique privacy/modeling challenges
• Privacy vs. Modeling Tradeoffs
• Potential directions
• Privacy-preserving machine learning models in a practical setting
[e.g., Chaudhuri et al, JMLR 2011; Papernot et al, ICLR 2017]
• Provably private submission of compensation entries?

Beyond Randomized Response
• LDP + Machine Learning:
• "Is interaction necessary for distributed private learning?"
Smith, Thakurta, Upadhyay, S&P 2017
• Federated Learning
• DP + Machine Learning
• Encode-Shuffle-Analyze architecture
"Prochlo: Strong Privacy for Analytics in the Crowd"
Bittau et al., SOSP 2017
• Amplification by Shuffling

LDP + Machine Learning
Interactivity as a major implementation constraint
...
parallel

LDP + Machine Learning
Interactivity as a major implementation constraint
...
sequential

"Is interaction necessary for distributed private learning?"
[Smith, Thakurta, Upadhyay, S&P2017]
• Single parameter learning (e.g., median):
• Maximal accuracy with full parallelism
• Multi-parameter learning:
• Polylog number of iterations
• Lower bounds

Federated Learning
"Practical secure aggregation for privacy-preserving machine learning"
Bonawitz, Ivanov, Kreuter, Marcedone, McMahan, Patel, Ramage, Segal,
Seth, ACM CCS 2017

"Generalization Implies Privacy" Fallacy
We don’t overfit, therefore
our model cannot possibly
violate privacy.

“Generalization Implies Privacy” Fallacy
Generalization
● average case
● model’s accuracy
Privacy
● worst case
● model’s parameters

“Generalization Implies Privacy” Fallacy
● Examples when it just ain’t so:
○ Person-to-person similarities
○ Support Vector Machines
● Models can be very large
○ Millions of parameters

ML + Differential Privacy
• [DP-SGD] Abadi, Chu, Goodfellow, McMahan, Mironov, Talwar, Zhang,
"Deep Learning with Differential Privacy", ACM CCS 2016
• [PATE] Papernot, Abadi, Erlingson, Goodfellow, Talwar, "Semi-supervised
Knowledge Transfer for Deep Learning from Private Training Data",
ICML 2017
• [PATE] Papernot, Song, Mironov, Raghunathan, Talwar, Erlingson,
"Scalable Private Learning with PATE", ICML 2018
https://github.com/tensorflow/privacy

Statistics + Differential Privacy
Harvard Privacy Tools Project:

Census 2020 and Differential Privacy

Key takeaway points
• Notion of differential privacy is a principled foundation for privacy-
preserving data analyses
• Local differential privacy is a powerful technique appropriate for
Internet-scale telemetry
• Other techniques (thresholding, shuffling) can be combined with
differentially private algorithms or be used in isolation.

References
Differential privacy:
review "A Firm Foundation For Private Data Analysis", C. ACM 2011
by Dwork
book "The Algorithmic Foundations of Differential Privacy"
by Dwork and Roth

References
Google's RAPPOR:
paper "RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response", ACM
CCS 2014, Erlingsson, Pihur, Korolova
blog
Apple's implementation:
article "Learning with Privacy at Scale", Apple ML J., Dec 2017
paper "Practical Locally Private Heavy Hitters", NIPS 2017,
by Bassily, Nissim, Stemmer, Thakurta
paper "Privacy Loss in Apple's Implementation of Differential Privacy on MacOS 10.12" by
Tang, Korolova, Bai, Wang, Wang
LinkedIn’s privacy-preserving analytics framework
paper "PriPeARL: A Framework for Privacy-Preserving Analytics and Reporting at LinkedIn",
CIKM 2018, Kenthapadi, Tran
LinkedIn Salary:
paper "LinkedIn Salary: A System for Secure Collection and Presentation of Structured
Compensation Insights to Job Seekers", IEEE PAC 2017, Kenthapadi, Chudhary, Ambler
blog

Fairness Privacy
Related WSDM’19 sessions:
1.Tutorial: Fairness-Aware Machine Learning: Practical Challenges and Lessons Learned
(Monday, 13:30 – 17:00)
2.H.V. Jagadish's invited talk: Responsible Data Science (Tuesday, 14:45 - 15:30)
3.Session 4: FATE & Privacy (Tuesday, 16:15 - 17:30)
4.Aleksandra Korolova’s invited talk: Privacy-Preserving WSDM (Wednesday, 14:45 - 15:30)

Thanks! Questions?
•Tutorial website:
https://sites.google.com/view/wsdm19-privacy-
tutorial
•Feedback most welcome 
• kkenthapadi@linkedin.com, mironov@google.com

PROCHLO:
Strong Privacy for Analytics in the Crowd
Bittau, Erlingsson, Maniatis, Mironov, Raghunathan,
Lie, Rudominer, Kode, Tinnes, Seefeld
SOSP 2017

The ESA Architecture and Its Prochlo realization
E
A
E
E
S
Σ
ESA: Encode, Shuffle, Analyze (ESA)
Prochlo: A hardened ESA realization using Intel's SGX + crypto

E
A
E
E
S
Σ
S S...
A
Σ
Σ
Σ
Local DP
Unlinkability
Randomized Thresholding Central DP

Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)

Similar to Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial) (20)

More from Krishnaram Kenthapadi

More from Krishnaram Kenthapadi (15)

Recently uploaded

Recently uploaded (20)

Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)