Privacy in AI/ML Systems: Practical Challenges and Lessons Learned

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Krishnaram Kenthapadi
Principal Scientist, Amazon AWS AI
Privacy in AI/ML Systems
Practical Challenges & Lessons Learned
EMLNP PrivateNLP Workshop, Nov’2020

What is Privacy?
• Right of/to privacy
• “Right to be let alone” [L. Brandeis & S. Warren, 1890]
• “No one shall be subjected to arbitrary interference with [their] privacy,
family, home or correspondence, nor to attacks upon [their] honor and
reputation.” [The United Nations Universal Declaration of Human Rights]
• “The right of a person to be free from intrusion into or publicity concerning
matters of a personal nature” [Merriam-Webster]
• “The right not to have one's personal matters disclosed or publicized; the
right to be left alone” [Nolo’s Plain-English Law Dictionary]

Data Privacy (or Information Privacy)
• “The right to have some control over how your personal information is
collected and used” [IAPP]
• “Privacy has fast-emerged as perhaps the most significant consumer
protection issue—if not citizen protection issue—in the global
information economy” [IAPP]

Data Privacy vs. Security
• Data privacy: use & governance of personal data
• Data security: protecting data from malicious attacks & the exploitation
of stolen data for profit
• Security is necessary, but not sufficient for addressing privacy.

Data Privacy:Technical Problem
Given a dataset with sensitive personal information, how can we compute
and release functions of the dataset while protecting individual privacy?
Credit: Kobbi Nissim

Massachusetts Group
Insurance Commission
(1997): Anonymized
medical history of state
employees
William Weld vs
Latanya Sweeney
Latanya Sweeney (MIT
grad student): $20 –
Cambridge voter roll
born July 31, 1945
resident of 02138

64%Uniquely identifiable with ZIP
+ birth date + gender (in the
US population)
Golle, “Revisiting the Uniqueness of Simple Demographics in the US Population”, WPES 2006

A History of Privacy Failures …
Credit: Kobbi Nissim,Or Sheffet

Lessons Learned …
• Attacker’s advantage: Auxiliary information; high dimensionality;
enough to succeed on a small fraction of inputs; active; observant …
• Unanticipated privacy failures from new attack methods
• Need for rigorous privacy notions & techniques

• Ethical challenges
posed by AI systems
• Inherent biases present
in society
• Reflected in training
data
• AI/ML models prone to
amplifying such biases
Algorithmic Bias

Laws against Discrimination
Immigration Reform and Control Act
Citizenship
Rehabilitation Act of 1973;
Americans with Disabilities Act
of 1990
Disability status
Civil Rights Act of 1964
Race
Age Discrimination in Employment Act of
1967
Age
Equal Pay Act of 1963;
Civil Rights Act of 1964
Sex
And more...

Fairness Privacy
Transparency Explainability

Motivation & Business Opportunities
• Regulatory. We need to understand why the ML model made a given
decision and also whether the decision it made was free from bias, both
in training and at inference
• Business. Providing explanations to internal teams (loan officers,
customer service rep, forecasting teams) and end users/customers
• Data Science. Improving models, understanding whether a model is
making inferences based on irrelevant data, etc.

15© 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved |
Scaling Fairness, Explainability & Privacy across the AWS ML Stack
VISION SPEECH TEXT SEARCH NEW CHATBOTS PERSONALIZATION FORECASTING FRAUD NEW DEVELOPMENT NEW CONTACT CENTERS
NEW
Amazon SageMaker Ground
Truth
Augmented
AI
SageMaker
Neo
Built-in
algorithms
SageMaker
Notebooks NEW
SageMaker
Experiments NEW
Model
tuning
SageMaker
Debugger NEW
SageMaker
Autopilot NEW
Model
hosting
SageMaker
Model Monitor NEW
Deep Learning
AMIs & Containers
GPUs &
CPUs
Elastic
Inference
Inferentia FPGA
Amazon
Rekognition
Amazon
Polly
Amazon
Transcribe
+Medical
Amazon
Comprehend
+Medical
Amazon
Translate
Amazon
Lex
Amazon
Personalize
Amazon
Forecast
Amazon
Fraud Detector
Amazon
CodeGuru
AI SERVICES
ML SERVICES
ML FRAMEWORKS & INFRASTRUCTURE
Amazon
Textract
Amazon
Kendra
Contact Lens
For Amazon
Connect
SageMaker Studio IDE NEW
NEW
NEW
NEW
NEW

LinkedIn operates the largest professional
network on the Internet
Tell your
story
645M+ members
30M+
companies are
represented
on LinkedIn
90K+
schools listed
(high school &
college)
35K+
skills listed
20M+
open jobs
on
LinkedIn
Jobs
280B
Feed updates

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Threat Models

Threat Models
User Access Only
• Users store their
data
• Noisy data or
analytics transmitted
Trusted Curator
• Stored by organization
• Managed only by a
trusted curator/admin
• Access only to noisy
analytics or synthetic
data
External Threat
• Stored by organization
• Organization has
access
• Only privacy enabled
models deployed

Privacy in
AI @
LinkedIn
PriPeARL: Framework to
compute robust,
privacy-preserving
analytics

Analytics & Reporting Products at LinkedIn
Profile View
Analytics
23
Content
Analytics
Ad Campaign
Analytics
All showing
demographics of
members engaging with
the product

Admit only a small # of predetermined query types
Querying for the number of member actions, for a specified time period,
together with the top demographic breakdowns

Admit only a small # of predetermined query types
Querying for the number of member actions, for a specified time period,
together with the top demographic breakdowns
E.g., Title = “Senior
Director”
E.g., Clicks on a
given ad

Privacy Requirements
Attacker cannot infer whether a member performed an action
E.g., click on an article or an ad
Attacker may use auxiliary knowledge
E.g., knowledge of attributes associated with the target member (say,
obtained from this member’s LinkedIn profile)
E.g., knowledge of all other members that performed similar action (say, by
creating fake accounts)

Possible Privacy Attacks
27
Targeting:
Senior directors in US, who studied at Cornell
Matches ~16k LinkedIn members
→ over minimum targeting threshold
Demographic breakdown:
Company = X
May match exactly one person
→ can determine whether the person
clicks on the ad or not
Require minimum reporting threshold
Attacker could create fake profiles!
E.g. if threshold is 10, create 9 fake profiles
that all click.
Rounding mechanism
E.g., report incremental of 10
Still amenable to attacks
E.g. using incremental counts over time to
infer individuals’ actions
Need rigorous techniques to preserve member privacy
(not reveal exact aggregate counts)

Problem Statement
Compute robust, reliable analytics in a privacy-
preserving manner, while addressing the product
needs.

Defining Privacy
31
CuratorCurator
+ your data
- your data

Differential Privacy
32
Databases D and D′ are neighbors if they differ in one person’s data.
Differential Privacy: The distribution of the curator’s output M(D) on database
D is (nearly) the same as M(D′).
Curator
+ your data
- your data
Dwork, McSherry, Nissim, Smith [TCC 2006]
Curator

(ε, 𝛿)-Differential Privacy: The distribution of the curator’s output M(D) on
database D is (nearly) the same as M(D′).
Differential Privacy
33
Curator
Parameter ε quantifies
information leakage
∀S: Pr[M(D)∊S] ≤ exp(ε) ∙ Pr[M(D′)∊S]＋𝛿.Curator
Parameter 𝛿 gives
some slack
Dwork, Kenthapadi, McSherry, Mironov, Naor [EUROCRYPT 2006]
+ your data
- your data

Differential Privacy: Random Noise Addition
If ℓ1-sensitivity of f : D → ℝn:
maxD,D′ ||f(D) − f(D′)||1 = s,
then adding Laplacian noise to true output
f(D) + Laplacen(s/ε)
offers (ε,0)-differential privacy.

PriPeARL: A Framework for Privacy-Preserving Analytics
K. Kenthapadi, T. T. L. Tran, ACM CIKM 2018
35
Pseudo-random noise generation, inspired by differential privacy
● Entity id (e.g., ad
creative/campaign/account)
● Demographic dimension
● Stat type (impressions, clicks)
● Time range
● Fixed secret seed
Uniformly Random
Fraction
● Cryptographic
hash
● Normalize to
(0,1)
Random
Noise
Laplace
Noise
● Fixed ε
True
Count
Noisy
Count
To satisfy consistency
requirements
● Pseudo-random noise → same query has same result over time, avoid
averaging attack.
● For non-canonical queries (e.g., time ranges, aggregate multiple entities)
○ Use the hierarchy and partition into canonical queries
○ Compute noise for each canonical queries and sum up the noisy
counts

Lessons Learned from Deployment (> 1
year)
Semantic consistency vs. unbiased, unrounded noise
Suppression of small counts
Online computation and performance requirements
Scaling across analytics applications
Tools for ease of adoption (code/API library, hands-on how-to tutorial) help!
Having a few entry points (all analytics apps built over Pinot)  wider adoption

Summary
Framework to compute robust, privacy-preserving analytics
Addressing challenges such as preserving member privacy, product
coverage, utility, and data consistency
Future
Utility maximization problem given constraints on the ‘privacy loss budget’
per user
E.g., noise with larger variance to impressions but less noise to clicks (or conversions)
E.g., more noise to broader time range sub-queries and less noise to granular time
range sub-queries
Reference: K. Kenthapadi, T. Tran, PriPeARL: A Framework for Privacy-
Preserving Analytics and Reporting at LinkedIn, ACM CIKM 2018.

Acknowledgements
Team:
AI/ML: Krishnaram Kenthapadi, Thanh T. L. Tran
Ad Analytics Product & Engineering: Mark Dietz, Taylor Greason, Ian
Koeppe
Legal / Security: Sara Harrington, Sharon Lee, Rohit Pitke
Acknowledgements
Deepak Agarwal, Igor Perisic, Arun Swami

LinkedIn Salary (launched in Nov, 2016)

Data Privacy Challenges
Minimize the risk of inferring any one
individual’s compensation data
Protection against data breach
No single point of failure

Problem Statement
How do we design LinkedIn Salary system taking into
account the unique privacy and security challenges,
while addressing the product requirements?
K. Kenthapadi, A. Chudhary, and
S. Ambler, LinkedIn Salary: A
System for Secure Collection and
Presentation of Structured
Compensation Insights to Job
Seekers, IEEE PAC 2017
(arxiv.org/abs/1705.06976)

Title Region
$$
User Exp
Designer
SF Bay
Area
100K
User Exp
Designer
SF Bay
Area
115K
... ...
...
Title Region
$$
User Exp
Designer
SF Bay
Area
100K
De-identification Example
Title Region Company Industry Years of
exp
Degree FoS Skills
$$
User Exp
Designer
SF Bay
Area
Google Internet 12 BS Interactive
Media
UX,
Graphics,
...
100K
Title Region Industry
$$
User Exp
Designer
SF Bay
Area
Internet
100K
Title Region Years of
exp $$
User Exp
Designer
SF Bay
Area
10+
100K
Title Region Company Years of
exp $$
User Exp
Designer
SF Bay
Area
Google 10+
100K
#data
points >
threshold?
Yes ⇒ Copy to
Hadoop (HDFS)
Note: Original submission stored as encrypted objects.

Acknowledgements
Team:
AI/ML: Krishnaram Kenthapadi, Stuart Ambler, Xi Chen, Yiqun Liu, Parul
Jain, Liang Zhang, Ganesh Venkataraman, Tim Converse, Deepak Agarwal
Application Engineering: Ahsan Chudhary, Alan Yang, Alex Navasardyan,
Brandyn Bennett, Hrishikesh S, Jim Tao, Juan Pablo Lomeli Diaz, Patrick
Schutz, Ricky Yan, Lu Zheng, Stephanie Chou, Joseph Florencio, Santosh
Kumar Kancha, Anthony Duerr
Product: Ryan Sandler, Keren Baruch
Other teams (UED, Marketing, BizOps, Analytics, Testing, Voice of
Members, Security, …): Julie Kuang, Phil Bunge, Prateek Janardhan, Fiona
Li, Bharath Shetty, Sunil Mahadeshwar, Cory Scott, Tushar Dalvi, and team
Acknowledgements
David Freeman, Ashish Gupta, David Hardtke, Rong Rong, Ram

Privacy Research @ Amazon -
Sampler
Work done by Oluwaseyi Feyisetan, Tom Diethe, Thomas Drake, Borja Belle

Simple but effective, privacy-preserving mechanism
Task: subsample from dataset using additional information in privacy-
preserving way.
Building on existing exponential analysis of k-anonymity, amplified by
sampling…
Mechanism M is (β, ε, δ)-differentially private
Model uncertainty via Bayesian NN
”Privacy-preserving Active Learning on Sensitive Data for User Intent
Classification” [Feyisetan, Balle, Diethe, Drake; PAL 2019]

Differentially-private text redaction
Task: automatically redact sensitive text for privatizing various ML models.
 Perturb sentences but maintain meaning
e.g. “goalie wore a hockey helmet”  “keeper wear the nhl hat”
Apply metric DP and analysis of word embeddings to scramble sentences
Mechanism M is d χ – differentially private
Establish plausible deniability statistics:
Nw := Pr[M(w ) = w ]
Sw := Expected number of words output by M(w)
“Privacy- and Utility-Preserving Textual Analysis via Calibrated Multivariate
Perturbations” [Feyisetan, Drake, Diethe, Balle; WSDM 2020]

Analysis of DP redaction
Show plausible deniability via dist. of Nw & Sw for ε:
ε  0 : Nw decreases, Sw increases
ε  inf : Nw increases, Sw decreases.
Impact of accuracy given ε (epsilon) on
multi-class classification and question
answering tasks, respectively:

Improving data utility of DP text redaction
Task: redact text, but use additional structured information to
better preserve utility.
Can we improve redaction for models that fail for extraneous words?
~Recall-sensitive
Extend d χ privacy to hyperbolic embeddings [Tifrea 2018] via
Hyperbolic: utilize high-dimensional geometry to infuse embeddings
with graph structure
E.g. uni- or bi-directional syllogisms from WebIsADb
New privacy analysis of Poincaré model and sampling procedure
Mechanism takes advantage of density in data to apply
perturbations more precisely.
“Leveraging Hierarchical Representations for Preserving Privacy
and Utility in Text” Feyisetan, Drake, Diethe; ICDM 2019
Tiling in Poincaré disk
Hyperbolic Glove emb.
projected into B2 Poincaré disk

Analysis of Hyperbolic redaction
New method improves over
privacy and utility because
of ability to encode
meaningful structure in
embeddings.
Accuracy scores on classification tasks. * indicates results better than 1 baseline, ** better than 2
baselines
Plausible deniability stat Nw (Pr[M(w ) = w) improved.

Beyond
Accuracy
Performance and Cost
Fairness and Bias
Transparency and Explainability
Privacy
Security
Safety
Robustness

Fairness, Explainability &
Privacy: Opportunities

Fairness in ML
Application specific challenges
Conversational AI systems: Unique bias/fairness/ethics considerations
E.g., Hate speech, Complex failure modes
Beyond protected categories, e.g., accent, dialect
Entire ecosystem (e.g., including apps such as Alexa skills)
Two-sided markets: e.g., fairness to buyers and to sellers, or to content
consumers and producers
Fairness in advertising (externalities)
Tools for ensuring fairness (measuring & mitigating bias) in AI lifecycle
Pre-processing (representative datasets; modifying features/labels)
ML model training with fairness constraints
Post-processing
Experimentation & Post-deployment

Explainability in ML
Actionable explanations
Balance between explanations & model secrecy
Robustness of explanations to failure modes (Interaction between ML
components)
Application-specific challenges
Conversational AI systems: contextual explanations
Gradation of explanations
Tools for explanations across AI lifecycle
Pre & post-deployment for ML models
Model developer vs. End user focused

Privacy in ML
Privacy for highly sensitive data: model training & analytics using
secure enclaves, homomorphic encryption, federated learning / on-
device learning, or a hybrid
Privacy-preserving model training, robust against adversarial
membership inference attacks (Dynamic settings + Complex data /
model pipelines)
Privacy-preserving mechanisms for data marketplaces

Reflections
“Fairness, Explainability, and Privacy by
Design” when building AI products
Collaboration/consensus across key
stakeholders
NYT / WSJ / ProPublica test :)

Acknowledgements
Amazon AWS AI team
Special thanks to Sergul Aydore, Satadal Bhattacharjee, William Brown, Sanjiv Das, Jason Gelman,
Kevin Haas, Tyler Hill, Michael Kearns, Jalaja Kurubarahalli, Andrea Olgiati, Luca Melis, Aaron Roth,
Sudipta Sengupta, Ankit Siva

Privacy in AI/ML Systems: Practical Challenges and Lessons Learned

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Privacy in AI/ML Systems: Practical Challenges and Lessons Learned

Similar to Privacy in AI/ML Systems: Practical Challenges and Lessons Learned (20)

More from Krishnaram Kenthapadi

More from Krishnaram Kenthapadi (16)

Recently uploaded

Recently uploaded (20)

Privacy in AI/ML Systems: Practical Challenges and Lessons Learned