Presented by: Melinda Thielbar, Fidelity Investments
Presented at All Things Open 2020
Abstract: Most AI researchers acknowledge that AI models can “learn” unwanted biases, but there is very little guidance on how to actually implement bias testing or what to do when bias is found. This talk will focus on how to choose the right bias tests for your model, and how to de-bias an AI model that fails those tests. We’ll compare and contrast three major open source software packages for AI bias testing and mitigation: AIF 360 from IBM, Google’s What-If Tool, and Audit AI from Pymetrics.
3. Today’s Agenda: Get You Started on Your AI Ethics Journey
• Bias Detection Overview andTerms
• Examples of Bias Detection with 3 SoftwareTools: AIF 360, Audit AI, and Google’s
What-IfTool
• Pros and Cons of EachTool (not a full tutorial—we don’t have enough time).
• Recommended Policies and Procedures for ImplementingAI Ethically
4. AI Bias and Ethics: Software is a Tool
• Some tools are better than others for the job at hand.
• AIF 360: Only tool that has bias mitigation.
• What-If: Interactive model understanding.
• Audit-AI:Tests whether the group classification rates are the same, using a
statistical test with p-values (or Bayes factor).
• All 3 tools are open source and have good support.
Recommendation:
• Audit-AI produces straightforward tests with minimum code. If the automated tests
are sufficient for your needs, it is the simplest to implement and understand.
• AIF 360 is the most comprehensive tool kit. It is the best for customization.
• What-If is better for understanding how the model works. It is best supported for
Tensorflow models. If you’re usingTensorflow, it’s a great tool. If you are not using
Tensorflow, it’s a great tool but requires some set up.
5. AI Ethics Bias ≠ Data Science Bias ≠ Fairness
• Fair: A human judgement about the morality of the process.
• Bias: When a Data Scientist Says it:These two groups do not have the same
counts/means.
Bias:When an AI Ethics Expert Says It:
“This model has systematic and repeatable errors that create unfair
outcomes.”
• Bias is an error.
• Biased models can cause us to miss opportunities.
• Biased models that are used as part of business practices can be a reputational
risk to your business.
Fidelity Confidential Information: DRAFT
6. Y = + bias𝑓(𝑋, 𝛽)
What
You’re
Trying to
Predict
Relationship
Between
Features and
WhatYou’re
Trying to Predict
Bias in AssigningY:
e.g.
• Human bias in hiring, or
lending,
• Unconscious bias in labeling
training data,
• Systematic bias that limits
opportunities for protected
groups
If the bias is large, and there are differences in features between the
dominant group and the protected group, the model learns how to tell who
is in the protected group instead of learning the relationships between
the features.
Where Does This Kind of Bias Come From?
7. Our Example: Healthcare Costs
An insurance company wants to find people with the highest predicted costs and offer
them services designed to improve their health.
This is simulated data based on a real example.
https://www.healthcarefinancenews.com/news/study-finds-racial-bias-optum-algorithm
Singh and Ramamurthy “Understanding racial bias in health using the Medical
Expenditure Panel Survey data”
But if we look at our protected group, we see that they have slightly worse
health measures—yet also have lower cost.
8. What Does Bias Testing Accomplish?
True/False Negative Rate
• False Negative Rate
True/False Positive Rate
• Statistical Parity
• Equal Opportunity
Both
• Average Odds
• Generalized Entropy
• Thiel Index
BiasTests Not Affected by Biases in theTarget
• Disparate Impact
Tests for Predictive Accuracy by Group
Predictive accuracy is calculated based on the target, which may be biased.
That means the above tests are affected (and possibly negated) when we think we
are working with a biased target.
Recommendation:
1) Always use Disparate Impact
2) Decide which is the more negative outcome and choose your other bias
tests accordingly.
9. Pick Fairness Metrics that Guard Against
the Most Negative Outcome
Predicted Label True Label
Positive Negative
Positive True Positive False Positive
Negative False Negative True Negative
Examples:
• A model that predicts whether a prisoner will re-offend if offered parole
should check False Positive rate by protected traits.
• A model that assigns people to a health-improvement program based on
prediction of cost should review the False Negative rate.
10. Bias Testing Demonstration
Audit-AI AIF 360 What-If
1. Import the packages
2. Specify a Pandas
dataframe with class
membership and
predictions.
1. Import the packages
2. Convert your Pandas
dataframe to a Structured
Data set of the
appropriate type.
3. Estimate a model.
4. Create a model explainer.
5. Run bias-detection
methods using the model
explainer.
6. Print out bias-detection
results.
7. Write function to loop
through different cut-offs
and store in a data frame.
8. Graph results.
1. Import the packages
2. Change Pandas dataframe
to a set of lists.
3. Create a WitWidget with
input and output columns.
4. Explore interactively.
11. What to Do If You Detect Bias?
Software solution: Bias Mitigation
Of the three tools, only AIF 360 has built-in bias mitigation—though they use
TensorFlow for adversarial debiasing.
Demonstration is in the notebook.
12. Real-World Examples of Bias
This image is instructive. It’s also interesting to read the lengthy debate it
sparked in the AI community. Salient points:
• One researcher was able to produce a better image just by re-training the
algorithm and starting the gradient descent in a different location.
• The team that constructed this algorithm was almost exclusively made up
of white men.
13. And Consider Our Toy Example
By simple inspection, the sickest people are not the most expensive.
And if that's the case, assigning people to a health improvement program to lower costs is
not going to have the expected business outcome.
This model fails twice--it's biased, and it doesn't lower costs!
This is a good reason to use theWhat-If tool: It gives you visibility into simple relationships.
14. AI Bias and Ethics: Software is a Tool
• Strengths of EachTool
• AIF 360: Production Monitoring andTesting
• Only tool for bias testing and mitigation.
• Highly customizable.
• Easy to automate
• What-If: Interactive model understanding.
• Intuitive and visual.
• Allows you to ask questions you wouldn’t have thought of asking.
• Audit-AI: Statistical bias testing
• Provides a p-value that answers the question “Is this difference due to
random chance?”
• Only tool that approaches bias tests in this way.
• Weaknesses
• AIF 360 is “fiddly”. Set up is onerous if you just need a quick test.
• What-IfTool is easiest and fullest-featured withTensorBoard andTensorFlow
Models. It also uses a sample.With a very large data set, a sample might tell
an incomplete story.
• Audit-AI:With very large data sets, most statistical tests will have small p-
values.
15. AI Bias and Ethics: Software is a Tool
• How to get better at this:
• Certified Ethical EmergingTechnologist Professional Certificate from
Coursera (taught by Renee Cummings and 6 other experts in this field.
• Towards a Code of Ethics for Artificial Intelligence by Paula Boddington
Boddington
• Montreal AI Ethics Institute: https://montrealethics.ai/
• Lighthouse3: https://lighthouse3.com/
• Case Studies
• COMPAS model, ProPublica https://www.propublica.org/article/machine-
bias-risk-assessments-in-criminal-sentencing
• Optum health model: https://www.healthcarefinancenews.com/news/study-
finds-racial-bias-optum-algorithm
16. Read the Docs!
• AIF 360
• Git Hub: https://github.com/Trusted-AI/AIF360
• My FavoriteTutorial:
https://nbviewer.jupyter.org/github/IBM/AIF360/blob/master/examples/tutori
al_medical_expenditure.ipynb
• Audit-AI
• GitHub: https://github.com/pymetrics/audit-ai
• Every you need is in the GitHub.The implementation is that simple.
• What-IfTool
• GitHub: https://github.com/PAIR-code/what-if-tool
• Video Demos: https://pair-code.github.io/what-if-tool/
For example, the COMPAS model from this ProPublica article: often categorized black defendants as higher recidivism risk, even when they were lower risk than white offenders. This meant black defendants were falsely denied parole more often than white defendants. Checking the False Positive rate would have been important for this model.
False Negative
How to set up WIT:
https://github.com/PAIR-code/what-if-tool
On the left is a pixelated phot of President Barak Obama. On the right is a face generated from that image by PULSE, an AI algorithm that purports to create realistic faces from down-sampled images.
Many people argue that AI bias is exclusively about biased samples, but researchers point out that they were able to get much more inclusive results with the same data just by re-training the algorithm.
Source: https://www.theverge.com/21298762/face-depixelizer-ai-machine-learning-tool-pulse-stylegan-obama-bias