56
© 2018 International Business Machines Corporation 1 Fair AI in Practice http://aif360.mybluemix.net https://github.com/ibm/aif360 https://pypi.org/project/aif360 Rachel K. E. Bellamy Principal Research Staff Member Manager Human-AI Collaboration IBM Research AI CodeFreeze Jan 16, 2019

Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 1

Fair AI in Practicehttp://aif360.mybluemix.net

https://github.com/ibm/aif360https://pypi.org/project/aif360

Rachel K. E. BellamyPrincipal Research Staff MemberManager Human-AI Collaboration

IBM Research AI

CodeFreezeJan 16, 2019

Page 2: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

AI Fairness 360 Agenda

Fairness in AI—overview

Causes of unfairness in AI?

Fairness & the ML Pipeline

Metrics

Mitigation

The AI360 Fairness Toolkit

Resources and interactive web demo

Simple code tutorial – credit scoring

How to participate and contribute

Summary and future challenges

Questions/Discussion

Page 3: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 3

Why is this a problem?AI is now used in many high-stakes decision making applications

Credit Employment Admission Sentencing

Page 4: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 4

What does it take to trust a decision made by a machine?(Other than that it is 99% accurate)

Is it fair? Is it easy to understand?

Did anyone tamper with it? Is it accountable?

Page 5: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 5

High Visibility Fairness Examples

§ Word Embeddings (2016)

§ Face Recognition (2018)

§ Predictive Policing (2017)

§ Job Recruiting (2018)

§ Sentiment Analysis (2017)

§ Photo Classification (2015)

§ Adaptive Chatbot (2016)

§ Recidivism Assessment Tool (2016)

§ Online Ads (2013)

Page 6: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 6

Sentiment Analysis (Motherboard, Oct 25, 2017)“determines the degree to which sentences expressed a negative or positive sentiment, on a scale of -1 to 1”

"We dedicate a lot of efforts to making sure the NLP API avoids bias, but we don't always get it right. This is an example of one of those times, and we are sorry. We take this seriously and are working on improving our models. We will correct this specific case, and, more broadly, building more inclusive algorithms is crucial to bringing the benefits of machine learning to everyone.“

Google spokesperson

https://motherboard.vice.com/en_us/article/j5jmj8/google-artificial-intelligence-bias

Google’s Sentiment analyzer thinks being gay is bad. This is the latest example of how bias creeps into artificial intelligence.

Page 7: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 7

Photo Classification Software (CBS News, July 1, 2015)“ability to recognize the content of photos and group them by category”

"We're appalled and genuinely sorry that this happened. We are taking immediate action to prevent this type of result from appearing. There is still clearly a lot of work to do with automatic image labeling, and we're looking at how we can prevent these types of mistakes from happening in the future.“

Google spokesperson

https://www.cbsnews.com/news/google-photos-labeled-pics-of-african-americans-as-gorillas/

“Jacky Alciné, a Brooklyn computer programmer of Haitian descent, tweeted a screenshot of Google's new Photos app showing that it had grouped pictures of him and a black female friend under the heading "Gorillas."

Google apologizes for mis-tagging photos of African Americans

Page 8: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 8

Job Recruiting (Reuters, Oct, 2018)“The team had been building computer programs since 2014 to review job applicants’ resumes with the aim of mechanizing the search for top talent”

https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G

“Amazon's system taught itself that male candidates were preferable. It penalized resumes that included the word "women's," as in "women's chess club captain." And it downgraded graduates of two all-women's colleges,

“The Seattle company ultimately disbanded the team by the start of last year because executives lost hope for the project”

Page 9: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 9

Facial Recognition (MIT News, Feb 2018)“general-purpose facial-analysis systems, which could be used to match faces in different photos as well as to assess characteristics such as gender, age, and mood.”

“error rate of 0.8 percent for light-skinned men, 34.7 percent for dark-skinned women”

https://news.mit.edu/2018/study-finds-gender-skin-type-bias-artificial-intelligence-systems-0212

Study finds gender and skin tone bias in commercial artificial-intelligence systems. Examination of facial-analysis software error rate of 0.8 percent for light-skinned men, 34.7 percent for dark-skinned women

Page 10: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 10

Recidivism Assessment (Propublica, May 2016)“used to inform decisions about who can be set free at every stage of the criminal justice system”

“The formula was particularly likely to falsely flag black defendants as future criminals, … at almost twice the rate as white defendants.”

“White defendants were mislabeled as low risk more often than black defendants.”

“Northpointe does not agree that the results of your analysis, or the claims being made based upon that analysis, are correct or that they accurately reflect the outcomes from the application of the model.”

https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

Page 11: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 11

Online Advertisements (MIT Technology Review, Feb, 2013)

“Google searches involving black-sounding names are more likely to serve up ads suggestive of a criminal record than white-sounding names, says computer scientist”

“AdWords does not conduct any racial profiling. We also have an “anti” and violence policy which states that we will not allow ads that advocate against an organisation, person or group of people. It is up to individual advertisers to decide which keywords they want to choose to trigger their ads.”

Google Spokesperson

Racism is poisoning online Ad delivery, says Harvard Professor.”

Page 12: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 12

Bias in Word Embeddings (July, 2016)

w2vNEWS embedding, 300-dim word2vec• proven to be immensely useful • high quality, publicly available, and easy to incorporate into any application

Ex) Paris is to France and Tokyo is to ??

https://arxiv.org/abs/1607.06520

Page 13: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 13

Predictive Policing (New Scientist, Oct 2017)“hope is that such systems will bring down crime rates while simultaneously reducing human bias in policing.”

“Their study suggest that the software merely sparks a “feedback loop” that leads to officers being repeatedly sent to certain neighbourhoods – typically ones with a high number of racial minorities – regardless of the true crime rate in that area.”

https://www.newscientist.com/article/mg23631464-300-biased-policing-is-made-worse-by-errors-in-pre-crime-algorithms/

“the software ends up overestimating the crime rate … without taking into account the possibility that more crime is observed there simply because more officers have been sent there – like a computerised version of confirmation bias.

Biased policing is making worse errors in pre-crime algorithms

Page 14: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 14

Summary of Examples

§ Intended use of services can provide great value• Increased productivity• Overcome human biases

§ Biases were often found by accident• But, there will be people trying to “break” the

system to show limitations or for financial gain

§ The stakes are high• Injustice• Significant public embarrassments

(Hardt, 2017)

Algorithmic fairness is one of the hottest topics in the ML/AI research community

Page 15: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

AI Fairness 360 Agenda

Fairness in AI—overview

Causes of unfairness in AI?

Fairness & the ML Pipeline

Metrics

Mitigation

The AI360 Fairness Toolkit

Resources and interactive web demo

Simple code tutorial – credit scoring

How to participate and contribute

Summary and future challenges

Questions/Discussion

Page 16: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 16

What is unwanted bias?Group vs individual fairness

Machine learning is always a form of statistical discrimination

Discrimination becomes objectionable when it places certain privileged groups at systematic advantage and certain unprivileged groups at systematic disadvantage

Illegal in certain contexts

Page 17: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 17

Where does it come from?

Unwanted bias in training data yields models with unwanted bias that scale out

Discrimination in labelling

Undersampling or oversampling

Page 18: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

AI Fairness 360 Agenda

Fairness in AI—overview

Causes of unfairness in AI?

Fairness & the ML Pipeline

Metrics

Mitigation

The AI360 Fairness Toolkit

Resources and interactive web demo

Simple code tutorial – credit scoring

How to participate and contribute

Summary and future challenges

Questions/Discussion

Page 19: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 19

Fairness in building and deploying models

(d’Alessandro et al., 2017)

Page 20: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 20

Metrics, Algorithms

dataset metric

pre-processing algorithm in-

processing algorithm

post-processing algorithm

classifier metric

Page 21: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 21

Metrics, Algorithms, and Explainers

dataset metric

pre-processing algorithm in-

processing algorithm

post-processing algorithm

classifier metric

classifier metric

explainer

dataset metric

explainer

Page 22: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 22

21 (or more) definitions of fairnessand the need for a toolbox with guidance

There is no one definition of fairness applicable in all contexts

Some definitions even conflict

Requires a comprehensive set of fairness metrics and bias mitigation algorithms

Also requires some guidance to industry practitioners

21 Definitions of Fairness, FAT*2018 Tutorial, Arvind Narayananhttps://www.youtube.com/watch?v=jIXIuYdnyyk

Page 23: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 23

Bias mitigation is not easyCannot simply drop protected attributes because features are correlated with them

Page 24: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

AI Fairness 360 Differentiation

Datasets

Toolbox

Fairness metrics (30+)

Fairness metric explanations

Bias mitigation algorithms (9+)

Guidance

Industry-specific tutorials

Comprehensive bias mitigation toolbox(including unique algorithms from IBM Research)

Several metrics and algorithms that have no available implementations elsewhere

Extensible

Designed to translate new research from the lab to industry practitioners

Page 25: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

http://aif360.mybluemix.nethttps://github.com/ibm/aif360https://pypi.org/project/aif360

http://aif360.mybluemix.nethttps://github.com/ibm/aif360https://pypi.org/project/aif360

Page 26: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 26

AI Fairness 360aif360.mybluemix.net

Designed to translate new research from the lab to industry practitioners:tutorials, education, glossary, resources.

Page 27: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 27

Designed to translate new research from the lab to industry practitioners: Guidance, glossary, education and industry-specific tutorials.

Comprehensive bias mitigation toolbox: 70+ fairness metrics, including metric explanations, 9 10 bias mitigation algorithms, (including unique algorithms from IBM Research).

Page 28: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 28

Designed for extensibility, following the scikit-learn’s fit/predict paradigm.

https://github.com/IBM/AIF360/blob/master/examples/demo_optim_data_preproc.ipynb

Page 29: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 29

Page 30: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 30

Page 31: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 31

Page 32: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 32

Page 33: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 33

Page 34: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 34

Page 35: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 35

Page 36: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 36

Page 37: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 37

Page 38: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 38

Page 39: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 39

Page 40: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 40

Page 41: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 41

Page 42: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 42

Page 43: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 43

Dataset

+metadata: dict

+copy()+export()+split()

StructuredDataset

+features: np.ndarray+labels: np.ndarray+scores: np.ndarray+protected_attributes: np.ndarray+feature_names: list(str)+label_names: list(str)+protected_attribute_names:

list(str)+privileged_protected_attributes:

list(np.ndarray)+unprivileged_protected_attributes:

list(np.ndarray)+instance_names: list(str)+instance_weights: np.ndarray+ignore_fields: set(str)

+align_datasets()+convert_to_dataframe()

BinaryLabelDataset

+favorable_label: np.float64+unfavorable_label: np.float64

Metric

-memoize()

DatasetMetric

+privileged_groups:list(dict)

+unprivileged_groups:list(dict)

+difference()+ratio()+num_instances()

BinaryLabelDatasetMetric

+num_positives()+num_negatives()+base_rate()+disparate_impact()+statistical_parity_difference()+consistency()

SampleDistortionMetric

+total()+average()+maximum()+euclidean_distance()+manhattan_distance()+mahalanobis_distance()

ClassificationMetric

+binary_confusion matrix()+generalized_binary_confusion_matrix()+performance_metrics()+average_(abs_)odds_difference+selection_rate()+disparate_impact()+statistical_parity_difference()+generalized_entropy_index()+between_group_generalized_entropy_index()

Transformer

+fit()+predict()+transform()+fit_predict()+fit_transform()-addmetatdata()

StandardDataset

GenericPreProcessing

+params...

+fit()+transform()

GenericInProcessing

+params...

+fit()+predict()

GenericPostProcessing

+params...

+fit()+predict()

Explainer

MetricTextExplainer

+(all metric functions)

MetricJSONExplainer

+(all metric functions)

1

+metric

0..*

0..*

1 +dataset

+dataset0..*

12

0..*

+dataset+distorted_dataset

0..*

+dataset+classified_dataset

0..*

12

+dataset

inheritance (is a)

aggregation (has a)

Page 44: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 44

Page 45: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

AI Fairness 360 Agenda

Fairness in AI—overview

Causes of unfairness in AI?

Fairness & the ML Pipeline

Metrics

Mitigation

The AI360 Fairness Toolkit

Resources and interactive web demo

Simple code tutorial – credit scoring

How to participate and contribute

Summary and future challenges

Questions/Discussion

Page 46: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 46

Checking and Mitigating Bias throughout the AI Lifecycle

Page 47: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

Group fairness metricssituation 1

legend

Page 48: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

Group fairness metricssituation 2

legend

Page 49: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

Group fairness metricsstatistical parity difference

legend

Page 50: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

Group fairness metricsdisparate impact

legend

Page 51: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

Group fairness metricsequal opportunity difference

legend

Page 52: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

Group fairness metricsaverage odds difference

legend

Page 53: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

© 2018 International Business Machines Corporation 53

Some remaining challenges…

§ Support for collecting and curating fair datasets prior to model building§ Approaches for situations where only high-level demographics are available

(e.g. neighborhood, school-level)§ Support for fairness drift detection§ More detailed guidance, e.g. what are potential protected variables, when to

use a particular metric§ …

For more discussion see: Holstein, K., Vaughan, J. W., Daumé III, H., Dudík, M., & Wallach, H. (2018). Improving fairness in machine learning systems: What do industry practitioners need?. arXiv preprint arXiv:1812.05239.

Page 54: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

05/03/18 Facebook says it has a tool to detect bias in its artificial intelligence Quartz

05/25/18 Microsoft is creating an oracle for catching biased AI algorithms MIT TechnologyReview

05/31/18 Pymetrics open-sources Audit AI, an algorithm bias detection tool VentureBeat

06/07/18 Google Education Guide to Responsible AI Practices – Fairness Google

06/09/18 Accenture wants to beat unfair AI with a professional toolkit TechCrunch

Page 55: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

Fairness Measures Framework to test given algorithm on variety of datasets and fairness metrics https://github.com/megantosh/fairness_measures_code

Fairness Comparison Extensible test-bed to facilitate direct comparisons of algorithms with respect to fairness measures. Includes raw & preprocessed datasets

https://github.com/algofairness/fairness-comparison

Themis-ML Python library built on scikit-learn that implements fairness-aware machine learning algorithms

https://github.com/cosmicBboy/themis-ml

FairML Looks at significance of model inputs to quantify prediction dependence on inputs https://github.com/adebayoj/fairml

Aequitas Web audit tool as well as python lib. Generates bias report for given model and dataset https://github.com/dssg/aequitas

Fairtest Tests for associations between algorithm outputs and protected populations https://github.com/columbia/fairtest

ThemisTakes a black-box decision-making procedure and designs test cases automatically to explore where the procedure might be exhibiting group-based or causal discrimination

https://github.com/LASER-UMASS/Themis

Audit-AI Python library built on top of scikit-learn with various statistical tests forclassification and regression tasks

https://github.com/pymetrics/audit-ai

Page 56: Fair AI in Practice - University of Minnesota · “White defendants were mislabeled as low risk more often than black defendants.” “Northpointe does not agree that the results

AI Fairness 360 Agenda

Fairness in AI—overview

Causes of unfairness in AI?

Fairness & the ML Pipeline

Metrics

Mitigation

The AI360 Fairness Toolkit

Resources and interactive web demo

Simple code tutorial – credit scoring

How to participate and contribute

Summary and future challenges

Questions/Discussion