126
https://aka.ms/aiguidelines

aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

https://aka.ms/aiguidelines

Page 2: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

INITIALLY

DURING

INTERACTION

WHEN WRONG

OVER TIME

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

Time services

based on

context.

3

Show

contextually

relevant

information.

4

Match

relevant

social norms.

5

Mitigate

social biases.

6

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

Support

efficient

correction.

9

Scope

services when

in doubt.

10

Make clear

why the

system did

what it did.

11

Remember

recent

interactions.

12

Learn from

user behavior.

13

Update and

adapt

cautiously.

14

Encourage

granular

feedback.

15

Provide

global

controls.

17

Notify users

about

changes.

1816

Convey the

consequences

of user

actions.

https://aka.ms/aiguidelines

Page 3: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Agenda

Page 4: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Agenda

Page 5: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being
Page 6: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

AI is fundamentally changing how we interact with

computing systems

Page 7: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

MS Word

MS OneNote

MS PowerPoint

Page 8: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being
Page 9: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being
Page 10: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Disclaimers

Page 11: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

INITIALLY

DURING

INTERACTION

WHEN WRONG

OVER TIME

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

Time services

based on

context.

3

Show

contextually

relevant

information.

4

Match

relevant

social norms.

5

Mitigate

social biases.

6

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

Support

efficient

correction.

9

Scope

services when

in doubt.

10

Make clear

why the

system did

what it did.

11

Remember

recent

interactions.

12

Learn from

user behavior.

13

Update and

adapt

cautiously.

14

Encourage

granular

feedback.

15

Provide

global

controls.

17

Notify users

about

changes.

1816

Convey the

consequences

of user

actions.

https://aka.ms/aiguidelines

Page 12: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being
Page 13: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

INITIALLY

DURING

INTERACTION

WHEN WRONG

OVER TIME

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

Time services

based on

context.

3

Show

contextually

relevant

information.

4

Match

relevant

social norms.

5

Mitigate

social biases.

6

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

Support

efficient

correction.

9

Scope

services when

in doubt.

10

Make clear

why the

system did

what it did.

11

Remember

recent

interactions.

12

Learn from

user behavior.

13

Update and

adapt

cautiously.

14

Encourage

granular

feedback.

15

Provide

global

controls.

17

Notify users

about

changes.

1816

Convey the

consequences

of user

actions.

https://aka.ms/aiguidelines

Page 14: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

*Dan Russell. The Joy of Search.

Page 15: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being
Page 16: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

INITIALLY

DURING

INTERACTION

WHEN WRONG

OVER TIME

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

Time services

based on

context.

3

Show

contextually

relevant

information.

4

Match

relevant

social norms.

5

Mitigate

social biases.

6

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

Support

efficient

correction.

9

Scope

services when

in doubt.

10

Make clear

why the

system did

what it did.

11

Remember

recent

interactions.

12

Learn from

user behavior.

13

Update and

adapt

cautiously.

14

Encourage

granular

feedback.

15

Provide

global

controls.

17

Notify users

about

changes.

1816

Convey the

consequences

of user

actions.

https://aka.ms/aiguidelines

Page 17: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Time

services

based on

context.

3

Show

contextually

relevant

information.

4

Match

relevant

social

norms.

5

Mitigate

social

biases.

6

Remember to

call your mom

Page 18: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Time

services

based on

context.

3

Show

contextually

relevant

information.

4

Match

relevant

social

norms.

5

Mitigate

social

biases.

6

Page 19: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Time

services

based on

context.

3

Show

contextually

relevant

information.

4

Match

relevant

social

norms.

5

Mitigate

social

biases.

6

Page 20: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Time

services

based on

context.

3

Show

contextually

relevant

information.

4

Match

relevant

social

norms.

5

Mitigate

social

biases.

6“Information is not subject to biases, unless

users are biased against fastest routes”

“There’s no way to set an avg walking speed.

[The product] assumes users to be healthy”

Page 21: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

INITIALLY

DURING

INTERACTION

WHEN WRONG

OVER TIME

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

Time services

based on

context.

3

Show

contextually

relevant

information.

4

Match

relevant

social norms.

5

Mitigate

social biases.

6

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

Support

efficient

correction.

9

Scope

services when

in doubt.

10

Make clear

why the

system did

what it did.

11

Remember

recent

interactions.

12

Learn from

user behavior.

13

Update and

adapt

cautiously.

14

Encourage

granular

feedback.

15

Provide

global

controls.

17

Notify users

about

changes.

1816

Convey the

consequences

of user

actions.

https://aka.ms/aiguidelines

Page 22: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

Support

efficient

correction.

9

Scope

services

when in

doubt.

10

Make clear

why the

system did

what it did.

11

Page 23: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

Support

efficient

correction.

9

Scope

services

when in

doubt.

10

Make clear

why the

system did

what it did.

11

Page 24: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

Support

efficient

correction.

9

Scope

services

when in

doubt.

10

Make clear

why the

system did

what it did.

11

Page 25: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

INITIALLY

DURING

INTERACTION

WHEN WRONG

OVER TIME

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

Time services

based on

context.

3

Show

contextually

relevant

information.

4

Match

relevant

social norms.

5

Mitigate

social biases.

6

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

Support

efficient

correction.

9

Scope

services when

in doubt.

10

Make clear

why the

system did

what it did.

11

Remember

recent

interactions.

12

Learn from

user behavior.

13

Update and

adapt

cautiously.

14

Encourage

granular

feedback.

15

Provide

global

controls.

17

Notify users

about

changes.

1816

Convey the

consequences

of user

actions.

https://aka.ms/aiguidelines

Page 26: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Cupcakesare

muffinsthat believed

in miracles

Page 27: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Remember

recent

interactions.

12

Learn from

user behavior.

13

Update and

adapt

cautiously.

14

Encourage

granular

feedback.

15

Provide global

controls.

17

Notify users

about changes.

1816

Convey the

consequences of

user

actions.

Page 28: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Agenda

Page 29: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Agenda

Page 30: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Findings &

Impact

Page 31: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Opportunity Analysis

Engagements with Practitioners

Findings &

Impact

Page 32: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Academia Practitioners Industry

CHI 2019 Best Paper Honorable Mention Being leveraged by product teams

across the company throughout the

design and development process

Cited and used in related organizations

Translated to other languages

Page 33: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Initial Impact

Engagements with Practitioners

Findings &

Impact

Page 34: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being
Page 35: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Developing the Guidelines for Human-AI Interaction

Page 36: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being
Page 37: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Confirms that the guidelines

are applicable to a broad range

of products, features and

scenarios.

Page 38: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Applies Violates

Page 39: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

14

29

21

17

4

12

13

27

17

20

14

11

17

9

9

6

23

14

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Violations

5

13

11

18

18

22

28

8

9

17

23

20

13

22

26

18

16

26

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Applications

Page 40: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Remember

recent

interactions.

12

14

29

21

17

4

12

13

27

17

20

14

11

17

9

9

6

23

14

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Violations

5

13

11

18

18

22

28

8

9

17

23

20

13

22

26

18

16

26

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Applications

Page 41: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Remember

recent

interactions.

12

14

29

21

17

4

12

13

27

17

20

14

11

17

9

9

6

23

14

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Violations

5

13

11

18

18

22

28

8

9

17

23

20

13

22

26

18

16

26

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Applications

Make clear

what the

system

can do.

1

Page 42: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Remember

recent

interactions.

12

14

29

21

17

4

12

13

27

17

20

14

11

17

9

9

6

23

14

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Violations

5

13

11

18

18

22

28

8

9

17

23

20

13

22

26

18

16

26

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Applications

Make clear

what the

system

can do.

1

Show

contextually

relevant

information.

4

Page 43: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Provide

global

controls.

17

14

29

21

17

4

12

13

27

17

20

14

11

17

9

9

6

23

14

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Violations

5

13

11

18

18

22

28

8

9

17

23

20

13

22

26

18

16

26

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Applications

Page 44: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Provide

global

controls.

17

14

29

21

17

4

12

13

27

17

20

14

11

17

9

9

6

23

14

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Violations

5

13

11

18

18

22

28

8

9

17

23

20

13

22

26

18

16

26

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Applications

Make clear

why the

system did

what it did.

11

Page 45: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Provide

global

controls.

17

14

29

21

17

4

12

13

27

17

20

14

11

17

9

9

6

23

14

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Violations

5

13

11

18

18

22

28

8

9

17

23

20

13

22

26

18

16

26

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Applications

Make clear

why the

system did

what it did.

11Make clear

how well the

system can

do what it

can do.

2

Page 46: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Types of content: examples, patterns,

research, code

Tagged by guideline and scenario

with faceted search and filtering

Comments and ratings to support

learning

Grow with examples and case studies

submitted by practitioners

Page 47: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Initial Impact

Opportunity Analysis

Engagements with Practitioners

Findings &

Impact

Page 48: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Workshops and Courses

Page 49: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Q & A Break

Page 50: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Agenda

Page 51: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Agenda

Page 52: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

How can I

implement the

HAI Guidelines? Interaction Design

AI System Engineering

Page 53: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Interaction Design

for AI requires

ML & Eng Support

Time services

based on

context.

3

Hard to implement if the logging infrastructure is

oblivious to context.

Scope

services when

in doubt.

10

Does the ML algorithm know or state that

it is “in doubt”?

Make clear

why the

system did

what it did.

11

Is the ML algorithm explainable?

Page 54: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

Page 55: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

Page 56: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

[Buolamwini, J. & Gebru, T. 2018]

?? ?

? ? ?

90% accuracy

Page 57: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

Benchmark

data

Internal component features

Content features

countOBJECTS

cat

people

Descriptive features Failure explanation models

with Pandora

[Nushi et. al. HCOMP 2018]

Page 58: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

All (5.5%)

Page 59: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Women (11.5%)

All (5.5%)

Page 60: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

No makeup (23.7%)

Women (11.5%)

All (5.5%)

Page 61: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Short hair (30.8%)

Women (11.5%)

No makeup (23.7%)

Page 62: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Not smiling (35.7%)

Women (11.5%)

No makeup (23.7%)

Short hair (30.8%)

All (5.5%)

Page 63: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

[Nushi et. al. HCOMP 2018]

[Wu et. al. ACL 2019]

[Zhang et. al. IEEE TVCG 2018]

Page 64: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

Page 65: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

2

https://scikit-learn.org/stable/auto_examples/

calibration/plot_calibration_curve.html

[Platt et al., 1999; Zadrozny & Elkan, 2001]

[Gal & Ghahramani, 2016; Osband et al., 2016]

Page 66: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

https://forecast.weather.gov/

Page 67: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

2

Probably a yellow school bus driving down a street

Page 68: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Time services

based on

context.

3

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

AI System

Automated

triggering

Explicit

invocation

Page 69: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being
Page 70: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Time services

based on

context.

3

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

AI System

Automated

triggering

Explicit

invocation

Page 71: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Time services

based on

context.

3

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

Pre

cisi

on

Recall

Page 72: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Learn from

user behavior.

13

Encourage

granular

feedback.

15

Update and

adapt

cautiously.

14

Page 73: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Learn from

user behavior.

13

Encourage

granular

feedback.

15

Update and

adapt

cautiously.

14

Page 74: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Learn from

user behavior.

13

Encourage

granular

feedback.

15

Update and

adapt

cautiously.

14

Page 75: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Encourage

granular

feedback.

15

Provide

global

controls.

17

Page 76: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Encourage

granular

feedback.

15

Provide

global

controls.

17

Page 77: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Encourage

granular

feedback.

15

Provide

global

controls.

17

Page 78: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Q & A

Page 79: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Agenda

Page 80: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Agenda

Page 81: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Learn from

user behavior.

13

Remember

recent

interactions.

12

Make clear

why the

system did

what it did.

11

Intelligible, Transparent, Explainable AI

Page 82: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Predictable ~ (Human) Simulate-able

Intelligible ~ Transparent

Explainable ~ Interpretable

Inscrutable ⊇ Blackbox

⊇⊇

Predict exactly what it will do

Answer counterfactual

predict how a change to model’s inputs

will change its output

Construct rationalization for why

(maybe) it did what it did

Inscrutable: too complex to understand

Blackbox: know nothing about it

Caveat: My take – No consensus here

Page 83: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Reasons for Wanting Intelligibility

1. The AI May be Optimizing the Wrong Thing

2. Missing a Crucial Feature

3. Distributional Drift

4. Facilitating User Control in Mixed Human/AI Teams

5. User Acceptance

6. Learning for Human Insight

7. Legal Requirements

[Weld & Bansal CACM 2019]

Page 84: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

interactions, or computed in expectation from the system it-self.

In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.

Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).

While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1

In sum, we make the following contributions:

1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.

2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.

3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.

1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.

(a)

(b)

Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .

4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.

2 3 4 5

2 Problem Descr iption

We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed

2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-

portant5BUG: Work weneed to citehere to show that collaboration may

not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].

interactions, or computed in expectation from the system it-self.

In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.

Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).

While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1

In sum, we make the following contributions:

1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.

2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.

3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.

1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.

(a)

(b)

Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .

4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.

2 3 4 5

2 Problem Descr iption

We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed

2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-

portant5BUG: Work weneed to citehere to show that collaboration may

not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].

interactions, or computed in expectation from the system it-self.

In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.

Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).

While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1

In sum, we make the following contributions:

1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.

2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.

3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.

1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.

(a)

(b)

Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .

4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.

2 3 4 5

2 Problem Descr iption

We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed

2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-

portant5BUG: Work weneed to citehere to show that collaboration may

not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].

Deploy

Autonomous AI

Intelligibility Useful in Both Cases

interactions, or computed in expectation from the system it-self.

In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.

Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).

While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1

In sum, we make the following contributions:

1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.

2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.

3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.

1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.

(a)

(b)

Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .

4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.

2 3 4 5

2 Problem Descr iption

We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed

2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-

portant5BUG: Work weneed to citehere to show that collaboration may

not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].

interactions, or computed in expectation from the system it-self.

In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.

Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).

While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1

In sum, we make the following contributions:

1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.

2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.

3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.

1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.

(a)

(b)

Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .

4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.

2 3 4 5

2 Problem Descr iption

We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed

2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-

portant5BUG: Work weneed to citehere to show that collaboration may

not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].

Human-AI Team

Page 85: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Reasons for Wanting Intelligibility

1. The AI May be Optimizing the Wrong Thing

2. Missing a Crucial Feature

3. Distributional Drift

4. Facilitating User Control in Mixed Human/AI Teams

5. User Acceptance

6. Learning for Human Insight

7. Legal Requirements

[Weld & Bansal CACM 2019]

Page 86: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Reasons for Wanting Intelligibility

1. The AI May be Optimizing the Wrong Thing

2. Missing a Crucial Feature

3. Distributional Drift

4. Facilitating User Control in Mixed Human/AI Teams

5. User Acceptance

6. Learning for Human Insight

7. Legal Requirements

[Weld & Bansal CACM 2019]

Page 87: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being
Page 88: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Panda!

Firetruck?!?

AI Errors

Human Errors

But Humans Err as Well

Page 89: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Human Errors

AI Errors

AI-Specific

ErrorsJoint

Errors

Preventable Errors

Page 90: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Human Errors

AI Errors

AI-Specific

ErrorsJoint

Errors

Preventable Errors

Intelligible AI → Better Teamwork

Page 91: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

A Simple Human-AI Team

Beyond Accuracy: TheRole of Mental Models in Human-AI Team Performance

Gagan Bansal1 Besmira Nushi2 EceKamar 2 Walter S. Lasecki3

Daniel S. Weld1,4 Eric Horvitz2

1University of Washington 2Microsoft Research 3University of Michigan 4Allen Institute for Artificial Intelligence

Abstract

Decisions made by human-AI teams (e.g., AI-advised hu-mans) are increasingly common in high-stakes domains suchas healthcare, criminal justice, and finance. Achieving highteam performance depends on more than just the accuracy ofthe AI system: Since the human and the AI may have differ-ent expertise, the highest team performance is often reachedwhen they both know how and when to complement one an-other. We focus on a factor that is crucial to supporting suchcomplementary: thehuman’smental model of theAI capabil-ities, specifically the AI system’s error boundary (i.e. know-ing “When does the AI err?”). Awareness of this lets the hu-man decidewhen to accept or override theAI’srecommenda-tion. Wehighlight two key properties of anAI’serror bound-ary, parsimony and stochasticity, and a property of the task,dimensionality. We show experimentally how these proper-ties affect humans’ mental models of AI capabilities and theresulting team performance. We connect our evaluations torelated work and propose goals, beyond accuracy, that meritconsideration during model selection and optimization to im-proveoverall human-AI team performance.

1 IntroductionWhile many AI applications address automation, numer-ous others aim to team with people to improve joint per-formance or accomplish tasks that neither the AI nor peo-ple can solve alone (Gillies et al. 2016; Kamar 2016;Chakraborti and Kambhampati 2018; Lundberg et al. 2018;Lundgard et al. 2018). In fact, many real-world, high-stakesapplications deploy AI inferences to help human expertsmake better decisions, e.g., with respect to medical diag-noses, recidivism prediction, and credit assessment. Rec-ommendations from AI systems—even if imperfect—canresult in human-AI teams that perform better than eitherthe human or the AI system alone (Wang et al. 2016;Jaderberg et al. 2019).

Successfully creating human-AI teamsputsadditional de-mands on AI capabilities beyond task-level accuracy (Grosz1996). Specifically, even though team performance dependson individual human and AI performance, the team perfor-mance will not exceed individual performance levels if hu-

Copyright c 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Figure 1: AI-advised human decision making for read-mission prediction: The doctor makes final decisions us-ing the classifier’s recommendations. Check marks denotecases where the AI system renders a correct prediction, andcrosses denote instances where the AI inference is erro-neous. Thesolid line represents theAI error boundary, whilethedashed lineshowsapotential human mental model of theerror boundary.

man and the AI cannot account for each other’s strengthsand weaknesses. In fact, recent research shows that AI ac-curacy does not always translate to end-to-end team perfor-mance(Yin, Vaughan, and Wallach 2019; Lai and Tan 2018).Despite this, and even for applications where people are in-volved, most AI techniques continue to optimize solely foraccuracy of inferences and ignore team performance.

Whilemany factors influence team performance, westudyone critical factor in this work: humans’ mental model ofthe AI system that they are working with. In settings wherethe human is tasked with deciding when and how to makeuse of the AI system’s recommendation, extracting benefitsfrom the collaboration requires the human to build insights(i.e., a mental model) about multiple aspects of the capabil-ities of AI systems. A fundamental attribute is recognitionof whether the AI system will succeed or fail for a partic-ular input or set of inputs. For situations where the humanuses theAI’soutput to makedecisions, this mental model oftheAI’serror boundary—which describes theregionswhereit is correct versus incorrect—enables the human to predictwhen the AI will err and decide when to override the auto-mated inference.

We focus here on AI-advised human decision making, asimplebut widespread form of human-AI team, for example,

ML Model

Readmission Prediction

ClassifierHuman

Decision

Maker

When can I trust it?

How can I adjust it?

Page 92: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Small decision tree over semantically

meaningful primitives

Beyond Accuracy: TheRole of Mental Models in Human-AI Team Performance

Gagan Bansal1 Besmira Nushi2 EceKamar 2 Walter S. Lasecki3

Daniel S. Weld1,4 Eric Horvitz2

1University of Washington 2Microsoft Research 3University of Michigan 4Allen Institute for Artificial Intelligence

Abstract

Decisions made by human-AI teams (e.g., AI-advised hu-mans) are increasingly common in high-stakes domains suchas healthcare, criminal justice, and finance. Achieving highteam performance depends on more than just the accuracy ofthe AI system: Since the human and the AI may have differ-ent expertise, the highest team performance is often reachedwhen they both know how and when to complement one an-other. We focus on a factor that is crucial to supporting suchcomplementary: thehuman’smental model of theAI capabil-ities, specifically the AI system’s error boundary (i.e. know-ing “When does the AI err?”). Awareness of this lets the hu-man decidewhen to accept or override theAI’srecommenda-tion. Wehighlight two key properties of anAI’serror bound-ary, parsimony and stochasticity, and a property of the task,dimensionality. We show experimentally how these proper-ties affect humans’ mental models of AI capabilities and theresulting team performance. We connect our evaluations torelated work and propose goals, beyond accuracy, that meritconsideration during model selection and optimization to im-proveoverall human-AI team performance.

1 IntroductionWhile many AI applications address automation, numer-ous others aim to team with people to improve joint per-formance or accomplish tasks that neither the AI nor peo-ple can solve alone (Gillies et al. 2016; Kamar 2016;Chakraborti and Kambhampati 2018; Lundberg et al. 2018;Lundgard et al. 2018). In fact, many real-world, high-stakesapplications deploy AI inferences to help human expertsmake better decisions, e.g., with respect to medical diag-noses, recidivism prediction, and credit assessment. Rec-ommendations from AI systems—even if imperfect—canresult in human-AI teams that perform better than eitherthe human or the AI system alone (Wang et al. 2016;Jaderberg et al. 2019).

Successfully creating human-AI teamsputsadditional de-mands on AI capabilities beyond task-level accuracy (Grosz1996). Specifically, even though team performance dependson individual human and AI performance, the team perfor-mance will not exceed individual performance levels if hu-

Copyright c 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Figure 1: AI-advised human decision making for read-mission prediction: The doctor makes final decisions us-ing the classifier’s recommendations. Check marks denotecases where the AI system renders a correct prediction, andcrosses denote instances where the AI inference is erro-neous. Thesolid line represents theAI error boundary, whilethedashed lineshowsapotential human mental model of theerror boundary.

man and the AI cannot account for each other’s strengthsand weaknesses. In fact, recent research shows that AI ac-curacy does not always translate to end-to-end team perfor-mance(Yin, Vaughan, and Wallach 2019; Lai and Tan 2018).Despite this, and even for applications where people are in-volved, most AI techniques continue to optimize solely foraccuracy of inferences and ignore team performance.

Whilemany factors influence team performance, westudyone critical factor in this work: humans’ mental model ofthe AI system that they are working with. In settings wherethe human is tasked with deciding when and how to makeuse of the AI system’s recommendation, extracting benefitsfrom the collaboration requires the human to build insights(i.e., a mental model) about multiple aspects of the capabil-ities of AI systems. A fundamental attribute is recognitionof whether the AI system will succeed or fail for a partic-ular input or set of inputs. For situations where the humanuses theAI’soutput to makedecisions, this mental model oftheAI’serror boundary—which describes theregionswhereit is correct versus incorrect—enables the human to predictwhen the AI will err and decide when to override the auto-mated inference.

We focus here on AI-advised human decision making, asimplebut widespread form of human-AI team, for example,

When can I trust it?How can I adjust it?

Page 93: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Linear model over semantically

meaningful primitives

Beyond Accuracy: TheRole of Mental Models in Human-AI Team Performance

Gagan Bansal1 Besmira Nushi2 EceKamar 2 Walter S. Lasecki3

Daniel S. Weld1,4 Eric Horvitz2

1University of Washington 2Microsoft Research 3University of Michigan 4Allen Institute for Artificial Intelligence

Abstract

Decisions made by human-AI teams (e.g., AI-advised hu-mans) are increasingly common in high-stakes domains suchas healthcare, criminal justice, and finance. Achieving highteam performance depends on more than just the accuracy ofthe AI system: Since the human and the AI may have differ-ent expertise, the highest team performance is often reachedwhen they both know how and when to complement one an-other. We focus on a factor that is crucial to supporting suchcomplementary: thehuman’smental model of theAI capabil-ities, specifically the AI system’s error boundary (i.e. know-ing “When does the AI err?”). Awareness of this lets the hu-man decidewhen to accept or override theAI’srecommenda-tion. Wehighlight two key properties of anAI’serror bound-ary, parsimony and stochasticity, and a property of the task,dimensionality. We show experimentally how these proper-ties affect humans’ mental models of AI capabilities and theresulting team performance. We connect our evaluations torelated work and propose goals, beyond accuracy, that meritconsideration during model selection and optimization to im-proveoverall human-AI team performance.

1 IntroductionWhile many AI applications address automation, numer-ous others aim to team with people to improve joint per-formance or accomplish tasks that neither the AI nor peo-ple can solve alone (Gillies et al. 2016; Kamar 2016;Chakraborti and Kambhampati 2018; Lundberg et al. 2018;Lundgard et al. 2018). In fact, many real-world, high-stakesapplications deploy AI inferences to help human expertsmake better decisions, e.g., with respect to medical diag-noses, recidivism prediction, and credit assessment. Rec-ommendations from AI systems—even if imperfect—canresult in human-AI teams that perform better than eitherthe human or the AI system alone (Wang et al. 2016;Jaderberg et al. 2019).

Successfully creating human-AI teamsputsadditional de-mands on AI capabilities beyond task-level accuracy (Grosz1996). Specifically, even though team performance dependson individual human and AI performance, the team perfor-mance will not exceed individual performance levels if hu-

Copyright c 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Figure 1: AI-advised human decision making for read-mission prediction: The doctor makes final decisions us-ing the classifier’s recommendations. Check marks denotecases where the AI system renders a correct prediction, andcrosses denote instances where the AI inference is erro-neous. Thesolid line represents theAI error boundary, whilethedashed lineshowsapotential human mental model of theerror boundary.

man and the AI cannot account for each other’s strengthsand weaknesses. In fact, recent research shows that AI ac-curacy does not always translate to end-to-end team perfor-mance(Yin, Vaughan, and Wallach 2019; Lai and Tan 2018).Despite this, and even for applications where people are in-volved, most AI techniques continue to optimize solely foraccuracy of inferences and ignore team performance.

Whilemany factors influence team performance, westudyone critical factor in this work: humans’ mental model ofthe AI system that they are working with. In settings wherethe human is tasked with deciding when and how to makeuse of the AI system’s recommendation, extracting benefitsfrom the collaboration requires the human to build insights(i.e., a mental model) about multiple aspects of the capabil-ities of AI systems. A fundamental attribute is recognitionof whether the AI system will succeed or fail for a partic-ular input or set of inputs. For situations where the humanuses theAI’soutput to makedecisions, this mental model oftheAI’serror boundary—which describes theregionswhereit is correct versus incorrect—enables the human to predictwhen the AI will err and decide when to override the auto-mated inference.

We focus here on AI-advised human decision making, asimplebut widespread form of human-AI team, for example,

When can I trust it?How can I adjust it?

Page 94: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Figure 4: A part of Figure 1 from [5] showing 3 (of 56 total) components for a GA2M model, which was trained to predict a

patient’s r isk of dying from pneumonia. The two l ine graphs depict the contr ibution of individual features to risk : a) patient’s

age, and b) boolean variable asthma. The y-axis denotes i ts contr ibution (log odds) to predicted risk . The heat map, c, visual izes

the contr ibution due to pairwise interactions between age and cancer rate.

evidence from the presence of thousands of words, one could see

the e ect of a given word by looking at the sign and magnitude of

the corresponding weight. This answers the question, “What if the

word had been omitted?” Similarly, by comparing the weights asso-

ciated with two words, one could predict the e ect on the model of

substituting one for the other.

Ranking Intel l igible Models: Since one may have a choice of

intelligible models, it is useful to consider what makes one prefer-

able to another. Grice introduced four rules characterizing coop-

erative communication, which also hold for intelligible explana-

tions [11]. The maxim of quality says be truthful, only relating

things that are supported by evidence. The maxim of quantity says

to give as much information as is needed, and no more. The maxim

of relation: only say things that are relevant to the discussion. The

maxim of manner says to avoid ambiguity, being be as clear as

possible.

Miller summarizes decades of work by psychological research,

noting that explanations are contrastive, i.e., of the form “Why P

rather than Q?” The event in question, P, is termed the fact and Q is

called the foil [28]. Often thefoil isnot explicitly stated even though

it is crucially important to the explanation process. For example,

consider the question, “Why did you predict that the image depicts

an indigo bunting?” An explanation that points to the color blue

implicitly assumes that the foil is another bird, such as a chickadee.

But perhaps the questioner wonders why the recognizer did not

predict a pair of denim pants; in this case a more precise explana-

tion might highlight the presence of wings and a beak. Clearly, an

explanation targeted to the wrong foil will be unsatisfying, but the

nature and sophistication of a foil can depend on the end user’s

expertise; hence, the ideal explanation will di er for di erent peo-

ple [7]. For example, to verify that an ML system is fair, an ethicist

might generate more complex foils than a data scientist. Most ML

explanation systems have restricted their attention to elucidating

thebehavior of abinary classi er, i.e., wherethere isonly onepossi-

ble foil choice. However, as we seek to explain multi-class systems,

addressing this issue becomes essential.

Many systems are simply too complex to understand without

approximation. Here, the key challenge is deciding which details

to omit. After long study psychologists determined that several

criteria can beprioritized for inclusion in an explanation: necessary

causes (vs. su cient ones); intentional actions (vs. those taken

without deliberation); proximal causes vs. distant ones); details that

distinguish between fact and foil; and abnormal features [28].

According to Lombrozo, humans prefer explanations that are

simpler (i.e., contain fewer clauses), moregeneral, and coherent (i.e.,

consistent with what thehuman’s prior beliefs) [24]. In particular,

she observed the surprising result that humans preferred simple

(one clause) explanations to conjunctive ones, even when the prob-

ability of the latter was higher than the former) [24]. These results

raise interesting questions about the purpose of explanations in

an AI system. Is an explanation’s primary purpose to convince a

human to accept thecomputer’sconclusions (perhapsby presenting

a simple, plausible, but unlikely explanation) or is it to educate the

human about the most likely true situation? Tversky, Kahneman,

and other psychologists have documented many cognitive biases

that lead humans to incorrect conclusions; for example, people

reason incorrectly about the probability of conjunctions, with a

concrete and vivid scenario deemed more likely than an abstract

one that strictly subsumes it [17]. Should an explanation system

exploit human limitations or seek to protect us from them?

Other studies raise an additional complication about how to

communicate a system’s uncertain predictions to human users.

Koehler found that simply presenting an explanation asafact makes

people think that it is more likely to be true [18]. Furthermore,

explaining a fact in the same way as previous facts have been

explained ampli es this e ect [35].

3 INHERENTLY INTELLIGIBLE MODELS

Several AI systems are inherently intelligible. We previously men-

tioned linear models as an example that supports conterfactuals.

Unfortunately, linear models have limited utility because they of-

ten result in poor accuracy. More expressive choices may include

simple decision trees and compact decision lists. We focus on Gen-

eralized additive models (GAMs), which are a powerful class of

ML models that relate a set of features to the target using a linear

combination of (potentially nonlinear) single-feature models called

shape functions [25]. For example, if y represents the target and

{x1, . . . .xn } represents the features, then a GAM model takes the

form y = β0 +Õ

j f j (xj ), where the f i sdenote shape functions and

the target y is computed by summing single-feature terms. Popular

shape functions include non-linear functions such as splines and

decision trees. With linear shapefunctions GAMsreduce to a linear

3

Part of Fig 1 from R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad. “Intelligible models for healthcare:

Predicting pneumonia risk and hospital 30-day readmission.” In KDD 2015.

1 (of 56) components of learned GA2M: risk of pneumonia death

Page 95: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Figure 4: A part of Figure 1 from [5] showing 3 (of 56 total) components for a GA2M model, which was trained to predict a

patient’s r isk of dying from pneumonia. The two l ine graphs depict the contr ibution of individual features to risk : a) patient’s

age, and b) boolean variable asthma. The y-axis denotes i ts contr ibution (log odds) to predicted risk . The heat map, c, visual izes

the contr ibution due to pairwise interactions between age and cancer rate.

evidence from the presence of thousands of words, one could see

the e ect of a given word by looking at the sign and magnitude of

the corresponding weight. This answers the question, “What if the

word had been omitted?” Similarly, by comparing the weights asso-

ciated with two words, one could predict the e ect on the model of

substituting one for the other.

Ranking Intel l igible Models: Since one may have a choice of

intelligible models, it is useful to consider what makes one prefer-

able to another. Grice introduced four rules characterizing coop-

erative communication, which also hold for intelligible explana-

tions [11]. The maxim of quality says be truthful, only relating

things that are supported by evidence. The maxim of quantity says

to give as much information as is needed, and no more. The maxim

of relation: only say things that are relevant to the discussion. The

maxim of manner says to avoid ambiguity, being be as clear as

possible.

Miller summarizes decades of work by psychological research,

noting that explanations are contrastive, i.e., of the form “Why P

rather than Q?” The event in question, P, is termed the fact and Q is

called the foil [28]. Often thefoil isnot explicitly stated even though

it is crucially important to the explanation process. For example,

consider the question, “Why did you predict that the image depicts

an indigo bunting?” An explanation that points to the color blue

implicitly assumes that the foil is another bird, such as a chickadee.

But perhaps the questioner wonders why the recognizer did not

predict a pair of denim pants; in this case a more precise explana-

tion might highlight the presence of wings and a beak. Clearly, an

explanation targeted to the wrong foil will be unsatisfying, but the

nature and sophistication of a foil can depend on the end user’s

expertise; hence, the ideal explanation will di er for di erent peo-

ple [7]. For example, to verify that an ML system is fair, an ethicist

might generate more complex foils than a data scientist. Most ML

explanation systems have restricted their attention to elucidating

thebehavior of abinary classi er, i.e., wherethere isonly onepossi-

ble foil choice. However, as we seek to explain multi-class systems,

addressing this issue becomes essential.

Many systems are simply too complex to understand without

approximation. Here, the key challenge is deciding which details

to omit. After long study psychologists determined that several

criteria can beprioritized for inclusion in an explanation: necessary

causes (vs. su cient ones); intentional actions (vs. those taken

without deliberation); proximal causes vs. distant ones); details that

distinguish between fact and foil; and abnormal features [28].

According to Lombrozo, humans prefer explanations that are

simpler (i.e., contain fewer clauses), moregeneral, and coherent (i.e.,

consistent with what thehuman’s prior beliefs) [24]. In particular,

she observed the surprising result that humans preferred simple

(one clause) explanations to conjunctive ones, even when the prob-

ability of the latter was higher than the former) [24]. These results

raise interesting questions about the purpose of explanations in

an AI system. Is an explanation’s primary purpose to convince a

human to accept thecomputer’sconclusions (perhapsby presenting

a simple, plausible, but unlikely explanation) or is it to educate the

human about the most likely true situation? Tversky, Kahneman,

and other psychologists have documented many cognitive biases

that lead humans to incorrect conclusions; for example, people

reason incorrectly about the probability of conjunctions, with a

concrete and vivid scenario deemed more likely than an abstract

one that strictly subsumes it [17]. Should an explanation system

exploit human limitations or seek to protect us from them?

Other studies raise an additional complication about how to

communicate a system’s uncertain predictions to human users.

Koehler found that simply presenting an explanation asafact makes

people think that it is more likely to be true [18]. Furthermore,

explaining a fact in the same way as previous facts have been

explained ampli es this e ect [35].

3 INHERENTLY INTELLIGIBLE MODELS

Several AI systems are inherently intelligible. We previously men-

tioned linear models as an example that supports conterfactuals.

Unfortunately, linear models have limited utility because they of-

ten result in poor accuracy. More expressive choices may include

simple decision trees and compact decision lists. We focus on Gen-

eralized additive models (GAMs), which are a powerful class of

ML models that relate a set of features to the target using a linear

combination of (potentially nonlinear) single-feature models called

shape functions [25]. For example, if y represents the target and

{x1, . . . .xn } represents the features, then a GAM model takes the

form y = β0 +Õ

j f j (xj ), where the f i sdenote shape functions and

the target y is computed by summing single-feature terms. Popular

shape functions include non-linear functions such as splines and

decision trees. With linear shapefunctions GAMsreduce to a linear

3

Part of Fig 1 from R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad. “Intelligible models for healthcare:

Predicting pneumonia risk and hospital 30-day readmission.” In KDD 2015.

2 (of 56) components of learned GA2M: risk of pneumonia death

Page 96: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Figure 4: A part of Figure 1 from [5] showing 3 (of 56 total) components for a GA2M model, which was trained to predict a

patient’s r isk of dying from pneumonia. The two l ine graphs depict the contr ibution of individual features to risk : a) patient’s

age, and b) boolean variable asthma. The y-axis denotes i ts contr ibution (log odds) to predicted risk . The heat map, c, visual izes

the contr ibution due to pairwise interactions between age and cancer rate.

evidence from the presence of thousands of words, one could see

the e ect of a given word by looking at the sign and magnitude of

the corresponding weight. This answers the question, “What if the

word had been omitted?” Similarly, by comparing the weights asso-

ciated with two words, one could predict the e ect on the model of

substituting one for the other.

Ranking Intel l igible Models: Since one may have a choice of

intelligible models, it is useful to consider what makes one prefer-

able to another. Grice introduced four rules characterizing coop-

erative communication, which also hold for intelligible explana-

tions [11]. The maxim of quality says be truthful, only relating

things that are supported by evidence. The maxim of quantity says

to give as much information as is needed, and no more. The maxim

of relation: only say things that are relevant to the discussion. The

maxim of manner says to avoid ambiguity, being be as clear as

possible.

Miller summarizes decades of work by psychological research,

noting that explanations are contrastive, i.e., of the form “Why P

rather than Q?” The event in question, P, is termed the fact and Q is

called the foil [28]. Often thefoil isnot explicitly stated even though

it is crucially important to the explanation process. For example,

consider the question, “Why did you predict that the image depicts

an indigo bunting?” An explanation that points to the color blue

implicitly assumes that the foil is another bird, such as a chickadee.

But perhaps the questioner wonders why the recognizer did not

predict a pair of denim pants; in this case a more precise explana-

tion might highlight the presence of wings and a beak. Clearly, an

explanation targeted to the wrong foil will be unsatisfying, but the

nature and sophistication of a foil can depend on the end user’s

expertise; hence, the ideal explanation will di er for di erent peo-

ple [7]. For example, to verify that an ML system is fair, an ethicist

might generate more complex foils than a data scientist. Most ML

explanation systems have restricted their attention to elucidating

thebehavior of abinary classi er, i.e., wherethere isonly onepossi-

ble foil choice. However, as we seek to explain multi-class systems,

addressing this issue becomes essential.

Many systems are simply too complex to understand without

approximation. Here, the key challenge is deciding which details

to omit. After long study psychologists determined that several

criteria can beprioritized for inclusion in an explanation: necessary

causes (vs. su cient ones); intentional actions (vs. those taken

without deliberation); proximal causes vs. distant ones); details that

distinguish between fact and foil; and abnormal features [28].

According to Lombrozo, humans prefer explanations that are

simpler (i.e., contain fewer clauses), moregeneral, and coherent (i.e.,

consistent with what thehuman’s prior beliefs) [24]. In particular,

she observed the surprising result that humans preferred simple

(one clause) explanations to conjunctive ones, even when the prob-

ability of the latter was higher than the former) [24]. These results

raise interesting questions about the purpose of explanations in

an AI system. Is an explanation’s primary purpose to convince a

human to accept thecomputer’sconclusions (perhapsby presenting

a simple, plausible, but unlikely explanation) or is it to educate the

human about the most likely true situation? Tversky, Kahneman,

and other psychologists have documented many cognitive biases

that lead humans to incorrect conclusions; for example, people

reason incorrectly about the probability of conjunctions, with a

concrete and vivid scenario deemed more likely than an abstract

one that strictly subsumes it [17]. Should an explanation system

exploit human limitations or seek to protect us from them?

Other studies raise an additional complication about how to

communicate a system’s uncertain predictions to human users.

Koehler found that simply presenting an explanation asafact makes

people think that it is more likely to be true [18]. Furthermore,

explaining a fact in the same way as previous facts have been

explained ampli es this e ect [35].

3 INHERENTLY INTELLIGIBLE MODELS

Several AI systems are inherently intelligible. We previously men-

tioned linear models as an example that supports conterfactuals.

Unfortunately, linear models have limited utility because they of-

ten result in poor accuracy. More expressive choices may include

simple decision trees and compact decision lists. We focus on Gen-

eralized additive models (GAMs), which are a powerful class of

ML models that relate a set of features to the target using a linear

combination of (potentially nonlinear) single-feature models called

shape functions [25]. For example, if y represents the target and

{x1, . . . .xn } represents the features, then a GAM model takes the

form y = β0 +Õ

j f j (xj ), where the f i sdenote shape functions and

the target y is computed by summing single-feature terms. Popular

shape functions include non-linear functions such as splines and

decision trees. With linear shapefunctions GAMsreduce to a linear

3

Part of Fig 1 from R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad. “Intelligible models for healthcare:

Predicting pneumonia risk and hospital 30-day readmission.” In KDD 2015.

Beyond Accuracy: TheRole of Mental Models in Human-AI Team Performance

Gagan Bansal1 Besmira Nushi2 EceKamar 2 Walter S. Lasecki3

Daniel S. Weld1,4 Eric Horvitz2

1University of Washington 2Microsoft Research 3University of Michigan 4Allen Institute for Artificial Intelligence

Abstract

Decisions made by human-AI teams (e.g., AI-advised hu-mans) are increasingly common in high-stakes domains suchas healthcare, criminal justice, and finance. Achieving highteam performance depends on more than just the accuracy ofthe AI system: Since the human and the AI may have differ-ent expertise, the highest team performance is often reachedwhen they both know how and when to complement one an-other. We focus on a factor that is crucial to supporting suchcomplementary: thehuman’smental model of theAI capabil-ities, specifically the AI system’s error boundary (i.e. know-ing “When does the AI err?”). Awareness of this lets the hu-man decidewhen to accept or override theAI’srecommenda-tion. Wehighlight two key properties of anAI’serror bound-ary, parsimony and stochasticity, and a property of the task,dimensionality. We show experimentally how these proper-ties affect humans’ mental models of AI capabilities and theresulting team performance. We connect our evaluations torelated work and propose goals, beyond accuracy, that meritconsideration during model selection and optimization to im-proveoverall human-AI team performance.

1 IntroductionWhile many AI applications address automation, numer-ous others aim to team with people to improve joint per-formance or accomplish tasks that neither the AI nor peo-ple can solve alone (Gillies et al. 2016; Kamar 2016;Chakraborti and Kambhampati 2018; Lundberg et al. 2018;Lundgard et al. 2018). In fact, many real-world, high-stakesapplications deploy AI inferences to help human expertsmake better decisions, e.g., with respect to medical diag-noses, recidivism prediction, and credit assessment. Rec-ommendations from AI systems—even if imperfect—canresult in human-AI teams that perform better than eitherthe human or the AI system alone (Wang et al. 2016;Jaderberg et al. 2019).

Successfully creating human-AI teamsputsadditional de-mands on AI capabilities beyond task-level accuracy (Grosz1996). Specifically, even though team performance dependson individual human and AI performance, the team perfor-mance will not exceed individual performance levels if hu-

Copyright c 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Figure 1: AI-advised human decision making for read-mission prediction: The doctor makes final decisions us-ing the classifier’s recommendations. Check marks denotecases where the AI system renders a correct prediction, andcrosses denote instances where the AI inference is erro-neous. Thesolid line represents theAI error boundary, whilethedashed lineshowsapotential human mental model of theerror boundary.

man and the AI cannot account for each other’s strengthsand weaknesses. In fact, recent research shows that AI ac-curacy does not always translate to end-to-end team perfor-mance(Yin, Vaughan, and Wallach 2019; Lai and Tan 2018).Despite this, and even for applications where people are in-volved, most AI techniques continue to optimize solely foraccuracy of inferences and ignore team performance.

Whilemany factors influence team performance, westudyone critical factor in this work: humans’ mental model ofthe AI system that they are working with. In settings wherethe human is tasked with deciding when and how to makeuse of the AI system’s recommendation, extracting benefitsfrom the collaboration requires the human to build insights(i.e., a mental model) about multiple aspects of the capabil-ities of AI systems. A fundamental attribute is recognitionof whether the AI system will succeed or fail for a partic-ular input or set of inputs. For situations where the humanuses theAI’soutput to makedecisions, this mental model oftheAI’serror boundary—which describes theregionswhereit is correct versus incorrect—enables the human to predictwhen the AI will err and decide when to override the auto-mated inference.

We focus here on AI-advised human decision making, asimplebut widespread form of human-AI team, for example,

When can I trust it?How can I adjust it?

3 (of 56) components of learned GA2M: risk of pneumonia death

Page 97: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

need

Deep cascade of CNNs

Variational networks

Transfer learning

GANs

Kidney MRI From [Lundervold & Lundervold 2018]https://www.sciencedirect.com/science/article/pii/S0939388918301181

Input: Pixels

Features are not semantically

meaningful

Page 98: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Use DirectlyInteract with Simpler

Model

Yes

NoIntelligible? •Explanations

•Controls

Map to Simpler Model

Page 99: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Inscrutable

Model Too Complex

Simplify by currying –> instance-specific explanation

Simplify by approximating

Features not Semantically Meaningful

Map to new vocabulary

Usually have to do both of these!

Page 100: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Usually have to do all of these!Simpler Explanatory

Model

Inscrutable

Model

Page 101: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

102

LIME - Local Approximations

1. Sample points around p

2. Use complex model to predict labels for each sample

3. Weigh samples according to distance from p

4. Learn new simple modelon weighted samples(possibly using different features)

5. Use simple model as explaination

Slide adapted from Marco Ribeiro – see “Why Should I Trust You?: Explaining the Predictions of Any Classifier,” M. Ribeiro, S. Singh, C. Guestrin,

SIGKDD 2016

To explain prediction for point p …

P

Page 102: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

To create features for explanatory classifier,

Compute `superpixels’ using off-the-shelf image segmenter

Hope that feature/values are semantically meaningful

To sample points around p, set some superpixels to grey

Explanation is set of superpixels with high coefficients…

Page 103: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Understandable

Over-Simplification

Accurate

Inscrutable

Any model simplification is a Lie

Page 104: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

2

Inscrutable

1

?

Need Desiderata

Page 105: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Fewer conjuncts (regardless of probability)

Explanations consistent with listener’s prior beliefs

Presenting an explanation made people believe P was true

If explanation ~ previous, effect was strengthened

Page 106: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

… Even when the explainer is wrong…

when not to trust

Beyond Accuracy: TheRole of Mental Models in Human-AI Team Performance

Gagan Bansal1 Besmira Nushi2 EceKamar 2 Walter S. Lasecki3

Daniel S. Weld1,4 Eric Horvitz2

1University of Washington 2Microsoft Research 3University of Michigan 4Allen Institute for Artificial Intelligence

Abstract

Decisions made by human-AI teams (e.g., AI-advised hu-mans) are increasingly common in high-stakes domains suchas healthcare, criminal justice, and finance. Achieving highteam performance depends on more than just the accuracy ofthe AI system: Since the human and the AI may have differ-ent expertise, the highest team performance is often reachedwhen they both know how and when to complement one an-other. We focus on a factor that is crucial to supporting suchcomplementary: thehuman’smental model of theAI capabil-ities, specifically the AI system’s error boundary (i.e. know-ing “When does the AI err?”). Awareness of this lets the hu-man decidewhen to accept or override theAI’srecommenda-tion. Wehighlight two key properties of anAI’serror bound-ary, parsimony and stochasticity, and a property of the task,dimensionality. We show experimentally how these proper-ties affect humans’ mental models of AI capabilities and theresulting team performance. We connect our evaluations torelated work and propose goals, beyond accuracy, that meritconsideration during model selection and optimization to im-proveoverall human-AI team performance.

1 IntroductionWhile many AI applications address automation, numer-ous others aim to team with people to improve joint per-formance or accomplish tasks that neither the AI nor peo-ple can solve alone (Gillies et al. 2016; Kamar 2016;Chakraborti and Kambhampati 2018; Lundberg et al. 2018;Lundgard et al. 2018). In fact, many real-world, high-stakesapplications deploy AI inferences to help human expertsmake better decisions, e.g., with respect to medical diag-noses, recidivism prediction, and credit assessment. Rec-ommendations from AI systems—even if imperfect—canresult in human-AI teams that perform better than eitherthe human or the AI system alone (Wang et al. 2016;Jaderberg et al. 2019).

Successfully creating human-AI teamsputsadditional de-mands on AI capabilities beyond task-level accuracy (Grosz1996). Specifically, even though team performance dependson individual human and AI performance, the team perfor-mance will not exceed individual performance levels if hu-

Copyright c 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Figure 1: AI-advised human decision making for read-mission prediction: The doctor makes final decisions us-ing the classifier’s recommendations. Check marks denotecases where the AI system renders a correct prediction, andcrosses denote instances where the AI inference is erro-neous. Thesolid line represents theAI error boundary, whilethedashed lineshowsapotential human mental model of theerror boundary.

man and the AI cannot account for each other’s strengthsand weaknesses. In fact, recent research shows that AI ac-curacy does not always translate to end-to-end team perfor-mance(Yin, Vaughan, and Wallach 2019; Lai and Tan 2018).Despite this, and even for applications where people are in-volved, most AI techniques continue to optimize solely foraccuracy of inferences and ignore team performance.

Whilemany factors influence team performance, westudyone critical factor in this work: humans’ mental model ofthe AI system that they are working with. In settings wherethe human is tasked with deciding when and how to makeuse of the AI system’s recommendation, extracting benefitsfrom the collaboration requires the human to build insights(i.e., a mental model) about multiple aspects of the capabil-ities of AI systems. A fundamental attribute is recognitionof whether the AI system will succeed or fail for a partic-ular input or set of inputs. For situations where the humanuses theAI’soutput to makedecisions, this mental model oftheAI’serror boundary—which describes theregionswhereit is correct versus incorrect—enables the human to predictwhen the AI will err and decide when to override the auto-mated inference.

We focus here on AI-advised human decision making, asimplebut widespread form of human-AI team, for example,

When can I trust it?

How can I adjust it?

Page 107: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being
Page 108: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Except…

In these papers, Accuracy(Humans) << Accuracy(AI)

So… the rational decision is to omit the humans (not explain)

Page 109: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

interactions, or computed in expectation from the system it-self.

In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.

Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).

While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1

In sum, we make the following contributions:

1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.

2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.

3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.

1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.

(a)

(b)

Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .

4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.

2 3 4 5

2 Problem Descr iption

We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed

2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-

portant5BUG: Work weneed to citehere to show that collaboration may

not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].

interactions, or computed in expectation from the system it-self.

In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.

Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).

While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1

In sum, we make the following contributions:

1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.

2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.

3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.

1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.

(a)

(b)

Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .

4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.

2 3 4 5

2 Problem Descr iption

We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed

2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-

portant5BUG: Work weneed to citehere to show that collaboration may

not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].

Assistance Architecture

0) Solo Human (No AI)

1) AI Recommends

2) AI also gives its confidence

3) AI also explains (LIME-like)

4) AI gives human explanations

Page 110: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Explanations are Convincing

AI Correct

AI Incorrect

Page 111: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Explanations are Convincing

AI Correct

AI Incorrect

Page 112: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Explanations are Convincing

AI Correct

AI Incorrect

Page 113: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Explanations are Convincing

Better Explanations are More Convincing

AI Correct

AI Incorrect

Page 114: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being
Page 115: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

That Other Question…

Beyond Accuracy: TheRole of Mental Models in Human-AI Team Performance

Gagan Bansal1 Besmira Nushi2 EceKamar 2 Walter S. Lasecki3

Daniel S. Weld1,4 Eric Horvitz2

1University of Washington 2Microsoft Research 3University of Michigan 4Allen Institute for Artificial Intelligence

Abstract

Decisions made by human-AI teams (e.g., AI-advised hu-mans) are increasingly common in high-stakes domains suchas healthcare, criminal justice, and finance. Achieving highteam performance depends on more than just the accuracy ofthe AI system: Since the human and the AI may have differ-ent expertise, the highest team performance is often reachedwhen they both know how and when to complement one an-other. We focus on a factor that is crucial to supporting suchcomplementary: thehuman’smental model of theAI capabil-ities, specifically the AI system’s error boundary (i.e. know-ing “When does the AI err?”). Awareness of this lets the hu-man decidewhen to accept or override theAI’srecommenda-tion. Wehighlight two key properties of anAI’serror bound-ary, parsimony and stochasticity, and a property of the task,dimensionality. We show experimentally how these proper-ties affect humans’ mental models of AI capabilities and theresulting team performance. We connect our evaluations torelated work and propose goals, beyond accuracy, that meritconsideration during model selection and optimization to im-proveoverall human-AI team performance.

1 IntroductionWhile many AI applications address automation, numer-ous others aim to team with people to improve joint per-formance or accomplish tasks that neither the AI nor peo-ple can solve alone (Gillies et al. 2016; Kamar 2016;Chakraborti and Kambhampati 2018; Lundberg et al. 2018;Lundgard et al. 2018). In fact, many real-world, high-stakesapplications deploy AI inferences to help human expertsmake better decisions, e.g., with respect to medical diag-noses, recidivism prediction, and credit assessment. Rec-ommendations from AI systems—even if imperfect—canresult in human-AI teams that perform better than eitherthe human or the AI system alone (Wang et al. 2016;Jaderberg et al. 2019).

Successfully creating human-AI teamsputsadditional de-mands on AI capabilities beyond task-level accuracy (Grosz1996). Specifically, even though team performance dependson individual human and AI performance, the team perfor-mance will not exceed individual performance levels if hu-

Copyright c 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Figure 1: AI-advised human decision making for read-mission prediction: The doctor makes final decisions us-ing the classifier’s recommendations. Check marks denotecases where the AI system renders a correct prediction, andcrosses denote instances where the AI inference is erro-neous. Thesolid line represents theAI error boundary, whilethedashed lineshowsapotential human mental model of theerror boundary.

man and the AI cannot account for each other’s strengthsand weaknesses. In fact, recent research shows that AI ac-curacy does not always translate to end-to-end team perfor-mance(Yin, Vaughan, and Wallach 2019; Lai and Tan 2018).Despite this, and even for applications where people are in-volved, most AI techniques continue to optimize solely foraccuracy of inferences and ignore team performance.

Whilemany factors influence team performance, westudyone critical factor in this work: humans’ mental model ofthe AI system that they are working with. In settings wherethe human is tasked with deciding when and how to makeuse of the AI system’s recommendation, extracting benefitsfrom the collaboration requires the human to build insights(i.e., a mental model) about multiple aspects of the capabil-ities of AI systems. A fundamental attribute is recognitionof whether the AI system will succeed or fail for a partic-ular input or set of inputs. For situations where the humanuses theAI’soutput to makedecisions, this mental model oftheAI’serror boundary—which describes theregionswhereit is correct versus incorrect—enables the human to predictwhen the AI will err and decide when to override the auto-mated inference.

We focus here on AI-advised human decision making, asimplebut widespread form of human-AI team, for example,

How can I adjust it?

Page 116: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being
Page 117: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

• Deep neural paper embeddings

• Explain with linear bigrams

Beta: s2-sanity.apps.allenai.org

Page 118: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

x'+

++

+

+

+

+

o

o

o

o

oo

o

o

Tuning

Agents

Turi

ng

+

g

+

If all one cared about was the explanatory model,

one could change this parameters… but not even the

features are shared with the neural model!

Page 119: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

x'+

++

+

+

+

+

o

o

o

o

oo

o

o

Tuning

Agents

Turi

ng

+

g

+

Instead… We generate new training instances by

varying the feedback feature, weight by distance to

x’…

+

+

+

+

Page 120: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

x'+

++

+

+

+

+

o

o

o

o

oo

o

o

Tuning

Agents

Turi

ng

+

g

+

Instead… We generate new training instances by

varying the feedback feature, weight by distance to

x’…

+

+

+

+

and Retrain.

Page 121: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Good News:

Less Good News:

No significant improvement on feed quality (team

performance) as measured by clickthru

Page 122: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Summary

Beyond Accuracy: TheRole of Mental Models in Human-AI Team Performance

Gagan Bansal1 Besmira Nushi2 EceKamar 2 Walter S. Lasecki3

Daniel S. Weld1,4 Eric Horvitz2

1University of Washington 2Microsoft Research 3University of Michigan 4Allen Institute for Artificial Intelligence

Abstract

Decisions made by human-AI teams (e.g., AI-advised hu-mans) are increasingly common in high-stakes domains suchas healthcare, criminal justice, and finance. Achieving highteam performance depends on more than just the accuracy ofthe AI system: Since the human and the AI may have differ-ent expertise, the highest team performance is often reachedwhen they both know how and when to complement one an-other. We focus on a factor that is crucial to supporting suchcomplementary: thehuman’smental model of theAI capabil-ities, specifically the AI system’s error boundary (i.e. know-ing “When does the AI err?”). Awareness of this lets the hu-man decidewhen to accept or override theAI’srecommenda-tion. Wehighlight two key properties of anAI’serror bound-ary, parsimony and stochasticity, and a property of the task,dimensionality. We show experimentally how these proper-ties affect humans’ mental models of AI capabilities and theresulting team performance. We connect our evaluations torelated work and propose goals, beyond accuracy, that meritconsideration during model selection and optimization to im-proveoverall human-AI team performance.

1 IntroductionWhile many AI applications address automation, numer-ous others aim to team with people to improve joint per-formance or accomplish tasks that neither the AI nor peo-ple can solve alone (Gillies et al. 2016; Kamar 2016;Chakraborti and Kambhampati 2018; Lundberg et al. 2018;Lundgard et al. 2018). In fact, many real-world, high-stakesapplications deploy AI inferences to help human expertsmake better decisions, e.g., with respect to medical diag-noses, recidivism prediction, and credit assessment. Rec-ommendations from AI systems—even if imperfect—canresult in human-AI teams that perform better than eitherthe human or the AI system alone (Wang et al. 2016;Jaderberg et al. 2019).

Successfully creating human-AI teamsputsadditional de-mands on AI capabilities beyond task-level accuracy (Grosz1996). Specifically, even though team performance dependson individual human and AI performance, the team perfor-mance will not exceed individual performance levels if hu-

Copyright c 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Figure 1: AI-advised human decision making for read-mission prediction: The doctor makes final decisions us-ing the classifier’s recommendations. Check marks denotecases where the AI system renders a correct prediction, andcrosses denote instances where the AI inference is erro-neous. Thesolid line represents theAI error boundary, whilethedashed lineshowsapotential human mental model of theerror boundary.

man and the AI cannot account for each other’s strengthsand weaknesses. In fact, recent research shows that AI ac-curacy does not always translate to end-to-end team perfor-mance(Yin, Vaughan, and Wallach 2019; Lai and Tan 2018).Despite this, and even for applications where people are in-volved, most AI techniques continue to optimize solely foraccuracy of inferences and ignore team performance.

Whilemany factors influence team performance, westudyone critical factor in this work: humans’ mental model ofthe AI system that they are working with. In settings wherethe human is tasked with deciding when and how to makeuse of the AI system’s recommendation, extracting benefitsfrom the collaboration requires the human to build insights(i.e., a mental model) about multiple aspects of the capabil-ities of AI systems. A fundamental attribute is recognitionof whether the AI system will succeed or fail for a partic-ular input or set of inputs. For situations where the humanuses theAI’soutput to makedecisions, this mental model oftheAI’serror boundary—which describes theregionswhereit is correct versus incorrect—enables the human to predictwhen the AI will err and decide when to override the auto-mated inference.

We focus here on AI-advised human decision making, asimplebut widespread form of human-AI team, for example,

ML Model

Readmission Prediction

ClassifierHuman

Decision

Maker

When can I trust it?

How can I adjust it?

Page 123: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

interactions, or computed in expectation from the system it-self.

In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.

Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).

While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1

In sum, we make the following contributions:

1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.

2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.

3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.

1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.

(a)

(b)

Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .

4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.

2 3 4 5

2 Problem Descr iption

We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed

2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-

portant5BUG: Work weneed to citehere to show that collaboration may

not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].

interactions, or computed in expectation from the system it-self.

In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.

Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).

While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1

In sum, we make the following contributions:

1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.

2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.

3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.

1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.

(a)

(b)

Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .

4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.

2 3 4 5

2 Problem Descr iption

We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed

2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-

portant5BUG: Work weneed to citehere to show that collaboration may

not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].

Human-AI Team

Helpful

Page 124: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Use DirectlyInteract with Simpler

Model

Yes

NoIntelligible? •Explanations

•Controls

Map to Simpler Model

Page 125: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

INITIALLY

DURING

INTERACTION

WHEN WRONG

OVER TIME

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

Time services

based on

context.

3

Show

contextually

relevant

information.

4

Match

relevant

social norms.

5

Mitigate

social biases.

6

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

Support

efficient

correction.

9

Scope

services when

in doubt.

10

Make clear

why the

system did

what it did.

11

Remember

recent

interactions.

12

Learn from

user behavior.

13

Update and

adapt

cautiously.

14

Encourage

granular

feedback.

15

Provide

global

controls.

17

Notify users

about

changes.

1816

Convey the

consequences

of user

actions.

https://aka.ms/aiguidelines

Thanks! Questions?

Page 126: aiguidelines · 2020. 2. 8. · Opportunity Analysis Engagements with Practitioners Findings & Impact. Academia Practitioners Industry CHI 2019 Best Paper Honorable Mention Being

Tutorial website: https://www.microsoft.com/en-us/research/project/guidelines-for-human-ai-

interaction/articles/aaai-2020-tutorial-guidelines-for-human-ai-interaction/

Learn the guidelines

Introduction to guidelines for human-AI interaction

Interactive cards with examples of the guidelines in practice

Use the guidelines in your work

Printable cards (PDF)

Printable poster (PDF)

Find out more

Guidelines for human-AI interaction design, Microsoft Research Blog

AI guidelines in the creative process: How we’re putting the human-AI guidelines into practice at

Microsoft, Microsoft Design on Medium

How to build effective human-AI interaction: Considerations for machine learning and software

engineering, Microsoft Research Blog