68
Learning How-To Knowledge from the Web Yuke Zhu IROS 2019

Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Learning How-To Knowledge from the Web

Yuke Zhu

IROS 2019

Page 2: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Advances in Artificial Intelligence

Visual Recognition Machine Translation Question Answering

Page 3: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

The Unsung Hero: Web Data

SQuAD QA Dataset [Rajpurkar et al. 2016]

100,000+ questions posed

by crowdworkers on a set

of Wikipedia articles

Google NMT[Wu et al. 2016]

WMT En→Fr dataset

with 36M sentence pairs

ImageNet[Deng et al. 2009]

14 million web images

annotated by AMT workers

Visual Recognition Machine Translation Question Answering

Page 4: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Traditional form of automation Intelligent robots in real world

The Unsung Hero: Web Data

?

What’s the role of web data in improving robot intelligence?

Page 5: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

What knowledge do we need for robotics?

“To accelerate or to brake?”

Page 6: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

What knowledge do we need for robotics?

Knowledge of “That-Is”

car

Heavy & Fast

bike

Slow

Declarative knowledge

Understanding the world

v Easy to articulate

(conscious)

v Describes facts

of the world

Page 7: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

What knowledge do we need for robotics?

Declarative knowledge

Understanding the world

Procedural knowledge

Interacting with the world

v Easy to articulate

(conscious)

v Describes facts

of the world

vDescribes how to

perform tasks

vHard to pinpoint

(unconscious)

Knowledge of “How-To”

Page 8: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Robotics

Procedural

Knowledge

(“How-To”)

Declarative

Knowledge

(“That-Is”)

Understanding

the World

Interacting with

the World

Page 9: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Understanding

the World

Interacting with

the WorldRobotics

Procedural

Knowledge

(“How-To”)

Declarative

Knowledge

(“That-Is”)

Learning Declarative (“That-Is”) Knowledge from the Web

Understanding the world is the cornerstone of interacting with the world.

Page 10: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

The Visual Genome Project

A large-scale visual knowledge base of

structured image concepts

Krishna, Zhu, Groth, Johnson, Hata, Kravitz, Chen, Kalantidis, Li, Shamma, Bernstein, and Fei-Fei, IJCV 2017

Page 11: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Visual Genome

knob

bowl

bowl

drawer

holder

knife

counter

Scene Graph: Objects

Questions

1. Q: What’s the color of the counter? A: Black.

2. Q: How many drawers can you see? A: Two.

3. Q: What’s the material of the pots? A: Metal.

……

Region Descriptions

1. There a green bowl on the black counter.

2. The cabinet door is closed.

3. Six knives are placed in the knife holder.

……

large

openable

metal

graspable

black

+ Attributes

has

on

with

next to

on

in

+ Relationships

Page 12: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Visual Genome

knob

has

bowl

on

bowl

drawer

with

next to

large

holder

knife

openable

on

in

metal

graspable

black counter

Scene Graph: Objects + Attributes + Relationships

Questions Region Descriptions

1. Q: What’s the color of the counter? A: Black.

2. Q: How many drawers can you see? A: Two.

3. Q: What’s the material of the pots? A: Metal.

……

1. There a green bowl on the black counter.

2. The cabinet door is closed.

3. Six knives are placed in the knife holder.

……

108K Images

1.7M Questions 5.4M Region Descriptions

3.8M Objects

2.8M Attributes

2.3M Relationships

Page 13: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Visual Genome

An ontology of visual concepts

https://visualgenome.org

Page 14: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

green onions sitting on the counter

a big white bowl

knives in a holder

wooden drawer is closed

two ceramic jars

Johnson et al. CVPR’16; Krishna, Zhu, et al. IJCV’17

Page 15: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

A: In the daytime.

Q: When was the picture taken?

A: Seven.

Q: How many drawer knobs can you see?

Zhu et al. CVPR’16, Zhu et al. CVPR’17

A: Black.

Q: What color is the countertop?

Page 16: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Xu, Zhu, Choy, Fei-Fei, CVPR’17

knob

has

bowl

on

bowl

drawer

with

next to

large

holder

knife

openable

on

in

metal

graspable

black counter

Page 17: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Understanding

the World

Interacting with

the World

Visual Genome learns Declarative Knowledge from the web.

We built a large-scale visual knowledge base via online crowdsourcing.

Procedural

Knowledge

(“How-To”)

Declarative

Knowledge

(“That-Is”)

Page 18: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Learning Procedural Knowledge needs new methodology.

Understanding

the World

Interacting with

the World

It is hard to pinpoint and difficult to verbally described.

Procedural

Knowledge

(“How-To”)

Declarative

Knowledge

(“That-Is”)

Page 19: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Learning Procedural (“How-To”) Knowledge from the Web

Three Key Questions

vWhat’s a good representation of procedural knowledge?

vHow do we learn procedural knowledge from the web?

vHow can robots take advantage of such knowledge?

Page 20: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Part I: Learning from Video Demonstrations

Part II: Learning from Crowd Teleoperation

Page 21: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Part II: Learning from Crowd Teleoperation

Part I: Learning from Video Demonstrations

Page 22: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Web videos supply massive knowledge of how to solve new tasks.

Source: The Verge, Pew Research Center

Page 23: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Humans learn efficiently from video demonstrations.

Meltzoff & Moore 1977; Meltzoff & Moore 1989, Meltzoff 1988

Imitation of Televised Models by Infants

Andrew N. Meltzoff, Child Development 1988

Babies (14-24 months) can learn by imitating

demonstrations from the TV screen.

Page 24: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

prepare dinner

prepare dinner

cook foodwash dishes

grasp wash place cut boil

Our Goal: Learning procedural knowledge as compositional task structures

from video demonstrations of a task

Page 25: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

One-Shot Imitation Learning from Videos

Xu*, Nair*, Zhu, Gao, Garg, Fei-Fei, Savarese. ICRA 2018

single video

demonstration

meta-learning

model

policy for the

demonstrated task

Page 26: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

a lot of training videos

(seen tasks)

policy for the

demonstrated task

supervision

One-Shot Imitation Learning from Videos

meta-learning

model

Xu*, Nair*, Zhu, Gao, Garg, Fei-Fei, Savarese. ICRA 2018

Page 27: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

single test video

(unseen task)

policy for the

demonstrated task

One-Shot Imitation Learning from Videos

meta-learning

model

Xu*, Nair*, Zhu, Gao, Garg, Fei-Fei, Savarese. ICRA 2018

Page 28: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

One-Shot Imitation Learning from Videos

[Duan et al. 17; Finn et al. 2017; Wang et al. 2017; Yu et al. 2018]

modeling demonstration

as a flat sequence

Xu*, Nair*, Zhu, Gao, Garg, Fei-Fei, Savarese. ICRA 2018

Page 29: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

One-Shot Imitation Learning from Videos

modeling demonstration

as a flat sequence

modeling demonstration

as a compositional structure

[Duan et al. 17; Finn et al. 2017; Wang et al. 2017; Yu et al. 2018]

Xu*, Nair*, Zhu, Gao, Garg, Fei-Fei, Savarese. ICRA 2018

Page 30: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Pin: pick_and_place

Pout: pick

EOP: False

Output Task Spec.

Env. Observation Input Task Spec.

Args: block_E

Pin: pick

Pout: move_to

EOP: False

move_to (block_E) return return

Args: block_E

Pin: pick

Pout: grip

EOP: True

retu

rn

Pin: block_stacking

Pout: pick_and_place

EOP: False

Output Task Spec.

Env. Observation Input Task Spec.

return

Pin: block_stacking

Pout: pick_and_place

EOP: False

Output Task Spec.

Pin: pick_and_place

Pout: drop

EOP: True

Output Task Spec.

Args: N/A

Pin: place

Pout: release

EOP: TruePin: place

Pout: move_to

EOP: False

Args: block_B

retu

rn

Env. Observation Input Task Spec. Env. Observation Input Task Spec.

Env. Observation Input Task Spec.

Env. Observation Input Task Spec.

Env. Observation Input Task Spec.

Env. Observation Input Task Spec.

grip (block_E) move_to (block_B) returnrelease()

… …

return

Pick and Place

Block Stacking

Pick Pick

Pick and Place

Place Place

Block Stacking

Move_to (Blue) Grip (Blue) Move_to (Red) Release( )

Neural Task Programming (NTP): Hierarchical Policy Learning as Neural Program Induction

Page 31: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

One-Shot Imitation Learning from Videos: Neural Task Programming (NTP)

demonstration policy

next program

pick(blue)

observation

State obs.

Task Demonstration

Robot API

Completed Tasks

NTP Env.

Robot API```

Task

1

```

Task

2

Task 1 Final State

Task 2 Final State

Task Conditional Output Policies

current program

pick_place(blue, green)

end-to-end

neural network

(LSTM)

meta-learning

model

Xu*, Nair*, Zhu, Gao, Garg, Fei-Fei, Savarese. ICRA 2018

Page 32: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

{( , )}

One-Shot Imitation Learning from Videos: Neural Task Programming (NTP)

demonstration

next program

pick(blue)

observation

State obs.

Task Demonstration

Robot API

Completed Tasks

NTP Env.

Robot API```

Task

1

```

Task

2

Task 1 Final State

Task 2 Final State

Task Conditional Output Policies

end-to-end

neural network

(LSTM)

current program

pick_place(blue, green)

Xu*, Nair*, Zhu, Gao, Garg, Fei-Fei, Savarese. ICRA 2018

Training supervision

video demonstration hierarchical program trace

Page 33: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

0

0.2

0.4

0.6

0.8

1

50 100 400 1000

un

se

en

ta

sk s

uc

ce

ss r

ate

number of training tasks

Flat NTP (Ours)

N/A

N/A

One-Shot Imitation Learning from Videos: Neural Task Programming (NTP)

Qualitative Quantitative

(the higher the better)

Object Sorting

Autonomous Execution

Demo

8x

Better generalization with less

training data than flat baselines

Xu*, Nair*, Zhu, Gao, Garg, Fei-Fei, Savarese. ICRA 2018

Page 34: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

One-Shot Imitation Learning from Videos: Neural Task Programming (NTP)

demonstration policy

compositional

model prior

meta-learning

model

Xu*, Nair*, Zhu, Gao, Garg, Fei-Fei, Savarese. ICRA 2018

end-to-end

neural network

(LSTM)

Page 35: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

One-Shot Imitation Learning from Videos: Neural Task Graphs (NTG)

Task Graph

Generator

Neural Task Graph

observation

Task Graph

Executor

State obs.

Task Demonstration

Robot API

Completed Tasks

NTP Env.

Robot API```

Task

1

```

Task

2

Task 1 Final State

Task 2 Final State

Task Conditional Output Policies

Huang*, Nair*, Xu*, Zhu, Garg, Fei-Fei, Savarese, Niebles. CVPR 2019

demonstration policy

meta-learning

model

Page 36: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

One-Shot Imitation Learning from Videos: Neural Task Graphs (NTG)

Task Graph

Generator

Neural Task Graph

observation

Task Graph

Executor

State obs.

Task Demonstration

Robot API

Completed Tasks

NTP Env.

Robot API```

Task

1

```

Task

2

Task 1 Final State

Task 2 Final State

Task Conditional Output Policies

demonstration policy

meta-learning

model

Huang*, Nair*, Xu*, Zhu, Garg, Fei-Fei, Savarese, Niebles. CVPR 2019

Page 37: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

One-Shot Imitation Learning from Videos: Neural Task Graphs (NTG)

Task Graph

Generator

Neural Task Graph

observation

Task Graph

Executor

State obs.

Task Demonstration

Robot API

Completed Tasks

NTP Env.

Robot API```

Task

1

```

Task

2

Task 1 Final State

Task 2 Final State

Task Conditional Output Policies

demonstration policy

meta-learning

model

Huang*, Nair*, Xu*, Zhu, Garg, Fei-Fei, Savarese, Niebles. CVPR 2019

Page 38: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Task Graph

Nodes States

place(red)

place(green)

pic

k(g

reen

) pic

k(r

ed

)

pick(orange)

Edges Actions

Conjugate Task Graph

place(green) pick(green)

pick(orange)

pick(red) place(red)

Nodes ActionsinfiniteEdges States (Preconditions)

valid states

One-Shot Imitation Learning from Videos: Neural Task Graphs (NTG)

finite

Huang*, Nair*, Xu*, Zhu, Garg, Fei-Fei, Savarese, Niebles. CVPR 2019

Page 39: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

place(green) pick(green)

pick(orange)

pick(red) place(red)

One-Shot Imitation Learning from Videos: Neural Task Graphs (NTG)

current observation

node

localizer

edge

classifier

selectednode

next action

pick(red)

selectededge

Huang*, Nair*, Xu*, Zhu, Garg, Fei-Fei, Savarese, Niebles. CVPR 2019

Page 40: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

{( , )}

Training supervision

video demonstration action sequence

One-Shot Imitation Learning from Videos: Neural Task Graphs (NTG)

Task Graph

Generator

Neural Task Graph

observation

Task Graph

Executor

State obs.

Task Demonstration

Robot API

Completed Tasks

NTP Env.

Robot API```

Task

1

```

Task

2

Task 1 Final State

Task 2 Final State

Task Conditional Output Policies

demonstration

policy

Huang*, Nair*, Xu*, Zhu, Garg, Fei-Fei, Savarese, Niebles. CVPR 2019

Page 41: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

0

0.2

0.4

0.6

0.8

1

50 100 400 1000

un

se

en

ta

sk s

uc

ce

ss r

ate

number of training tasks

Flat NTP (Ours)

0

0.2

0.4

0.6

0.8

1

50 100 400 1000

un

se

en

ta

sk s

uc

ce

ss r

ate

number of training tasks

Flat NTP (Ours) NTG (Ours)

N/A

N/A

One-Shot Imitation Learning from Videos: Neural Task Graphs (NTG)

Qualitative Quantitative

(the higher the better)

Recovery from Intermediate Failures

Autonomous Execution 20x

Weaker supervision, less training

data, and better generalization

Huang*, Nair*, Xu*, Zhu, Garg, Fei-Fei, Savarese, Niebles. CVPR 2019

Page 42: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

One-Shot Imitation Learning from Videos: Neural Task Graphs (NTG)

Huang*, Nair*, Xu*, Zhu, Garg, Fei-Fei, Savarese, Niebles. CVPR 2019

OrientingNeedle Positioning NeedlePushing Needle

through Tissue

PullingSuturewith

LeftHand

Predicted

Path

Video

OrientingNeedle Positioning NeedlePushing Needle

through Tissue

PullingSuturewith

LeftHand

Predicted

Graph

Applying NTG to the real-world surgical video dataset JIGSAWS

Page 43: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Next Goal: Learning task knowledge from web videos

Page 44: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Summary - Part I

Extracting how-to knowledge about the compositional task

structure of complex tasks from video demonstrations

Meta-learning models with compositional priors generalize

better than black-box models

vs.

task graph

black box

Page 45: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

NTP and NTG learn how-to knowledge in the form of compositional task

structures while motor skills are abstracted away.

prepare dinner

prepare dinner

cook foodwash dishes

grasp wash place cut boil

modeled as pre-defined “API calls”

Page 46: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

NTP and NTG learn how-to knowledge in the form of compositional task

structures while motor skills are abstracted away.

How can we collect data

for learning motor skills

from the web?

Manually defining motor skills is intractable.

We need to learn from data.

Page 47: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Part I: Learning from Video Demonstrations

Part II: Learning from Crowd Teleoperation

Page 48: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Vecerik et al. 2017: 100 demos

Finn et al. 2017: 30 demosRajeswaran et al. 2018: 25 demos

Large demonstration datasets is hard to collect.

Humans need to demonstrate not label.

Zhu et al. 2018: 30 demos

Imitation Learning Reinforcement & Self-Supervised Learning

Levine et al. 2016

Pinto et al. 2016

Kalashnikov et al. 2018

Data can be low quality due to lack of expert.

Fang et al. 2018

Data is critical for learning robot motor skills.

Page 49: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Data is critical for learning robot skills.

How to scale up high-quality human supervision for robotics?

Provide a natural way for anyone to provide demonstrations

Page 50: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

RoboTurk in action

+

roboturk.stanford.edu Mandlekar, Zhu, Garg, Booher, Spero, Tung, Gao, Emmons, Gupta, Orbay, Savarese, Fei-Fei, CoRL 2018

Web-based Crowd Teleoperation with RoboTurk

RoboTurk: Crowdsourcing Platform for Large-Scale Demonstration Collection

real-time streaming

from remote robot

6-DoF

controller

Page 51: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

cloud

users

remote

robots

server

Mandlekar, Zhu, Garg, Booher, Spero, Tung, Gao, Emmons, Gupta, Orbay, Savarese, Fei-Fei, CoRL 2018roboturk.stanford.edu

Web-based Crowd Teleoperation with RoboTurk

Page 52: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

cloud

users

remote

robots

server

User Interface

Web Browser View

Mandlekar, Zhu, Garg, Booher, Spero, Tung, Gao, Emmons, Gupta, Orbay, Savarese, Fei-Fei, CoRL 2018roboturk.stanford.edu

Web-based Crowd Teleoperation with RoboTurk

Page 53: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

RoboTurk Pilot Dataset

137.5 hours of demonstrations

22 hours of total platform usage

2218 successful demonstrations

surreal.stanford.edu Zhu*, Fan*, Zhu, Liu, Zeng, Gupta, Creus-Costa, Savarese, Fei-Fei, CoRL 2018

teleoperated demonstrations

roboturk.stanford.edu Mandlekar, Zhu, Garg, Booher, Spero, Tung, Gao, Emmons, Gupta, Orbay, Savarese, Fei-Fei, CoRL 2018

Web-based Crowd Teleoperation with RoboTurk

Page 54: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Bin Picking (Can) Nut Assembly (Round)

Policy Learning from Teleoperated Demonstrations

Learning from the Masses

Page 55: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

RoboTurk Pilot Dataset

137.5 hours of demonstrations

22 hours of total platform usage

2218 successful demonstrations0

200

400

600

800

1000

0 1 10 100 1000

Task P

erf

orm

ance (re

ward

)

Number of Demonstrations

Reinforcement and Imitation Learning

Pure RL

assembly

pick & place

Zhu*, Fan*, Zhu, Liu, Zeng, Gupta, Creus-Costa, Savarese, Fei-Fei, CoRL 2018

Mandlekar, Zhu, Garg, Booher, Spero, Tung, Gao, Emmons, Gupta, Orbay, Savarese, Fei-Fei, CoRL 2018

surreal.stanford.edu

roboturk.stanford.edu

Reinforcement and Imitation Learning: Data

Page 56: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Webcam

Kinect

Robot

61

RoboTurk on Physical Robots

Page 57: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Real-time

62

Scalable Data Collection

Page 58: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Gao et al.

2014

Zhang et al.

2018

Yu, Finn et al.

2018

Sharma et al.

2018

1.662.35

4.08

13.7

0

2

4

6

8

10

12

14

16

JIGSAWS Deep Imitation DAML MIME

Data

set

Siz

e (H

ours

)

Dataset Size Comparison

Mandlekar, Booher, Spero, Tung, Gupta, Zhu, Garg, Savarese, Fei-Fei, IROS 2019

Page 59: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Gao et al.

2014

Zhang et al.

2018

Yu, Finn et al.

2018

Sharma et al.

2018

1.66 2.35 4.08

13.7

111.25

0

20

40

60

80

100

120

JIGSAWS Deep Imitation DAML MIME Ours

Data

set

Siz

e (H

ours

)

10x

Dataset Size Comparison

Mandlekar, Booher, Spero, Tung, Gupta, Zhu, Garg, Savarese, Fei-Fei, IROS 2019

Page 60: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

RoboTurk for

everyone, everywhere

Page 61: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Summary - Part II

RoboTurk scales up demonstration collection with teleoperated

crowdsourcing from web users

Large-scale crowdsourced data enables us to train more effective

motor skill learning algorithms.

Page 62: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Come to our IROS Presentation

Learn More about RoboTurk?

RoboTurk: Human Reasoning and Dexterity for Large-

Scale Dataset Creation

Tuesday 15:45-16:00, Award Session II: Paper TuBT4.5

Page 63: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Part I: Learning from Web Videos

Part II: Learning from Crowd Teleoperation

Extracting compositional task structures from video data

Crowdsourcing teleoperated demonstrations for skill learning

Page 64: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Conclusions

vWhat’s a good representation of procedural knowledge?

vHow do we learn procedural knowledge from the web?

vHow can robots take advantage of such knowledge?

High-level task structures & low-level motor skills

Large-scale web videos & crowd teleoperation from online users

Machine learning algorithms, e.g., meta-learning & imitation learning

Page 65: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Conclusions

Open Question:

How to integrate procedural knowledge and

declarative knowledge into a unified knowledge

ontology for building intelligent algorithms in

robotics?

Page 66: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Acknowledgements

Fei-Fei Li Silvio Savarese Animesh Garg Danfei Xu De-An HuangAjay Mandlekar

Page 67: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Robotics

Procedural

Knowledge

(“How-To”)

Declarative

Knowledge

(“That-Is”)

Understanding

the World

Interacting with

the World

http://ai.stanford.edu/~yukez/

[email protected]

Page 68: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore

Conclusions

Open Question:

How to integrate procedural knowledge and

declarative knowledge into a unified knowledge

ontology for building intelligent algorithms in

robotics?