Language acquisition framework for robots: From grounded language acquisition to spoken dialogues

LCore: A Language Acquisition Framework for Robots
From Grounded Language Acquisition to Spoken Dialogues
2013/12/13

Komei Sugiura and Naoto Iwahashi
National Institute of Information and Communication Technology, Japan
komei.sugiura@nict.go.jp

Open problem: grounded language processing
• Language processing based on non-verbal information (vision,
motion, context, experience, …) is still very difficult
– e.g. “Put the blue cup away”, “Give me the usual”
• What is missing in dialog processing for robots?
– Physical situatedness / symbol grounding
– Shared experience

“blue cup”: multiple candidates
2

“the usual”： umbrella, remote, drink,..

Spoken dialogue system + Robot ≠ Robot dialogue
• Robot dialogue

– Categorization/prediction of real-world information
– Handling real-world properties
– Linguistic interaction

• Why is this difficult?

– Machine learning, CV, manipulation, symbol grounding problem,
speech recognition,…
Tableware

Cup

Tea cup

Cutlery

Fork

Plate

Knife

Robot Language Acquisition Framework

[Iwahashi 10, “Robots That Learn to Communicate: A Developmental Approach…”]

• Task: Object manipulation dialogues
• Key features
– Fully grounded vocabulary
– Imitation learning
– Incremental & interactive learning
– Language independent

4

LCore functions
Phoneme learning

Learning question answering

Word learning

Visual feature learning

Grammar learning

Affordance learning

Disambiguation of word ellipsis

Imitation learning

Utterance understanding

Role reversal imitation

Robot-directed utterance
detection

Active-learning-based dialogue
5

Learning modules
Word

Grammar

Motion-object
relationship

• Learning nouns/adjectives
• Learning verbs
• Learning probabilistic distributions of • Estimation of related objects
visual features
• Learning trajectories
• Learning phoneme sequences
• Learning phoneme sequences

Symbol grounding: Learning nouns and adjectives
• Visual features modeled by Gaussians
– Input: visual features of objects
• Out-of-vocabulary word = phoneme sequence + waveform
– Voice conversion (Eigenvoice GMM) to robot voice

Generative models

BLUE

Unknown object
RED

Imitation learning of object manipulation [Sugiura+ 07]
• Difficulty: Clustering trajectories in the world coordinate system does not work
• Proposed method
– Input: Position sequences of all objects
– Estimation of reference point and coordinate system by EM algorithm
– Number of state is optimized by cross-validation

Place A on B

Imitation learning using reference-point-dependent HMMs
[Sugiura+ 07][Sugiura+ 11]

Searching optimal coordinate system
Coordinate system
type

:Position at time t
…
=

Reference object ID

HMM
parameters

• Delta parameters

=

…

* Sugiura, K. et al, “Learning, Recognition, and Generation of Motion by …”, Advanced Robotics, Vol.25, No.17, 2011

Results: motion learning
No verb is estimated to have WCS
-> Reference-point-dependent verb

Velocity
Motion “place-on”

Log likelihood

Position

Place-on Move-closer

Raise

Jump-over Move-away

Rotate

Move-down

Training-set likelihood

Transformation of reference-point-dependent HMMs [Sugiura+ 11]
• What is the problem?
– Simple HMMs do not generate continuous trajectories
– Situation dependent trajectories
• Reference-point-dependent HMM
– Input: (motion ID, object ID) e.g. <place-on, Object 1, Object 3>
– Output: Maximum likelihood trajectory

Situation

HMM “Place-on”
World CS

Place X on Y

* Sugiura, K. (2011), “Learning, Generation, and Recognition of Reference-Point-Dependent Probabilistic…”

Generating continuous trajectory using delta parameters
[Tokuda+ 00]

Maximum likelihood trajectory

: time series of
(position,velocity,acceleration)

: state sequence
: HMM parameters

: filter (

)

: matrix of covariance
matrices of each OPDF
: time series of position
: vector of mean vectors

*Tokuda, K. et al, “Speech parameter generation algorithms for HMM-based speech synthesis”, 2000

Quantitative results
• Evaluation measure
– Euclidian distance
– Normalized by frame number T

Trajectory by Subject
Trajectory by proposed method

SPOKEN LANGUAGE UNDERSTANDING
USING NON-LINGUISTIC INFORMATION

Utterance understanding in LCore (1)
• User utterances are understood by using multimodal
information learned in a statistical learning framework

Vision

Motion

(Bayesian
learning of a
Gaussian)

（HMM）

Speech
（HMM）

Motion-object
relationship
(Bayesian learning
of a Gaussian)

Shared
belief

Context
（MCE Learning）
15

Integration of multimodal information
• Shared belief Ψ: weighted sum of five modules
utterance

action

scene

context

Speech
Motion
Vision
Motion-object relationship

Context
16

Inter-module learning
Multimodal
understanding

Confidence
learning

Utterance/Motion
generation

Place Elmo on box
Place Elmo
Place it

User intension

17

Grounded utterance disambiguation
Where to?
• Simple dialog systems
Which “cup”?
U: “Place the cup (on the table).”
R: “You said place the cup.”
-> Risk of motion failure
• Generating confirmation utterances using physical information
R: “I’ll place the red cup on the table, is it OK?”

Multimodal utterance understanding

Place-on Elmo
30th
1st

2nd

…
1st

2nd

…
30th

Sugiura, K. et al, "Situated Spoken Dialogue with Robots Using Active Learning", Advanced Robotics, 2011

19

Multimodal utterance understanding

Place-on Elmo
30th

Margin
1st

2nd

…
1st

2nd

…
30th

Sugiura, K. et al, "Situated Spoken Dialogue with Robots Using Active Learning", Advanced Robotics, 2011

20

Confirmation by paraphrasing user’s utterance

• Learning phase
• Bayesian Logistic Regression
• Input: Margin(d), Output: probability

• Execution phase
– Decision-making on responses
based on expected utility
Probability

21

Margin

Quantitative result: Risk reduction
Baseline

Proposed

Decreased to 1/4
Failure rate
Rejection rate
Confirmation rate
# of confirmation utt

22

Reduction of motion failure in learning phase [Sugiura+ 11]
• So far…
– Learning utterance understanding probabilities
• Idea
• Learning-by-asking
Phase

Operator

Motion executor

Active Learning

Robot

User

(Passive) learning

User

Robot

Execution

User

Robot

Sugiura, K. et al, "Situated Spoken Dialogue with Robots Using Active Learning", Advanced Robotics, Vol. 25, No. 17, 2011

Reduction of motion failure in learning phase
• Problem:
– Motion failure is required in learning
phase to avoid over-fitting

Active Learning
phase
Motion
failure

Motion
success

Learning phase

“Safe” training
data
Motion
failure

Execution phase

Motion
success

What kind of commands are effective for learning?
• Proposed method: Active Learning-based command generation
• Objective: Reduce the number of interactions
• [Input = image], [Output = utterance]
• Expected Log Loss Reduction(ELLR[Roy, 2001]) is used to select
the optimal utterance
Active Learning : A form of supervised learning in which inputs can be
selected by the algorithm
Target action

Robot utterance

Loss

Act=A, Objs = <1,3>

“Place-on Elmo blue box”

35.8

Act=A, Objs = <1,3>

“Place-on Elmo”

12.3

Act=A, Objs= <1, 2>

“Place-on Elmo”

28.1

：

：

：

“Raise box”

332.3

：

：

Act=B, Objs=<2>
：

Reduction of motion failure in learning phase

Test-set likelihood

(1) Proposed
(2) Baseline
Number of episodes

Motion failure risk
reduced
# of motion failure

Fast convergence

Proposed Baseline

Language acquisition framework for robots: From grounded language acquisition to spoken dialogues

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Similar to Language acquisition framework for robots: From grounded language acquisition to spoken dialogues

Similar to Language acquisition framework for robots: From grounded language acquisition to spoken dialogues (20)

More from Komei Sugiura

More from Komei Sugiura (20)

Recently uploaded

Recently uploaded (20)

Language acquisition framework for robots: From grounded language acquisition to spoken dialogues