2. 2
Human-Centered Multimedia
Founded: April 2001
Chair: Elisabeth André
Research Topics:
Human-Computer Interaction
Social Signal Processing
Affective Computing
Embodied Conversational Agents
Social Robotics
3. 3
Motivation
There is another level in human communication, which is
just as important as the spoken message:
nonverbal communication
How can we enrich the precise and useful functions of
computers with the human’s ability to shape the meaning
of a message through nonverbal messages?
4. 4
Observation
Social signal processing has developed from a
side issue to a major area of research.
Undertaken effort has not translated well into
applications. Why is this?
1998 ………………. …………………….. 2005 2006 ……. 2009 .. 2011 2012 2013 2015
Special Session on Face and Gesture Recognition
Keynote „Honest Signals“
1st HCM Workshop
1/3 of Grand Challenge Papers on Affective Computing
3 Workshops on „Social Cues“Brave New Topic: Affective Multimodal HCI
ACM MM
5. 5
Challenge: Real-Life
Applications
Total of 434 publications on SSPNet
10% include term “real(-)time” and are related to detection
Only 2 % address multi-modal detection
Social Signal Processing in the Wild
90%
3%
2%
2%
1%0%2%
face (15) gesture (9) speech (9)
interaction (8) physiological (2) multimodal (13)
Meta Analysis by J. Wagner
6. 6
Organization of the Talk
Analysis of Emotional and Social Signals
Generation of Expressive Behaviors in Virtual Agents
and Robots
Applications of Socially Signal Processing and Embodied
Agents
Socially sensitive Robots
Training of Presentation Skills in
• Job Interviews
• Public Speaking
Providing Information on Social Context to Blind
People
7. 7
Challenge: Noisy and Corrupted
Data
We only rely on previously seen data.
We have to deal with noisy and corrupted data.
?
now
time
noise missing
8. 8
Challenge: Non-Prototypical
Behaviors
Previous research focused on the analysis of
prototypical samples in preferably pure form
In daily life, we also observe subtle, blended and
suppressed emotions, i.e. non-prototypical emotional
displays.
Pictures from Ekman and Friesen’s database of emotional faces
9. 9
Accuracy Drops with
Naturalness
Systems developed under laboratory conditions
often perform poorly in real-world scenarios
100%
80% 70%
Accuracy
Naturalness
Acted Read WOZ
10. 10
Contextualized Analysis
Improvement by context-sensitive analysis
Gender-specific information (Vogt & André 2006)
Success / failure of student in tutoring applications
(Conati & McLaren 2009)
Dialogue behavior of virtual agent / robot (Baur et al.
2014)
Learning context using (B)LSTM (Metallinou et al.
2014)
11. 11
Challenge: Multimodal Fusion
Meta study by D’Mello and Kory on multimodal
affect detection shows that improvement
correlates with naturalness of corpus:
>10% for acted and only <5% for natural data
In natural interaction people draw on a mixture
of strategies to express emotion leading to a
complementary rather than consistent display of
social behaviour
S.K. D'Mello, J.M. Kory: Consistent but modest: a
meta-analysis on unimodal and multimodal affect
detection accuracies from 30 studies. ICMI 2012: 31-38
12. 12
Event-Based Fusion
In case of contradictory cues, fusion methods
trust the “right” modality just as often as “wrong”
one
single modalities
fusion
techniques
sample
correct classification
incorrect classification
J. Wagner, E. André, F. Lingenfelser, J. Kim: Exploring Fusion Methods for
Multimodal Emotion Recognition with Missing Data. T. Affective Computing 2(4): 206-
218 (2011)
13. 13
Event-Based Fusion
Amount of misclassified samples significantly
higher when annotations mismatch
Yes 71%
No 29%
62% 36%
Agreement?
15. 15
Synchronous Fusion
Synchronous fusion approaches are characterized by
the consideration of multiple modalities within the same
time frame
16. 16
Asynchronous Fusion
Asynchronous fusion algorithms refer to past time
frames with the help of some kind of memory support.
Therefore, they are able to capture the asynchronous
nature of observed modalities.
18. 18
Event-Based Fusion
Take into account temporal relationships between
channels and learn when to combine information
Move from segmentation-based processing to
asynchronous event-driven approaches
More robust in the case of missing or noisy data
+
0
Fusion
time
haha hehe
Event
F. Lingenfelser, J. Wagner, E. André, G. McKeown, W. Curran: An Event Driven Fusion Approach
for Enjoyment Recognition in Real-time. ACM Multimedia 2014: 377-386
19. 19
SSI Framework
The Social Signal Interpretation (SSI) framework
is the attempt to provide a general architecture
to tackle the challenges we have discussed:
collection of large and rich multi-modal corpora
investigation of advanced fusion techniques
simplifying the development of online systems
hehe
hehe
Johannes Wagner, Florian Lingenfelser, Tobias
Baur, Ionut Damian, Felix Kistler, Elisabeth André:
The social signal interpretation (SSI) framework:
multimodal signal processing and recognition in
real-time. ACM Multimedia 2013: 831-834
SSI is freely available under:
http://www.openssi.net
24. 24
Generation of Facial
Expressions
FACS (Facial Action Coding System) can be used to
generate and recognize facial expressions.
Action Units are used to describe emotional
expressions.
Seven Action Units were identified for the robotic face
(out of 40 Action Units for the human face)
Lower face:
lip corner puller (AU 12),
lip corner depressor (AU 15)
and lip opening (AU 25)
Upper face:
inner brows raiser (AU 1),
brown lowerer (AU 4),
upper lid raiser (AU 5)
and eye closure (AU 43).
26. 26
Realization of Social Lies for the
Hanson Robokind
Social lies constitute a great part of human conversation.
Social lies, as used for politeness reasons, are generally
accepted.
Humans often show deceptive cues in their nonverbal
behavior while lying.
Humanoid robots should show deceptive cues while
conducting social lies as well.
27. 27
Deceptive Cues
Deceptive cues in human faces, according to Ekman and
colleagues:
Micro-expressions: A false emotion is displayed but
the felt emotion is unconsciously expressed for the
fraction of a second.
Masks: The felt emotion is intentionally masked by a
not corresponding facial expression.
Timing: The longer an expression is shown the more
likely it is accompanying a lie.
Asymmetry: Voluntarily shown facial expressions
tend to be displayed in an asymmetrical way.
30. 30
Real versus Faked Smile
Smile with blended anger (in the
eye region
Real smile
31. 31
Results of a Study
It was easier to detect faked smiles by the mouth region.
Robots with an asymmetrical smile were rated as
significantly less happy than robots with a genuine smile.
Results are in line with research on virtual agents:
Rehm & André, AAMAS 2005:
• Agents that fake emotions are perceived as less trustworthy
and less convincing
• Subjects were not able to name reasons for their uneasiness
with the deceptive agent
B. Endrass, M. Häring, G. Akila, E. André: Simulating
Deceptive Cues of Joy in Humanoid Robots. IVA 2014:
174-177
33. 33
Social Feedback Loop
Improve
Social Skills
Implicit Social Response
Generate Feedback
Explicit Hint on Social Behavior
Behavior Analysis
Social Behavior
Sensors
34. 34
Behavior
Analysis
Real-time multimodal analysis and classification
of social signals
Expressivity features (Energy, Openness, Fluidity)
Facial expressions (Smiles, Lip biting)
Speech quality (Speech rate, Loudness, Pitch)
Engagement, Nervousness
35. 35
Evaluation
Location:
Parkschule School in Stadtbergen, Germany
Participants:
20 pupils (10m/10f), 13-16 years old, job seeking
Two practitioners
I. Damian, T. Baur, B.Lugrin, P. Gebhard, G.
Mehlmann, E. André: Games are Better than Books:
In-Situ Comparison of an Interactive Job Interview
Game with Conventional Training. AIED 2015: 84-94
37. 37
Day 1 Day 2 Day 3
Pre-Interviews Training (Control) Training (TARDIS) Post-Interviews
20 pupils
2 practitioners
Task: mock-
interviews
Duration: ~10 min
10 pupils
Task: reading a
job interview guide
Duration: ~10 min
10 pupils
Task: Interaction
with TARDIS +
NovA
Duration: ~10 min
20 pupils
2 practitioners
Task: mock-
interviews
Duration: ~10 min
2x performance
questionnaires
(user +
practitioner)
user experience
questionnaires
user experience
questionnaires
2x performance
questionnaires
(user +
practitioner)
Experimental Setting
38. 38
Results
The overall behavior of the pupils who had interacted
with TARDIS was rated significantly better by job trainers
than the overall behavior of the pupils who prepared
themselves for the job interview using books.
Only for the pupils who trained with TARDIS we were
able to measure statistically significant improvements:
Their use of smiles appeared more appropriate.
Their use of eye contact appeared more appropriate.
They appeared significantly less nervous.
39. [...] using the system, pupils seem
to be highly motivated and able to
learn how to improve their
behaviour […] they usually lack
such motivation during class
[...] transports the experience into
the youngster’s own world
[...] makes the feedback be much
more believable
40. 40
Augmenting
Social
Interactions
I. Damian, C.S. Tan, T. Baur, J. Schöning,
K. Luyten, E. André: Augmenting Social
Interactions: Realtime Behavioural
Feedback using Social Signal Processing
Techniques. CHI 2015: 565-574
42. 42
Social Feedback Loop
Behavior Analysis
Social Behavior
Explicit
Feedback Generation
Haptic Feedback
Sensors
Improve
Social Skills
43. 15 speakers, 2 observers
Task: Hold 5 min presentation
2 Conditions: system on, system off
- within subjects
- randomized order, 2 weeks apart
Data acquisition: social signal recordings,
questionnaires (speaker/observers)
Study 1: Quantitative study in
controlled environment
44. Objective analysis of recordings:
Amount of inappropriate behaviour
decreased when system was on
Off
On
%inappropriatebehaviour
(lowerisbetter)
44
46. 3 speakers, 13 observers
Task: Present PhD progress
Data acquisition: semi-structured
interview
Study 2: Qualitative study in a
real presentation setting
47. [...] once I saw the feedback that
I was talking too fast, I tried to
adapt
48. [...] once I saw the feedback that
I was talking too fast, I tried to
adapt
[...] most of the time I did not
perceive the system, only when I
consciously looked at the feedback
49. [...] once I saw the feedback that
I was talking too fast, I tried to
adapt
[...] most of the time I did not
perceive the system, only when I
consciously looked at the feedback
It was a good feeling seeing
everything [the icons] green ...
it’s like applause, or as if
someone looks at you and nods.
However, the green lasts longer
than a nod [laughs]
53. 53
User Study
Users:
7 blind and visually impaired participants
Criteria:
No nystagmus, unrestricted eye
movements
Age Gender Visual impairment Control method
68 male Cataract center point
49 female Cataract (early stage) eye gaze
43 female Optic atrophy eye gaze
73 male Congenital blindness center point
68 male Optic nerve damage (accident) center point
87 female Macular degeneration eye gaze
70 male Retinal degeneration eye gaze
54. 54
Experiment
Scenario:
Two videos with a speaker giving a monologue are
shown
Task:
Rate emotional state of the speaker
Results:
Videos were rated more accurately with the system on
56. 56
Overall Conclusions
Social and emotional sensitivity are key elements of
human intelligence.
Social signals are particularly difficult to interpret
requiring to understand and model the causes and
consequences of them.
Offline applications start from too optimistic recognition
rates.
More work needs to be devoted to interactive online
applications.
More information and software available under:
http://www.hcm-lab.de
57. 57
Current Work:
Mobile Social Signal Processing
SSJ: Realtime Social Signal Processing for Java/Android
SSI – Unix/Android build compatibility