Please tweet to us using: @Affectiva and #EmoDev16
Key Websites:
Affectiva: http://affectiva.com
Developer Portal: http://developer.affectiva.com
Affectiva Demo: http://go.affectiva.com/affectiva-demo
Emotion AI Developer Day brought together the largest remote conference of emotion recognition developers in the world, including Affectiva staff, affective computing thought leaders and companies offering complementary technologies. Emotion AI Developer Day provided opportunities for attendees to learn, as we as to help shape the future of Affectiva.
Find us on:
Facebook: https://www.facebook.com/Affectiva/
Twitter: https://twitter.com/Affectiva
LinkedIn: https://www.linkedin.com/company/affectiva_2
08448380779 Call Girls In Civil Lines Women Seeking Men
Jay Turcot - Emotion AI Developer Day 2016
1. @affectiva
Metrics & How Affectiva Software
works
Jay Turcot
Director of Applied AI, Affectiva Inc.
@pjturcot
Nov 16, 2016
Emotion AI Developer Day 2016
2. Outline
• Why the face? & FACS
• How the technology works
• Metrics and emotions
• How it’s used
• Digging deeper
• Computer vision pipeline
• Static image vs. video analysis
• Pose & Luminance
3. Why the face?
• Spontaneous
• Real time feedback
• Front-facing camera
• Transmits rich information
• Emotional state
• Intensity
• Human interpretable
4. Ekman and Friesen
Facial Action Coding System -- 1978
• Codifies facial expressions
• Action Units
• Independent movements of the face
• Numeric codes
• Associated with specific facial muscles
• 5 intensity ratings (A,B,C,D,E)
• Example:
• AU 1 – Inner Eyebrow Raise
• AU 9 – Nose wrinkle
5. FACS: example
• Common language:
• Not bad meme
• Frown (North America only!)
• FACS:
• 15E+17E
• Lip corner depressor (AU15)
• Chin raiser (AU17)
6. Our computer vision
algorithms identify key
landmarks on the face
Machine learning algorithms
analyze pixels in those regions
to classify facial expressions
Combinations of facial
expressions are mapped
to emotions
How it works?
14. Face Detection
• Initial detection of a face in an image
• Position and scale (bounding box)
• In the SDK
• Near-frontal & upright faces*
• Looking for faces in multiple positions
and multiple scales is time consuming
• For multi-face, face detection needs to
run periodically to scan for new
individuals
(until max faces is reached)
15. Landmark tracking
• Locate 2D position of face landmarks
• 34 landmarks
• Allows head angle estimate
• In the SDK
• Once tracking, can follow face through
position/scale/orientation changes
• Tracking allows a wider range of non-frontal
angles & upside down
• Sudden position changes can disrupt tracking
• Confidence in tracking is checked at every frame
16. Expression detection
• Detect a facial expression
• Probability of expression
(correlated with intensity)
• In the SDK
• Frame-by-frame analysis
• Detection relies on visual texture
information: shading, wrinkling
• Robustness gained through observing
thousands of real-world examples
17. Expression interpretation
• Interpret expressions
• Basic 6 emotions, contempt
• Valence, engagement
• Emoji
• In the SDK
• Emotional interpretations are available on a
frame-by-frame basis
• Great starting point for analysis
• You can always drill down into underlying
expressions
19. Video vs. Images
• Expression detection requires a baseline
(neutral)
• With video, you can build an estimate of a
person’s neutral face
• Person specific appearance
• Improves accuracy & sensitivity
• With still image, no additional information is
available
20. Pose & Luminance
• Metrics are robust to lighting
• Lighting of the face!
(and not lighting of the room)
• Pose is more challenging for some
expressions
• Beyond certain ranges we are no longer
confident in metrics and stop reporting
them
(despite tracking the face)
• Improvements in future releases as we
learn from more and more data
Summary:
Our system currently detects 20 independent facial actions, which can combine to make hundreds of different facial expressions
(optional) We make use of the Facial Action Coding System (FACS), and decouple the detection of what occurs on the face, from the interpretation
Talk Track:
Our system currently detects 20 independent facial actions 15 of which are shown here. These actions are based on the Facial Action Coding System, or FACS, which codifies facial movements into these independent facial actions. You can think of the almost as a 1 to 1 mapping with facial muscle groups, so things like inner and outer eyebrows raising, raising your upper lip, closing your eyelids. These actions can be combined to make hundreds of different facial expressions, though in practice, only certain combinations happen naturally.
The reason we make use of FACS is that we want to be able to reliably detect and describe the expressions, what we see on occurring the face, and decouple that from the emotional interpretation of what that might mean.
Summary:
Our system currently detects 20 independent facial actions, which can combine to make hundreds of different facial expressions
(optional) We make use of the Facial Action Coding System (FACS), and decouple the detection of what occurs on the face, from the interpretation
Talk Track:
Our system currently detects 20 independent facial actions 15 of which are shown here. These actions are based on the Facial Action Coding System, or FACS, which codifies facial movements into these independent facial actions. You can think of the almost as a 1 to 1 mapping with facial muscle groups, so things like inner and outer eyebrows raising, raising your upper lip, closing your eyelids. These actions can be combined to make hundreds of different facial expressions, though in practice, only certain combinations happen naturally.
The reason we make use of FACS is that we want to be able to reliably detect and describe the expressions, what we see on occurring the face, and decouple that from the emotional interpretation of what that might mean.
Summary:
Face detector, landmark detector, analyze pixels/texture to detect expressions and roll those into inferences on emotional state
Landmarks are used to localize but our technology analyses pixels/shading
Expressions are detected on the face, and these expressions are rolled up into high level interpretation metrics: emotions,valence
Talk track:
Our current product works by using a number of different computer vision technologies.
First, faces are detected [1],
For each face, we then localize a set of landmarks which tells us more details about the position, scale and orientation of the face. We don’t use the landmark positions to determine expression but rather to help localize in fine detail where regions of interest such as the eyes, mouth, eyebrows are.
The image content, specifically the pixels/textures around these landmarks are analyzed by the system. Looking at the pixels allows us to factor in appearance information such as wrinkling and shading, enabling the technology to detect more subtle expressions not possible by looking just at the landmark positions alone.Models trained through machine learning are applied to recognize (classify) 20 different independent facial actions… reported on every frame in the video…. These independent actions can combine to form hundreds of different facial expressions
The facial actions are then mapped into a number of different higher level interpretive outputs : emotions, valence, arousal and even emoji.
Notes:
[1] – Face detection is a common place technology that is even built into digital cameras/ smartphones now.
Summary:
Our system currently detects 20 independent facial actions, which can combine to make hundreds of different facial expressions
(optional) We make use of the Facial Action Coding System (FACS), and decouple the detection of what occurs on the face, from the interpretation
Talk Track:
Our system currently detects 20 independent facial actions 15 of which are shown here. These actions are based on the Facial Action Coding System, or FACS, which codifies facial movements into these independent facial actions. You can think of the almost as a 1 to 1 mapping with facial muscle groups, so things like inner and outer eyebrows raising, raising your upper lip, closing your eyelids. These actions can be combined to make hundreds of different facial expressions, though in practice, only certain combinations happen naturally.
The reason we make use of FACS is that we want to be able to reliably detect and describe the expressions, what we see on occurring the face, and decouple that from the emotional interpretation of what that might mean.
Additional notes:- The reason the FACS mapping isn’t 1-to-1 with muscle groups is that some actions engage two muscles groups simultaneously such as eyebrow lower (AU04) which both lowers and draws the eyebrows in. The lowering muscle is also engaged in an AU09 when wrinkling the nose.
An example of the benefit of decoupling the expression from the emotion, is that if you look at a typical taxonomy of emotions, the only reason a person would lower their eyebrows is that they are angry, which we know isn’t the case. There are a number of reasons people lower their eyebrows: during cognitive load, that is confusion and concentration,
Also, true emotion detection will likely be context specific, and being able to robustly describe what happens simplifies the interpretation
Summary:
Expressions are rolled up into higher level emotions
(optional) We make use of the Facial Action Coding System (FACS), and decouple the detection of what occurs on the face, from the interpretation
Talk Track:
Our system currently detects 20 independent facial actions 15 of which are shown here. These actions are based on the Facial Action Coding System, or FACS, which codifies facial movements into these independent facial actions. You can think of the almost as a 1 to 1 mapping with facial muscle groups, so things like inner and outer eyebrows raising, raising your upper lip, closing your eyelids. These actions can be combined to make hundreds of different facial expressions, though in practice, only certain combinations happen naturally.
The reason we make use of FACS is that we want to be able to reliably detect and describe the expressions, what we see on occurring the face, and decouple that from the emotional interpretation of what that might mean.
Additional notes:- The reason the FACS mapping isn’t 1-to-1 with muscle groups is that some actions engage two muscles groups simultaneously such as eyebrow lower (AU04) which both lowers and draws the eyebrows in. The lowering muscle is also engaged in an AU09 when wrinkling the nose.
An example of the benefit of decoupling the expression from the emotion, is that if you look at a typical taxonomy of emotions, the only reason a person would lower their eyebrows is that they are angry, which we know isn’t the case. There are a number of reasons people lower their eyebrows: during cognitive load, that is confusion and concentration,
Also, true emotion detection will likely be context specific, and being able to robustly describe what happens simplifies the interpretation
Summary:
Another way our individual expression detection results are combined is by mapping them to the nearest emoji
Talk Track:
Also, our SDK reports back the closest matching emoji to a person’s facial expression. This again, highlights how the system is capable of detectinga wide variety of expressions that occur on a person’s face
Additional notes:
- Emoji does a better job of describing facial expression than language
Summary:
We also detect gender, an estimate of age, and whether they are wearing glasses
Not our primary focus, but some partners have found these estimates useful
Talk Track:
In addition to detecting emotional states, we also can report estimate on basic demographic and appearance information about an individual, such as their age and gender.
Additional notes:- Gender estimation is about 90% accuracy for now. Higher accuracy is on the roadmap for improvements in a future SDK release.
- Age estimation is done in discrete age bands
Summary:
Models trained using machine learning
Showcase that lots of data helps model performance, both simple and complex (diminishing returns)
Our team is constantly researching new algorithms to better leverage more data and push the accuracy even higher
Deep learning can really push the upper limit
Talk Track:
The core technology that drives the expression detection is called machine learning. The field of research of machine learning is to build models that learn from many examples (called training). As you feed the models more examples, for example of different people smiling at varying intensities, in varying lighting conditions, you see the accuracy of the models increase. In this way we leverage all the data that Affectiva has collected over the years from around the globe to build extremely accurate and reliable models. Our research team is always exploring new computer vision and machine learning techniques that can benefit from additional data and push our accuracy even higher. For example deep learning, a very popular area of research, can push accuracies extremely high but require a lot of data to do it.
Additional notes:
This specific image is exploring the role of training data in training AU04 using :(1) Linear SVM (2) RBF approximation SVM (3) RBF SVM
You can see model accuracy increase with additional data, but there are diminishing returns.
Extremely fast linear models trained with lots of data
Paper: Facial Action Unit Detection using Active Learning and an Efficient Non-Linear Kernel Approximation
http://www.affectiva.com/wp-content/uploads/2016/02/Facial-Action-Unit-Detection-using-Active-Learning-and-an-Efficient-Non-Linear-Kernel-Approximation.pdf
Summary:
In additional to basic accuracy, we explicitly check how robust our classifiers are to environmental conditions like lighting and head angles
Talk Track:
In additional to basic accuracy, we explicitly check how robust our classifiers are to environmental conditions like lighting and head angles
Additional notes:-
Let’s look at some examples
Face normalized to 96x96 pixels
Let’s look at some examples
Summary:
In additional to basic accuracy, we explicitly check how robust our classifiers are to environmental conditions like lighting and head angles
Talk Track:
In additional to basic accuracy, we explicitly check how robust our classifiers are to environmental conditions like lighting and head angles
Additional notes:-