ABSTRACT
Past work on automatic analysis of facial expressions has focused mostly on detecting prototypic expressions of basic emotions like happiness and anger. The method proposed hereenables the detection of a much larger range of facial behaviorby recognizing facial muscle actions that compound expressions. AUs are agnostic, leaving the inferenceabout conveyed intent to higher order decision making (e.g.,emotion recognition). The proposed fully automatic method not only allows the recognition of 22 AUs but also explicitly modelstheir temporal characteristics (i.e., sequences of temporal segments:neutral, onset, apex, and offset). To do so, it uses a facialpoint detector based on Gabor-feature-based boosted classifiersto automatically localize 20 facial fiducial points. These pointsare tracked through a sequence of images using a method calledparticle filtering with factorized likelihoods. To encode AUs andtheir temporal activation models based on the tracking data, it applies a combination of GentleBoost, support vector machines,and hidden Markov models. We attain an average AU recognitionrate of 95.3% when tested on a benchmark set of deliberatelydisplayed facial expressions and 72% when tested on spontaneousexpressions.
INTRODUCTION
FACIAL EXPRESSIONS synchronize the dialogue by means of brow raising and nodding, clarify the content and intent of what is said by means of lip reading and emblems like a wink, signal comprehension, or disagreement, and convey messages about cognitive, psychological, and affective states . Therefore, attaining machine understanding of facial behavior would be highly beneficial for fields as diverse as computing technology, medicine, and security in applications like ambient interfaces, empathetic tutoring, interactive gaming, research on pain and depression, health support appliances, of stress and fatigue, and deception detection. Because of this practical importance and the theoretical interest of cognitive and medical scientists machine analysis of facial expressions attracted the interest of many researchers in computer vision .
AUTOMATIC ANALYSIS OF FACIAL EXPRESSIONS
Two main streams in the current research on automatic analysis of facial expressions consider facial affect (emotion) detection and facial muscle action detection. These streams stem directly from the two major approaches to facial expression measurement in psychological research : message and sign judgment. It leads to infer what underlies a displayed facial expression, such as affect or personality, while the aim of the latter is to describe the "surface" of the shown behavior, such as facial movement or facial component shape.
Thus, a frown can be judged as "anger" in a message-judgment approach and as a facial movement that lowers and pulls the eyebrows closer together in a sign-judgment approach. Most facial expression analyzers developed so far adhere to the message judgment stream and attempt to recognize a small set of prototypic emotional facial expressions such as the six basic emotions proposed by Ekman.
In sign judgment approaches, a widely used method for manual labeling of facial actions is the Facial Action Coding System (FACS) . FACS associates facial expression changes with actions of the muscles that produce them. It defines 9 different action units (AUs) in the upper face, 18 in the lower face, and 5 AUs that cannot be classified as belonging to either the upper or the lower face. Additionally, it defines so-called action descriptors, 11 for head position, 9 for eye position, and 14 additional descriptors for miscellaneous actions (for examples, see Fig. 1).
Figure2.1 Examples of upper and lower face AUs defined in the FACS.
AUs are considered to be the smallest visually discernible facial movements. FACS also provides the rules for the recognition of AUs' temporal segments (onset, apex, and offset) in a face video. Using FACS, human coders can manually code nearly any anatomically possible facial expression, decomposing it into the specific AUs and their temporal segments that produced the expression. As AUs are independent of any interpretation they can be used as the basis for any higher order decision making process including the recognition of basic emotions , cognitive states like (dis)agreement and puzzlement, psychological states like pain, and social signals like emblems , regulators and illustrators. Hence, AUs are extremely suitable to be used as midlevel parameters in an automatic facial behavior analysis system .
AU’s IN POSED FACE IMAGES
The focus of the research efforts in the field was first on automatic recognition of AUs in either static face images or face image sequences picturing facial expressions produced on command. One of the main criticisms that these works received from both cognitive and computer scientists is that the methods are not applicable in real-life situations, where subtle changes in facial expression typify the displayed facial behavior rather than the exaggerated AU activations typical of deliberately displayed facial expressions.
Automatic recognition of facial expression configuration has been the main focus of the research efforts in the field. Both the configuration and the dynamics of facial expressions are important for the interpretation of human facial behavior. Facial expression temporal dynamics are essential for the categorization of complex psychological states like various types of pain and mood . They are also the key parameter in the differentiation between posed and spontaneous facial expressions. Some of the past work in the field has used aspects of temporal dynamics of facial expression such as the speed of a facial point displacement or the persistence of facial parameters over time. However, this was mainly done either in order to increase the performance of facial expression analyzers or in order to report on the intensity of (a component of) the shown facial expression but not in order to analyze explicitly the properties of facial actions' temporal dynamics. However, that this work does not report on the explicit analysis of temporal segments of AUs (e.g., the duration and the speed of the onset and offset of the actions).These leads to the studies on automatic segmentation of AU activation into temporal segments (neutral, onset, apex, and offset) in frontal and profile-view face videos. Hence, the focus of the research in the field started to shift toward automatic AU recognition in spontaneous facial expressions
FACIAL POINT DETECTION
The first step in any facial information extraction process is face detection, i.e., the identification of all regions in the scene that contain a human face. The second step in facial expression analysis is to extract geometric features (facial points and shapes of facial components) and/or appearance features (descriptions of the texture of the face such as wrinkles and furrows).
FACE DETECTION
The most commonly employed face detector in automatic facial expression analysis is the real-time Viola-Jones face detector. The Viola-Jones face detector consists of a cascade of classifiers. Each classifier employs integral image filters and can be computed very fast at any location and scale. This is essential to the speed of the detector. For each stage in the cascade, a subset of features is chosen using a feature selection procedure. The C++ code of the face detector runs at about 500 Hz on a 3.2-GHz Pentium 4.
CHARACTERISTIC FACIAL POINT DETECTION
Methods for facial feature point detection can be classified as either texture-based methods (modeling local texture around a given facial point) or shape-based methods (which regard all facial points as a shape that is learned from a set of labeled faces). A typical texture-based methods uses log-Gabor filters. And also typically texture and shape-based methods are employed by AdaBoost to determine facial feature point for each pixel in an input image and used a shape model as a filter to select the most likely position of feature points.
Although these detectors can be used to localize the 20 facial characteristic points illustrated in Fig. 3, none perform the detection with high accuracy. They usually regard the localization of a point to be successful if the distance between the automatically labeled point and the manually labeled point is less than 30% ofthe true interocular distance DI (the distance between the eyes, more specifically between the inner eye corners). However, this is an unacceptably large error in the case of facial expression analysis since subtle changes in the facial expression will be missed due to the errors in facial point localization.
We therefore adopt the fiducial facial point detector proposed by Vukadinovic and Pantic . When used to initialize a point tracking algorithm, this method is accurate enough to allow geometric-feature based expression recognition The outline of the developed fully automated method for the detection of the target 20 facial characteristic points is illustrated in Fig. 3. The locations of these facial components are approximated by analyzing the histograms in the regions. Based on this approximate location, search regions are defined for every point to detect. In these regions of interest (ROIs), a sliding window approach search is performed.
At each location of the ROI, Gabor-filter responses are calculated and fed into the GentleBoost-based point detectors. The location with the highest output determines the predicted point location. Typical results of this algorithm are illustrated in Fig. 5.2. The point detection algorithm is tolerant to changes in illumination as long as they remain locally constant. If illumination is uneven in the direct neighborhood of a facial point, the point detector may fail for that point.
PFFL FOR FACIAL POINT TRACKING
After the fiducial facial points are found in the first frame, we track their positions in the entire image sequence. Standard particle filtering techniques are commonly used for facial point tracking in facial expression analysis
The tracking scheme that we adopt is based on particle filtering. The main idea behind particle filtering is to maintain a set of solutions that are an efficient representation of the conditional probability p(a/Y ), where a is the state of a temporal event to be tracked given a set of noisy observations Y ={y 1, . . . ,y’, y} up to the current time instant. By maintaining a set of solutions instead of a single estimate, particle filtering is able to track multimodal conditional probabilities p(α|Y ), and it is therefore robustto missing and inaccurate data and particularly attractive for estimation and prediction in nonlinear, non-Gaussian systems. In the particle filtering framework, the aposteriori probability p(α|Y ) is used.
The PFFL tracking scheme assumes that the state α can be partitioned into substates αi such that α = _α1, . . . , αn_. At each frame of the input image sequence, we obtain a particle-based representation of p(α|Y ) in two stages. At the first stage of the PFFL tracking scheme, each facial point i is tracked for one frame independently from the other facial points. At the second stage, interdependences between the substates are taken into account by means of a scheme that samples complete particles from the proposal distribution g(α), which is defined as the product of the posteriors of each αi given the observations, i.e., g(α) =i p(αi|Y ).Finally, each of the particles produced in this way is reweighted by evaluating the joint probability p(α|α−) so that the set of particles with their new weights represents the aposteriori probabilitp(α|Y ).
RECOGINITION OF AU’s AND THEIR TEMPORAL ACTIVATION MODELS
Contractions of facial muscles alter the shape and location of the facial components. Some of these changes are observable from the movements of 20 facial points, which we track in the input sequence. To classify the movements of the tracked points in terms of AUs and their temporal activation models, changes in the position of the points over time are first represented as a set of midlevel parameters.
REGISTRATION AND SMOOTHING
Before the midlevel parameters can be calculated, all rigid head motions in the input sequence must be eliminated. We register each frame of the input image sequence with the first frame using an affine transformation T1 based on three referential points: the nasal spine point and the inner corners of the eyes (see Fig. 3). We use these points as the referential points because contractions of the facial muscles do not affect these points.
Interperson variations in size and location of the facial points are minimized by applying an affine transformation T 2 to every tracked facial point in each frame. T 2 is obtained by comparing the locations of the referential points of a given subject in the first frame with the corresponding points in a selected expressionless "standard" face .Thus, after tracking any of 20 characteristic facial points in an input sequence containing k frames, we obtain a set of coordinates (p 1, . . . , pk) corresponding to the locations of the pertinent point p in each of k frames. Then, the registered Co-ordinates p ir are obtained as
Using this registration technique, four out of six degrees of freedom of head movements can be dealt with, and the remaining two can be handled partially. The tracked points returned by the PFFL tracker contain random noise occurring due to the probabilistic nature of particle filtering. Therefore, we apply a temporal smoothing filter to arrive at a registered set of points that contains less noise
where t denotes the frame number and p_ and pr are elements of the collections P_ and Pr, respectively. The window side lobe size ws to which we apply the temporal smoothing was chosen after visual inspection of the smoothed tracker’s output.
FACIAL AU CLASSIFICATION
AU recognition from input image sequences is based on SVMs. SVMs are very well suited for this task because the high dimensionality of the feature space (representation space) does not affect the training time, which instead depends only on the number of training examples. Furthermore, SVMs generalize well even when few training data are provided. However, note that classification performance decreases when the dimensionality of the feature set is far greater than the number of samples available in the training set .
The implementation of the feature selection has been done as follows. For every d ∈ D, where D is the set of 22 AUs that our system can recognize in an input sequence, we apply Gentle Boost resulting in a set of selected features Gd. To detect 22 AUs occurring alone or in combination in the current frame of the input sequence . we train a separate SVM to detect the activity for every AU.More specifically, we use Gd to train and test the SVM classifier for the relevant AU . An advantage of feature selection by a boosting algorithm is that it tries to optimize the actual classification problem instead of reducing the variability in the data overall
TEMPORAL ACTIVATION MODELS OF FACIAL AU’s
To encode the temporal segments of the AUs found to be activated in the input image sequence, we proceed as follows. An AU can either be in the following phases:
The onset phase, where the muscles are contracting and the appearance of the face changes as the The apex phase, where the facial action is at a peak and there are no more changes in facial appearance due to this particular facial action.
The offset phase, where the muscles are relaxing and the face returns to its neutral appearance.
The neutral phase, where there are no signs of activation of the particular facial action.
Here we use two approaches to detect an AU temporal model.
mc-SVMs: In the first approach, we employ a one-versus-one strategy to multiclass SVMs (mc-SVMs). For each AU and every pair of temporal segments, we train a separate sub classifier specialized in the discrimination between the two temporal segments. This results in /C/ (/C/ - 1)/2 sub classifiers that need to be trained, with C ={neutral, onset, apex, offset} and /. / being the cardinality of a set. For each frame t of an input sequence, every sub classifier returns a prediction of the class cÎ C, and a majority vote is cast to determine the final output c t of the mc-SVM for the current frame t. For each classifier separating classes c i, c jÎ C, i not equal to j, we apply GentleBoost, resulting in a set of selected features Gi,j . We use Gi,j to train the sub classifier specialized in discriminating between the two temporal segments in question.
· Hybrid SVM-HMM: In the second approach, we propose to apply hybrid SVM-HMMs to the problem of AU temporal model detection. Traditionally, HMMs have been used very effectively to model time in classification problems. However, while the sequence of the temporal phases of a facial action over time can be represented very well by HMMs, the HMM suffers from poor discrimination between temporal phases at a single moment in time. The emission probabilities, which are computed for each frame of an input video for the HMM hidden states, are normally modeled by fitting Gaussian mixtures on the features. These Gaussian mixtures are fitted using likelihood maximization, which assumes the correctness of the models (i.e., the feature values should follow a Gaussian distribution) and thus suffers from poor discrimination. Moreover, it results in mixtures trained to model each class and not to discriminate one class from the other.
SVMs, on the other hand, are not suitable for modeling time, but they discriminate extremely well between classes. Using them as emission probabilities might very well result in an improved recognition. We therefore again train one-versus-oneSVMs to distinguish the temporal phases neutral, onset, apex, and offset, just as described in Section IV-D1. We then use the output of the component SVMs to compute the emission probabilities. In this way, we arrive at a hybrid SVM-HMM system. This approach has been previously applied with success to speech recognition.
HMM consists of four states, one for each temporal phase. From each SVM, we get, using Platt’s method, pair wise class probabilities μij ≡ p(ci|ci or cj , x) of the class ci given the feature vector x and that x belongs to either ci or cj . These pair wise probabilities are transformed into posterior probabilities p(ci|x)
EMOTION DETECTION
To detect the six basic emotions, we use the same set of features, described in Midlevel parametric representation. Here we train an multi-class GentleBoost Support Vector Machines and Hidden Markov Models (MCGentleSVMs), with a similar structure as the AU temporal segment detector. Again, we train one-versus-one GentleBoost Support Vector Machines (GentleSVMs) to distinguish between pairs of emotions. Because the neutral expression is also present in every video, we also learn classifiers that distinguish between each emotion and the neutral expression. Thus learn 21 binary classifiers and to determine the emission probabilities used by SVM. In contrast with the AU temporal segment detector, we do not use the emotions as the state variables, instead we learn the optimal number of states.
UTILIZED FACIAL EXPRESSION DATA SETS
We used four different data sets:
The CK-db of volitional facial displays .
The MMI facial expression database (MMI-db) .
The DS118 data set of spontaneous facial displays .
The triad data set of spontaneous human behavior .
THE CK-db OF VOLITIONAL FACIAL DISPLAYS
The CK-db was developed for research in the recognition of the six basic emotions and their corresponding AUs. The database contains over 2000 near-frontal-view videos of facial displays It is currently the most commonly used database for studies on automatic facial expression analysis. All facial displays were made on command, and the recordings were made under constant lighting conditions. Two certified FACS coders provided AU coding for all videos.
THE MMI FACIAL EXPRESSION DATABASE
The MMI facial expression database has five parts . Two FACS experts AU-coded the database.. The two coders made the final decisions on AU coding by consensus, and this final AU coding was used for the study presented in this paper.
THE DS118 DATA SET
The DS118 data set has been collected to study facial expression in patients with heart disease . Spontaneous facial displays were video recorded during a clinical interview that elicited AUs related to disgust, contempt, and other negative emotions as well as smiles. The facial actions displayed in the data are often very subtle. Due to confidentiality issues, this FACS- coded data set is not publicly available. Only the AU coding made by human observers and the tracking data were made available to us.
THE TRIAD DATA SET
The triad data set was collected to study the effects of alcohol on the behavior of so-called social drinkers. The recordings are long (over 15 min) and contain displays of diverse facial and bodily gesturing. No AU coding of the data was made publicly available.
APPLICATIONS
- It would be highly beneficial for fields as diverse as computing technology, medicine .
- It has security applications like ambient interfaces, empathetic tutoring, interactive gaming, research on pain and depression, health support appliances, monitoring of stress and fatigue, and deception detection.
ADVANTAGES
- This leads to the explicit analysis of temporal segments of action units (AU)
- Can be applicable in real life situations.
- This is highly accurate.
DISADVANTAGES
It can only recognize facial expressions as long as the face is viewed from a pseudofrontal view. If the head has an out-of-plane rotation greater than 20◦, the system will fail
FUTURE ENHANCEMENT
A system that can recognize facial expressions as long as the face is viewed from a pseudofrontal view, that is if the head has an out-of-plane rotation greater than 20◦
CONCLUSION
Accurate fully automatic facial expression analysis would have many real-world applications. In this, we have shown that not only fully automatic highly accurate AU activation detection based on geometric features is possible but also that it is possible to detect the four temporal phases of an AU with high accuracy and that geometric features are very well suited for this task.
No comments:
Post a Comment
leave your opinion