Fully Automatic Facial Expression Analysis

Fully Automatic Facial Expression Analysis
Past work on automatic analysis of facial expressions has focused mostly on detecting prototypic expressions of basic emotions like happiness and anger. The method proposed hereenables the detection of a much larger range of facial behaviorby recognizing facial muscle actions that compound expressions. AUs are agnostic, leaving the inferenceabout conveyed intent to higher order decision making (e.g.,emotion recognition). The proposed fully automatic method not only allows the recognition of 22 AUs but also explicitly modelstheir temporal characteristics (i.e., sequences of temporal segments:neutral, onset, apex, and offset). To do so, it uses a facialpoint detector based on Gabor-feature-based boosted classifiersto automatically localize 20 facial fiducial points. These pointsare tracked through a sequence of images using a method calledparticle filtering with factorized likelihoods. To encode AUs andtheir temporal activation models based on the tracking data, it applies a combination of GentleBoost, support vector machines,and hidden Markov models. We attain an average AU recognitionrate of 95.3% when tested on a benchmark set of deliberatelydisplayed facial expressions and 72% when tested on spontaneousexpressions.

FACIAL EXPRESSIONS synchronize the dialogue by means of brow raising and nodding, clarify the content and intent of what is said by means of lip reading and emblems like a wink, signal comprehension, or disagreement, and convey messages about  cognitive,  psychological,  and  affective  states .  Therefore,  attaining  machine  understanding  of  facial behavior  would  be  highly  beneficial  for  fields  as  diverse  as computing  technology,  medicine,  and  security  in  applications like  ambient  interfaces,  empathetic  tutoring,  interactive  gaming, research on pain and depression, health support appliances, of  stress  and  fatigue,  and  deception  detection.  Because  of  this  practical  importance  and  the theoretical interest  of  cognitive  and  medical  scientists  machine analysis  of  facial  expressions  attracted  the  interest  of  many researchers in computer vision . 

Two main streams in the current research on automatic analysis of facial expressions consider facial affect (emotion) detection and facial muscle action detection. These streams stem  directly  from  the  two  major  approaches  to  facial  expression measurement in psychological research : message and sign judgment. It leads to infer what underlies a displayed facial expression, such as affect or personality, while the  aim  of  the  latter  is  to  describe  the  "surface"  of  the  shown behavior, such as facial movement or facial component shape.
Thus, a frown can be judged as "anger" in a message-judgment approach  and  as  a  facial  movement  that  lowers  and  pulls  the eyebrows  closer  together  in  a  sign-judgment  approach. Most  facial  expression  analyzers  developed so  far  adhere  to  the  message  judgment  stream  and  attempt  to recognize a small set of prototypic emotional facial expressions such as the six basic emotions proposed by Ekman.
In  sign  judgment  approaches,  a  widely  used  method for   manual   labeling   of   facial   actions   is   the   Facial   Action Coding System (FACS) . FACS associates facial expression changes  with  actions  of  the  muscles  that  produce  them.  It defines  9  different  action  units  (AUs)  in  the  upper  face,  18 in   the   lower   face,   and   5   AUs   that   cannot   be   classified   as belonging to either the upper or the lower face. Additionally, it defines so-called action descriptors, 11 for head position, 9 for eye  position,  and  14  additional  descriptors  for  miscellaneous actions (for examples, see Fig. 1).
 Figure2.1    Examples of upper and lower face AUs defined in the FACS.
AUs are considered to be the smallest visually discernible facial movements. FACS also provides  the  rules  for  the  recognition  of  AUs'  temporal  segments (onset,  apex,  and  offset)  in  a  face  video.  Using  FACS,  human coders  can  manually  code  nearly  any  anatomically  possible facial  expression,  decomposing  it  into  the  specific  AUs  and their temporal segments that produced the expression. As AUs  are independent of any interpretation they can be used as the basis  for  any  higher  order  decision  making  process  including the  recognition  of  basic  emotions ,  cognitive  states  like (dis)agreement  and  puzzlement,  psychological  states  like pain, and social signals like emblems , regulators and  illustrators. Hence, AUs are extremely suitable to be used as midlevel parameters in an automatic facial behavior analysis  system .

 The focus of the research efforts  in  the  field  was  first  on  automatic  recognition  of  AUs in  either  static  face  images  or  face  image  sequences  picturing facial expressions  produced  on  command. One  of  the  main  criticisms  that  these  works  received  from both  cognitive  and  computer  scientists  is  that  the  methods  are not  applicable  in  real-life  situations,  where  subtle  changes  in facial expression typify the displayed facial behavior rather than the exaggerated AU activations typical of deliberately displayed facial expressions.  
Automatic recognition of facial expression configuration has  been the  main  focus  of  the  research  efforts  in  the  field. Both  the  configuration  and  the  dynamics  of  facial  expressions  are important for the interpretation of human facial behavior. Facial expression temporal dynamics are essential for the categorization of complex  psychological  states  like  various  types  of  pain  and  mood . They  are  also  the  key  parameter  in  the  differentiation between posed and spontaneous facial expressions. Some of the past work in  the  field  has  used  aspects  of  temporal  dynamics  of  facial expression  such  as  the  speed  of  a  facial  point  displacement  or  the  persistence  of  facial  parameters  over  time.  However,  this was  mainly  done  either  in  order  to  increase  the  performance of  facial  expression  analyzers  or  in  order  to report  on  the  intensity  of  (a  component  of)  the  shown  facial expression  but  not  in  order  to  analyze explicitly  the  properties  of  facial  actions'  temporal  dynamics. However, that this work does not report on the explicit analysis of temporal segments of AUs (e.g., the duration and the speed of the onset and offset of the actions).These  leads to the studies on automatic segmentation of AU activation into  temporal  segments  (neutral,   onset,  apex,  and  offset)  in frontal  and profile-view face videos. Hence, the focus of the research in the field started to shift toward automatic AU recognition in spontaneous facial   expressions 

The  first  step  in  any  facial  information  extraction  process is  face  detection,  i.e.,  the  identification  of  all  regions  in  the scene  that  contain  a  human  face.  The  second  step  in  facial expression   analysis   is   to   extract  geometric   features   (facial points  and  shapes  of  facial  components)  and/or  appearance features (descriptions of the texture of the face such as wrinkles and  furrows).

The most commonly employed face detector in automatic facial expression analysis is the real-time Viola-Jones face detector. The Viola-Jones face detector consists of a cascade of classifiers. Each  classifier  employs  integral image filters and can be computed very fast at any location and scale. This is essential to the speed of the detector. For each stage in the cascade, a subset of features is chosen using a feature selection procedure. The C++ code  of  the  face  detector  runs  at  about  500  Hz  on  a  3.2-GHz Pentium 4. 

Methods  for  facial  feature  point  detection  can  be  classified as either texture-based methods (modeling local texture around a given facial point) or shape-based methods (which regard all facial  points  as  a  shape  that  is learned  from  a  set  of  labeled faces).  A  typical  texture-based  methods uses  log-Gabor  filters.  And also typically texture and shape-based methods are employed by AdaBoost to determine facial feature point for each pixel in an input image and used a shape model as a filter to select the most likely position of feature points. 
Although   these   detectors   can   be   used   to   localize   the 20 facial characteristic points illustrated in Fig. 3, none perform the   detection   with   high   accuracy.   They   usually   regard   the localization of a point to be successful if the distance between the automatically labeled point and the manually labeled point is less than 30% ofthe true interocular distance DI (the distance between  the  eyes,  more specifically  between  the  inner  eye corners).  However,  this  is  an  unacceptably  large  error  in  the case  of  facial  expression  analysis  since  subtle  changes  in  the facial expression will be missed due to the errors in facial point  localization. 
We therefore adopt the fiducial facial point detector proposed by Vukadinovic and Pantic . When used to initialize a point tracking  algorithm,  this  method  is  accurate  enough  to  allow geometric-feature based expression recognition The  outline  of  the  developed  fully  automated method  for  the  detection  of  the  target  20  facial  characteristic points is illustrated in Fig. 3. The locations of these facial components  are  approximated  by  analyzing  the  histograms  in the regions. Based on this approximate location, search regions are defined for every point to detect. In these regions of interest (ROIs),  a  sliding  window  approach  search  is  performed.  
At each location of the ROI, Gabor-filter responses are calculated and  fed  into  the  GentleBoost-based  point  detectors. The  location  with  the  highest  output  determines  the  predicted  point location. Typical results of this algorithm are illustrated in Fig. 5.2. The point detection algorithm is tolerant to changes in illumination as   long   as  they remain locally constant. If illumination   is uneven  in  the  direct  neighborhood  of  a  facial  point,  the  point detector  may  fail  for  that  point.  

 After  the  fiducial  facial  points  are  found  in  the  first  frame, we track their positions in the entire image sequence. Standard particle filtering techniques are commonly used for facial point   tracking   in   facial   expression   analysis                   
The  tracking  scheme  that  we  adopt  is  based  on  particle filtering.  The  main  idea  behind  particle  filtering  is  to  maintain  a  set  of  solutions  that  are  an  efficient  representation  of the  conditional  probability  p(a/Y ),  where  a  is  the  state  of  a temporal  event  to  be  tracked  given  a  set  of  noisy  observations Y  ={y 1, . . . ,y’, y} up to the current time instant. By maintaining  a set of solutions instead of a single estimate, particle filtering is able to track multimodal conditional probabilities p(α|Y ), and it is therefore robustto missing and inaccurate data and particularly attractive for estimation and prediction in nonlinear, non-Gaussian systems. In the particle filtering framework, the  aposteriori  probability p(α|Y ) is used.
The PFFL tracking scheme assumes that the state α can be partitioned into substates αi such that α = _α1, . . . , αn_. At each  frame of the  input image sequence, we obtain a particle-based  representation of p(α|Y ) in two stages. At the first stage of the PFFL tracking scheme, each facial point i is tracked for one frame independently from the other facial points. At the second stage, interdependences between the substates are taken into account by means of a scheme that samples complete particles from the proposal distribution g(α), which is defined as the product of the posteriors of each αi given the observations, i.e., g(α) =i p(αi|Y ).Finally, each of the particles produced in this way is reweighted by evaluating the joint probability p(α|α−) so that the set of particles with their new weights represents the aposteriori probabilitp(α|Y ).

 Contractions of facial muscles alter the shape and location of the  facial  components.  Some  of  these  changes  are  observable from the movements of 20 facial points, which we track in the input sequence. To classify the movements of the tracked points in terms of AUs and their temporal activation models, changes in the position of the points over time are first represented as a set of midlevel parameters. 

Before  the  midlevel  parameters  can  be  calculated,  all  rigid head  motions  in  the  input  sequence  must  be  eliminated. We  register  each  frame  of  the input   image   sequence   with   the   first   frame   using   an   affine transformation  T1  based  on  three  referential  points:  the  nasal spine  point  and  the  inner  corners  of  the  eyes  (see  Fig.  3).  We use  these  points  as  the  referential  points  because  contractions of the facial muscles do not affect these points. 
Interperson variations in size and location of the facial points are minimized by applying an affine transformation T 2 to every tracked facial point in each frame. T 2 is obtained by comparing the  locations  of  the  referential  points  of  a  given  subject  in the  first  frame  with  the  corresponding  points  in  a  selected expressionless "standard" face .Thus,  after  tracking  any  of  20  characteristic  facial  points  in an  input  sequence  containing  k  frames,  we  obtain  a  set  of coordinates (p 1, . . . , pk) corresponding  to  the  locations  of  the pertinent  point  p  in  each  of  k  frames.  Then,  the  registered Co-ordinates p ir  are obtained as 

Using  this  registration  technique,  four  out  of  six  degrees of  freedom  of  head  movements  can  be  dealt  with,  and  the remaining  two  can  be  handled  partially. The tracked points returned by the PFFL tracker contain random noise occurring due to the probabilistic nature of particle filtering.  Therefore, we  apply  a  temporal  smoothing  filter  to arrive at a registered set of points   that contains less noise 
where t denotes the frame number and p_ and pr are elements of the collections P_ and Pr, respectively. The window side lobe size ws to which we apply the temporal smoothing was chosen after visual inspection of the smoothed tracker’s output. 

AU recognition from input image sequences is based on SVMs. SVMs are very well suited for this task because the high dimensionality of the feature space (representation space) does not affect the training time, which instead depends only on the number of training examples. Furthermore, SVMs generalize well even when few training data are provided. However, note that classification performance decreases when the dimensionality of the feature set is far greater than the number of samples available in the training set . 
  The implementation of the feature selection has been done as follows. For every d ∈ D, where D is the set of 22 AUs that our system can recognize in an input sequence, we apply Gentle Boost resulting in a set of selected features Gd. To detect 22 AUs occurring alone or in combination in the current frame of the input sequence . we train a separate SVM to detect the activity for every AU.More specifically, we use Gd to train and test the SVM classifier for the relevant AU . An advantage of feature selection by a boosting algorithm is that it tries to optimize the actual classification problem instead of reducing the variability in the data overall        
To encode the temporal segments of the AUs found to be activated in the input image sequence, we proceed as follows. An AU can either be in the following phases: 
The onset phase, where the muscles are contracting and the appearance of the face changes as the The apex phase, where the facial action is at a peak and there are no more changes in facial appearance due to this particular facial action.
The offset phase, where the muscles are relaxing and the face returns to its neutral appearance.
The neutral phase, where there are no signs of activation of the particular facial action.
 Here we use two approaches to detect an AU temporal model.
mc-SVMs:  In the first approach, we employ a one-versus-one strategy to multiclass SVMs (mc-SVMs). For each AU and  every pair of temporal segments, we train a separate sub classifier specialized in the discrimination between the two temporal segments.  This  results  in  /C/  (/C/  - 1)/2  sub classifiers  that need to be trained, with C  ={neutral, onset, apex, offset} and /. /  being  the  cardinality  of  a  set.  For  each  frame  t  of  an input  sequence,  every  sub classifier  returns  a  prediction  of  the class  cÎ C,  and  a  majority  vote  is  cast  to  determine  the  final output  c t  of  the  mc-SVM  for  the  current  frame  t.  For each classifier separating classes c i, c jÎ C, i not equal to j, we apply GentleBoost, resulting in a set of selected features Gi,j . We use Gi,j  to train the sub classifier specialized in discriminating between the two temporal segments in question. 
· Hybrid SVM-HMM: In the second approach, we propose to apply hybrid SVM-HMMs to the problem of AU temporal model detection. Traditionally, HMMs have been used very effectively to model time in classification problems. However, while the sequence of the temporal phases of a facial action over time can be represented very well by HMMs, the HMM suffers from poor discrimination between temporal phases at a single moment in time. The emission probabilities, which are computed for each frame of an input video for the HMM hidden states, are normally modeled by fitting Gaussian mixtures on the features. These Gaussian mixtures are fitted using likelihood maximization, which assumes the correctness of the models (i.e., the feature values should follow a Gaussian distribution) and thus suffers from poor discrimination. Moreover, it results in mixtures trained to model each class and not to discriminate one class from the other.
SVMs, on the other hand, are not suitable for modeling time, but they discriminate extremely well between classes. Using them as emission probabilities might very well result in an improved recognition. We therefore again train one-versus-oneSVMs to distinguish the temporal phases neutral, onset, apex, and offset, just as described in Section IV-D1. We then use the output of the component SVMs to compute the emission probabilities. In this way, we arrive at a hybrid SVM-HMM system. This approach has been previously applied with success to speech recognition.
 HMM consists of four states, one for  each temporal phase. From each SVM, we get, using Platt’s method, pair wise class probabilities μij ≡ p(ci|ci or cj , x) of the class ci given the feature vector x and that x belongs to either ci or cj . These pair wise probabilities are transformed into posterior probabilities p(ci|x)

To  detect  the  six  basic  emotions,  we  use  the  same  set  of features, described in Midlevel parametric representation. Here we train an multi-class GentleBoost Support Vector Machines and Hidden   Markov   Models   (MCGentleSVMs),   with   a   similar structure as the AU temporal segment detector. Again, we train one-versus-one  GentleBoost  Support  Vector Machines  (GentleSVMs)  to  distinguish  between  pairs  of  emotions.  Because the  neutral  expression  is  also  present  in  every  video,  we  also learn classifiers that distinguish between each emotion and the neutral  expression.  Thus  learn  21  binary  classifiers and to determine the emission probabilities used by SVM.  In  contrast  with  the  AU  temporal  segment detector, we do not  use  the  emotions  as  the  state  variables, instead we learn the optimal number of states. 

We used four different data sets: 
The CK-db of volitional facial displays .
The MMI facial expression database  (MMI-db) .
The  DS118  data  set  of  spontaneous facial displays .
The triad data set of spontaneous human behavior . 

The  CK-db  was  developed  for  research  in  the  recognition of  the  six  basic  emotions  and  their  corresponding  AUs.  The database contains over 2000 near-frontal-view videos of facial displays It is currently the most commonly used database  for  studies  on  automatic  facial  expression  analysis. All facial displays were made on command, and the recordings were  made  under  constant  lighting  conditions.  Two  certified FACS coders provided AU coding for all videos. 

 The MMI facial expression database has five parts . Two FACS experts AU-coded the database.. The two coders  made  the  final  decisions  on  AU  coding  by  consensus, and  this  final  AU  coding  was  used  for  the  study  presented in  this  paper.  

The  DS118  data  set  has  been  collected  to  study  facial  expression in patients with heart disease . Spontaneous  facial  displays were  video  recorded  during  a  clinical  interview  that  elicited AUs related to disgust, contempt, and other negative emotions as  well  as  smiles.  The  facial  actions  displayed  in  the  data  are often  very  subtle.  Due  to  confidentiality  issues,  this  FACS- coded  data  set  is  not  publicly  available.  Only  the  AU  coding made  by  human  observers  and  the  tracking  data  were  made available to us. 

The   triad   data   set   was   collected   to   study   the   effects   of alcohol  on  the  behavior  of  so-called  social  drinkers. The recordings are long (over 15 min) and contain displays of diverse facial and bodily gesturing. No AU coding of the data was made publicly available. 


  • It would be highly beneficial for fields as diverse as computing  technology, medicine .
  • It has security applications like ambient interfaces, empathetic tutoring, interactive gaming, research on pain and depression, health support appliances, monitoring of stress and fatigue, and deception detection.


  • This leads to the explicit analysis of temporal segments of action units (AU)
  • Can be applicable in real life situations.
  • This is highly accurate.

It can only recognize facial expressions as long as the face is viewed from a pseudofrontal view. If the head has an out-of-plane rotation greater than 20◦, the system will fail

A system that can recognize facial expressions as long as the face is viewed from a pseudofrontal view, that is  if the head has an out-of-plane rotation greater than 20◦

Accurate fully automatic facial expression analysis would have many real-world applications. In this, we have shown that not only fully automatic highly accurate AU activation detection based on geometric features is possible but also that it is possible to detect the four temporal phases of an AU with high accuracy and that geometric features are very well suited for this task. 

No comments:

Post a Comment

leave your opinion