Can a person’s interpretation of a scene, as reflected in their gaze patterns, be harnessed to recognize different classes of actions? Behavioral data were acquired from a previous study in which participants (n=8) saw 500 images from the PASCAL VOC 2012 Actions image set. Each image was freely viewed for 3 seconds and was followed by a 10-AFC test in which the depicted human action had to be selected from among 10 action classes: walking, running, jumping, riding-horse, riding-bike, phoning, taking-photo, using-computer, reading, and playing-instrument. To quantify the spatio-temporal information in gaze we labeled segments in each image (person, upper-body, lower-body, context) and derived gaze features, which included: number of transitions between segment pairs, avg/max of fixation-density map per segment, dwell time per segment, and a measure of when fixations were made on the person versus the context. For baseline comparison we also derived purely visual features using a Convolutional Neural Network trained on fixed subregions of the persons. Three linear Support Vector Machine classifiers were trained, one using visual features alone, one using gaze features alone, and one using both features in combination. Although average precision across the ten action categories was poor, the gaze classifier revealed four distinct behaviorally-meaningful subgroups, walking+running+jumping, riding-horse+riding-bike, phoning+taking-photo, and using-computer+reading+playing-instrument, where actions within each subgroup were highly confusable. Retraining the classifiers to discriminate between these four subgroups resulted in significantly improved performance for the gaze classifier, up from 43.9% to 81.2% (and in the case of phoning+picture-taking; gaze = 81.6%, vision = 65.4%). Moreover, the gaze+vision classifier outperformed both the gaze-alone and vision-alone classifiers, suggesting that gaze-features and vision-features are each contributing to the classification decision. These results have implications for both behavioral and computer vision; gaze patterns can reveal how people group similar actions, which in turn can improve automated action recognition.
This work was supported in part by NSF grants IIS-1161876, IIS-1111047, and the SubSample Project by the DIGITEO institute, France. We thank Minh Hoai for providing precomputed CNN features.