This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Search for Publication

Year(s) from:  to 
Keywords (separated by spaces):

Acquisition of a 3D Audio-Visual Corpus of Affective Speech

G. Fanelli, J. Gall, H. Romsdorfer, T.Weise and L. Van Gool
270, 2010
Computer Vision Lab, ETH Zuerich


Communication between humans deeply relies on our capability of experiencing, expressing, and recognizing feelings. For this reason, research on human-machine interaction needs to focus on the recognition and simulation of emotional states, prerequisite of which is the collection of affective corpora. Currently available datasets still represent a bottleneck because of the difficulties arising during the acquisition and labeling of authentic affective data. In this work, we present a new audio-visual corpus for possibly the two most important modalities used by humans to communicate their emotional states, namely speech and facial expression in the form of dense dynamic 3D face geometries. We also introduce an acquisition setup for labeling the data with very little manual effort. We acquire high-quality data by working in a controlled environment and resort to video clips to induce affective states. In order to obtain the physical prosodic parameters of each utterance, the annotation process includes: transcription of the corpus text into the phonological representation, accurate phone segmentation, fundamental frequency extraction, and signal intensity estimation of the speech signals. We employ a real-time 3D scanner for the recording of dense dynamic facial geometries and track the faces throughout the sequences, achieving full spatial and temporal correspondences. The corpus is not only relevant for affective visual speech synthesis or view-independent facial expression recognition, but also for studying the correlations between audio and facial features in the context of emotional speech.

Download in pdf format
  author = {G. Fanelli and J. Gall and H. Romsdorfer and T.Weise and L. Van Gool},
  title = {Acquisition of a 3D Audio-Visual Corpus of Affective Speech},
  year = {2010},
  month = {February},
  number = {270},
  institution = {Computer Vision Lab, ETH Zuerich},
  keywords = {affective speech, audio-visual corpus, 3D face tracking}