Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2




Introduction

Speech and facial expressions are among the most important channels employed for human communication. During verbal interactions, both the sounds we produce and the deformations which our faces undergo reveal a lot about our emotional states, moods, and intentions. In the future, we forsee computers able to capture those subtle affective signals from the persons they are interacting with, and interfaces able to send such signals back to the human users in the form of believable virtual characters' animations.

To achieve such goals, corpora are needed in the research community to study how humans communicate their feelings and thus much effort is being put into data collection. In general, however, the authenticity of the acquired data is inversely proportional to its quality; on one hand, naturalistic emotions are less likely to occur in controlled environments, on the other hand, data collected 'in the wild' are very noisy and of no much use to applications like face synthesis.

The Biwi 3D Audiovisual Corpus of Affective Communication represents a compromise between the quality of the recorded data and the authenticity of the represented affective states. The corpus was acquired at ETHZ, in collaboration with SYNVO GmbH.

Acquisition

The corpus comprises a total of 1109 sentences uttered by 14 native English speakers (6 males and 8 females). A real time 3D scanner and a professional microphone were used to capture the facial movements and the speech of the speakers. The dense dynamic face scans were acquired at 25 frames per second and the RMS error in the 3D reconstruction is about 0.5 mm. In order to ease automatic speech segmentation, we carried out the recordings in a anechoic room, with walls covered by sound wave-absorbing materials, as shown in the picture.

Each sentence was recorded twice:

Annotation

The corpus has been annotated in terms of the speech, the facial movements, and the emotional content of the recorded sequences. For the speech signal, a phonological representation of the utterances, phone segmentation, fundamental frequency, and signal intensity are provided. The depth signal is converted into a sequence of 3-D meshes, providing full spatial and temporal correspondences across all sequences and speakers, a vital requirement for generating advanced statistical models targeting animation or recognition applications. Renderings of the tracked faces were shown to anonymous internet users by means of an online survey. The people watched the videos and graded the following affective adjectives between 'not at all' to 'very': emotional, negativeness, anger, sadness, fear, contempt, nervousness, disgust, frustration, stress, excitement, confidence, surprise, happiness, positiveness.

The above image shows the 3D tracked faces aligned to the speech segmentation and some simple speech features. The two plots refer to the same sentence and same speaker, in emotional (left) and neutral mode (right).

Examples

A sample of the face data can be found here. The .tgz file contains one .obj file corresponding to the neutral face of one of the subjects, one .obj file containing a frame from a sequence, and the corresponding RGB images (.png).

The following videos show two examples sentences pronounced both in the neutral and emotional modes. Both the original scans (on the right) and the tracked faces (on the left) are rendered with (up) and without (bottom) texture.

Example 1 - Neutral Example 1 - Emotional Example 2 - Neutral Example 2 - Emotional

Requesting the data

The database can be obtained upon request, for research purposes only. A license agreement must first be signed (no students) and sent to Irene Zarza.

Download the EULA

Related publications

G. Fanelli, J. Gall, H. Romsdorfer, T. Weise and L. Van Gool
A 3D Audio-Visual Corpus of Affective Communication
IEEE Transactions on Multimedia
Vol. 12, No. 6, pp. 591 - 598, October 2010 PDF, bibtex.

G. Fanelli, J. Gall, H. Romsdorfer, T. Weise and L. Van Gool
Acquisition of a 3D Audio-Visual Corpus of Affective Speech
ETH BIWI Tech report n. 270 PDF, bibtex.

G. Fanelli, J. Gall, H. Romsdorfer, T. Weise and L. Van Gool
3D Vision Technology for Capturing Multimodal Corpora: Chances and Challenges
LREC WS on Multimodal Corpora, Malta, May 2010 PDF, bibtex.