The MultiModal - MultiMedia Seminar Series
Since the winter term 06/07, the Computer Vision Lab (BIWI) and the TIK Speech Processing Group will offer a joint Multimodal - Multimedia seminar.
This seminar series is an initiative that originated from BIWI's and TIK's joint
participation in the SNF NCCR IM2.
| Date, Time, Location | Speaker, Title and Abstract |
|---|---|
| 11.12.2007 10:00
ETZ E8 |
Simplifying principles for the control of movement and locomotionAlain Berthoz, Collège de France The planning, control and memory of arm movement control, locomotion and navigation in animals and humans require extremely complex neural mechanisms which have been elaborated throughout millions years of evolution. The multiplicity of degrees of freedom to be controlled, the multiple possible reference frames, the changing external conditions have challenged living organisms to find solutions in order to perform these tasks rapidly and efficiently. Simplifying principles have been found which I will describe for each of these functions. They include various anticipatory mechanisms involving a top-down selection for relevant sensory inputs, task specific attention mechanism, and so on. Natural movement is governed by laws such as the 1/3 power law which links curvature and velocity, segmental limb coordination, end point control of gesture, use of composite variables for transforming nonlinear problem into linear problems, and reduction of the number of degrees of freedom by the anatomical organization of neural projections. In addition, the problem of reference frames has been solved by several clever solutions, e.g., stabilization of the head during complex movements, use of gaze as a reference frame for guiding movement to a goal, or the Listing law for rotation of the eye as a solution to the non-commutativity of rotations. For navigation, recent findings of Cognitive Neuroscience and modeling have revealed the system that underlay the capacity of the animal or human brain to use several strategies for solving navigation tasks. These include specialized systems for coding place, direction and movement, distinction between egocentric and allocentric strategies, task dependent selection of pertinent sensory inputs, mechanisms for "filling up" incomplete information. Finally, although nearly all computations for movement or navigation control use a Euclidean geometry framework, it is possible that in fact simplifying mechanisms may come from the use of non-Euclidean geometries. Up to now, although in robotics many elegant solutions have been found to solve the same problems, some of the solutions found by evolution may be usefully implemented in robots and humanoids in a bio-inspired and a neurobotics approach. I will mention some of these achievements. In addition, modern robotics provides neuroscientists with thoughtful modeling and original principles that may guide the study of brain mechanisms. |
| 9.8.2007, 10:30
ETZ E8 |
Information theoretic feature extraction for multimodal signal/image processingJean-Philippe Thiran, EPFL, Signal Processing Institute (http://ltswww.epfl.ch/~thiran) Exploiting the complementary and redundant information contained in multimodal signals is a real challenge. In a multimodal classification talks, the extraction of relevant features from the different channels requires the measure of their marginal information as well as of the joint information shared by the different channels. How to extract such information and to exploit it for optimizing the classification performances is the topic of our research. In this talk, we will summarize the information theoretical framework that we have developed for several years. We will then show how to use it in practice in applications such as multimodal medical image processing, multimodal audio-visual speaker detection or audio-visual speech recognition. |
| 10.5.2007
Location tba |
Multimodal Fusion Strategies for Video Description, Indexing and RetrievalEric Bruno, Research Associate, Viper Group, University of Geneva This talk proposes a discussion around the problem of multimodal fusion for multimedia content abstraction, indexing and retrieval. Around the VICODE platform, designed for video content description and exploration, we detail a multimodal and contextual fusion strategy based on CRF for automated video structuring and labelling. In the second part of the presentation, we present a multimodal indexing technique based on dissimilarity spaces. We then compare several fusion methods for retrieving from dissimilarity spaces the relevant documents queried by users. The TRECVid corpus is used to evaluate all the proposed algorithms in an applicative setup. |
| 1.2.2007
ETZ E9 |
Building Browsers with JFerretMike Flynn, Senior Research Scientist, IDIAP JFerret is a multi-media meta-browser. It takes an XML description of a browser, and creates the user interface dynamically. Browsers can play multiple video and audio tracks, synchronized, and a display large range of usual interaction components. I will demonstrate various browsers produced for the AMI and IM2 projects, for exploring recordings of meetings. Mike Flynn graduated as a computer scientist in 1981, initially working on commercial communications and fault-tolerant systems, but then moved into research on automatic VLSI design, formal methods and early object-orientation. In 1991, he joined Xerox PARC, producing a prototype Video Diary, and the well-known Forget-Me-Not memory prosthesis. A spin-off from this work became a Xerox product, "MobileDoc". In 2001 he worked for British Telecom Research Labs on satellite provision of broadband to the public, followed by a move to Symbian, as an architect of the "next generation" of mobile phone software. In 2003, he joined IDIAP to work on new ideas for multi-modal Browsers. He holds nine patents and has publications in a wide range of computer science. |
| 29.1.2007
ETZ F76.1 |
ASR-based pronunciation training: Does it really work?Prof. Dr. Helmer Strik Radboud University Nijmegen A system was developed that gives automatic feedback on pronunciation by means of ASR (automatic speech recognition) technology. Next, the system was tested. We studied a group of immigrants who were following regular, teacher-fronted Dutch classes, and who were assigned to three groups using either [1] Dutch CAPT, our ASR-based Computer Assisted Pronunciation Training (CAPT) system that provides feedback on a number of Dutch speech sounds that are problematic for second language learners; [2] a CAPT system without feedback; and [3] no CAPT system at at all. Participants were tested before and after the training. The results show that the ASR-based feedback was effective in correcting the errors addressed in the training. |
| 18.1.2007
ETZ E9 |
A Discriminative Approach for the Retrieval of Images from Text QueriesThis work proposes a new approach to the retrieval of images from text queries. Contrasting with previous work based on generative models and likelihood maximization, this method relies on a discriminative approach: the parameters are selected in order to minimize a loss related to the ranking performance of the model, i.e. its ability to rank the relevant pictures above the non-relevant ones when given a text query. In order to minimize this loss, we present two models: first a simple linear based system (that can be kernelized), using the same features as found in previous work; second, a neural network trained by stochastic gradient descent, where the extraction of a global image represention from local block descriptors is learned jointly with the retrieval problem. Both discriminant models are compared to state-of-the-art approaches over the Corel dataset and yield statistically significantly better performance. |
| 9.11.2006 10:15
ETZ F91 |
Pull the Strings and obey your Master: The Digital MarionetteStefan Mueller Arisona, Computer Systems Institute, ETH Zurich The Digital Marionette, an interactive media art installation, shows the audience the look and feel of a puppet in the multimedia era: The wooden marionette is replaced by a Lara Croft-like cyber character; the traditional strings attached to the puppet control handles emerge into a network of computer cables. The translation from old to new, from analogue to digital, takes place via eight computer mice, which track movements of the individual strings attached to the puppet, and through digital speech recognition. Using this expressive interface, the puppet master can anonymously conduct the digital marionette, which reveals itself to the audience with an over-dimensioned, computer-generated face and a loud voice. The talk addresses the artistic and educational concept of the installation and highlights its technical realisation. |