Research

Multimodal classification of the focus of attention

Abstract

In the German SmartWeb project, the user was interacting with the web via a Smartphone to get information on, for example, points of interest. To overcome the tedious use of devices such as push-to-talk, but still to be able to tell whether the user is addressing the system or talking to herself or to a third person, we developed a module that monitors speech and video in parallel. Our database has been recorded in a real-life setting, indoors as well as outdoors, with unfavourable acoustic and light conditions. With acoustic features, we classify up to 4 different types of addressing (talking to the system: On-Talk, reading from the display: Read Off- Talk, paraphrasing information presented on the screen: Paraphrasing Off-Talk, talking to a third person or to oneself: Spontaneous Off-Talk). With the camera of the Smartphone, we record the user's face and decide whether he is looking onto the phone or somewhere else. We use three different types of turn features based on classification scores of frame-based face detection and word-based analysis: 13 acoustic-prosodic features, 18 linguistic features, and 9 video features. The classification rate for acoustics only is up to 62 % for the four-class problem, and up to 77 % for the most important two-class problem "user is focussing on interaction with the system or not". For video only, it is 45 % and 71 %, respectively. By combining the two modalities, and using linguistic information in addition, classification performance for the two-class problem rises up to 85 %.

Investigated problem

Classification, whether the user of a smartphone communicates with the system by speech input or not.
The image shows the user from the perspective of a smartphone cam:

Classification of On-View/Off-View, ROT (read off-talk), POT (paraphrasing off-talk), SOT (spontaneous off-talk), and NOT (no off-talk = on-talk):

Classification

For automatic classification of the focus of attention we used Haar-Wavelets/Adaboost for On-View/Off-View detection and prosodic features for On-Talk/ROT/POT/SOT classification. Linguistic features (POS = part of speech, e.g. content word follows) are used in the third classification task. Fusion of the modalities was based on meta features describing each sub-system in low dimensional featre space:

Demonstrator

The demonstrator allows step by step speech recording, recognition, and classification (left). In the right part, whe word based recognition results are shown together with On-Talk scores. In the middle the sentence based scores for On-Talk/Off-Talk, On-View/Off-View, and after fusion (On-Focus/Off-Focus) are shown.

Own Publications

all publications

Batliner, Anton; Hacker, Christian; Nöth, Elmar To Talk or not to Talk with a Computer - Taking into Account the User's Focus of Attention Journal on Multimodal User Interfaces, vol. 2, no. 28, pp. 171-186, 2008

Batliner, Anton; Hacker, Christian; Kaiser, Moritz; Mögele, Hannes; Nöth, Elmar Taking into Account the User's Focus of Attention with the Help of Audio-Visual Information: Towards less Artificial Human-Machine-Communication In: Krahmer, Emiel; Swerts, Marc; Vroomen, Jean (Eds.) AVSP 2007 (International Conference on Auditory-Visual Speech Processing 2007 Hilvarenbeek 31.08.-03.09.2007) 2007, pp. 51-56
[download poster]

Hacker, Christian; Batliner, Anton; Nöth, Elmar Are You Looking at Me, are You Talking with Me -- Multimodal Classification of the Focus of Attention In: Sojka, P.; Kopecek, I.; Pala, K. (Eds.) Text, Speech and Dialogue. 9th International Conference, TSD 2006, Brno, Czech Republic, September 2006, Proceedings (9th International Conference, TSD 2006 Brno 11-15.9.2006) Berlin, Heidelberg : Springer 2006, pp. 581 -- 588 - ISBN 978-3-540-39090-9

Nöth, Elmar; Hacker, Christian; Batliner, Anton Does Multimodality Really Help? The Classification of Emotion and of On/Off-Focus in Multimodal Dialogues - Two Case Studies. In: Grgic, Mislav; Grgic, Sonja (Eds.) Proceedings Elmar-2007 (Elmar-2007 Zadar 12.-14.09.) Zadar : Croatian Society Electronics in Marine - ELMAR 2007, pp. 9-16 - ISBN 978-953-7044-05-3

Batliner, Anton; Hacker, Christian; Nöth, Elmar To Talk or not to Talk with a Computer: On-Talk vs. Off-Talk In: Fischer, Kerstin (Eds.) How People Talk to Computers, Robots, and Other Artificial Communication Partners (How People Talk to Computers, Robots, and Other Artificial Communication Partners Bremen April 21-23, 2006) 2006, pp. 79-100

Maier, Andreas; Hacker, Christian; Steidl, Stefan; Nöth, Elmar; Niemann, Heinrich
Robust Parallel Speech Recognition in Multiple Energy Bands In: Kropatsch, Walter G.; Sablatnig, Robert; Hanbury, Allan (Eds.) Pattern Recognition, 27th DAGM Synopsium, Vienna, Austria, August/September 2005, Proceedings (27th DAGM Synopsium, Vienna, Austria, August/September 2005 Wien 30. August - 2. September) Vol. 1 Berlin : Springer 2005, pp. 133-140 - ISBN 3-540-28703-5