Integration of Speech and Image Understanding

Program

second part of our logo

The workshop will consist of nine invited talks by the following speakers.

Invited speakers

Joachim Denzler (U Erlangen, DE)
"Knowledge Based Image and Speech Analysis for Service Robots"

Hilary Buxton (U Sussex, UK)

John R. Kender (U Columbia, USA)
"From Images to Sentences Via Spatial Relations"

Naoyuki Okada (Kyushu Institute of Technology, Iizuka Japan)
"Towards Affective Integration of Vision, Behaviour, and Speech Processing"

Jeffrey Mark Siskind (NEC Research Institute, Inc, Princeton, USA)
"Visual Event Perception"

Hans-Hellmut Nagel (U Karlsruhe, DE)
"From Video to Language -- a Detour via Logic vs. Jumping to Conclusions"

Deb Roy (MIT Media Lab, USA)
"Sensory-Grounded Language Learning"

Bernd Neumann (U Hamburg, DE)
"Event Recognition With Description Logics"

Sven Wachsmuth (U Bielefeld, DE)
"Connecting Concepts in Vision and Speech Processing"

Time Table

The talks will take about 20 min leaving 10 min for discussions.

speaker title

9.00 Gerhard Sagerer Welcome and Opening

9.10 Sven Wachsmuth Connecting Concepts in Vision and Speech Processing

9.40 Joachim Denzler Knowledge Based Image and Speech Analysis for Service Robots

10.10 Naoyuki Okada Towards Affective Integration of Vision, Behaviour, and Speech Processing

10.40 coffee break

11.10 John R. Kender From Images to Sentences Via Spatial Relations

11.40 Hilary Buxton --

12.10 Hans-Hellmut Nagel From Video to Language -- a Detour via Logic vs. Jumping to Conclusions

12.40 lunch

14.00 Bernd Neumann Event Recognition With Description Logics

14.30 Deb Roy Sensory-Grounded Language Learning

15.00 Jeffrey Mark Siskind Visual Event Perception

15.30 Gerhard Sagerer Final Discussion

Knowledge Based Image and Speech Analysis for Service Robots

Joachim Denzler, University of Erlangen, Germany.
LME at U Erlangen-Nuremberg, denzler@informatik.uni-erlangen.de

Recently the strategy of active vison has become more and more important in the field of computer vision. Systems for scene exploration rely strongly on the suggestion, that camera parameters have been chosen for an optimal image acquisition. For speech recognition, espcially in dialogue, parallels can be drawn. There, the system must try to acquire the information from the user, which is actually missing due to errors in speech recognition, semantic analysis, ambiguities or unmodeled utterances.
Active image analysis as well as speech understanding and dialogue is an important ability of an service robot. Such an autonomous mobile system moves in natural environment and has to interact with humans.
This talk presents a knowledge based approach for both image and speech analysis using semantic networks. For the active vision part the framework of semantic networks is used to model camera actions at different levels. This can be achieved by integrating concepts for camera actions in the knowledge base. For speech understanding the same framework is used to perform a dialogue. Both, image and speech, use the semantic network ERNEST and an iterative control algorithm, which has real-time and any-time capabilities. By the use of that common framework the integration and use in service robots can be easily done.

From Images to Sentences Via Spatial Relations

John R. Kender, University of Columbia, USA.
Department of Computer Science, jrk@cs.columbia.edu

We discuss a conceptual framework for representing, manipulating, measuring, and communicating in natural language several ideas about topological (non-metric) spatial locations, object spatial contexts, and user expectations of spatial relationships. We articulate a theory of spatial relations, how they can be represented as fuzzy predicates internally, and how they can be appropriately derived from imagery; then, how they can be augmented or filtered using prior knowledge; and lastly, how they can produce natural language statements about location and space. This framework quantifies the notions of context and vagueness, so that all spatial relations are measurably accurate, provably efficient, and matched to users' expectations.
The work makes explicit two critical heuristics for reducing the complexity of the relationships implicit in imagery, one a general rule for single object descriptions, and the other a general rule for rank ordering object relationships.
A derived working system combines variable aspects of computer science and linguistics in such a way so as to be extensible to many environments. The system has been demonstrated both in a landmark navigation task and in a medical task, two very separate domains, and has been evaluated in both.

(Joint work with Alicia Abella, now of AT&T Research)

Towards Affective Integration of Vision, Behaviour, and Speech Processing

Naoyuki Okada, Kyushu Institute of Technology, Iizuka Japan.
AI Lab at KIT, okada@ai.kyutech.ac.jp

In each subfield of artificial intelligence such as image understanding, speech understanding, robotics, etc., a tremendous amount of research effort has so far yielded considerable results. Unfortunately, they have ended up too different to combine with one another straightforwardly. We have been conducting a case study, or AESOPWORLD project, aiming at establishing an architectural foundation of an "integrated" intelligent agent.
In this article, we first review our agent model proposed previously. It integrates seven activities of intelligence: recognition, planning, action, desire, emotion, memory, and language. We then show how recognition and action interact each other based on planning. Image understanding is a part of this interaction. Next, we show language processing for communications, where recognition and planning are essential. Finally, we discuss the activity of emotion to show an application of our agent to affective communications. This computation of emotions could be expected to be a basis for friendly human-computer interfaces.

Visual Event Perception

Jeffrey Mark Siskind, NEC Research Institute, Inc., USA.
Vision Research at NECI, Qobi@research.nj.nec.com

This paper presents a novel framework for training models to recognise simple spatial-motion events, such as those described by the verbs pick up, put down, push, pull, drop, tip, and tap and classifying novel observations into previously trained classes.
Simple colour- and motion-based segmentation and tracking techniques are used to produce a time series of feature vectors constructed from the 2D object positions, orientations, shapes, and sizes.
Hidden Markov models are trained on this time series data and used to classify novel occurrences into previously trained classes. The particular choice of features used allows the system to construct meaningful semantic representations of the event classes that it has learned.

From Video to Language -- a Detour via Logic vs. Jumping to Conclusions

Hans-Hellmut Nagel, University of Karlsruhe, Germany.
IAKS at U Karlsruhe, KOGS at Fraunhofer IITB, hhn@iitb.fhg.de

Image sequences allow to capture temporal developments of a scene in the field of view of a recording video camera. The algorithmic evaluation of digitized video sequences explicates information coded by their spatio-temporal grayvalue variation.
Highlighting explicated information -- for example a moving body -- for easier detection and tracking by a human observer constitutes only the most basic possibility. It is desirable in general to transform the explicated information into a representation which can be more easily adapted to the requirements of human interaction. Natural language communication clearly offers a flexible way to communicate many aspects about the temporal developments within a scene. Attempts to transform the geometric representation obtained by image sequence evaluation into a natural language description let it appear advantageous to first transform the intermediate geometrical results into a fuzzy metric temporal logic representation. This can in turn be processed by logic operations to extract those partial informations of particular interest at a certain moment. Examples from video sequences recorded at innercity road intersections will be used to illustrate problems and solution approaches in this direction.

Sensory-Grounded Language Learning

Deb Roy, MIT Media Lab, USA.
dkroy@media.mit.edu

I will present a computational model of early language learning which acquires a lexicon from raw auditory and visual input. The model uses a multidirectional inference process in which acoustic events and co-occurring visual observations serve as mutual constraints for learning to segment speech, form visual categories, and ultimately establish associations between spoken words and their visual semantics.
The model has been implemented on a custom-built robotic platform. The system gathers images of objects taken from multiple perspectives with a CCD camera that is mounted on the robot. Acoustic and visual analysis methods have been developed to extract robust representations from microphone and camera input. A novel on-line algorithm operates on these audio-visual representations to learn a sensory-grounded lexicon. The acquired lexicon can be used to understand and generate spoken words. The current system learns color and shape terms by interacting with a person who presents the robot with objects while talking about them.
This research is motivated by two complementary goals. The first goal is to develop a cognitively plausible model of early word learning which makes use of multimodal sensory input. A study of parent-infant interactions during the early word learning phase has been undertaken to evaluate the model in this regard. I will present results of using recordings from these experiments as input to the model.
A second motivation is to develop human-computer interfaces which can tune to the language usage patterns of individuals. The goal is to create an adaptive and natural interface which learns to understand the user, regardless of individual differences in acoustic and semantic form. I will present on-going efforts to integrate the word learning system into various application domains including entertainment, education, command-and-control, and assistive aids.

Event Recognition With Description Logics

Bernd Neumann, University of Hamburg, Germany.
AI Lab at U Hamburg, Cognitive Systems Group, neumann@informatik.uni-hamburg.de

Description Logics (DL) is an interesting candidate for the knowledge services which are required for high-level vision. This talk presents recent work aimed at improving the usefulness of DL for spatial and temporal reasoning and other aspects of high-level vision.

Connecting Concepts in Vision and Speech Processing

Sven Wachsmuth, University of Bielefeld, Germany.
Applied Computer Science at U Bielefeld, swachsmu@techfak.uni-bielefeld.de

The interaction of image and speech processing is a crucial property of multimedia systems. Classical systems using inferences on pure qualitative high level descriptions miss a lot of information when concerned with erroneous, vague, or incomplete data.
I will present an architecture that integrates various levels of processing by using multiple representations of the visually observed scene. They are vertically connected by Bayesian networks in order to find the most plausible interpretation of the scene.
The interpretation of a spoken utterance naming an object in the visually observed scene is modeled as another partial representation of the scene. Concepts in the speech processing and vision components are connected by a Bayesian Network. Thereby, objects in the scene can be identified very robustly by matching the verbal representation to the representation of the visually observed scene. The same scheme can be used in order to integrate pointing gestures.
Another possibility is to use the identification results of the interacting component for the speech processing in an incremental way. Thereby, the syntax processing component can solve syntax ambiguities by looking into the scene.
The implemented system is used in an assembling task environment. A human has to instruct a robot, which shall execute the construction. Therefore, both communication partners perceive an arrangement of building blocks on a table. These can be assembled by screwing or plugging. The human speaker has to specify the objects in the scene verbally by describing properties of these objects without knowing the exact terms. Therefore, the system has to interpret mostly vague descriptions. The system is capable to view the objects in the scene and has a speech recognition and speech understanding unit.

swachsmu@techfak.uni-bielefeld.de

Last Update: May-2-99

	speaker	title
9.00	Gerhard Sagerer	Welcome and Opening
9.10	Sven Wachsmuth	Connecting Concepts in Vision and Speech Processing
9.40	Joachim Denzler	Knowledge Based Image and Speech Analysis for Service Robots
10.10	Naoyuki Okada	Towards Affective Integration of Vision, Behaviour, and Speech Processing
10.40		coffee break
11.10	John R. Kender	From Images to Sentences Via Spatial Relations
11.40	Hilary Buxton	--
12.10	Hans-Hellmut Nagel	From Video to Language -- a Detour via Logic vs. Jumping to Conclusions
12.40		lunch
14.00	Bernd Neumann	Event Recognition With Description Logics
14.30	Deb Roy	Sensory-Grounded Language Learning
15.00	Jeffrey Mark Siskind	Visual Event Perception
15.30	Gerhard Sagerer	Final Discussion