The workshop will consist of nine invited talks by the following
speakers.
Invited speakers
Joachim Denzler, University of Erlangen, Germany.
Recently the strategy of active vison has become more and more
important in the field of computer vision. Systems for scene
exploration rely strongly on the suggestion, that camera parameters
have been chosen for an optimal image acquisition. For speech
recognition, espcially in dialogue, parallels can be drawn. There,
the system must try to acquire the information from the user, which
is actually missing due to errors in speech recognition, semantic analysis,
ambiguities or unmodeled utterances.
John R. Kender,
University of Columbia, USA.
We discuss a conceptual framework for representing, manipulating,
measuring, and communicating in natural language several ideas about
topological (non-metric) spatial locations, object spatial contexts,
and user expectations of spatial relationships. We articulate a
theory of spatial relations, how they can be represented as fuzzy
predicates internally, and how they can be appropriately derived from
imagery; then, how they can be augmented or filtered using prior
knowledge; and lastly, how they can produce natural language
statements about location and space. This framework quantifies the
notions of context and vagueness, so that all spatial relations are
measurably accurate, provably efficient, and matched to users'
expectations.
(Joint work with Alicia Abella, now of AT&T Research)
Naoyuki Okada, Kyushu Institute of Technology, Iizuka Japan.
In each subfield of artificial intelligence such as image understanding,
speech understanding, robotics, etc., a tremendous amount of research
effort has so far yielded considerable results. Unfortunately, they have
ended up too different to combine with one another straightforwardly.
We have been conducting a case study, or AESOPWORLD project, aiming at
establishing an architectural foundation of an "integrated" intelligent
agent.
Hans-Hellmut Nagel,
University of Karlsruhe, Germany.
Image sequences allow to capture temporal developments of a scene in the field
of view of a recording video camera. The algorithmic evaluation of digitized
video sequences explicates information coded by their spatio-temporal grayvalue
variation.
Deb Roy,
MIT Media Lab, USA.
I will present a computational model of early language learning which
acquires a lexicon from raw auditory and visual input. The model uses
a multidirectional inference process in which acoustic events and
co-occurring visual observations serve as mutual constraints for
learning to segment speech, form visual categories, and ultimately
establish associations between spoken words and their visual
semantics.
Bernd Neumann, University of Hamburg, Germany.
Description Logics (DL) is an interesting candidate for the knowledge
services which are required for high-level vision. This talk presents
recent work aimed at improving the usefulness of DL for spatial and
temporal reasoning and other aspects of high-level vision.
Sven Wachsmuth,
University of Bielefeld, Germany.
The interaction of image and speech processing is a crucial property of
multimedia systems. Classical systems using inferences on pure qualitative
high level descriptions miss a lot of information when concerned with
erroneous, vague, or incomplete data.
"Knowledge Based Image and Speech Analysis for Service Robots"
"From Images to Sentences Via Spatial Relations"
"Towards Affective Integration of Vision, Behaviour, and
Speech Processing"
"From Video to Language -- a Detour via Logic vs. Jumping to Conclusions"
"Sensory-Grounded Language Learning"
"Event Recognition With Description Logics"
"Connecting Concepts in Vision and Speech Processing"
Knowledge Based Image and Speech Analysis for Service Robots
LME at U Erlangen-Nuremberg,
denzler@informatik.uni-erlangen.de
Active image analysis as well as speech understanding and dialogue
is an important ability of an service robot. Such an autonomous mobile system
moves in natural environment and has to interact with humans.
This talk presents a knowledge based approach for both image and speech
analysis using semantic networks. For the active vision part the framework
of semantic networks is used to model camera actions at different
levels. This can be achieved by integrating concepts for camera actions
in the knowledge base. For speech understanding the same framework is
used to perform a dialogue. Both, image and speech, use the semantic network
ERNEST and an iterative control algorithm, which has real-time and
any-time capabilities. By the use of that common framework the integration
and use in service robots can be easily done.
From Images to Sentences Via Spatial Relations
Department of Computer Science,
jrk@cs.columbia.edu
The work makes explicit two critical heuristics for reducing the
complexity of the relationships implicit in imagery, one a general
rule for single object descriptions, and the other a general rule for
rank ordering object relationships.
A derived working system combines variable aspects of computer science
and linguistics in such a way so as to be extensible to many
environments. The system has been demonstrated both in a landmark
navigation task and in a medical task, two very separate domains, and
has been evaluated in both.
Towards Affective Integration of Vision, Behaviour, and Speech Processing
AI Lab at KIT,
okada@ai.kyutech.ac.jp
In this article, we first review our agent model proposed previously.
It integrates seven activities of intelligence: recognition, planning,
action, desire, emotion, memory, and language. We then show how
recognition and action interact each other based on planning. Image
understanding is a part of this interaction. Next, we show language
processing for communications, where recognition and planning are
essential. Finally, we discuss the activity of emotion to show an
application of our agent to affective communications. This computation
of emotions could be expected to be a basis for friendly human-computer
interfaces.
From Video to Language -- a Detour via Logic vs. Jumping to Conclusions
IAKS at U Karlsruhe,
KOGS at Fraunhofer IITB,
hhn@iitb.fhg.de
Highlighting explicated information -- for example a moving body -- for easier
detection and tracking by a human observer constitutes only the most basic
possibility. It is desirable in general to transform the explicated information
into a representation which can be more easily adapted to the requirements of
human interaction. Natural language communication clearly offers a flexible way
to communicate many aspects about the temporal developments within a scene.
Attempts to transform the geometric representation obtained by image sequence
evaluation into a natural language description let it appear advantageous to
first transform the intermediate geometrical results into a fuzzy metric
temporal logic representation. This can in turn be processed by logic
operations to extract those partial informations of particular interest at a
certain moment. Examples from video sequences recorded at innercity road
intersections will be used to illustrate problems and solution approaches in
this direction.
Sensory-Grounded Language Learning
Vision and Modeling Group at the MIT Media Lab,
dkroy@media.mit.edu
The model has been implemented on a custom-built robotic platform. The
system gathers images of objects taken from multiple perspectives with
a CCD camera that is mounted on the robot. Acoustic and visual
analysis methods have been developed to extract robust representations
from microphone and camera input. A novel on-line algorithm operates
on these audio-visual representations to learn a sensory-grounded
lexicon. The acquired lexicon can be used to understand and generate
spoken words. The current system learns color and shape terms by
interacting with a person who presents the robot with objects while
talking about them.
This research is motivated by two complementary goals. The first goal
is to develop a cognitively plausible model of early word learning
which makes use of multimodal sensory input. A study of parent-infant
interactions during the early word learning phase has been undertaken
to evaluate the model in this regard. I will present results of using
recordings from these experiments as input to the model.
A second motivation is to develop human-computer interfaces which can
tune to the language usage patterns of individuals. The goal is to
create an adaptive and natural interface which learns to understand
the user, regardless of individual differences in acoustic and
semantic form. I will present on-going efforts to integrate the word
learning system into various application domains including
entertainment, education, command-and-control, and assistive aids.
Event Recognition With Description Logics
AI Lab at U Hamburg,
Cognitive Systems Group,
neumann@informatik.uni-hamburg.de
Connecting Concepts in Vision and Speech Processing
Applied Computer Science at U Bielefeld,
swachsmu@techfak.uni-bielefeld.de
I will present an architecture that integrates various levels of processing
by using multiple representations of the visually observed scene.
They are vertically connected by Bayesian networks in order to find the
most plausible interpretation of the scene.
The interpretation of a spoken utterance naming an object in the visually
observed scene is modeled as another partial representation of the scene.
Concepts in the speech processing and vision components are connected
by a Bayesian Network.
Thereby, objects in the scene can be identified very robustly by
matching the verbal representation to the representation of the visually
observed scene.
The same scheme can be used in order to integrate pointing gestures.
Another possibility is to use the identification results of the
interacting component for the speech processing in an incremental
way. Thereby, the syntax processing component can solve syntax
ambiguities by looking into the scene.
The implemented system is used in an assembling task environment.
A human has to instruct a robot, which shall execute the construction.
Therefore, both communication partners perceive an
arrangement of building blocks on a table.
These can be assembled by screwing or plugging.
The human speaker has to specify the objects in the scene verbally by
describing properties of these objects without knowing the exact terms.
Therefore, the system has to interpret mostly vague descriptions.
The system is capable to view the objects in the scene and has a speech
recognition and speech understanding unit.
swachsmu@techfak.uni-bielefeld.de
Last Update: May-2-99