Bielefeld University, Faculty of Technology, winter term 2011/2012

Intelligent Systems Lab Project: Dialoginterface for Webservices

Participants

Alexander Baumann
Marcel Dulisch
Dimitri Heil

Supervisors

Thies Pfeiffer

Motivation

One aspect of easing everyday tasks is to offer a service that enables people to obtain information or even other services like food delivery or fast public transport informations with natural Speech.
Advantages would be:

To have one single and central service that can deliver any information instead of many different ones.
To link up personal preferences and configurations to this single service to obtain the desired information faster and easier.
The service would be language independant. You could get informations from i.e. a chinese delivery service by speaking portuguese.

Application Szenario

You are watching a movie with your Friends and you become hungry. So you tell the service what you want to eat and where you want to order it. The service sends an email or a fax to the delivery service and you get your pizza, lasagne or whatever you desire.
You are at your friends house in the next city and want to be home at a specific time. You just tell the service, when you want to be at your home train station and you'll get the time of the train or bus you have to take.

Objectives

The project goals are

To develop a central service
It should be usable with a mobile or any other phone.
Automatic speech recognition for german language.
Speech synthesis into german language.
Dialogmanagement with VoiceXML.
Obtain information by webscraping at runtime.

Description

The project is designed to be used with any phone. To achieve this, a central service was created, that manages incoming calls, speech recognition, speech synthesis and data retreival and dialogue flow of all kinds. The VoiceXML standard is used to have a well defined format on which the dialogue management bases on. JVoiceXML is used as a VoiceXML browser that interprets the dialogou files generated by the service. To provide the possibility to call the service from any phone, the open software zanzibar, which uses the JVoiceXML project, in combination with an asterisk pbx server can be used. Speech recognition is realized by the Windows Speech Recognition and MARY TTS is used for speech synthesis. As mentioned above, the dialogue flow is managed within VoiceXML files. These files and the corresponding grammar files are generated by php scripts using information obtained from php based web scrapers. Due to the flexibility of JVoiceXML and asterisk, if given, the transmission of the gathered information of the order or query can be of various type, like a simple http request, an email sent to a specific address, e.g. that of the pizza place, a call or a fax to an appropriate phone number or any message passing between a script and an equivalent backend.

Results

Basically our results can be seen in the demonstration video. It shows, how you can order anything that's on the menu of the delivery service. The conversation flow can be improved by changing or altering the vxml file accordingly.
A demonstration is available as Interaction Video (mp4, 20 Mb) (YouTube Link: Video): The video shows some problems of the speech recognizer, too. So it does not recognize words in foreign languages, so you have to pronounce them in the initial language, i.e. italian terms can't be recognized when the initial language is german. Also the voice synthesizer sounds very robotic, but as already mentioned this can be improved in many ways. The speech must become more fluent as well as the voice itself sounds too robotic.

Discussion and Conclusion

As a conclusion of our project we showed, that it is possible to create a markably powerful service with a wide field of application but is hindered due to following problems:

The most speech recognizers and synthesizers are not open source, the ones that are, aren't very powerful. For better detection there must either be paid for, or the open source recognizers have to be improved a lot.
As above, the sound of the voice has the same problems. Paid voices are much more humanlike and will be accepted by people more likely.
The number of available languages in open source is still very limited compared to the non open source, as for example Nuance can deliver over 56 languages and dialects in contrast to i.e. sphinx where about six language packs are useable but nontheless can be enhanced with big effort admittedly.

Outlook

SIWI can be enhanced and improved. For example the voicesynthesizer can get better by adding new voices to the MARY TTS Project what will improve the system's acceptance.
The voice can also be a personalization option of the service's settings.
The speech recognition is part of actual research and so it is not fully developed. Due to the fact, that researches are moving on, fellow developers should consider to get implement a different, more improved speech recognizer. Unfortunately most of the speech recognizers available aren't open source but a financial investment would pay off.

Navigation