Speech Repairs, Intonational Boundaries and Discourse Markers: Modeling Speakers’ Utterances in Spoken Dialog by
Peter Anthony HeemanUniversity of Rochester, Rochester, New York. 1997
AbstractInteractive spoken dialog provides many new challenges for natural language understanding
systems. One of the most critical challenges is simply determining the speaker’s
intended utterances: both segmenting a speaker’s turn into utterances and determining
the intended words in each utterance. Even assuming perfect word recognition, the latter
problem is complicated by the occurrence of speech repairs, which occur where the
speaker goes back and changes (or repeats) something she just said. The words that are
replaced or repeated are no longer part of the intended utterance, and so need to be identified.
The two problems of segmenting the turn into utterances and resolving speech
repairs are strongly intertwined with a third problem: identifying discourse markers.
Lexical items that can function as discourse markers, such as well and okay, are
ambiguous as to whether they are introducing an utterance unit, signaling a speech
repair, or are simply part of the context of an utterance, as in that’s okay. Spoken
dialog systems need to address these three issues together and early on in the processing
stream. In fact, just as these three issues are closely intertwined with each other,
they are also intertwined with identifying the syntactic role or part-of-speech (POS) of
each word and the speech recognition problem of predicting the next word given the
previous words.
In this thesis, we present a statistical language model for resolving these issues.
Rather than finding the best word interpretation for an acoustic signal, we redefine the
speech recognition problem to so that it also identifies the POS tags, discourse markers,
speech repairs and intonational phrase endings (a major cue in determining utterance
viii
units). Adding these extra elements to the speech recognition problem actually allows it
to better predict the words involved, since we are able to make use of the predictions of
boundary tones, discourse markers and speech repairs to better account for what word
will occur next. Furthermore, we can take advantage of acoustic information, such as silence
information, which tends to co-occur with speech repairs and intonational phrase
endings, that current language models can only regard as noise in the acoustic signal.
The output of this language model is a much fuller account of the speaker’s turn, with
part-of-speech assigned to each word, intonation phrase endings and discourse markers
identified, and speech repairs detected and corrected. In fact, the identification of
the intonational phrase endings, discourse markers, and resolution of the speech repairs
allows the speech recognizer to model the speaker’s utterances, rather than simply the
words involved, and thus it can return a more meaningful analysis of the speaker’s turn
for later processing.