Trang Tran
Published: 2020
Total Pages: 109
Get eBook
Prosody comprises aspects of speech that communicate information beyond written words related to syntax, sentiment, intent, discourse, and comprehension. Decades of research have confirmed the importance of prosody in human speech perception and production, yet spoken language technology has made limited use of prosodic information. This limitation is due to several reasons. Words (written or transcribed) are often treated as discrete units while speech signals are continuous, which makes it challenging to combine these two modalities appropriately in spoken language systems. In addition, as variable as text can often be, text has fewer sources of variation than speech. Different meanings of a written or transcribed sentence can be communicated through punctuation, but a sentence can be spoken in many more ways, where prosody is often essential in conveying information not reflected in the word sequence. Moreover, given the highly variable nature of speech, most successful systems require a lot of data that covers these different aspects, which in turn requires powerful computing technology that was not available until recently. Given these challenges, and taking advantage of the recent advances in both the speech processing and natural language processing communities, this work aims to develop new mechanisms for integrating prosody in spoken language systems, using spontaneous and expressive speech. This thesis focuses on two language understanding tasks: (a) constituency parsing (identifying the syntactic structure of a sentence), motivated by the fact that prosodic boundaries align with constituent boundaries, and (b) dialog act recognition (identifying the segmentation and intents of utterances in discourse), motivated by the fact that prosodic boundaries signal dialog act boundaries, and intonational cues help disambiguate intents. Both parsing and dialog act recognition are important components of spoken language systems. This work makes several contributions. From the modeling perspective, we propose a method for integrating prosody effectively in spoken language understanding systems, which is shown empirically to advance the state of the art in parsing and dialog act recognition tasks. Further, our methods can be extended to other spoken language processing tasks. Through many experiments and analyses, our work contributes to a better understanding and design of language systems. Finally, speech understanding has broad impact on many areas, as it facilitates accessibility and allows for more natural human-computer interactions in education, health care, elder care, and AI-assisted domains in general.