Anastasiia Tsukanova
Published: 2019
Total Pages: 0
Get eBook
The thesis is set in the domain of articulatory speech synthesis and consists of three major parts: the first two are dedicated to the development of two articulatory speech synthesizers and the third addresses how we can relate them to each other. The first approach results from a rule-based approach to articulatory speech synthesis that aimed to have a comprehensive control over the articulators (the jaw, the tongue, the lips, the velum, the larynx and the epiglottis). This approach used a dataset of static mid-sagittal magnetic resonance imaging (MRI) captures showing blocked articulation of French vowels and a set of consonant-vowel syllables; that dataset was encoded with a PCA-based vocal tract model. Then the system comprised several components: using the recorded articulatory configurations to drive a rule-based articulatory speech synthesizer as a source of target positions to attain (which is the main contribution of this first part); adjusting the obtained vocal tract shapes from the phonetic perspective; running an acoustic simulation unit to obtain the sound. The results of this synthesis were evaluated visually, acoustically and perceptually, and the problems encountered were broken down by their origin: the dataset, its modeling, the algorithm for managing the vocal tract shapes, their translation to the area functions, and the acoustic simulation. We concluded that, among our test examples, the articulatory strategies for vowels and stops are most correct, followed by those of nasals and fricatives. The second explored approach started off a baseline deep feed-forward neural network-based speech synthesizer trained with the standard recipe of Merlin on the audio recorded during real-time MRI (RT-MRI) acquisitions: denoised (and yet containing a considerable amount of noise of the MRI machine) speech in French and force-aligned state labels encoding phonetic and linguistic information. This synthesizer was augmented with eight parameters representing articulatory information--the lips opening and protrusion, the distance between the tongue and the velum, the velum and the pharyngeal wall and the tongue and the pharyngeal wall--that were automatically extracted from the captures and aligned with the audio signal and the linguistic specification. The jointly synthesized speech and articulatory sequences were evaluated objectively with dynamic time warping (DTW) distance, mean mel-cepstrum distortion (MCD), BAP (band aperiodicity prediction error), and three measures for F0: RMSE (root mean square error), CORR (correlation coefficient) and V/UV (frame-level voiced/unvoiced error). The consistency of articulatory parameters with the phonetic label was analyzed as well. I concluded that the generated articulatory parameter sequences matched the original ones acceptably closely, despite struggling more at attaining a contact between the articulators, and that the addition of articulatory parameters did not hinder the original acoustic model. The two approaches above are linked through the use of two different kinds of MRI speech data. This motivated a search for such coarticulation-aware targets as those that we had in the static case to be present or absent in the real-time data. To compare static and real-time MRI captures, the measures of structural similarity, Earth mover's distance, and SIFT were utilized; having analyzed these measures for validity and consistency, I qualitatively and quantitatively studied their temporal behavior, interpreted it and analyzed the identified similarities. I concluded that SIFT and structural similarity did capture some articulatory information and that their behavior, overall, validated the static MRI dataset. [...].