Bianca Sutcliffe, L. Wiggins, D. Rubin, V. Aharonson
{"title":"Voice quality enhancement for vocal tract rehabilitation","authors":"Bianca Sutcliffe, L. Wiggins, D. Rubin, V. Aharonson","doi":"10.1109/SAIBMEC.2018.8363197","DOIUrl":null,"url":null,"abstract":"Vocal rehabilitation devices used by patients after Laryngectomy produce an unnatural sounding speech. Our study aims at increasing the quality of these synthetically generated voices by implementing human-like characteristics. A simplified source filter model, linear predictive coding coefficients and line spectral frequencies were used to model the vocal tract and manipulate the acoustic features of their resulting speech. Two different mapping functions were employed to convert between the features of synthetically generated voice and those of a human voice: A Gaussian mixture model and a linear regression model. The models were trained on a set of 50 human and 50 synthetic voice utterances. Both mapping functions yielded significant changes in the transformed synthetic voices and their spectra were similar to the human voices. The linear regression model mapping produced slightly better results compared to the Gaussian mixture model mapping. Listeners' tests confirmed this result, but indicated that voices re-synthesized from the transformed model coefficients, improved on the synthetic voice but still sounded unnatural. This may imply that the vocal tract model is lacking in information that produces the subjective perception of “artificial speech”. Future work will investigate an elaborate model which will include the speech production excitation and radiation signals and the transformation of their features. These models have the potential to improve the conversion of synthetically generated electrolarynx voice into human sounding one.","PeriodicalId":165912,"journal":{"name":"2018 3rd Biennial South African Biomedical Engineering Conference (SAIBMEC)","volume":"75 2-3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 3rd Biennial South African Biomedical Engineering Conference (SAIBMEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SAIBMEC.2018.8363197","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Vocal rehabilitation devices used by patients after Laryngectomy produce an unnatural sounding speech. Our study aims at increasing the quality of these synthetically generated voices by implementing human-like characteristics. A simplified source filter model, linear predictive coding coefficients and line spectral frequencies were used to model the vocal tract and manipulate the acoustic features of their resulting speech. Two different mapping functions were employed to convert between the features of synthetically generated voice and those of a human voice: A Gaussian mixture model and a linear regression model. The models were trained on a set of 50 human and 50 synthetic voice utterances. Both mapping functions yielded significant changes in the transformed synthetic voices and their spectra were similar to the human voices. The linear regression model mapping produced slightly better results compared to the Gaussian mixture model mapping. Listeners' tests confirmed this result, but indicated that voices re-synthesized from the transformed model coefficients, improved on the synthetic voice but still sounded unnatural. This may imply that the vocal tract model is lacking in information that produces the subjective perception of “artificial speech”. Future work will investigate an elaborate model which will include the speech production excitation and radiation signals and the transformation of their features. These models have the potential to improve the conversion of synthetically generated electrolarynx voice into human sounding one.