{"title":"Influence of the surprisal power adjustment on spoken word duration in emotional speech in Serbian","authors":"Jelena Lazić, Sanja Vujnović","doi":"10.1016/j.csl.2025.101803","DOIUrl":null,"url":null,"abstract":"<div><div>Emotional speech analysis has been a topic of interest across multiple disciplines. However, it remains a challenging task due to its complexity and multimodality. Computer systems still struggle with robustness when dealing with emotional speech. Despite being a difficult area of research, the wide range of potential applications, especially nowadays in the era of intelligent agents and humanoid systems, has led to increased interest in this field. With the development of machine learning models, a variety of novel techniques have emerged, including pre-trained language models. In this work, we used these models to research emotional speech analysis from an information-theory perspective. Specifically, we focused on analyzing language processing difficulty, measured by word-level spoken time duration, and its relation to information distribution over speech, measured by word-level surprisal values. We analyzed a dataset of audio recordings in the low-resourced Serbian language, recorded under five different speakers’ emotional states. Seven state-of-the-art machine learning language models were employed to estimate surprisal values, which were then used as predictive parameters for word-level spoken time duration. Our results supported related studies in the English language and indicated that machine learning-estimated surprisal values may be good predictors of speech parameters in Serbian. Furthermore, modulating the power of surprisal values led to different outcomes for various speakers’ emotional states. This suggests potential differences in the role of surprisal values in speech production under different emotional conditions.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"94 ","pages":"Article 101803"},"PeriodicalIF":3.1000,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000282","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Emotional speech analysis has been a topic of interest across multiple disciplines. However, it remains a challenging task due to its complexity and multimodality. Computer systems still struggle with robustness when dealing with emotional speech. Despite being a difficult area of research, the wide range of potential applications, especially nowadays in the era of intelligent agents and humanoid systems, has led to increased interest in this field. With the development of machine learning models, a variety of novel techniques have emerged, including pre-trained language models. In this work, we used these models to research emotional speech analysis from an information-theory perspective. Specifically, we focused on analyzing language processing difficulty, measured by word-level spoken time duration, and its relation to information distribution over speech, measured by word-level surprisal values. We analyzed a dataset of audio recordings in the low-resourced Serbian language, recorded under five different speakers’ emotional states. Seven state-of-the-art machine learning language models were employed to estimate surprisal values, which were then used as predictive parameters for word-level spoken time duration. Our results supported related studies in the English language and indicated that machine learning-estimated surprisal values may be good predictors of speech parameters in Serbian. Furthermore, modulating the power of surprisal values led to different outcomes for various speakers’ emotional states. This suggests potential differences in the role of surprisal values in speech production under different emotional conditions.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.