{"title":"Response Timing Estimation for Spoken Dialog Systems Based on Syntactic Completeness Prediction","authors":"Jin Sakuma, S. Fujie, Tetsunori Kobayashi","doi":"10.1109/SLT54892.2023.10023458","DOIUrl":null,"url":null,"abstract":"Appropriate response timing is very important for achieving smooth dialog progression. Conventionally, prosodic, temporal and linguistic features have been used to determine timing. In addition to the conventional parameters, we propose to utilize the syntactic completeness after a certain time, which represents whether the other party is about to finish speaking. We generate the next token sequence from intermediate speech recognition results using a language model and obtain the probability of the end of utterance appearing $K$ tokens ahead, where $K$ varies from 1 to $M$. We obtain an $M$ -dimensional vector, which we denote as estimates of syntactic completeness (ESC). We evaluated this method on a simulated dialog database of a restaurant information center. The results confirmed that considering ESC improves the performance of response timing estimation, especially the accuracy in quick responses, compared with the method using only conventional features.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT54892.2023.10023458","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Appropriate response timing is very important for achieving smooth dialog progression. Conventionally, prosodic, temporal and linguistic features have been used to determine timing. In addition to the conventional parameters, we propose to utilize the syntactic completeness after a certain time, which represents whether the other party is about to finish speaking. We generate the next token sequence from intermediate speech recognition results using a language model and obtain the probability of the end of utterance appearing $K$ tokens ahead, where $K$ varies from 1 to $M$. We obtain an $M$ -dimensional vector, which we denote as estimates of syntactic completeness (ESC). We evaluated this method on a simulated dialog database of a restaurant information center. The results confirmed that considering ESC improves the performance of response timing estimation, especially the accuracy in quick responses, compared with the method using only conventional features.