Modeling rater judgments of interpreting quality: Ordinal logistic regression using neural-based evaluation metrics, acoustic fluency measures, and computational linguistic indices
{"title":"Modeling rater judgments of interpreting quality: Ordinal logistic regression using neural-based evaluation metrics, acoustic fluency measures, and computational linguistic indices","authors":"Chao Han , Xiaolei Lu , Shirong Chen","doi":"10.1016/j.rmal.2025.100194","DOIUrl":null,"url":null,"abstract":"<div><div>Human raters remain central to interpreting quality assessment (IQA); however, recent years have witnessed a growing body of research exploring automatic assessment. These studies have used machine translation evaluation metrics, acoustic fluency measures, and computational linguistic indices as separate approaches to model rater judgments of interpreting quality. Nonetheless, limited research has integrated these three types of measures within a single framework. To address this gap, this exploratory and proof-of-concept study adopts an integrative approach, combining these three types of measures to model rater judgments of interpreting quality as a classification problem. Using a dataset of 161 Chinese-to-English interpretations, we applied ordinal logistic regression analysis to identify significant predictors across fidelity, fluency, and linguistic dimensions. The analyses yielded two sets of significant predictors, including (a) COMET-22, mean length of unfilled pauses, mean length of run, and mean word length, and (b) BLEURT-20, phonation time ratio, speech rate, mean word length, type-token ratio for content words, type-token ratio for all words, and mean word frequency for content words. These models performed well on the testing dataset, particularly for classifying interpretations into four bands of overall interpreting quality (e.g., accuracy = .643, 1-off accuracy = .805), based on the Rasch-calibrated scores from human evaluation. These findings suggest that this integrated approach may enhance the precision and scalability of IQA and has the potential to reduce logistical burdens in large-scale professional interpreter certification exams and language proficiency tests.</div></div>","PeriodicalId":101075,"journal":{"name":"Research Methods in Applied Linguistics","volume":"4 1","pages":"Article 100194"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research Methods in Applied Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772766125000151","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Human raters remain central to interpreting quality assessment (IQA); however, recent years have witnessed a growing body of research exploring automatic assessment. These studies have used machine translation evaluation metrics, acoustic fluency measures, and computational linguistic indices as separate approaches to model rater judgments of interpreting quality. Nonetheless, limited research has integrated these three types of measures within a single framework. To address this gap, this exploratory and proof-of-concept study adopts an integrative approach, combining these three types of measures to model rater judgments of interpreting quality as a classification problem. Using a dataset of 161 Chinese-to-English interpretations, we applied ordinal logistic regression analysis to identify significant predictors across fidelity, fluency, and linguistic dimensions. The analyses yielded two sets of significant predictors, including (a) COMET-22, mean length of unfilled pauses, mean length of run, and mean word length, and (b) BLEURT-20, phonation time ratio, speech rate, mean word length, type-token ratio for content words, type-token ratio for all words, and mean word frequency for content words. These models performed well on the testing dataset, particularly for classifying interpretations into four bands of overall interpreting quality (e.g., accuracy = .643, 1-off accuracy = .805), based on the Rasch-calibrated scores from human evaluation. These findings suggest that this integrated approach may enhance the precision and scalability of IQA and has the potential to reduce logistical burdens in large-scale professional interpreter certification exams and language proficiency tests.