{"title":"基于多表排序损失的特征敏感负样本对话响应一致性评价","authors":"YeongJun Hwang, Dongjun Kang, JinYeong Bak","doi":"10.1016/j.engappai.2025.110609","DOIUrl":null,"url":null,"abstract":"<div><div>Automatic evaluation of dialogue coherency is crucial for developing high-quality dialogue systems. However, traditional evaluation metrics such as Bilingual Evaluation Understudy (BLEU) and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) have limitations when it comes to assessing diverse and creative responses because they heavily rely on reference responses. For learnable metrics which utilize contrastive learning, challenges are encountered due to the use of randomly selected negative samples that do not reflect conversational features (i.e. topic, emotion, intention) and the lack of granularity in assessing response appropriateness. To address these limitations, we propose the Feature sensitive Multi-Listwise Ranking (FMListR) response coherency evaluation model. This model aims to evaluate dialogue coherency in degrees while considering conversational sensitive features. This approach involves sampling feature-sensitive responses that share conversational features with ground truth responses and utilizing them as hard negative samples. The model is trained using Multi-Listwise Ranking (MListR) loss, which is designed to learn the ranking between negative samples and identify response features. The experimental results demonstrate that Feature sensitive Multi-Listwise Ranking exhibits stronger correlations with human judgment compared to other response coherency evaluation metrics. By considering conversational features and training the model using a specialized loss function, FMListR provides a more robust and accurate evaluation of dialogue coherency.</div></div>","PeriodicalId":50523,"journal":{"name":"Engineering Applications of Artificial Intelligence","volume":"150 ","pages":"Article 110609"},"PeriodicalIF":8.0000,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dialogue response coherency evaluation with feature sensitive negative sample using multi list-wise ranking loss\",\"authors\":\"YeongJun Hwang, Dongjun Kang, JinYeong Bak\",\"doi\":\"10.1016/j.engappai.2025.110609\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Automatic evaluation of dialogue coherency is crucial for developing high-quality dialogue systems. However, traditional evaluation metrics such as Bilingual Evaluation Understudy (BLEU) and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) have limitations when it comes to assessing diverse and creative responses because they heavily rely on reference responses. For learnable metrics which utilize contrastive learning, challenges are encountered due to the use of randomly selected negative samples that do not reflect conversational features (i.e. topic, emotion, intention) and the lack of granularity in assessing response appropriateness. To address these limitations, we propose the Feature sensitive Multi-Listwise Ranking (FMListR) response coherency evaluation model. This model aims to evaluate dialogue coherency in degrees while considering conversational sensitive features. This approach involves sampling feature-sensitive responses that share conversational features with ground truth responses and utilizing them as hard negative samples. The model is trained using Multi-Listwise Ranking (MListR) loss, which is designed to learn the ranking between negative samples and identify response features. The experimental results demonstrate that Feature sensitive Multi-Listwise Ranking exhibits stronger correlations with human judgment compared to other response coherency evaluation metrics. By considering conversational features and training the model using a specialized loss function, FMListR provides a more robust and accurate evaluation of dialogue coherency.</div></div>\",\"PeriodicalId\":50523,\"journal\":{\"name\":\"Engineering Applications of Artificial Intelligence\",\"volume\":\"150 \",\"pages\":\"Article 110609\"},\"PeriodicalIF\":8.0000,\"publicationDate\":\"2025-03-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Engineering Applications of Artificial Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0952197625006098\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering Applications of Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0952197625006098","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
Dialogue response coherency evaluation with feature sensitive negative sample using multi list-wise ranking loss
Automatic evaluation of dialogue coherency is crucial for developing high-quality dialogue systems. However, traditional evaluation metrics such as Bilingual Evaluation Understudy (BLEU) and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) have limitations when it comes to assessing diverse and creative responses because they heavily rely on reference responses. For learnable metrics which utilize contrastive learning, challenges are encountered due to the use of randomly selected negative samples that do not reflect conversational features (i.e. topic, emotion, intention) and the lack of granularity in assessing response appropriateness. To address these limitations, we propose the Feature sensitive Multi-Listwise Ranking (FMListR) response coherency evaluation model. This model aims to evaluate dialogue coherency in degrees while considering conversational sensitive features. This approach involves sampling feature-sensitive responses that share conversational features with ground truth responses and utilizing them as hard negative samples. The model is trained using Multi-Listwise Ranking (MListR) loss, which is designed to learn the ranking between negative samples and identify response features. The experimental results demonstrate that Feature sensitive Multi-Listwise Ranking exhibits stronger correlations with human judgment compared to other response coherency evaluation metrics. By considering conversational features and training the model using a specialized loss function, FMListR provides a more robust and accurate evaluation of dialogue coherency.
期刊介绍:
Artificial Intelligence (AI) is pivotal in driving the fourth industrial revolution, witnessing remarkable advancements across various machine learning methodologies. AI techniques have become indispensable tools for practicing engineers, enabling them to tackle previously insurmountable challenges. Engineering Applications of Artificial Intelligence serves as a global platform for the swift dissemination of research elucidating the practical application of AI methods across all engineering disciplines. Submitted papers are expected to present novel aspects of AI utilized in real-world engineering applications, validated using publicly available datasets to ensure the replicability of research outcomes. Join us in exploring the transformative potential of AI in engineering.