基于测试驱动信息论的组合分布语义:以西班牙语歌词为例

IF 7.2 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Adrián Ghajari , Alejandro Benito-Santos , Salvador Ros , Víctor Fresno , Elena González-Blanco
{"title":"基于测试驱动信息论的组合分布语义:以西班牙语歌词为例","authors":"Adrián Ghajari ,&nbsp;Alejandro Benito-Santos ,&nbsp;Salvador Ros ,&nbsp;Víctor Fresno ,&nbsp;Elena González-Blanco","doi":"10.1016/j.knosys.2025.113549","DOIUrl":null,"url":null,"abstract":"<div><div>Song lyrics pose unique challenges for semantic similarity assessment due to their metaphorical language, structural patterns, and cultural nuances - characteristics that often challenge standard natural language processing (NLP) approaches. These challenges stem from a tension between compositional and distributional semantics: while lyrics follow compositional structures, their meaning depends heavily on context and interpretation. The Information Theory-based Compositional Distributional Semantics framework offers a principled approach by integrating information theory with compositional rules and distributional representations. We evaluate eight embedding models on Spanish song lyrics, including multilingual, monolingual contextual, and static embeddings. Results show that multilingual models consistently outperform monolingual alternatives, with the domain-adapted ALBERTI achieving the highest F1 macro scores (78.92 ± 10.86). Our analysis reveals that monolingual models generate highly anisotropic embedding spaces, significantly impacting performance with traditional metrics. The Information Contrast Model metric proves particularly effective, providing improvements up to 18.04 percentage points over cosine similarity. Additionally, composition functions maintaining longer accumulated vector norms consistently outperform standard averaging approaches. Our findings have important implications for NLP applications and challenge standard practices in similarity calculation, showing that effectiveness varies with both task nature and model characteristics.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"319 ","pages":"Article 113549"},"PeriodicalIF":7.2000,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Test-driving information theory-based compositional distributional semantics: A case study on Spanish song lyrics\",\"authors\":\"Adrián Ghajari ,&nbsp;Alejandro Benito-Santos ,&nbsp;Salvador Ros ,&nbsp;Víctor Fresno ,&nbsp;Elena González-Blanco\",\"doi\":\"10.1016/j.knosys.2025.113549\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Song lyrics pose unique challenges for semantic similarity assessment due to their metaphorical language, structural patterns, and cultural nuances - characteristics that often challenge standard natural language processing (NLP) approaches. These challenges stem from a tension between compositional and distributional semantics: while lyrics follow compositional structures, their meaning depends heavily on context and interpretation. The Information Theory-based Compositional Distributional Semantics framework offers a principled approach by integrating information theory with compositional rules and distributional representations. We evaluate eight embedding models on Spanish song lyrics, including multilingual, monolingual contextual, and static embeddings. Results show that multilingual models consistently outperform monolingual alternatives, with the domain-adapted ALBERTI achieving the highest F1 macro scores (78.92 ± 10.86). Our analysis reveals that monolingual models generate highly anisotropic embedding spaces, significantly impacting performance with traditional metrics. The Information Contrast Model metric proves particularly effective, providing improvements up to 18.04 percentage points over cosine similarity. Additionally, composition functions maintaining longer accumulated vector norms consistently outperform standard averaging approaches. Our findings have important implications for NLP applications and challenge standard practices in similarity calculation, showing that effectiveness varies with both task nature and model characteristics.</div></div>\",\"PeriodicalId\":49939,\"journal\":{\"name\":\"Knowledge-Based Systems\",\"volume\":\"319 \",\"pages\":\"Article 113549\"},\"PeriodicalIF\":7.2000,\"publicationDate\":\"2025-04-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950705125005957\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125005957","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

由于歌词的隐喻语言、结构模式和文化差异,这些特征经常挑战标准的自然语言处理(NLP)方法,因此歌词对语义相似性评估提出了独特的挑战。这些挑战源于构成语义和分布语义之间的紧张关系:虽然歌词遵循构成结构,但它们的意义在很大程度上取决于上下文和解释。基于信息论的组合分布语义框架通过将信息论与组合规则和分布表示相结合,提供了一种有原则的方法。我们评估了西班牙语歌词的八种嵌入模型,包括多语言、单语言上下文和静态嵌入。结果表明,多语言模型的表现始终优于单语言模型,其中领域适应的ALBERTI获得了最高的F1宏得分(78.92±10.86)。我们的分析表明,单语言模型产生高度各向异性的嵌入空间,显著影响传统指标的性能。信息对比模型度量被证明特别有效,比余弦相似度提高了18.04个百分点。此外,保持较长累计向量规范的组合函数始终优于标准的平均方法。我们的研究结果对NLP应用具有重要意义,并对相似性计算的标准实践提出了挑战,表明相似性计算的有效性随任务性质和模型特征而变化。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Test-driving information theory-based compositional distributional semantics: A case study on Spanish song lyrics
Song lyrics pose unique challenges for semantic similarity assessment due to their metaphorical language, structural patterns, and cultural nuances - characteristics that often challenge standard natural language processing (NLP) approaches. These challenges stem from a tension between compositional and distributional semantics: while lyrics follow compositional structures, their meaning depends heavily on context and interpretation. The Information Theory-based Compositional Distributional Semantics framework offers a principled approach by integrating information theory with compositional rules and distributional representations. We evaluate eight embedding models on Spanish song lyrics, including multilingual, monolingual contextual, and static embeddings. Results show that multilingual models consistently outperform monolingual alternatives, with the domain-adapted ALBERTI achieving the highest F1 macro scores (78.92 ± 10.86). Our analysis reveals that monolingual models generate highly anisotropic embedding spaces, significantly impacting performance with traditional metrics. The Information Contrast Model metric proves particularly effective, providing improvements up to 18.04 percentage points over cosine similarity. Additionally, composition functions maintaining longer accumulated vector norms consistently outperform standard averaging approaches. Our findings have important implications for NLP applications and challenge standard practices in similarity calculation, showing that effectiveness varies with both task nature and model characteristics.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Knowledge-Based Systems
Knowledge-Based Systems 工程技术-计算机:人工智能
CiteScore
14.80
自引率
12.50%
发文量
1245
审稿时长
7.8 months
期刊介绍: Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信