MAEVa：一种匹配农业生态试验变量的混合方法

Natural Language Processing Journal Pub Date : 2025-09-11 DOI:10.1016/j.nlp.2025.100180

Oussama Mechhour , Sandrine Auzoux , Clément Jonquet , Mathieu Roche

{"title":"MAEVa：一种匹配农业生态试验变量的混合方法","authors":"Oussama Mechhour , Sandrine Auzoux , Clément Jonquet , Mathieu Roche","doi":"10.1016/j.nlp.2025.100180","DOIUrl":null,"url":null,"abstract":"<div><div>Source variables or observable properties used to describe agroecological experiments are heterogeneous, nonstandardized, and multilingual, which makes them challenging to understand, explain, and use in cropping system modeling and multicriteria evaluations of agroecological system performance. Data annotation via a controlled vocabulary, known as candidate variables from the agroecological global information system (AEGIS), offers a solution. Text similarity measures play crucial roles in tasks such as word-sense disambiguation, schema matching in databases, and data annotation. Commonly used measures include (1) string-based similarity, (2) corpus-based similarity, (3) knowledge-based similarity, and (4) hybrid-based similarity, which combine two or more of these measures. This work presents a hybrid approach called Matching Agroecological Experiment Variables (MAEVa), which combines well-known techniques (PLMs, multi-head attention, TF–IDF) tailored to the challenges of aligning source and candidate variables in agroecology. MAEVa integrates the following components: (1) Our key innovation, which consists of extending pretrained language models (PLMs) (i.e., BERT, SBERT, SimCSE) with an external multi-head attention layer for matching variable names; (2) An analysis of the relevance and impact of various data collection techniques (snippet extraction, scientific articles) and prompt-based data augmentation on TF–IDF for matching variable descriptions; (3) A linear combination of components (1) and (2); and (4) A voting-based method for selecting the final matching results. Experimental results demonstrate that extending PLMs with an external multi-head attention layer improves the matching of variable names. Furthermore, TF–IDF benefits consistently from the presence of an enriched corpus, regardless of the specific enrichment technique employed.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"13 ","pages":"Article 100180"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MAEVa: A hybrid approach for matching agroecological experiment variables\",\"authors\":\"Oussama Mechhour , Sandrine Auzoux , Clément Jonquet , Mathieu Roche\",\"doi\":\"10.1016/j.nlp.2025.100180\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Source variables or observable properties used to describe agroecological experiments are heterogeneous, nonstandardized, and multilingual, which makes them challenging to understand, explain, and use in cropping system modeling and multicriteria evaluations of agroecological system performance. Data annotation via a controlled vocabulary, known as candidate variables from the agroecological global information system (AEGIS), offers a solution. Text similarity measures play crucial roles in tasks such as word-sense disambiguation, schema matching in databases, and data annotation. Commonly used measures include (1) string-based similarity, (2) corpus-based similarity, (3) knowledge-based similarity, and (4) hybrid-based similarity, which combine two or more of these measures. This work presents a hybrid approach called Matching Agroecological Experiment Variables (MAEVa), which combines well-known techniques (PLMs, multi-head attention, TF–IDF) tailored to the challenges of aligning source and candidate variables in agroecology. MAEVa integrates the following components: (1) Our key innovation, which consists of extending pretrained language models (PLMs) (i.e., BERT, SBERT, SimCSE) with an external multi-head attention layer for matching variable names; (2) An analysis of the relevance and impact of various data collection techniques (snippet extraction, scientific articles) and prompt-based data augmentation on TF–IDF for matching variable descriptions; (3) A linear combination of components (1) and (2); and (4) A voting-based method for selecting the final matching results. Experimental results demonstrate that extending PLMs with an external multi-head attention layer improves the matching of variable names. Furthermore, TF–IDF benefits consistently from the presence of an enriched corpus, regardless of the specific enrichment technique employed.</div></div>\",\"PeriodicalId\":100944,\"journal\":{\"name\":\"Natural Language Processing Journal\",\"volume\":\"13 \",\"pages\":\"Article 100180\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Natural Language Processing Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2949719125000561\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719125000561","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

用于描述农业生态实验的源变量或可观察属性是异质的、非标准化的和多语言的，这使得它们难以理解、解释和用于种植系统建模和农业生态系统性能的多标准评估。通过受控词汇表（即来自农业生态全球信息系统（AEGIS）的候选变量）进行数据注释提供了一种解决方案。文本相似度度量在词义消歧、数据库中的模式匹配和数据注释等任务中起着至关重要的作用。常用的度量包括(1)基于字符串的相似性，(2)基于语料库的相似性，(3)基于知识的相似性，以及(4)混合的相似性，它结合了这些度量的两个或多个。这项工作提出了一种称为匹配农业生态实验变量（MAEVa）的混合方法，该方法结合了针对农业生态学中对齐源变量和候选变量的挑战而定制的知名技术（PLMs，多头关注，TF-IDF）。MAEVa集成了以下组件：(1)我们的关键创新，它包括扩展预训练语言模型（plm）（即BERT， SBERT, SimCSE），并使用外部多头部关注层来匹配变量名称；(2)分析各种数据收集技术（片段提取、科学文章）和基于提示的数据增强在TF-IDF上匹配变量描述的相关性和影响；(3)分量(1)和(2)的线性组合；(4)基于投票的最终匹配结果选择方法。实验结果表明，使用外部多头注意层扩展plm可以改善变量名的匹配。此外，无论采用何种特定富集技术，TF-IDF始终受益于富集语料库的存在。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MAEVa: A hybrid approach for matching agroecological experiment variables

Source variables or observable properties used to describe agroecological experiments are heterogeneous, nonstandardized, and multilingual, which makes them challenging to understand, explain, and use in cropping system modeling and multicriteria evaluations of agroecological system performance. Data annotation via a controlled vocabulary, known as candidate variables from the agroecological global information system (AEGIS), offers a solution. Text similarity measures play crucial roles in tasks such as word-sense disambiguation, schema matching in databases, and data annotation. Commonly used measures include (1) string-based similarity, (2) corpus-based similarity, (3) knowledge-based similarity, and (4) hybrid-based similarity, which combine two or more of these measures. This work presents a hybrid approach called Matching Agroecological Experiment Variables (MAEVa), which combines well-known techniques (PLMs, multi-head attention, TF–IDF) tailored to the challenges of aligning source and candidate variables in agroecology. MAEVa integrates the following components: (1) Our key innovation, which consists of extending pretrained language models (PLMs) (i.e., BERT, SBERT, SimCSE) with an external multi-head attention layer for matching variable names; (2) An analysis of the relevance and impact of various data collection techniques (snippet extraction, scientific articles) and prompt-based data augmentation on TF–IDF for matching variable descriptions; (3) A linear combination of components (1) and (2); and (4) A voting-based method for selecting the final matching results. Experimental results demonstrate that extending PLMs with an external multi-head attention layer improves the matching of variable names. Furthermore, TF–IDF benefits consistently from the presence of an enriched corpus, regardless of the specific enrichment technique employed.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Natural Language Processing Journal

自引率

0.00%

发文量