{"title":"MAEVa:一种匹配农业生态试验变量的混合方法","authors":"Oussama Mechhour , Sandrine Auzoux , Clément Jonquet , Mathieu Roche","doi":"10.1016/j.nlp.2025.100180","DOIUrl":null,"url":null,"abstract":"<div><div>Source variables or observable properties used to describe agroecological experiments are heterogeneous, nonstandardized, and multilingual, which makes them challenging to understand, explain, and use in cropping system modeling and multicriteria evaluations of agroecological system performance. Data annotation via a controlled vocabulary, known as candidate variables from the agroecological global information system (AEGIS), offers a solution. Text similarity measures play crucial roles in tasks such as word-sense disambiguation, schema matching in databases, and data annotation. Commonly used measures include (1) string-based similarity, (2) corpus-based similarity, (3) knowledge-based similarity, and (4) hybrid-based similarity, which combine two or more of these measures. This work presents a hybrid approach called Matching Agroecological Experiment Variables (MAEVa), which combines well-known techniques (PLMs, multi-head attention, TF–IDF) tailored to the challenges of aligning source and candidate variables in agroecology. MAEVa integrates the following components: (1) Our key innovation, which consists of extending pretrained language models (PLMs) (i.e., BERT, SBERT, SimCSE) with an external multi-head attention layer for matching variable names; (2) An analysis of the relevance and impact of various data collection techniques (snippet extraction, scientific articles) and prompt-based data augmentation on TF–IDF for matching variable descriptions; (3) A linear combination of components (1) and (2); and (4) A voting-based method for selecting the final matching results. Experimental results demonstrate that extending PLMs with an external multi-head attention layer improves the matching of variable names. Furthermore, TF–IDF benefits consistently from the presence of an enriched corpus, regardless of the specific enrichment technique employed.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"13 ","pages":"Article 100180"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MAEVa: A hybrid approach for matching agroecological experiment variables\",\"authors\":\"Oussama Mechhour , Sandrine Auzoux , Clément Jonquet , Mathieu Roche\",\"doi\":\"10.1016/j.nlp.2025.100180\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Source variables or observable properties used to describe agroecological experiments are heterogeneous, nonstandardized, and multilingual, which makes them challenging to understand, explain, and use in cropping system modeling and multicriteria evaluations of agroecological system performance. Data annotation via a controlled vocabulary, known as candidate variables from the agroecological global information system (AEGIS), offers a solution. Text similarity measures play crucial roles in tasks such as word-sense disambiguation, schema matching in databases, and data annotation. Commonly used measures include (1) string-based similarity, (2) corpus-based similarity, (3) knowledge-based similarity, and (4) hybrid-based similarity, which combine two or more of these measures. This work presents a hybrid approach called Matching Agroecological Experiment Variables (MAEVa), which combines well-known techniques (PLMs, multi-head attention, TF–IDF) tailored to the challenges of aligning source and candidate variables in agroecology. MAEVa integrates the following components: (1) Our key innovation, which consists of extending pretrained language models (PLMs) (i.e., BERT, SBERT, SimCSE) with an external multi-head attention layer for matching variable names; (2) An analysis of the relevance and impact of various data collection techniques (snippet extraction, scientific articles) and prompt-based data augmentation on TF–IDF for matching variable descriptions; (3) A linear combination of components (1) and (2); and (4) A voting-based method for selecting the final matching results. Experimental results demonstrate that extending PLMs with an external multi-head attention layer improves the matching of variable names. Furthermore, TF–IDF benefits consistently from the presence of an enriched corpus, regardless of the specific enrichment technique employed.</div></div>\",\"PeriodicalId\":100944,\"journal\":{\"name\":\"Natural Language Processing Journal\",\"volume\":\"13 \",\"pages\":\"Article 100180\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Natural Language Processing Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2949719125000561\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719125000561","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
MAEVa: A hybrid approach for matching agroecological experiment variables
Source variables or observable properties used to describe agroecological experiments are heterogeneous, nonstandardized, and multilingual, which makes them challenging to understand, explain, and use in cropping system modeling and multicriteria evaluations of agroecological system performance. Data annotation via a controlled vocabulary, known as candidate variables from the agroecological global information system (AEGIS), offers a solution. Text similarity measures play crucial roles in tasks such as word-sense disambiguation, schema matching in databases, and data annotation. Commonly used measures include (1) string-based similarity, (2) corpus-based similarity, (3) knowledge-based similarity, and (4) hybrid-based similarity, which combine two or more of these measures. This work presents a hybrid approach called Matching Agroecological Experiment Variables (MAEVa), which combines well-known techniques (PLMs, multi-head attention, TF–IDF) tailored to the challenges of aligning source and candidate variables in agroecology. MAEVa integrates the following components: (1) Our key innovation, which consists of extending pretrained language models (PLMs) (i.e., BERT, SBERT, SimCSE) with an external multi-head attention layer for matching variable names; (2) An analysis of the relevance and impact of various data collection techniques (snippet extraction, scientific articles) and prompt-based data augmentation on TF–IDF for matching variable descriptions; (3) A linear combination of components (1) and (2); and (4) A voting-based method for selecting the final matching results. Experimental results demonstrate that extending PLMs with an external multi-head attention layer improves the matching of variable names. Furthermore, TF–IDF benefits consistently from the presence of an enriched corpus, regardless of the specific enrichment technique employed.