{"title":"C-BERT:基于融合词义聚类和 BERT 的蒙古语反向词典","authors":"Amuguleng Wang, Yilagui Qi, Dahu Baiyila","doi":"10.1016/j.aej.2024.10.041","DOIUrl":null,"url":null,"abstract":"<div><div>A reverse dictionary is an electronic dictionary that accepts user-provided natural language descriptions and returns semantically matching lexicons. Despite substantial research achievements in Mongolian lexicography, discussions on Mongolian reverse dictionaries have not yet emerged. To address this, we propose an innovative model, C-BERT, combining advanced lexical semantic clustering and BERT classification technology. Initially, the <span><math><mi>K</mi></math></span>-means algorithm was used to cluster preprocessed entries from well-known Mongolian dictionaries into 5000 clusters, forming a comprehensive training set. We then optimized this training set’s data distribution through random negative sampling and fine-tuned the CINO-large model, leading to the creation of the C-BERT model. When users submit descriptions, C-BERT matches them with the central words of 5000 clusters, selecting the top 125 clusters. It then matches target words within these clusters to recommend the top 100 semantically relevant candidates. Compared to the seven baseline models, C-BERT demonstrates superior performance, particularly when evaluated on datasets with human-generated descriptions, where its synonym accuracy@10/100 reaches 16.5% and 71%, respectively. Benefiting from clustering, C-BERT improves inference speed more than tenfold, significantly enhancing its practical utility. Accordingly, we have developed a user-friendly online application platform based on C-BERT for a broad range of users, available at <span><span>http://mrdp.net/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":7484,"journal":{"name":"alexandria engineering journal","volume":"111 ","pages":"Pages 385-395"},"PeriodicalIF":6.2000,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"C-BERT: A Mongolian reverse dictionary based on fused lexical semantic clustering and BERT\",\"authors\":\"Amuguleng Wang, Yilagui Qi, Dahu Baiyila\",\"doi\":\"10.1016/j.aej.2024.10.041\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>A reverse dictionary is an electronic dictionary that accepts user-provided natural language descriptions and returns semantically matching lexicons. Despite substantial research achievements in Mongolian lexicography, discussions on Mongolian reverse dictionaries have not yet emerged. To address this, we propose an innovative model, C-BERT, combining advanced lexical semantic clustering and BERT classification technology. Initially, the <span><math><mi>K</mi></math></span>-means algorithm was used to cluster preprocessed entries from well-known Mongolian dictionaries into 5000 clusters, forming a comprehensive training set. We then optimized this training set’s data distribution through random negative sampling and fine-tuned the CINO-large model, leading to the creation of the C-BERT model. When users submit descriptions, C-BERT matches them with the central words of 5000 clusters, selecting the top 125 clusters. It then matches target words within these clusters to recommend the top 100 semantically relevant candidates. Compared to the seven baseline models, C-BERT demonstrates superior performance, particularly when evaluated on datasets with human-generated descriptions, where its synonym accuracy@10/100 reaches 16.5% and 71%, respectively. Benefiting from clustering, C-BERT improves inference speed more than tenfold, significantly enhancing its practical utility. Accordingly, we have developed a user-friendly online application platform based on C-BERT for a broad range of users, available at <span><span>http://mrdp.net/</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":7484,\"journal\":{\"name\":\"alexandria engineering journal\",\"volume\":\"111 \",\"pages\":\"Pages 385-395\"},\"PeriodicalIF\":6.2000,\"publicationDate\":\"2024-10-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"alexandria engineering journal\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1110016824011967\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"alexandria engineering journal","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1110016824011967","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}
C-BERT: A Mongolian reverse dictionary based on fused lexical semantic clustering and BERT
A reverse dictionary is an electronic dictionary that accepts user-provided natural language descriptions and returns semantically matching lexicons. Despite substantial research achievements in Mongolian lexicography, discussions on Mongolian reverse dictionaries have not yet emerged. To address this, we propose an innovative model, C-BERT, combining advanced lexical semantic clustering and BERT classification technology. Initially, the -means algorithm was used to cluster preprocessed entries from well-known Mongolian dictionaries into 5000 clusters, forming a comprehensive training set. We then optimized this training set’s data distribution through random negative sampling and fine-tuned the CINO-large model, leading to the creation of the C-BERT model. When users submit descriptions, C-BERT matches them with the central words of 5000 clusters, selecting the top 125 clusters. It then matches target words within these clusters to recommend the top 100 semantically relevant candidates. Compared to the seven baseline models, C-BERT demonstrates superior performance, particularly when evaluated on datasets with human-generated descriptions, where its synonym accuracy@10/100 reaches 16.5% and 71%, respectively. Benefiting from clustering, C-BERT improves inference speed more than tenfold, significantly enhancing its practical utility. Accordingly, we have developed a user-friendly online application platform based on C-BERT for a broad range of users, available at http://mrdp.net/.
期刊介绍:
Alexandria Engineering Journal is an international journal devoted to publishing high quality papers in the field of engineering and applied science. Alexandria Engineering Journal is cited in the Engineering Information Services (EIS) and the Chemical Abstracts (CA). The papers published in Alexandria Engineering Journal are grouped into five sections, according to the following classification:
• Mechanical, Production, Marine and Textile Engineering
• Electrical Engineering, Computer Science and Nuclear Engineering
• Civil and Architecture Engineering
• Chemical Engineering and Applied Sciences
• Environmental Engineering