结合潜在语义分析和预训练模型的越南文文本可读性评估:结合统计语义嵌入和预训练模型的越南文长序列可读性评估

Proceedings of the 4th International Conference on Information Technology and Computer Communications Pub Date : 2022-06-23 DOI:10.1145/3548636.3548643

Nam T. Doan, Thi-Anh-Thi Le, An-Vinh Lương, Dinh Dien

{"title":"结合潜在语义分析和预训练模型的越南文文本可读性评估:结合统计语义嵌入和预训练模型的越南文长序列可读性评估","authors":"Nam T. Doan, Thi-Anh-Thi Le, An-Vinh Lương, Dinh Dien","doi":"10.1145/3548636.3548643","DOIUrl":null,"url":null,"abstract":"Together with the rapid development of text processing, readability assessment is an important and challenging task of measuring how easy or difficult it is to read a text. Despite the foundation and enhancement of this task in high-resource languages such as English where there are a ton of NLP tools and corpus, this task is not an advantage for low-resource languages, especially Vietnamese. Most previous studies for Vietnamese text readability assessment focus on shallow text characteristics, which have yet to address deeper readability features. In our study, we propose a novel finding in Vietnamese to create the construction of features reflecting in terms of semantics. In view of this, we notice that the difficulty level of terms affects the difficulty level of knowledge that strongly involves text comprehension. Particularly, our approach based on the difficulty distribution of terms in a text generated by Latent Semantic Analysis (LSA) technique decreases the dependence of experts in annotating and discovering the typical feature in a narrow domain. Our proposed feature is efficient to be considered as a new and automatic feature for Vietnamese text readability assessment. Furthermore, LSA is a statistical approach that is more stable and feasible for low-resource languages. In addition, we also integrate PhoBERT, a pre-trained language model for Vietnamese, to generate the bidirectional contextual representation of a word for Vietnamese long-sequence as a semantic feature. Through the experiments in Vietnamese readability dataset, our proposed approach achieves promising performance against the strong competitive baselines. The best performance with up to an accuracy of 94.52% and a weighted F1 score of 94.09%.","PeriodicalId":384376,"journal":{"name":"Proceedings of the 4th International Conference on Information Technology and Computer Communications","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Combining Latent Semantic Analysis and Pre-trained Model for Vietnamese Text Readability Assessment: Combining Statistical Semantic Embeddings and Pre-trained Model for Vietnamese Long-Sequence Readability Assessment\",\"authors\":\"Nam T. Doan, Thi-Anh-Thi Le, An-Vinh Lương, Dinh Dien\",\"doi\":\"10.1145/3548636.3548643\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Together with the rapid development of text processing, readability assessment is an important and challenging task of measuring how easy or difficult it is to read a text. Despite the foundation and enhancement of this task in high-resource languages such as English where there are a ton of NLP tools and corpus, this task is not an advantage for low-resource languages, especially Vietnamese. Most previous studies for Vietnamese text readability assessment focus on shallow text characteristics, which have yet to address deeper readability features. In our study, we propose a novel finding in Vietnamese to create the construction of features reflecting in terms of semantics. In view of this, we notice that the difficulty level of terms affects the difficulty level of knowledge that strongly involves text comprehension. Particularly, our approach based on the difficulty distribution of terms in a text generated by Latent Semantic Analysis (LSA) technique decreases the dependence of experts in annotating and discovering the typical feature in a narrow domain. Our proposed feature is efficient to be considered as a new and automatic feature for Vietnamese text readability assessment. Furthermore, LSA is a statistical approach that is more stable and feasible for low-resource languages. In addition, we also integrate PhoBERT, a pre-trained language model for Vietnamese, to generate the bidirectional contextual representation of a word for Vietnamese long-sequence as a semantic feature. Through the experiments in Vietnamese readability dataset, our proposed approach achieves promising performance against the strong competitive baselines. The best performance with up to an accuracy of 94.52% and a weighted F1 score of 94.09%.\",\"PeriodicalId\":384376,\"journal\":{\"name\":\"Proceedings of the 4th International Conference on Information Technology and Computer Communications\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 4th International Conference on Information Technology and Computer Communications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3548636.3548643\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th International Conference on Information Technology and Computer Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3548636.3548643","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

随着文本处理技术的快速发展，可读性评估是衡量文本阅读难易程度的一项重要而富有挑战性的任务。尽管该任务在高资源语言(如英语，那里有大量的NLP工具和语料库)中得到了基础和增强，但对于低资源语言，特别是越南语，该任务并不是一个优势。以往对越南语文本可读性评价的研究大多集中在浅层文本特征上，尚未涉及深层文本特征。在我们的研究中，我们提出了一个新的发现，即在越南语中创建反映语义的特征结构。鉴于此，我们注意到术语的难易程度会影响强烈涉及文本理解的知识的难易程度。特别是，我们的方法基于潜在语义分析(LSA)技术生成的文本中术语的难度分布，减少了对专家在狭窄领域中注释和发现典型特征的依赖。本文提出的特征可以作为越南语文本可读性评估的一种新的自动特征。此外，LSA是一种对于低资源语言来说更加稳定和可行的统计方法。此外，我们还集成了PhoBERT，一个越南语的预训练语言模型，以生成越南语长序列单词的双向上下文表示作为语义特征。通过在越南语可读性数据集上的实验，我们提出的方法在强竞争基线下取得了令人满意的性能。最佳性能，准确率高达94.52%，F1加权得分为94.09%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Combining Latent Semantic Analysis and Pre-trained Model for Vietnamese Text Readability Assessment: Combining Statistical Semantic Embeddings and Pre-trained Model for Vietnamese Long-Sequence Readability Assessment

Together with the rapid development of text processing, readability assessment is an important and challenging task of measuring how easy or difficult it is to read a text. Despite the foundation and enhancement of this task in high-resource languages such as English where there are a ton of NLP tools and corpus, this task is not an advantage for low-resource languages, especially Vietnamese. Most previous studies for Vietnamese text readability assessment focus on shallow text characteristics, which have yet to address deeper readability features. In our study, we propose a novel finding in Vietnamese to create the construction of features reflecting in terms of semantics. In view of this, we notice that the difficulty level of terms affects the difficulty level of knowledge that strongly involves text comprehension. Particularly, our approach based on the difficulty distribution of terms in a text generated by Latent Semantic Analysis (LSA) technique decreases the dependence of experts in annotating and discovering the typical feature in a narrow domain. Our proposed feature is efficient to be considered as a new and automatic feature for Vietnamese text readability assessment. Furthermore, LSA is a statistical approach that is more stable and feasible for low-resource languages. In addition, we also integrate PhoBERT, a pre-trained language model for Vietnamese, to generate the bidirectional contextual representation of a word for Vietnamese long-sequence as a semantic feature. Through the experiments in Vietnamese readability dataset, our proposed approach achieves promising performance against the strong competitive baselines. The best performance with up to an accuracy of 94.52% and a weighted F1 score of 94.09%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 4th International Conference on Information Technology and Computer Communications

自引率

0.00%

发文量