Combining Latent Semantic Analysis and Pre-trained Model for Vietnamese Text Readability Assessment: Combining Statistical Semantic Embeddings and Pre-trained Model for Vietnamese Long-Sequence Readability Assessment
Nam T. Doan, Thi-Anh-Thi Le, An-Vinh Lương, Dinh Dien
{"title":"Combining Latent Semantic Analysis and Pre-trained Model for Vietnamese Text Readability Assessment: Combining Statistical Semantic Embeddings and Pre-trained Model for Vietnamese Long-Sequence Readability Assessment","authors":"Nam T. Doan, Thi-Anh-Thi Le, An-Vinh Lương, Dinh Dien","doi":"10.1145/3548636.3548643","DOIUrl":null,"url":null,"abstract":"Together with the rapid development of text processing, readability assessment is an important and challenging task of measuring how easy or difficult it is to read a text. Despite the foundation and enhancement of this task in high-resource languages such as English where there are a ton of NLP tools and corpus, this task is not an advantage for low-resource languages, especially Vietnamese. Most previous studies for Vietnamese text readability assessment focus on shallow text characteristics, which have yet to address deeper readability features. In our study, we propose a novel finding in Vietnamese to create the construction of features reflecting in terms of semantics. In view of this, we notice that the difficulty level of terms affects the difficulty level of knowledge that strongly involves text comprehension. Particularly, our approach based on the difficulty distribution of terms in a text generated by Latent Semantic Analysis (LSA) technique decreases the dependence of experts in annotating and discovering the typical feature in a narrow domain. Our proposed feature is efficient to be considered as a new and automatic feature for Vietnamese text readability assessment. Furthermore, LSA is a statistical approach that is more stable and feasible for low-resource languages. In addition, we also integrate PhoBERT, a pre-trained language model for Vietnamese, to generate the bidirectional contextual representation of a word for Vietnamese long-sequence as a semantic feature. Through the experiments in Vietnamese readability dataset, our proposed approach achieves promising performance against the strong competitive baselines. The best performance with up to an accuracy of 94.52% and a weighted F1 score of 94.09%.","PeriodicalId":384376,"journal":{"name":"Proceedings of the 4th International Conference on Information Technology and Computer Communications","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th International Conference on Information Technology and Computer Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3548636.3548643","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Together with the rapid development of text processing, readability assessment is an important and challenging task of measuring how easy or difficult it is to read a text. Despite the foundation and enhancement of this task in high-resource languages such as English where there are a ton of NLP tools and corpus, this task is not an advantage for low-resource languages, especially Vietnamese. Most previous studies for Vietnamese text readability assessment focus on shallow text characteristics, which have yet to address deeper readability features. In our study, we propose a novel finding in Vietnamese to create the construction of features reflecting in terms of semantics. In view of this, we notice that the difficulty level of terms affects the difficulty level of knowledge that strongly involves text comprehension. Particularly, our approach based on the difficulty distribution of terms in a text generated by Latent Semantic Analysis (LSA) technique decreases the dependence of experts in annotating and discovering the typical feature in a narrow domain. Our proposed feature is efficient to be considered as a new and automatic feature for Vietnamese text readability assessment. Furthermore, LSA is a statistical approach that is more stable and feasible for low-resource languages. In addition, we also integrate PhoBERT, a pre-trained language model for Vietnamese, to generate the bidirectional contextual representation of a word for Vietnamese long-sequence as a semantic feature. Through the experiments in Vietnamese readability dataset, our proposed approach achieves promising performance against the strong competitive baselines. The best performance with up to an accuracy of 94.52% and a weighted F1 score of 94.09%.