Nam T. Doan, Thi-Anh-Thi Le, An-Vinh Lương, Dinh Dien
{"title":"结合潜在语义分析和预训练模型的越南文文本可读性评估:结合统计语义嵌入和预训练模型的越南文长序列可读性评估","authors":"Nam T. Doan, Thi-Anh-Thi Le, An-Vinh Lương, Dinh Dien","doi":"10.1145/3548636.3548643","DOIUrl":null,"url":null,"abstract":"Together with the rapid development of text processing, readability assessment is an important and challenging task of measuring how easy or difficult it is to read a text. Despite the foundation and enhancement of this task in high-resource languages such as English where there are a ton of NLP tools and corpus, this task is not an advantage for low-resource languages, especially Vietnamese. Most previous studies for Vietnamese text readability assessment focus on shallow text characteristics, which have yet to address deeper readability features. In our study, we propose a novel finding in Vietnamese to create the construction of features reflecting in terms of semantics. In view of this, we notice that the difficulty level of terms affects the difficulty level of knowledge that strongly involves text comprehension. Particularly, our approach based on the difficulty distribution of terms in a text generated by Latent Semantic Analysis (LSA) technique decreases the dependence of experts in annotating and discovering the typical feature in a narrow domain. Our proposed feature is efficient to be considered as a new and automatic feature for Vietnamese text readability assessment. Furthermore, LSA is a statistical approach that is more stable and feasible for low-resource languages. In addition, we also integrate PhoBERT, a pre-trained language model for Vietnamese, to generate the bidirectional contextual representation of a word for Vietnamese long-sequence as a semantic feature. Through the experiments in Vietnamese readability dataset, our proposed approach achieves promising performance against the strong competitive baselines. The best performance with up to an accuracy of 94.52% and a weighted F1 score of 94.09%.","PeriodicalId":384376,"journal":{"name":"Proceedings of the 4th International Conference on Information Technology and Computer Communications","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Combining Latent Semantic Analysis and Pre-trained Model for Vietnamese Text Readability Assessment: Combining Statistical Semantic Embeddings and Pre-trained Model for Vietnamese Long-Sequence Readability Assessment\",\"authors\":\"Nam T. Doan, Thi-Anh-Thi Le, An-Vinh Lương, Dinh Dien\",\"doi\":\"10.1145/3548636.3548643\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Together with the rapid development of text processing, readability assessment is an important and challenging task of measuring how easy or difficult it is to read a text. Despite the foundation and enhancement of this task in high-resource languages such as English where there are a ton of NLP tools and corpus, this task is not an advantage for low-resource languages, especially Vietnamese. Most previous studies for Vietnamese text readability assessment focus on shallow text characteristics, which have yet to address deeper readability features. In our study, we propose a novel finding in Vietnamese to create the construction of features reflecting in terms of semantics. In view of this, we notice that the difficulty level of terms affects the difficulty level of knowledge that strongly involves text comprehension. Particularly, our approach based on the difficulty distribution of terms in a text generated by Latent Semantic Analysis (LSA) technique decreases the dependence of experts in annotating and discovering the typical feature in a narrow domain. Our proposed feature is efficient to be considered as a new and automatic feature for Vietnamese text readability assessment. Furthermore, LSA is a statistical approach that is more stable and feasible for low-resource languages. In addition, we also integrate PhoBERT, a pre-trained language model for Vietnamese, to generate the bidirectional contextual representation of a word for Vietnamese long-sequence as a semantic feature. Through the experiments in Vietnamese readability dataset, our proposed approach achieves promising performance against the strong competitive baselines. The best performance with up to an accuracy of 94.52% and a weighted F1 score of 94.09%.\",\"PeriodicalId\":384376,\"journal\":{\"name\":\"Proceedings of the 4th International Conference on Information Technology and Computer Communications\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 4th International Conference on Information Technology and Computer Communications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3548636.3548643\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th International Conference on Information Technology and Computer Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3548636.3548643","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Combining Latent Semantic Analysis and Pre-trained Model for Vietnamese Text Readability Assessment: Combining Statistical Semantic Embeddings and Pre-trained Model for Vietnamese Long-Sequence Readability Assessment
Together with the rapid development of text processing, readability assessment is an important and challenging task of measuring how easy or difficult it is to read a text. Despite the foundation and enhancement of this task in high-resource languages such as English where there are a ton of NLP tools and corpus, this task is not an advantage for low-resource languages, especially Vietnamese. Most previous studies for Vietnamese text readability assessment focus on shallow text characteristics, which have yet to address deeper readability features. In our study, we propose a novel finding in Vietnamese to create the construction of features reflecting in terms of semantics. In view of this, we notice that the difficulty level of terms affects the difficulty level of knowledge that strongly involves text comprehension. Particularly, our approach based on the difficulty distribution of terms in a text generated by Latent Semantic Analysis (LSA) technique decreases the dependence of experts in annotating and discovering the typical feature in a narrow domain. Our proposed feature is efficient to be considered as a new and automatic feature for Vietnamese text readability assessment. Furthermore, LSA is a statistical approach that is more stable and feasible for low-resource languages. In addition, we also integrate PhoBERT, a pre-trained language model for Vietnamese, to generate the bidirectional contextual representation of a word for Vietnamese long-sequence as a semantic feature. Through the experiments in Vietnamese readability dataset, our proposed approach achieves promising performance against the strong competitive baselines. The best performance with up to an accuracy of 94.52% and a weighted F1 score of 94.09%.