ACM Transactions on Asian and Low-Resource Language Information Processing最新文献_第10页

Explanation Guided Knowledge Distillation for Pre-trained Language Model Compression 预训练语言模型压缩的解释引导知识提炼

IF 2 4区计算机科学

ACM Transactions on Asian and Low-Resource Language Information Processing Pub Date : 2023-12-29 DOI: 10.1145/3639364

Zhao Yang, Yuanzhe Zhang, Dianbo Sui, Yiming Ju, Jun Zhao, Kang Liu

{"title":"Explanation Guided Knowledge Distillation for Pre-trained Language Model Compression","authors":"Zhao Yang, Yuanzhe Zhang, Dianbo Sui, Yiming Ju, Jun Zhao, Kang Liu","doi":"10.1145/3639364","DOIUrl":"https://doi.org/10.1145/3639364","url":null,"abstract":"Knowledge distillation is widely used in pre-trained language model compression, which can transfer knowledge from a cumbersome model to a lightweight one. Though knowledge distillation based model compression has achieved promising performance, we observe that explanations between the teacher model and the student model are not consistent. We argue that the student model should study not only the predictions of the teacher model but also the internal reasoning process. To this end, we propose Explanation Guided Knowledge Distillation (EGKD) in this paper, which utilizes explanations to represent the thinking process and improve knowledge distillation. To obtain explanations in our distillation framework, we select three typical explanation methods rooted in different mechanisms, namely gradient-based, perturbation-based, and feature selection methods, Then, to improve computational efficiency, we propose different optimization strategies to utilize the explanations obtained by these three different explanation methods, which could provide the student model better learning guidance. Experimental results on GLUE demonstrate that leveraging explanations can improve the performance of the student model. Moreover, our EGKD could also be applied to model compression with different architectures.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"247 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139071925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Leveraging Dual Gloss Encoders in Chinese Biomedical Entity Linking 在中文生物医学实体链接中利用双词汇编码器

IF 2 4区计算机科学

ACM Transactions on Asian and Low-Resource Language Information Processing Pub Date : 2023-12-28 DOI: 10.1145/3638555

Tzu-Mi Lin, Man-Chen Hung, Lung-Hao Lee

引用次数: 0

Improving the Detection of Multilingual South African Abusive Language via Skip-gram using Joint Multilevel Domain Adaptation: The Detection of Multilingual South African Abusive Language using Skip-gram and Domain Adaptation: ACM Transactions on Asian and Low-Resource Language Information Processing: Vol 0, No ja 利用多层次领域联合适应，通过跳过图改进对南非多语种辱骂性语言的检测：使用跳格和领域适应检测多语种南非辱骂性语言》：ACM Transactions on Asian and Low-Resource Language Information Processing：Vol 0, No ja

IF 2 4区计算机科学

ACM Transactions on Asian and Low-Resource Language Information Processing Pub Date : 2023-12-28 DOI: 10.1145/3638759

Oluwafemi Oriola, Eduan Kotzé

{"title":"Improving the Detection of Multilingual South African Abusive Language via Skip-gram using Joint Multilevel Domain Adaptation: The Detection of Multilingual South African Abusive Language using Skip-gram and Domain Adaptation: ACM Transactions on Asian and Low-Resource Language Information Processing: Vol 0, No ja","authors":"Oluwafemi Oriola, Eduan Kotzé","doi":"10.1145/3638759","DOIUrl":"https://doi.org/10.1145/3638759","url":null,"abstract":"The distinctiveness and sparsity of low-resource multilingual South African abusive language necessitate the development of a novel solution to automatically detect different classes of abusive language instances using machine learning. Skip-gram has been used to address sparsity in machine learning classification problems but is inadequate in detecting South African abusive language due to the considerable amount of rare features and class imbalance. Joint Domain Adaptation has been used to enlarge features of a low-resource target domain for improved classification outcomes by jointly learning from the target domain and large-resource source domain. This paper, therefore, builds a Skip-gram model based on Joint Domain Adaptation to improve the detection of multilingual South African abusive language. Contrary to the existing Joint Domain Adaptation approaches, a Joint Multilevel Domain Adaptation model involving adaptation of monolingual source domain instances and multilingual target domain instances with high frequency of rare features was executed at the first level, and adaptation of target-domain features and first-level features at the next level. Both surface-level and embedding word features were used to evaluate the proposed model. In the evaluation of surface-level features, the Joint Multilevel Domain Adaptation model outperformed the state-of-the-art models with accuracy of 0.92 and F1-score of 0.68. In the evaluation of embedding features, the proposed model outperformed the state-of-the-art models with accuracy of 0.88 and F1-score of 0.64. The Joint Multilevel Domain Adaptation model significantly improved the average information gain of the rare features in different language categories and reduced class imbalance.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"27 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139064260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Ibn-Ginni: An Improved Morphological Analyzer for Arabic 伊本-吉尼：改进的阿拉伯语态分析器

IF 2 4区计算机科学

ACM Transactions on Asian and Low-Resource Language Information Processing Pub Date : 2023-12-28 DOI: 10.1145/3639050

Waleed Nazih, Amany Fashwan, Amr El-Gendy, Yasser Hifny

{"title":"Ibn-Ginni: An Improved Morphological Analyzer for Arabic","authors":"Waleed Nazih, Amany Fashwan, Amr El-Gendy, Yasser Hifny","doi":"10.1145/3639050","DOIUrl":"https://doi.org/10.1145/3639050","url":null,"abstract":"Arabic is a morphologically rich language, which means that the Arabic language has a complicated system of word formation and structure. The affixes in the Arabic language (i.e., prefixes and suffixes) can be added to root words to generate different meanings and grammatical functions. These affixes can indicate aspects such as tense, gender, number, case, person, and more. In addition, the meaning and function of words can be modified in Arabic using an internal structure known as morphological patterns. Computational morphological analyzers of Arabic are vital to developing Arabic language processing toolkits. In this paper, we introduce a new morphological analyzer (Ibn-Ginni) that inherits the speed and quality of the Buckwalter Arabic Morphological Analyzer (BAMA). The BAMA has poor coverage of the classical Arabic language. Hence, the coverage of classical Arabic is improved by using the Alkhalil analyzer. Although it is slow, it was used to generate a huge number of solutions for 3 million unique Arabic words collected from different resources. These wordform-based solutions were converted to stem-based solutions, refined manually, and added to the database of BAMA, resulting in substantial improvements in the quality of the analysis. Hence, Ibn-Ginni is a hybrid system between BAMA and Alkhalil analyzers and may be considered an efficient large-scale analyzer. The Ibn-Ginni analyzer analyzed 0.6 million more words than the BAMA analyzer. Therefore, our analyzer significantly improves the coverage of the Arabic language. Besides, the Ibn-Ginni analyzer is high-speed at providing solutions; the average time to analyze a word is 0.3 ms. Using a corpus designed for benchmarking Arabic morphological analyzers, our analyzer was able to find all solutions for 72.72% of the words. Moreover, the analyzer did not provide all possible morphological solutions for 24.24% of the words. The analyzer and its morphological database are publicly available on GitHub.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"20 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139064478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hypergraph Neural Network for Emotion Recognition in Conversations 超图神经网络用于对话中的情感识别

IF 2 4区计算机科学

ACM Transactions on Asian and Low-Resource Language Information Processing Pub Date : 2023-12-27 DOI: 10.1145/3638760

Cheng Zheng, Haojie Xu, Xiao Sun

引用次数: 0

Autoregressive Feature Extraction with Topic Modeling for Aspect-based Sentiment Analysis of Arabic as a Low-resource Language 利用自回归特征提取和主题建模对低资源语言阿拉伯语进行基于方面的情感分析

IF 2 4区计算机科学

ACM Transactions on Asian and Low-Resource Language Information Processing Pub Date : 2023-12-27 DOI: 10.1145/3638050

Asmaa Hashem Sweidan, Nashwa El-Bendary, Esraa Elhariri

{"title":"Autoregressive Feature Extraction with Topic Modeling for Aspect-based Sentiment Analysis of Arabic as a Low-resource Language","authors":"Asmaa Hashem Sweidan, Nashwa El-Bendary, Esraa Elhariri","doi":"10.1145/3638050","DOIUrl":"https://doi.org/10.1145/3638050","url":null,"abstract":"This paper proposes an approach for aspect-based sentiment analysis of Arabic social data, especially the considerable text corpus generated through communications on Twitter for expressing opinions in Arabic-language tweets during the COVID-19 pandemic. The proposed approach examines the performance of several pre-trained predictive and autoregressive language models; namely, BERT (Bidirectional Encoder Representations from Transformers) and XLNet, along with topic modeling algorithms; namely, LDA (Latent Dirichlet Allocation) and NMF (Non-negative Matrix Factorization), for aspect-based sentiment analysis of online Arabic text. In addition, Bi-LSTM (Bidirectional Long Short Term Memory) deep learning model is used to classify the extracted aspects from online reviews. Obtained experimental results indicate that the combined XLNet-NMF model outperforms other implemented state-of-the-art methods through improving the feature extraction of unstructured social media text with achieving values of 0.946 and 0.938, for average sentiment classification accuracy and F-measure, respectively.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"22 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139064412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Computational Method for Supporting Thai VerbNet Construction 支持泰语动词网构建的计算方法

IF 2 4区计算机科学

ACM Transactions on Asian and Low-Resource Language Information Processing Pub Date : 2023-12-26 DOI: 10.1145/3638533

Krittanut Chungnoi, Rachada Kongkachandra, Sarun Gulyanon

引用次数: 0

Dual-branch Multitask Fusion Network for Offline Chinese Writer Identification 用于离线中文作家识别的双分支多任务融合网络

IF 2 4区计算机科学

ACM Transactions on Asian and Low-Resource Language Information Processing Pub Date : 2023-12-26 DOI: 10.1145/3638554

Haixia Wang, Qingran Miao, Qun Xiao, Yilong Zhang, Yingyu Mao

{"title":"Dual-branch Multitask Fusion Network for Offline Chinese Writer Identification","authors":"Haixia Wang, Qingran Miao, Qun Xiao, Yilong Zhang, Yingyu Mao","doi":"10.1145/3638554","DOIUrl":"https://doi.org/10.1145/3638554","url":null,"abstract":"Chinese characters are complex and contain discriminative information, meaning that their writers have the potential to be recognized using less text. In this study, offline Chinese writer identification based on a single character was investigated. To extract comprehensive features to model Chinese characters, explicit and implicit information as well as global and local features are of interest. A dual-branch multitask fusion network is proposed which contains two branches for global and local feature extraction simultaneously, and introduces auxiliary tasks to help the main task. Content recognition, stroke number estimation, and stroke recognition are considered as three auxiliary tasks for explicit information. The main task extracts implicit information of writer identity. The experimental results validated the positive influences of auxiliary tasks on the writer identification task, with the stroke number estimation task being most helpful. In-depth research was conducted to investigate the influencing factors in Chinese writer identification, with respect to character complexity, stroke importance, and character number, which provides a systematic reference for the actual application of neural networks in Chinese writer identification.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"18 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139052827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Machine Learning-Based Readability Model for Gujarati Texts 基于机器学习的古吉拉特语文本可读性模型

IF 2 4区计算机科学

ACM Transactions on Asian and Low-Resource Language Information Processing Pub Date : 2023-12-21 DOI: 10.1145/3637826

Chandrakant K. Bhogayata

{"title":"A Machine Learning-Based Readability Model for Gujarati Texts","authors":"Chandrakant K. Bhogayata","doi":"10.1145/3637826","DOIUrl":"https://doi.org/10.1145/3637826","url":null,"abstract":"This study aims to develop a machine learning-based model to predict the readability of Gujarati texts. The dataset was fifty prose passages from Gujarati literature. Fourteen lexical and syntactic readability text features were extracted from the dataset using a machine learning algorithm of the unigram POS tagger and three Python programming scripts. Two samples of native Gujarati speaking secondary and higher education students rated the Gujarati texts for readability judgment on a 10-point scale of 'easy' to 'difficult' with the interrater agreement. After dimensionality reduction, seven text features as the independent variables and the mean readability rating as the dependent variable were used to train the readability model. As the students' level of education and gender were related to their readability rating, four readability models for school students, university students, male students, and female students were trained with a backward stepwise multiple linear regression algorithm of supervised machine learning. The trained model is comparable across the raters' groups. The best model is the university students' readability rating model. The model is cross-validated. It explains 91% and 88% of the variance in readability ratings at training and cross-validation, respectively, and its effect size and power are large and high.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"131 50","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138953509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Ensemble Strategy with Gradient Conflict for Multi-Domain Neural Machine Translation 多域神经机器翻译的梯度冲突集合策略

IF 2 4区计算机科学

ACM Transactions on Asian and Low-Resource Language Information Processing Pub Date : 2023-12-21 DOI: 10.1145/3638248

Zhibo Man, Yujie Zhang, Yu Li, Yuanmeng Chen, Yufeng Chen, Jinan Xu

引用次数: 0