{"title":"FedREAS: A Robust Efficient Aggregation and Selection Framework for Federated Learning","authors":"Shuming Fan, Chenpei Wang, Xinyu Ruan, Hongjian Shi, Ruhui Ma, Haibing Guan","doi":"10.1145/3670689","DOIUrl":"https://doi.org/10.1145/3670689","url":null,"abstract":"<p>In the field of Natural Language Processing (NLP), Deep Learning (DL) and Neural Network (NN) technologies have been widely applied to machine translation and sentiment analysis and have demonstrated outstanding performance. In recent years, NLP applications have also combined multimodal data, such as visual and audio, continuously improving language processing performance. At the same time, the size of Neural Network models is increasing, and many models cannot be deployed on devices with limited computing resources. Deploying models on cloud platforms has become a trend. However, deploying models in the cloud introduces new privacy risks for endpoint data, despite overcoming computational limitations. Federated Learning (FL) methods protect local data by keeping the data on the client side and only sending local updates to the central server. However, the FL architecture still has problems, such as vulnerability to adversarial attacks and non-IID data distribution. In this work, we propose a Federated Learning aggregation method called FedREAS. The server uses a benchmark dataset to train a global model and obtains benchmark updates in this method. Before aggregating local updates, the server adjusts the local updates using the benchmark updates and then returns the adjusted benchmark updates. Then, based on the similarity between the adjusted local updates and the adjusted benchmark updates, the server aggregates these local updates to obtain a more robust update. This method also improves the client selection process. FedREAS selects suitable clients for training at the beginning of each round based on specific strategies, the similarity of the previous round’s updates, and the submitted data. We conduct experiments on different datasets and compare FedREAS with other Federated Learning methods. The results show that FedREAS outperforms other methods regarding model performance and resistance to attacks.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"70 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Study on Intelligent Scoring of English Composition Based on Machine Learning from the Perspective of Natural Language Processing","authors":"Jing Tang","doi":"10.1145/3625545","DOIUrl":"https://doi.org/10.1145/3625545","url":null,"abstract":"<p>Knowledge management is crucial to the teaching and learning process in the current era of digitalization. The idea of \"learning via working together\" is making Natural Language Processing a popular tool to improve the learning process based on the intelligent system for evaluating the composition. English language learning is highly dependent on the composition written by the students under various topics. Teachers are facing huge difficulties in the evaluation of the composition as the level of writing by the students will vary for individual. In this research, Natural Language Processing concept is utilized for getting trained with the student's writing skills and Multiprocessor Learning Algorithm (MLA) combined with Convolutional Neural Network (CNN) (MLA-CNN) for evaluating the composition and declaring the scores for the students. The model's composition scoring rate is validated using a range of learning rate settings. Some theoretical notions for smart teaching are proposed, and it is hoped that this automatic composition scoring model would be used to grade student writing in English classes. When applied to the automatic scoring of students' English composition in schools, the suggested composition scoring system trained by the MLP-CNN has great performance and lays the groundwork for the educational applications of ML inside AI. The study results proved that the proposed model has provided an accuracy of 98%.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"2 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"X-Phishing-Writer: A Framework for Cross-Lingual Phishing Email Generation","authors":"Shih-Wei Guo, Yao-Chung Fan","doi":"10.1145/3670402","DOIUrl":"https://doi.org/10.1145/3670402","url":null,"abstract":"<p>Cybercrime is projected to cause annual business losses of $10.5 trillion by 2025, a significant concern given that a majority of security breaches are due to human errors, especially through phishing attacks. The rapid increase in daily identified phishing sites over the past decade underscores the pressing need to enhance defenses against such attacks. Social Engineering Drills (SEDs) are essential in raising awareness about phishing, yet face challenges in creating effective and diverse phishing email content. These challenges are exacerbated by the limited availability of public datasets and concerns over using external language models like ChatGPT for phishing email generation. To address these issues, this paper introduces X-Phishing-Writer, a novel cross-lingual Few-Shot phishing email generation framework. X-Phishing-Writer allows for the generation of emails based on minimal user input, leverages single-language datasets for multilingual email generation, and is designed for internal deployment using a lightweight, open-source language model. Incorporating Adapters into an Encoder-Decoder architecture, X-Phishing-Writer marks a significant advancement in the field, demonstrating superior performance in generating phishing emails across 25 languages when compared to baseline models. Experimental results and real-world drills involving 1,682 users showcase a 17.67% email open rate and a 13.33% hyperlink click-through rate, affirming the framework’s effectiveness and practicality in enhancing phishing awareness and defense.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"53 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic Algerian Sarcasm Detection from Texts and Images","authors":"Kheira Zineb Bousmaha, Khaoula Hamadouche, Hadjer Djouabi, Lamia Hadrich-Belguith","doi":"10.1145/3670403","DOIUrl":"https://doi.org/10.1145/3670403","url":null,"abstract":"<p>In recent years, the number of Algerian Internet users has significantly increased, providing a valuable opportunity for collecting and utilizing opinions and sentiments expressed online. They now post not just texts but also images. However, to benefit from this wealth of information, it is crucial to address the challenge of sarcasm detection, which poses a limitation in sentiment analysis. Sarcasm often involves the use of non-literal and ambiguous language, making its detection complex. To enhance the quality and relevance of sentiment analysis, it is essential to develop effective methods for sarcasm detection. By overcoming this limitation, we can fully harness the expressed online opinions and benefit from their valuable insights for a better understanding of trends and sentiments among the Algerian public. In this work, our aim is to develop a comprehensive system that addresses sarcasm detection in Algerian dialect, encompassing both text and image analysis. We propose a hybrid approach that combines linguistic characteristics and machine learning techniques for text analysis. Additionally, for image analysis, we utilized the deep learning model VGG-19 for image classification, and employed the EasyOCR technique for Arabic text extraction. By integrating these approaches, we strive to create a robust system capable of detecting sarcasm in both textual and visual content in the Algerian dialect. Our system achieved an accuracy of 92.79% for the textual models and 89.28% for the visual model.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"7 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"KannadaLex: A lexical database with psycholinguistic information","authors":"Shreya R. Aithal, Muralikrishna Sn, Raghavendra Ganiga, Ashwath Rao, Govardhan Hegde","doi":"10.1145/3670688","DOIUrl":"https://doi.org/10.1145/3670688","url":null,"abstract":"<p>Databases containing lexical properties are of primary importance to psycholinguistic research and speech-language therapy. Several lexical databases for different languages have been developed in the recent past, but Kannada, a language spoken by 50.8 million people, has no comprehensive lexical database yet. To address this, <i>KannadaLex</i>, a Kannada lexical database is built as a language resource that contains orthographic, phonological, and syllabic information about words that are sourced from newspaper articles from the last decade. Along with these vital statistics like the phonological neighbourhood, syllable complexity summed syllable and bigram syllable frequencies, and lemma and inflectional family information are stored. The database is validated by correlating frequency, a well-established psycholinguistic feature, with other numerical features. The developed lexical database contains 170K words from varied disciplines, complete with psycholinguistic features. This <i>KannadaLex</i> is a comprehensive resource for psycholinguists, speech therapists, and linguistic researchers for analyzing Kannada and other similar languages. Psycholinguists require lexical data for choosing stimuli to conduct experiments that study the factors that enable humans to acquire, use, comprehend, and produce language. Speech and language therapists query these databases for developing the most efficient stimuli for evaluating, diagnosing, and treating communication disorders, and rehabilitation of speech after brain injuries.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"70 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Document-Level Relation Extraction Based on Machine Reading Comprehension and Hybrid Pointer-sequence Labeling","authors":"xiaoyi wang, Jie Liu, Jiong Wang, Jianyong Duan, guixia guan, qing zhang, Jianshe Zhou","doi":"10.1145/3666042","DOIUrl":"https://doi.org/10.1145/3666042","url":null,"abstract":"<p>Document-level relational extraction requires reading, memorization and reasoning to discover relevant factual information in multiple sentences. It is difficult for the current hierarchical network and graph network methods to fully capture the structural information behind the document and make natural reasoning from the context. Different from the previous methods, this paper reconstructs the relation extraction task into a machine reading comprehension task. Each pair of entities and relationships is characterized by a question template, and the extraction of entities and relationships is translated into identifying answers from the context. To enhance the context comprehension ability of the extraction model and achieve more precise extraction, we introduce large language models (LLMs) during question construction, enabling the generation of exemplary answers. Besides, to solve the multi-label and multi-entity problems in documents, we propose a new answer extraction model based on hybrid pointer-sequence labeling, which improves the reasoning ability of the model and realizes the extraction of zero or multiple answers in documents. Extensive experiments on three public datasets show that the proposed method is effective.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"8 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141197570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quantitative Stylistic Analysis of Middle Chinese Texts Based on the Dissimilarity of Evolutive Core Word Usage","authors":"Bing Qiu, Jiahao Huo","doi":"10.1145/3665794","DOIUrl":"https://doi.org/10.1145/3665794","url":null,"abstract":"<p>Stylistic analysis enables open-ended and exploratory observation of languages. To fill the gap in the quantitative analysis of the stylistic systems of Middle Chinese, we construct lexical features based on the evolutive core word usage and scheme a Bayesian method for feature parameters estimation. The lexical features are from the Swadesh list, each of which has different word forms along with the language evolution during the Middle Ages. We thus count the varied word of those entries along with the language evolution as the linguistic features. With the Bayesian formulation, the feature parameters are estimated to construct a high-dimensional random feature vector in order to obtain the pair-wise dissimilarity matrix of all the texts based on different distance measures. Finally, we perform the spectral embedding and clustering to visualize, categorize and analyze the linguistic styles of Middle Chinese texts. The quantitative result agrees with the existing qualitative conclusions and furthermore, betters our understanding of the linguistic styles of Middle Chinese from both the inter-category and intra-category aspects. It also helps unveil the special styles induced by the indirect language contact.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"47 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141170478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SCBG: Semantic-Constrained Bidirectional Generation for Emotional Support Conversation","authors":"Yangyang Xu, Zhuoer Zhao, Xiao Sun","doi":"10.1145/3666090","DOIUrl":"https://doi.org/10.1145/3666090","url":null,"abstract":"<p>The Emotional Support Conversation (ESC) task aims to deliver consolation, encouragement, and advice to individuals undergoing emotional distress, thereby assisting them in overcoming difficulties. In the context of emotional support dialogue systems, it is of utmost importance to generate user-relevant and diverse responses. However, previous methods failed to take into account these crucial aspects, resulting in a tendency to produce universal and safe responses (e.g., “I do not know” and “I am sorry to hear that”). To tackle this challenge, a semantic-constrained bidirectional generation (SCBG) framework is utilized for generating more diverse and user-relevant responses. Specifically, we commence by selecting keywords that encapsulate the ongoing dialogue topics based on the context. Subsequently, a bidirectional generator generates responses incorporating these keywords. Two distinct methodologies, namely statistics-based and prompt-based methods, are employed for keyword extraction. Experimental results on the ESConv dataset demonstrate that the proposed SCBG framework improves response diversity and user relevance while ensuring response quality.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"90 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141170384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Robert Lalramhluna, Sandeep Dash, Dr.Partha Pakray
{"title":"MizBERT: A Mizo BERT Model","authors":"Robert Lalramhluna, Sandeep Dash, Dr.Partha Pakray","doi":"10.1145/3666003","DOIUrl":"https://doi.org/10.1145/3666003","url":null,"abstract":"<p>This research investigates the utilization of pre-trained BERT transformers within the context of the Mizo language. BERT, an abbreviation for Bidirectional Encoder Representations from Transformers, symbolizes Google’s forefront neural network approach to Natural Language Processing (NLP), renowned for its remarkable performance across various NLP tasks. However, its efficacy in handling low-resource languages such as Mizo remains largely unexplored. In this study, we introduce <i>MizBERT</i>, a specialized Mizo language model. Through extensive pre-training on a corpus collected from diverse online platforms, <i>MizBERT</i> has been tailored to accommodate the nuances of the Mizo language. Evaluation of <i>MizBERT’s</i> capabilities is conducted using two primary metrics: Masked Language Modeling (MLM) and Perplexity, yielding scores of 76.12% and 3.2565, respectively. Additionally, its performance in a text classification task is examined. Results indicate that <i>MizBERT</i> outperforms both the multilingual BERT (mBERT) model and the Support Vector Machine (SVM) algorithm, achieving an accuracy of 98.92%. This underscores <i>MizBERT’s</i> proficiency in understanding and processing the intricacies inherent in the Mizo language.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141151258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gayathri G L, Krithika Swaminathan, Divyasri Krishnakumar, Thenmozhi D, Bharathi B
{"title":"Abusive Comment Detection in Tamil Code-Mixed Data by Adjusting Class Weights and Refining Features","authors":"Gayathri G L, Krithika Swaminathan, Divyasri Krishnakumar, Thenmozhi D, Bharathi B","doi":"10.1145/3664619","DOIUrl":"https://doi.org/10.1145/3664619","url":null,"abstract":"<p>In recent years, a significant portion of the content on various platforms on the internet has been found to be offensive or abusive. Abusive comment detection can go a long way in preventing internet users from facing the adverse effects of coming in contact with abusive language. This problem is particularly challenging when the comments are found in low-resource languages like Tamil or Tamil-English code-mixed text. So far, there has not been any substantial work on abusive comment detection using imbalanced datasets. Furthermore, significant work has not been performed, especially for Tamil code-mixed data, that involves analysing the dataset for classification and accordingly creating a custom vocabulary for preprocessing. This paper proposes a novel approach to classify abusive comments from an imbalanced dataset using a customised training vocabulary and a combination of statistical feature selection with language-agnostic feature selection while making use of explainable AI for feature refinement. Our model achieved an accuracy of 74% and a macro F1-score of 0.46.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"66 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141060706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}