{"title":"Sentiment analysis of Mizo using lexical features in low resource based models","authors":"Mercy Lalthangmawii , Thoudam Doren Singh","doi":"10.1016/j.nlp.2025.100181","DOIUrl":"10.1016/j.nlp.2025.100181","url":null,"abstract":"<div><div>Sentiment analysis is a vital area of natural language processing (NLP) for interpreting emotions in user-generated content. Although significant progress has been made for widely spoken languages, low-resource languages such as Mizo remain underexplored. This study addresses this gap by developing the first comprehensive sentiment analysis framework for Mizo language. We created a meticulously annotated data set that captures positive, negative, and neutral sentiments. Using classical machine learning models enhanced with lexicon features and transfer learning with XLM-RoBERTa, we demonstrate the feasibility of sentiment analysis in low-resource settings. Our approach achieves an accuracy of 82% with Logistic Regression and 78% with XLM-RoBERTa, which establishes a benchmark for future research in Mizo sentiment analysis.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"13 ","pages":"Article 100181"},"PeriodicalIF":0.0,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145159702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RACHNA: Racial hoax code mixed Hindi–English with novel language augmentation","authors":"Shanu SidharthKumar Dhawale , Rahul Ponnusamy , Prasanna Kumar Kumaresan , Sajeetha Thavareesan , Saranya Rajiakodi , Bharathi Raja Chakravarthi","doi":"10.1016/j.nlp.2025.100183","DOIUrl":"10.1016/j.nlp.2025.100183","url":null,"abstract":"<div><div><strong>Warning</strong>: This paper contains derogatory language that may be offensive to some readers. As a type of misinformation, hoaxes seek to propagate incorrect information in order to gain popularity on social media. Racial hoaxes are a particular kind of hoax that is particularly harmful since they falsely link individuals or groups to crimes or incidents. This involves nuanced challenges of identifying false accusations, fabrications, and stereotypes that falsely impact other social, ethnic or out groups in negative actions. On the other hand, social media comments frequently incorporate many languages and are written in scripts that are not native to the user. They also rarely adhere to inflexible grammar norms. Lack of code-mixed racial hoax annotated data for a Low-resource languages like Code-Mixed Hindi and English make this issue more challenging. In order to address this, we collected 210,768 sentences and generated a racial hoax-annotated, code-mixed corpus of 5,105 YouTube comment postings in Hindi–English as HoaxMixPlus corpus. We outline the method of building the corpus and assigning the binary values indicating the presence of racial hoax which fills a critical gap in understanding and combating racialized misinformation along with inter-annotator agreement. We display the results of analysis, training using this corpus as a benchmark, new methodologies which includes dictionary based approach by correctly identifying code-mixed words as well as novel language augmentation strategies like transliteration and language tags. We evaluate several models on this dataset and demonstrate that our augmentation strategies lead to consistent performance gains.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"13 ","pages":"Article 100183"},"PeriodicalIF":0.0,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145098436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Position-aware end-to-end cross-document event coreference resolution for Dutch","authors":"Loic De Langhe, Orphée De Clercq, Veronique Hoste","doi":"10.1016/j.nlp.2025.100184","DOIUrl":"10.1016/j.nlp.2025.100184","url":null,"abstract":"<div><div>Natural language understanding entails the ability to comprehend the relations between various people, objects or events throughout one, or multiple, text(s). Event coreference resolution (ECR) is a discourse-based natural language processing (NLP) task which aims to link those textual events, be they real or fictional, that refer to the same conceptual event. In this paper, we introduce a novel end-to-end approach for cross-document ECR which combines expert-level positional knowledge and graph-based representations in order to create a memory-efficient and accurate system meant for the detection and resolution of events in large document collections. We make three fundamental architectural changes to a current state-of-the-art cross-document ECR system and show that our approach outperforms this earlier model (+ 4% CONLL F1) on a large Dutch ECR dataset. Moreover, we show through in-depth qualitative and quantitative analysis that our proposed approach consistently detects more relevant events and suffers notably less from the typical issues models exhibit when predicting coreference chains.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"13 ","pages":"Article 100184"},"PeriodicalIF":0.0,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145098437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MAEVa: A hybrid approach for matching agroecological experiment variables","authors":"Oussama Mechhour , Sandrine Auzoux , Clément Jonquet , Mathieu Roche","doi":"10.1016/j.nlp.2025.100180","DOIUrl":"10.1016/j.nlp.2025.100180","url":null,"abstract":"<div><div>Source variables or observable properties used to describe agroecological experiments are heterogeneous, nonstandardized, and multilingual, which makes them challenging to understand, explain, and use in cropping system modeling and multicriteria evaluations of agroecological system performance. Data annotation via a controlled vocabulary, known as candidate variables from the agroecological global information system (AEGIS), offers a solution. Text similarity measures play crucial roles in tasks such as word-sense disambiguation, schema matching in databases, and data annotation. Commonly used measures include (1) string-based similarity, (2) corpus-based similarity, (3) knowledge-based similarity, and (4) hybrid-based similarity, which combine two or more of these measures. This work presents a hybrid approach called Matching Agroecological Experiment Variables (MAEVa), which combines well-known techniques (PLMs, multi-head attention, TF–IDF) tailored to the challenges of aligning source and candidate variables in agroecology. MAEVa integrates the following components: (1) Our key innovation, which consists of extending pretrained language models (PLMs) (i.e., BERT, SBERT, SimCSE) with an external multi-head attention layer for matching variable names; (2) An analysis of the relevance and impact of various data collection techniques (snippet extraction, scientific articles) and prompt-based data augmentation on TF–IDF for matching variable descriptions; (3) A linear combination of components (1) and (2); and (4) A voting-based method for selecting the final matching results. Experimental results demonstrate that extending PLMs with an external multi-head attention layer improves the matching of variable names. Furthermore, TF–IDF benefits consistently from the presence of an enriched corpus, regardless of the specific enrichment technique employed.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"13 ","pages":"Article 100180"},"PeriodicalIF":0.0,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145098435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A synergistic multi-stage RAG architecture for boosting context relevance in data science literature","authors":"Ahmet Yasin Aytar, Kamer Kaya, Kemal Kılıç","doi":"10.1016/j.nlp.2025.100179","DOIUrl":"10.1016/j.nlp.2025.100179","url":null,"abstract":"<div><div>Navigating the voluminous and rapidly evolving data science literature presents a significant bottleneck for researchers and practitioners. Standard Retrieval-Augmented Generation (RAG) systems often struggle with retrieving precisely relevant context from this dense academic corpus. This paper introduces a synergistic multi-stage RAG architecture specifically tailored to overcome these challenges. Our approach integrates structured document parsing (GROBID), domain-specific embedding fine-tuning derived from textbooks, semantic chunking for coherence, and proposes a novel ’Abstract First’ retrieval strategy that prioritizes concise, high-signal summaries. Through rigorous evaluation using the RAGAS framework and a custom data science query set, we demonstrate that this integrated architecture significantly boosts Context Relevance by over 15-fold compared to baseline RAG, surpassing configurations using only subsets of these enhancements. These findings underscore the critical importance of multi-stage optimization and highlight the surprising efficacy of the abstract-centric retrieval method for specialized academic domains, offering a validated pathway to more effective literature navigation in data science.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"13 ","pages":"Article 100179"},"PeriodicalIF":0.0,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145060827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing social media discourse of avian influenza outbreaks","authors":"Marzieh Soltani , Shayan Sharif , Rozita Dara","doi":"10.1016/j.nlp.2025.100176","DOIUrl":"10.1016/j.nlp.2025.100176","url":null,"abstract":"<div><div>The ongoing avian influenza outbreaks have had significant implications for the global poultry industry in addition to a wide range of wild birds and mammals. To enhance our understanding of public perceptions and reactions during such outbreaks, the present study examined social media discourse surrounding avian influenza on X (formerly known as Twitter). By employing advanced large language models, including DistilBERT for post filtering (average 89.5% accuracy via 5-fold cross-validation) along with Mixtral-8x7B, BERTopic, and RoBERTa for sentiment and topic/user analysis, this research categorizes the discussions and sentiments expressed by users over time. Our analysis focused on three aspects: main topics, sentiment, and temporal patterns of user engagement surrounding avian influenza outbreaks. Sentiment analysis revealed that a majority of posts related to economic impact (81.2%), wildlife (71.7%), and human cases (67.9%) expressed negative sentiment. Through topic modeling, prevalent topics of concern were identified in discussions, including concerns about transmission to humans and mammals, as well as issues related to food security and food prices. Additionally, the analysis of user engagement patterns showed distinct categories of users and highlighted the contributions of top users in shaping the discourse. Emotion analysis showed that over 80% of posts on major topics conveyed emotions such as anger, sadness, and fear, especially during periods of high case reports. The present study underscores the potential of social media analysis to understand public reactions to avian influenza outbreaks and to facilitate effective responses to public concerns and needs.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"12 ","pages":"Article 100176"},"PeriodicalIF":0.0,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144886249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jackson S. Zaunegger , Paul G. Singerman , Ram M. Narayanan , Muralidhar Rangaswamy
{"title":"RadarTD: A Radar Text Dataset for multi-parameter optimization","authors":"Jackson S. Zaunegger , Paul G. Singerman , Ram M. Narayanan , Muralidhar Rangaswamy","doi":"10.1016/j.nlp.2025.100178","DOIUrl":"10.1016/j.nlp.2025.100178","url":null,"abstract":"<div><div>This paper introduces the radar text dataset (RadarTD) for technical language modeling. This dataset is comprised of sentences containing radar parameters, values, and units determined from published radar literature. Additionally, each statement is assigned a sentiment, goal priority, and goal direction label. In this work, we show how RadarTD may be used to train simple Natural Language Processing (NLP) models to identify the attributes of each sentence listed in RadarTD. Once the NLP models have identified these attributes from text, we can use this information to develop Language Based Cost Functions (LBCF). Our study shows that the proposed text classification model achieves a classification accuracy between 96.7% and 97.8%, while the proposed named entity recognition model achieves an F1 score of 99.7. These findings suggest that the developed models are capable of achieving good performance for both text classification and named entity recognition for autonomous radar applications. We then illustrate an example of how these models could be used with Language Based Cost Functions to develop multi-parameter radar optimization schemes. We also provide a method of providing scalarization weights for each parameter, to improve the results of the optimization process.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"12 ","pages":"Article 100178"},"PeriodicalIF":0.0,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144907138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Large language models for ingredient substitution in food recipes using supervised fine-tuning and direct preference optimization","authors":"Thevin Senath , Kumuthu Athukorala , Ransika Costa , Surangika Ranathunga , Rishemjit Kaur","doi":"10.1016/j.nlp.2025.100177","DOIUrl":"10.1016/j.nlp.2025.100177","url":null,"abstract":"<div><div>In this paper, we address the challenge of recipe personalization through ingredient substitution. We make use of Large Language Models (LLMs) to build an ingredient substitution system designed to predict plausible substitute ingredients within a given recipe context. Given that the use of LLMs for this task has been barely done, we carry out an extensive set of experiments to determine the best LLM, prompt, and the fine-tuning setups. We further experiment with methods such as multi-task learning, two-stage fine-tuning, and Direct Preference Optimization (DPO). The experiments are conducted using the publicly available Recipe1MSub corpus. The best results are produced by the Mistral7-Base LLM after fine-tuning and DPO. This result outperforms the strong baseline available for the same corpus with a Hit@1 score of 22.04. Although LLM results lag behind the baseline with respect to other metrics such as Hit@3 and Hit@10, we believe that this research represents a promising step towards enabling personalized and creative culinary experiences by utilizing LLM-based ingredient substitution.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"12 ","pages":"Article 100177"},"PeriodicalIF":0.0,"publicationDate":"2025-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144865457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fanlong Zeng , Wensheng Gan , Yongheng Wang , Philip S. Yu
{"title":"Distributed training of large language models: A survey","authors":"Fanlong Zeng , Wensheng Gan , Yongheng Wang , Philip S. Yu","doi":"10.1016/j.nlp.2025.100174","DOIUrl":"10.1016/j.nlp.2025.100174","url":null,"abstract":"<div><div>The emergence of large language models (LLMs) such as ChatGPT has opened up groundbreaking possibilities, enabling a wide range of applications in diverse fields, including healthcare, law, and education. A recent research report highlighted that the performance of these models is often closely tied to their parameter scale, raising a pressing question: how can we effectively train LLMs? This concern is at the forefront of many researchers’ minds. Currently, several distributed training frameworks, such as Megatron-LM and DeepSpeed, are widely used. In this paper, we provide a comprehensive overview of the current state of LLMs, beginning with an introduction to their development status. We then dig into the common parallel strategies employed in LLM distributed training, followed by an examination of the underlying technologies and frameworks that support these models. Next, we discuss the state-of-the-art optimization techniques used in LLMs. Finally, we summarize some key challenges and limitations of current LLM training methods and outline potential future directions for the development of LLMs.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"12 ","pages":"Article 100174"},"PeriodicalIF":0.0,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144714199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detecting indicators of violence in digital text using deep learning","authors":"Abbas Z. Kouzani, Muhammad Nouman","doi":"10.1016/j.nlp.2025.100175","DOIUrl":"10.1016/j.nlp.2025.100175","url":null,"abstract":"<div><div>Individuals who experience violence often use digital platforms to share their experiences and find assistance. Artificial intelligence (AI) techniques have emerged as one of the successful technological strategies used for the detection of indicators of violence in various forms of data, particularly text communications. A hybrid deep learning model is introduced in this paper for the detection of violence indicators in online text communications. It enables the extraction of word embeddings from texts to infer the contextual relationships among words. Additionally, it uses a classifier capable of processing sequential data in both forward as well as backward directions. This approach enables the retention of long-term dependencies from texts while maintaining semantic relationships between words. The word embeddings extraction is implemented with the use of the bidirectional encoder representations from transformer algorithm. The sequence processing classification is implemented by incorporating a combination of parallel layers consisting of the bidirectional long–short-term memory as well as the bidirectional gated recurrent unit algorithms. The developed deep learning architecture is experimentally tested, and the associated results are compared with those of several other machine learning models. The findings are presented and discussed.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"12 ","pages":"Article 100175"},"PeriodicalIF":0.0,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144672206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}