Shanjita Akter Prome , Neethiahnanthan Ari Ragavan , Md Rafiqul Islam , David Asirvatham , Anasuya Jegathevi Jegathesan
{"title":"Deception detection using machine learning (ML) and deep learning (DL) techniques: A systematic review","authors":"Shanjita Akter Prome , Neethiahnanthan Ari Ragavan , Md Rafiqul Islam , David Asirvatham , Anasuya Jegathevi Jegathesan","doi":"10.1016/j.nlp.2024.100057","DOIUrl":"10.1016/j.nlp.2024.100057","url":null,"abstract":"<div><p>Deception detection is a crucial concern in our daily lives, with its effect on social interactions. The human face is a rich source of data that offers trustworthy markers of deception. The deception detection systems are non-intrusive, cost-effective, and mobile by identifying face expressions. Over the last decade, numerous studies have been conducted on deception/lie detection using several advanced techniques. Researchers have given their attention to inventing more effective and efficient solutions for deception detection. However, there are still a lot of opportunities for innovative deception detection methods. Thus, in this literature review, we conduct the statistical analysis by following the PRISMA protocol and extract various articles from five e-databases. The main objectives of this paper are (i) to explain the overview of machine learning (ML) and deep learning (DL) techniques for deception detection, (ii) to outline the existing literature, and (iii) to address the current challenges and its research prospects for further study. While significant issues in deception detection methods are acknowledged, the review highlights key conclusions and offers a systematic analysis of state-of-the-art techniques, emphasizing contributions and opportunities. The findings illuminate current trends and future research prospects, fostering ongoing development in the field.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"6 ","pages":"Article 100057"},"PeriodicalIF":0.0,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949719124000050/pdfft?md5=eef92a93b295ca392877e0d65bfe7ec7&pid=1-s2.0-S2949719124000050-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139638524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep temporal modelling of clinical depression through social media text","authors":"Nawshad Farruque , Randy Goebel , Sudhakar Sivapalan , Osmar Zaïane","doi":"10.1016/j.nlp.2023.100052","DOIUrl":"https://doi.org/10.1016/j.nlp.2023.100052","url":null,"abstract":"<div><p>We describe the development of a model to detect user-level clinical depression based on a user’s temporal social media posts. Our model uses a Depression Symptoms Detection (DSD) classifier, which is trained on the largest existing samples of clinician annotated tweets for clinical depression symptoms. We subsequently use our DSD model to extract clinically relevant features, e.g., depression scores and their consequent temporal patterns, as well as user posting activity patterns, e.g., quantifying their “no activity” or “silence.” Furthermore, to evaluate the efficacy of these extracted features, we create three kinds of datasets including a test dataset, from two existing well-known benchmark datasets for user-level depression detection. We then provide accuracy measures based on single features, baseline features and feature ablation tests, at several different levels of temporal granularity. The relevant data distributions and clinical depression detection related settings can be exploited to draw a complete picture of the impact of different features across our created datasets. Finally, we show that, in general, only semantic oriented representation models perform well. However, clinical features may enhance overall performance provided that the training and testing distribution is similar, and there is more data in a user’s timeline. The consequence is that the predictive capability of depression scores increase significantly while used in a more sensitive clinical depression detection settings.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"6 ","pages":"Article 100052"},"PeriodicalIF":0.0,"publicationDate":"2024-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949719123000493/pdfft?md5=0d6383093fc7867b461d44edd1c64ce4&pid=1-s2.0-S2949719123000493-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139550053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Identifying hidden patterns of fake COVID-19 news: An in-depth sentiment analysis and topic modeling approach","authors":"Tanvir Ahammad","doi":"10.1016/j.nlp.2024.100053","DOIUrl":"10.1016/j.nlp.2024.100053","url":null,"abstract":"<div><p>Spreading misinformation and fake news about COVID-19 has become a critical concern. It contributes to a lack of trust in public health authorities, hinders actions from controlling the virus’s spread, and risks people’s lives. This study aims to gain insights into the types of misinformation spread and develop an in-depth analytical approach for analyzing COVID-19 fake news. It combines the idea of Sentiment Analysis (SA) and Topic Modeling (TM) to improve the accuracy of topic extraction from a large volume of unstructured texts by considering the sentiment of the words. A dataset containing 10,254 news headlines from various sources was collected and prepared, and rule-based SA was applied to label the dataset with three sentiment tags. Among the TM models evaluated, Latent Dirichlet Allocation (LDA) demonstrated the highest coherence score of 0.66 for 20 coherent negative sentiment-based topics and 0.573 for 18 coherent positive fake news topics, outperforming Non-negative Matrix Factorization (NMF) (coherence: 0.43) and Latent Semantic Analysis (LSA) (coherence: 0.40). The topics extracted from the experiments highlight that misinformation primarily revolves around the COVID vaccine, crime, quarantine, medicine, and political and social aspects. This research offers insight into the effects of COVID-19 fake news, provides a valuable method for detecting and analyzing misinformation, and emphasizes the importance of understanding the patterns and themes of fake news for protecting public health and promoting scientific accuracy. Moreover, it can aid in developing real-time monitoring systems to combat misinformation, extending beyond COVID-19-related fake news and enhancing the applicability of the findings.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"6 ","pages":"Article 100053"},"PeriodicalIF":0.0,"publicationDate":"2024-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949719124000013/pdfft?md5=8f1425dee06c23636d0b5b055c7010af&pid=1-s2.0-S2949719124000013-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139394597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A review of sentiment analysis for Afaan Oromo: Current trends and future perspectives","authors":"Jemal Abate , Faizur Rashid","doi":"10.1016/j.nlp.2023.100051","DOIUrl":"10.1016/j.nlp.2023.100051","url":null,"abstract":"<div><p>Sentiment analysis, commonly referred to as opinion mining, is a fast-expanding area that seeks to ascertain the sentiment expressed in textual data. While sentiment analysis has been extensively studied for major languages such as English, research focusing on low-resource languages like Afaan Oromo is still limited. This review article surveys the existing techniques and approaches used for sentiment analysis specifically for Afaan Oromo, the widely spoken language in Ethiopia. The review highlights the effectiveness of combining neural network architectures, such as Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory (Bi-LSTM) models, as well as clustering techniques like Gaussian Mixture Models (GMM) and Support Vector Machine (SVM) in sentiment analysis for Afaan Oromo. These approaches have demonstrated promising results in various domains, including social media content and SMS texts. However, the lack of a standardized corpus for Afaan Oromo NLP tasks remains a major challenge, which indicates the need for comprehensive data collection and preparation. Additionally, challenges related to domain-specific language, informal expressions, and context-specific polarity orientations pose difficulties for sentiment analysis in Afaan Oromo.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"6 ","pages":"Article 100051"},"PeriodicalIF":0.0,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949719123000481/pdfft?md5=e70b97eefccb0378b45c08e181baa491&pid=1-s2.0-S2949719123000481-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139195139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards a large sized curated and annotated corpus for discriminating between human written and AI generated texts: A case study of text sourced from Wikipedia and ChatGPT","authors":"Aakash Singh, Deepawali Sharma, Abhirup Nandy, Vivek Kumar Singh","doi":"10.1016/j.nlp.2023.100050","DOIUrl":"https://doi.org/10.1016/j.nlp.2023.100050","url":null,"abstract":"<div><p>The recently launched large language models have the capability to generate text and engage in human-like conversations and question-answering. Owing to their capabilities, these models are now being widely used for a variety of purposes, ranging from question answering to writing scholarly articles. These models are producing such good outputs that it is becoming very difficult to identify what texts are written by human beings and what by these programs. This has also led to different kinds of problems such as out-of-context literature, lack of novelty in articles, and issues of plagiarism and lack of proper attribution and citations to the original texts. Therefore, there is a need for suitable computational resources for developing algorithmic approaches that can identify and discriminate between human and machine generated texts. This work contributes towards this research problem by providing a large sized curated and annotated corpus comprising of 44,162 text articles sourced from Wikipedia and ChatGPT. Some baseline models are also applied on the developed dataset and the results obtained are analyzed and discussed. The curated corpus offers a valuable resource that can be used to advance the research in this important area and thereby contribute to the responsible and ethical integration of AI language models into various fields.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"6 ","pages":"Article 100050"},"PeriodicalIF":0.0,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S294971912300047X/pdfft?md5=48afd2554f84aa4af2b6e1f9fb5dbc60&pid=1-s2.0-S294971912300047X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139100584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A survey of GPT-3 family large language models including ChatGPT and GPT-4","authors":"Katikapalli Subramanyam Kalyan","doi":"10.1016/j.nlp.2023.100048","DOIUrl":"https://doi.org/10.1016/j.nlp.2023.100048","url":null,"abstract":"<div><p>Large language models (LLMs) are a special class of pretrained language models (PLMs) obtained by scaling model size, pretraining corpus and computation. LLMs, because of their large size and pretraining on large volumes of text data, exhibit special abilities which allow them to achieve remarkable performances without any task-specific training in many of the natural language processing tasks. The era of LLMs started with OpenAI’s GPT-3 model, and the popularity of LLMs has increased exponentially after the introduction of models like ChatGPT and GPT4. We refer to GPT-3 and its successor OpenAI models, including ChatGPT and GPT4, as GPT-3 family large language models (GLLMs). With the ever-rising popularity of GLLMs, especially in the research community, there is a strong need for a comprehensive survey which summarizes the recent research progress in multiple dimensions and can guide the research community with insightful future research directions. We start the survey paper with foundation concepts like transformers, transfer learning, self-supervised learning, pretrained language models and large language models. We then present a brief overview of GLLMs and discuss the performances of GLLMs in various downstream tasks, specific domains and multiple languages. We also discuss the data labelling and data augmentation abilities of GLLMs, the robustness of GLLMs, the effectiveness of GLLMs as evaluators, and finally, conclude with multiple insightful future research directions. To summarize, this comprehensive survey paper will serve as a good resource for both academic and industry people to stay updated with the latest research related to GLLMs.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"6 ","pages":"Article 100048"},"PeriodicalIF":0.0,"publicationDate":"2023-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949719123000456/pdfft?md5=72753bb0aac6b7c01d0dc8bddfb62121&pid=1-s2.0-S2949719123000456-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139100583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gender bias in transformers: A comprehensive review of detection and mitigation strategies","authors":"Praneeth Nemani , Yericherla Deepak Joel , Palla Vijay , Farhana Ferdouzi Liza","doi":"10.1016/j.nlp.2023.100047","DOIUrl":"10.1016/j.nlp.2023.100047","url":null,"abstract":"<div><p>Gender bias in artificial intelligence (AI) has emerged as a pressing concern with profound implications for individuals’ lives. This paper presents a comprehensive survey that explores gender bias in Transformer models from a linguistic perspective. While the existence of gender bias in language models has been acknowledged in previous studies, there remains a lack of consensus on how to measure and evaluate this bias effectively. Our survey critically examines the existing literature on gender bias in Transformers, shedding light on the diverse methodologies and metrics employed to assess bias. Several limitations in current approaches to measuring gender bias in Transformers are identified, encompassing the utilization of incomplete or flawed metrics, inadequate dataset sizes, and a dearth of standardization in evaluation methods. Furthermore, our survey delves into the potential ramifications of gender bias in Transformers for downstream applications, including dialogue systems and machine translation. We underscore the importance of fostering equity and fairness in these systems by emphasizing the need for heightened awareness and accountability in developing and deploying language technologies. This paper serves as a comprehensive overview of gender bias in Transformer models, providing novel insights and offering valuable directions for future research in this critical domain.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"6 ","pages":"Article 100047"},"PeriodicalIF":0.0,"publicationDate":"2023-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949719123000444/pdfft?md5=bfc905884945510f2b2e207d895b481c&pid=1-s2.0-S2949719123000444-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139022065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EPAG: A novel enhanced move recognition algorithm based on continuous learning mechanism with positional embedding","authors":"Hao Wen , Jie Wang , Xiaodong Qiao","doi":"10.1016/j.nlp.2023.100049","DOIUrl":"https://doi.org/10.1016/j.nlp.2023.100049","url":null,"abstract":"<div><p>The identification of abstracts plays a vital role in efficiently locating the content and providing clarity to the article. Existing algorithms for move recognition exhibit a deficiency in their capacity to acquire word adjacent position information when word changes in Chinese expressions to obtain contextual semantics changes. This paper introduces EPAG: a novel enhanced move recognition algorithm with the improved pre-trained framework and downstream model for unstructured abstracts of Chinese scientific and technological papers. The proposed algorithm first performs data segmentation and vocabulary training. The EPAG framework is leveraged to incorporate word positional information, facilitating deep semantic learning and targeted feature extraction. Experimental results demonstrate that the proposed algorithm achieves 13.37% higher accuracy on the split dataset than on the original dataset and a 7.55% improvement in accuracy over the basic comparison model.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"6 ","pages":"Article 100049"},"PeriodicalIF":0.0,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949719123000468/pdfft?md5=e65848012bae8aeab0939b4bcb600659&pid=1-s2.0-S2949719123000468-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138769459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Context is not key: Detecting Alzheimer’s disease with both classical and transformer-based neural language models","authors":"Behrad TaghiBeyglou , Frank Rudzicz","doi":"10.1016/j.nlp.2023.100046","DOIUrl":"10.1016/j.nlp.2023.100046","url":null,"abstract":"<div><p>Natural language processing (NLP) has exhibited potential in detecting Alzheimer’s disease (AD) and related dementias, particularly due to the impact of AD on spontaneous speech. Recent research has emphasized the significance of context-based models, such as Bidirectional Encoder Representations from Transformers (BERT). However, these models often come at the expense of increased complexity and computational requirements, which are not always accessible. In light of these considerations, we propose a straightforward and efficient word2vec-based model for AD detection, and evaluate it on the Alzheimer’s Dementia Recognition through Spontaneous Speech (ADReSS) challenge dataset. Additionally, we explore the efficacy of fusing our model with classic linguistic features and compare this to other contextual models by fine-tuning BERT-based and Generative Pre-training Transformer (GPT) sequence classification models. We find that simpler models achieve a remarkable accuracy of 92% in classifying AD cases, along with a root mean square error of 4.21 in estimating Mini-Mental Status Examination (MMSE) scores. Notably, our models outperform all state-of-the-art models in the literature for classifying AD cases and estimating MMSE scores, including contextual language models.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"6 ","pages":"Article 100046"},"PeriodicalIF":0.0,"publicationDate":"2023-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949719123000432/pdfft?md5=336a0f84783ed1740358a38f35a9194c&pid=1-s2.0-S2949719123000432-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138610270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eric Chagnon, Ronald Pandolfi, Jeffrey Donatelli, Daniela Ushizima
{"title":"Benchmarking topic models on scientific articles using BERTeley","authors":"Eric Chagnon, Ronald Pandolfi, Jeffrey Donatelli, Daniela Ushizima","doi":"10.1016/j.nlp.2023.100044","DOIUrl":"10.1016/j.nlp.2023.100044","url":null,"abstract":"<div><p>The introduction of BERTopic marked a crucial advancement in topic modeling and presented a topic model that outperformed both traditional and modern topic models in terms of topic modeling metrics on a variety of corpora. However, unique issues arise when topic modeling is performed on scientific articles. This paper introduces BERTeley, an innovative tool built upon BERTopic, designed to alleviate these shortcomings and improve the usability of BERTopic when conducting topic modeling on a corpus consisting of scientific articles. This is accomplished through BERTeley’s three main features: scientific article preprocessing, topic modeling using pre-trained scientific language models, and topic model metric calculation. Furthermore, an experiment was conducted comparing topic models using four different language models in three corpora consisting of scientific articles.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"6 ","pages":"Article 100044"},"PeriodicalIF":0.0,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949719123000419/pdfft?md5=ba7f61749a42e9736def8c59c69a58d2&pid=1-s2.0-S2949719123000419-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138620209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}