{"title":"AugGPT: Leveraging ChatGPT for Text Data Augmentation","authors":"Haixing Dai;Zhengliang Liu;Wenxiong Liao;Xiaoke Huang;Yihan Cao;Zihao Wu;Lin Zhao;Shaochen Xu;Fang Zeng;Wei Liu;Ninghao Liu;Sheng Li;Dajiang Zhu;Hongmin Cai;Lichao Sun;Quanzheng Li;Dinggang Shen;Tianming Liu;Xiang Li","doi":"10.1109/TBDATA.2025.3536934","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3536934","url":null,"abstract":"Text data augmentation is an effective strategy for overcoming the challenge of limited sample sizes in many natural language processing (NLP) tasks. This challenge is especially prominent in the few-shot learning (FSL) scenario, where the data in the target domain is generally much scarcer and of lowered quality. A natural and widely used strategy to mitigate such challenges is to perform data augmentation to better capture data invariance and increase the sample size. However, current text data augmentation methods either can’t ensure the correct labeling of the generated data (lacking faithfulness), or can’t ensure sufficient diversity in the generated data (lacking compactness), or both. Inspired by the recent success of large language models (LLM), especially the development of ChatGPT, we propose a text data augmentation approach based on ChatGPT (named ”AugGPT”). AugGPT rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples. The augmented samples can then be used in downstream model training. Experiment results on multiple few-shot learning text classification tasks show the superior performance of the proposed AugGPT approach over state-of-the-art text data augmentation methods in terms of testing accuracy and distribution of the augmented samples.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"907-918"},"PeriodicalIF":7.5,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Expertise or Hallucination? A Comprehensive Evaluation of ChatGPT's Aptitude in Clinical Genetics","authors":"Yingbo Zhang;Shumin Ren;Jiao Wang;Chaoying Zhan;Mengqiao He;Xingyun Liu;Rongrong Wu;Jing Zhao;Cong Wu;Chuanzhu Fan;Bairong Shen","doi":"10.1109/TBDATA.2025.3536939","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3536939","url":null,"abstract":"Whether viewed as an expert or as a source of ‘knowledge hallucination’, the use of ChatGPT in medical practice has stirred ongoing debate. This study sought to evaluate ChatGPT's capabilities in the field of clinical genetics, focusing on tasks such as ‘Clinical genetics exams’, ‘Associations between genetic diseases and pathogenic genes’, and ‘Limitations and trends in clinical genetics’. Results indicated that ChatGPT performed exceptionally well in question-answering tasks, particularly in clinical genetics exams and diagnosing single-gene diseases. It also effectively outlined the current limitations and prospective trends in clinical genetics. However, ChatGPT struggled to provide comprehensive answers regarding multi-gene or epigenetic diseases, particularly with respect to genetic variations or chromosomal abnormalities. In terms of systematic summarization and inference, some randomness was evident in ChatGPT's responses. In summary, while ChatGPT possesses a foundational understanding of general knowledge in clinical genetics due to hyperparameter learning, it encounters significant challenges when delving into specialized knowledge and navigating the complexities of clinical genetics, particularly in mitigating ‘Knowledge Hallucination’. To optimize its performance and depth of expertise in clinical genetics, integration with specialized knowledge databases and knowledge graphs is imperative.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"919-932"},"PeriodicalIF":7.5,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammad Nadeem;Shahab Saquib Sohail;Dag Øivind Madsen;Ahmed Ibrahim Alzahrani;Javier Del Ser;Khan Muhammad
{"title":"A Multi-Modal Assessment Framework for Comparison of Specialized Deep Learning and General-Purpose Large Language Models","authors":"Mohammad Nadeem;Shahab Saquib Sohail;Dag Øivind Madsen;Ahmed Ibrahim Alzahrani;Javier Del Ser;Khan Muhammad","doi":"10.1109/TBDATA.2025.3536937","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3536937","url":null,"abstract":"Recent years have witnessed tremendous advancements in Al tools (e.g., ChatGPT, GPT-4, and Bard), driven by the growing power, reasoning, and efficiency of Large Language Models (LLMs). LLMs have been shown to excel in tasks ranging from poem writing and coding to essay generation and puzzle solving. Despite their proficiency in general queries, specialized tasks such as metaphor understanding and fake news detection often require finely tuned models, posing a comparison challenge with specialized Deep Learning (DL). We propose an assessment framework to compare task-specific intelligence with general-purpose LLMs on suicide and depression tendency identification. For this purpose, we trained two DL models on a suicide and depression detection dataset, followed by testing their performance on a test set. Afterward, the same test dataset is used to evaluate the performance of four LLMs (GPT-3.5, GPT-4, Google Bard, and MS Bing) using four classification metrics. The BERT-based DL model performed the best among all, with a testing accuracy of 94.61%, while GPT-4 was the runner-up with accuracy 92.5%. Results demonstrate that LLMs do not outperform the specialized DL models but are able to achieve comparable performance, making them a decent option for downstream tasks without specialized training. However, LLMs outperformed specialized models on the reduced dataset.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1001-1012"},"PeriodicalIF":7.5,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"2024 Reviewers List*","authors":"","doi":"10.1109/TBDATA.2025.3526356","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3526356","url":null,"abstract":"","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"310-313"},"PeriodicalIF":7.5,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10843074","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Advances in Robust Federated Learning: A Survey With Heterogeneity Considerations","authors":"Chuan Chen;Tianchi Liao;Xiaojun Deng;Zihou Wu;Sheng Huang;Zibin Zheng","doi":"10.1109/TBDATA.2025.3527202","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3527202","url":null,"abstract":"In the field of heterogeneous federated learning (FL), the key challenge is to efficiently and collaboratively train models across multiple clients with different data distributions, model structures, task objectives, computational capabilities, and communication resources. This diversity leads to significant heterogeneity, which increases the complexity of model training. In this paper, we first outline the basic concepts of heterogeneous FL and summarize the research challenges in FL in terms of five aspects: data, model, task, device and communication. In addition, we explore how existing state-of-the-art approaches cope with the heterogeneity of FL, and categorize and review these approaches at three different levels: data-level, model-level, and architecture-level. Subsequently, the paper extensively discusses privacy-preserving strategies in heterogeneous FL environments. Finally, the paper discusses current open issues and directions for future research, aiming to promote the further development of heterogeneous FL.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1548-1567"},"PeriodicalIF":7.5,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Big Data-Driven Advancements and Future Directions in Vehicle Perception Technologies: From Autonomous Driving to Modular Buses","authors":"Hongyi Lin;Yang Liu;Liang Wang;Xiaobo Qu","doi":"10.1109/TBDATA.2025.3527208","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3527208","url":null,"abstract":"The rapid development of Big Data and artificial intelligence (AI) is revolutionizing the automotive and transportation industries, leading to the creation of the Autonomous Modular Bus (AMB). Designed to address the key challenges of modern public transportation systems, the AMB adopts a modular dynamic assembly approach. However, existing research on the AMB predominantly focuses on operational aspects, whereas in-transit docking remains the primary obstacle to its commercial deployment. This challenge stems from the fact that current perception accuracy in autonomous vehicles is limited to the decimeter level, with insufficient capability to manage adverse weather and complex traffic conditions. To enable AMBs to achieve full-scenario autonomous driving capabilities, this paper reviews current perception technologies from three perspectives: single-vehicle single-sensor perception, multi-sensor fusion perception, and cooperative perception. It examines the characteristics of existing perception solutions and evaluates their applicability to AMB-specific requirements. Furthermore, considering the unique challenges of in-transit docking, this paper identifies and proposes four future research directions for advancing AMB perception systems as well as general autonomous driving technologies.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1568-1587"},"PeriodicalIF":7.5,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detection of Rumors and Their Sources in Social Networks: A Comprehensive Survey","authors":"Otabek Sattarov;Jaeyoung Choi","doi":"10.1109/TBDATA.2024.3522801","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3522801","url":null,"abstract":"With the recent advancements in social network platform technology, an overwhelming amount of information is spreading rapidly. In this situation, it can become increasingly difficult to discern what information is false or true. If false information proliferates significantly, it can lead to undesirable outcomes. Hence, when we receive some information, we can pose the following two questions: <inline-formula><tex-math>$(i)$</tex-math></inline-formula> Is the information true? <inline-formula><tex-math>$(ii)$</tex-math></inline-formula> If not, who initially spread that information? The first problem is the rumor detection issue, while the second is the rumor source detection problem. A rumor-detection problem involves identifying and mitigating false or misleading information spread via various communication channels, particularly online platforms and social media. Rumors can range from harmless ones to deliberately misleading content aimed at deceiving or manipulating audiences. Detecting misinformation is crucial for maintaining the integrity of information ecosystems and preventing harmful effects such as the spread of false beliefs, polarization, and even societal harm. Therefore, it is very important to quickly distinguish such misinformation while simultaneously finding its source to block it from spreading on the network. However, most of the existing surveys have analyzed these two issues separately. In this work, we first survey the existing research on the rumor-detection and rumor source detection problems with joint detection approaches, simultaneously. This survey deals with these two issues together so that their relationship can be observed and it provides how the two problems are similar and different. The limitations arising from the rumor detection, rumor source detection, and their combination problems are also explained, and some challenges to be addressed in future works are presented.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1528-1547"},"PeriodicalIF":7.5,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust Privacy-Preserving Federated Item Ranking in Online Marketplaces: Exploiting Platform Reputation for Effective Aggregation","authors":"Guilherme Ramos;Ludovico Boratto;Mirko Marras","doi":"10.1109/TBDATA.2024.3505055","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3505055","url":null,"abstract":"Online marketplaces often collect products to sell from several other platforms (and sellers) and produce a unique ranking/score of these products to users. Keeping as private the user preferences provided in each (individual) platform is a need and a challenge at the same time. We are currently used to rating items in the marketplace itself which, in turn, can produce more effective rankings. Hence, the shaping of an effective item ranking would require a sharing of the user ratings between the individual platforms and the marketplace, thus impacting users’ privacy. In this paper, we propose the initial steps towards a change of paradigm, where the ratings are kept as private in each platform. Under this paradigm, each platform produces its rankings, then aggregated by the marketplace, in a federated fashion. To ensure that the marketplace’s rankings maintain their effectiveness, we exploit the concept of <italic>reputation of the individual platform</i>, so that the final marketplace ranking is weighted by the reputation of each platform providing its ranking. Experiments on three datasets, covering different use cases, show that our approach can produce effective rankings, improving robustness to attacks, while keeping user preference data private within each seller platform.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"303-309"},"PeriodicalIF":7.5,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Senzhang Wang;Changdong Wang;Di Jin;Shirui Pan;Philip S. Yu
{"title":"Guest Editorial TBD Special Issue on Graph Machine Learning for Recommender Systems","authors":"Senzhang Wang;Changdong Wang;Di Jin;Shirui Pan;Philip S. Yu","doi":"10.1109/TBDATA.2024.3452328","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3452328","url":null,"abstract":"","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 6","pages":"682-682"},"PeriodicalIF":7.5,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10750533","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142600265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuxin Guo;Deyu Bo;Cheng Yang;Zhiyuan Lu;Zhongjian Zhang;Jixi Liu;Yufei Peng;Chuan Shi
{"title":"Data-Centric Graph Learning: A Survey","authors":"Yuxin Guo;Deyu Bo;Cheng Yang;Zhiyuan Lu;Zhongjian Zhang;Jixi Liu;Yufei Peng;Chuan Shi","doi":"10.1109/TBDATA.2024.3489412","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3489412","url":null,"abstract":"The history of artificial intelligence (AI) has witnessed the significant impact of high-quality data on various deep learning models, such as ImageNet for AlexNet and ResNet. Recently, instead of designing more complex neural architectures as model-centric approaches, the attention of AI community has shifted to data-centric ones, which focuses on better processing data to strengthen the ability of neural models. Graph learning, which operates on ubiquitous topological data, also plays an important role in the era of deep learning. In this survey, we comprehensively review graph learning approaches from the data-centric perspective, and aim to answer three crucial questions: <italic>(1) when to modify graph data</i>, <italic>(2) what part of the graph data needs modification</i> to unlock the potential of various graph models, and <italic>(3) how to safeguard graph models</i> from problematic data influence. Accordingly, we propose a novel taxonomy based on the stages in the graph learning pipeline, and highlight the processing methods for different data structures in the graph data, i.e., topology, feature and label. Furthermore, we analyze some potential problems embedded in graph data and discuss how to solve them in a data-centric manner. Finally, we provide some promising future directions for data-centric graph learning.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"1-20"},"PeriodicalIF":7.5,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}