Journal of Big Data最新文献

筛选
英文 中文
The adaptive community-response (ACR) method for collecting misinformation on social media 在社交媒体上收集错误信息的自适应社区响应(ACR)方法
IF 8.1 2区 计算机科学
Journal of Big Data Pub Date : 2024-02-24 DOI: 10.1186/s40537-024-00894-w
Julian Kauk, Helene Kreysa, André Scherag, Stefan R. Schweinberger
{"title":"The adaptive community-response (ACR) method for collecting misinformation on social media","authors":"Julian Kauk, Helene Kreysa, André Scherag, Stefan R. Schweinberger","doi":"10.1186/s40537-024-00894-w","DOIUrl":"https://doi.org/10.1186/s40537-024-00894-w","url":null,"abstract":"<p>Social media can be a major accelerator of the spread of misinformation, thereby potentially compromising both individual well-being and social cohesion. Despite significant recent advances, the study of online misinformation is a relatively young field facing several (methodological) challenges. In this regard, the detection of online misinformation has proven difficult, as online large-scale data streams require (semi-)automated, highly specific and therefore sophisticated methods to separate posts containing misinformation from irrelevant posts. In the present paper, we introduce the adaptive community-response (ACR) method, an unsupervised technique for the large-scale collection of misinformation on Twitter (now known as ’X’). The ACR method is based on previous findings showing that Twitter users occasionally reply to misinformation with fact-checking by referring to specific fact-checking sites (crowdsourced fact-checking). In a first step, we captured such misinforming but fact-checked tweets. These tweets were used in a second step to extract specific linguistic features (keywords), enabling us to collect also those misinforming tweets that were not fact-checked at all as a third step. We initially present a mathematical framework of our method, followed by an explicit algorithmic implementation. We then evaluate ACR on the basis of a comprehensive dataset consisting of <span>(&gt;25)</span> million tweets, belonging to <span>(&gt;300)</span> misinforming stories. Our evaluation shows that ACR is a useful extension to the methods pool of the field, enabling researchers to collect online misinformation more comprehensively. Text similarity measures clearly indicated correspondence between the claims of false stories and the ACR tweets, even though ACR performance was heterogeneously distributed across the stories. A baseline comparison to the fact-checked tweets showed that the ACR method can detect story-related tweets to a comparable degree, while being sensitive to different types of tweets: Fact-checked tweets tend to be driven by high outreach (as indicated by a high number of retweets), whereas the sensitivity of the ACR method extends to tweets exhibiting lower outreach. Taken together, ACR’s capacity as a valuable methodological contribution to the field is based on (i) the adoption of prior, pioneering research in the field, (ii) a well-formalized mathematical framework and (iii) an empirical foundation via a comprehensive set of indicators.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"130 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139968856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing IoT intrusion detection system: feature selection versus feature extraction in machine learning 优化物联网入侵检测系统:机器学习中的特征选择与特征提取
IF 8.1 2区 计算机科学
Journal of Big Data Pub Date : 2024-02-24 DOI: 10.1186/s40537-024-00892-y
Jing Li, Mohd Shahizan Othman, Hewan Chen, Lizawati Mi Yusuf
{"title":"Optimizing IoT intrusion detection system: feature selection versus feature extraction in machine learning","authors":"Jing Li, Mohd Shahizan Othman, Hewan Chen, Lizawati Mi Yusuf","doi":"10.1186/s40537-024-00892-y","DOIUrl":"https://doi.org/10.1186/s40537-024-00892-y","url":null,"abstract":"<p>Internet of Things (IoT) devices are widely used but also vulnerable to cyberattacks that can cause security issues. To protect against this, machine learning approaches have been developed for network intrusion detection in IoT. These often use feature reduction techniques like feature selection or extraction before feeding data to models. This helps make detection efficient for real-time needs. This paper thoroughly compares feature extraction and selection for IoT network intrusion detection in machine learning-based attack classification framework. It looks at performance metrics like accuracy, f1-score, and runtime, etc. on the heterogenous IoT dataset named Network TON-IoT using binary and multiclass classification. Overall, feature extraction gives better detection performance than feature selection as the number of features is small. Moreover, extraction shows less feature reduction compared with that of selection, and is less sensitive to changes in the number of features. However, feature selection achieves less model training and inference time compared with its counterpart. Also, more space to improve the accuracy for selection than extraction when the number of features changes. This holds for both binary and multiclass classification. The study provides guidelines for selecting appropriate intrusion detection methods for particular scenarios. Before, the TON-IoT heterogeneous IoT dataset comparison and recommendations were overlooked. Overall, the research presents a thorough comparison of feature reduction techniques for machine learning-driven intrusion detection in IoT networks.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"49 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139952919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Measuring regularity of human physical activities with entropy models 用熵模型测量人类体育活动的规律性
IF 8.1 2区 计算机科学
Journal of Big Data Pub Date : 2024-02-24 DOI: 10.1186/s40537-024-00891-z
Keqin Shi, Zhen Chen, Weiqiang Sun, Weisheng Hu
{"title":"Measuring regularity of human physical activities with entropy models","authors":"Keqin Shi, Zhen Chen, Weiqiang Sun, Weisheng Hu","doi":"10.1186/s40537-024-00891-z","DOIUrl":"https://doi.org/10.1186/s40537-024-00891-z","url":null,"abstract":"<p>Regularity is an important aspect of physical activity that can provide valuable insights into how individuals engage in physical activity over time. Accurate measurement of regularity not only advances our understanding of physical activity behavior but also facilitates the development of human activity modeling and forecasting. Furthermore, it can inform the design and implementation of tailored interventions to improve population health outcomes. In this paper, we aim to assess the regularity of physical activities through longitudinal sensor data, which reflects individuals’ all physical activities over an extended period. We explore three entropy models, including entropy rate, approximate entropy, and sample entropy, which can potentially offer a more comprehensive evaluation of physical activity regularity compared to metrics based solely on periodicity or stability. We propose a framework to validate the performance of entropy models on both synthesized and real-world physical activity data. The results indicate entropy rate is able to identify not only the magnitude and amount of noise but also macroscopic variations of physical activities, such as differences on duration and occurrence time. Simultaneously, entropy rate is highly correlated with the predictability of real-world samples, further highlighting its applicability in measuring human physical activity regularity. Leveraging entropy rate, we further investigate the regularity for 686 individuals. We find the composition of physical activities can partially explain the difference in regularity among individuals, and the majority of individuals exhibit temporal stability of regularity.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"60 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139952796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data-driven multinomial random forest: a new random forest variant with strong consistency 数据驱动的多叉随机森林:具有强一致性的新型随机森林变体
IF 8.1 2区 计算机科学
Journal of Big Data Pub Date : 2024-02-23 DOI: 10.1186/s40537-023-00874-6
JunHao Chen, XueLi Wang, Fei Lei
{"title":"Data-driven multinomial random forest: a new random forest variant with strong consistency","authors":"JunHao Chen, XueLi Wang, Fei Lei","doi":"10.1186/s40537-023-00874-6","DOIUrl":"https://doi.org/10.1186/s40537-023-00874-6","url":null,"abstract":"<p>In this paper, we modify the proof methods of some previously weakly consistent variants of random forest into strongly consistent proof methods, and improve the data utilization of these variants in order to obtain better theoretical properties and experimental performance. In addition, we propose the Data-driven Multinomial Random Forest (DMRF) algorithm, which has the same complexity with BreimanRF (proposed by Breiman) while satisfying strong consistency with probability 1. It has better performance in classification and regression tasks than previous RF variants that only satisfy weak consistency, and in most cases even surpasses BreimanRF in classification tasks. To the best of our knowledge, DMRF is currently a low-complexity and high-performing variation of random forest that achieves strong consistency with probability 1.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"2 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139956566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine learning-based network intrusion detection for big and imbalanced data using oversampling, stacking feature embedding and feature extraction 利用超采样、堆叠特征嵌入和特征提取,基于机器学习的大数据和不平衡数据网络入侵检测
IF 8.1 2区 计算机科学
Journal of Big Data Pub Date : 2024-02-22 DOI: 10.1186/s40537-024-00886-w
Md. Alamin Talukder, Md. Manowarul Islam, Md Ashraf Uddin, Khondokar Fida Hasan, Selina Sharmin, Salem A. Alyami, Mohammad Ali Moni
{"title":"Machine learning-based network intrusion detection for big and imbalanced data using oversampling, stacking feature embedding and feature extraction","authors":"Md. Alamin Talukder, Md. Manowarul Islam, Md Ashraf Uddin, Khondokar Fida Hasan, Selina Sharmin, Salem A. Alyami, Mohammad Ali Moni","doi":"10.1186/s40537-024-00886-w","DOIUrl":"https://doi.org/10.1186/s40537-024-00886-w","url":null,"abstract":"<p>Cybersecurity has emerged as a critical global concern. Intrusion Detection Systems (IDS) play a critical role in protecting interconnected networks by detecting malicious actors and activities. Machine Learning (ML)-based behavior analysis within the IDS has considerable potential for detecting dynamic cyber threats, identifying abnormalities, and identifying malicious conduct within the network. However, as the number of data grows, dimension reduction becomes an increasingly difficult task when training ML models. Addressing this, our paper introduces a novel ML-based network intrusion detection model that uses Random Oversampling (RO) to address data imbalance and Stacking Feature Embedding based on clustering results, as well as Principal Component Analysis (PCA) for dimension reduction and is specifically designed for large and imbalanced datasets. This model’s performance is carefully evaluated using three cutting-edge benchmark datasets: UNSW-NB15, CIC-IDS-2017, and CIC-IDS-2018. On the UNSW-NB15 dataset, our trials show that the RF and ET models achieve accuracy rates of 99.59% and 99.95%, respectively. Furthermore, using the CIC-IDS2017 dataset, DT, RF, and ET models reach 99.99% accuracy, while DT and RF models obtain 99.94% accuracy on CIC-IDS2018. These performance results continuously outperform the state-of-art, indicating significant progress in the field of network intrusion detection. This achievement demonstrates the efficacy of the suggested methodology, which can be used practically to accurately monitor and identify network traffic intrusions, thereby blocking possible threats.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"32 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139952924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comprehensive study of driver behavior monitoring systems using computer vision and machine learning techniques 利用计算机视觉和机器学习技术全面研究驾驶员行为监控系统
IF 8.1 2区 计算机科学
Journal of Big Data Pub Date : 2024-02-22 DOI: 10.1186/s40537-024-00890-0
Fangming Qu, Nolan Dang, Borko Furht, Mehrdad Nojoumian
{"title":"Comprehensive study of driver behavior monitoring systems using computer vision and machine learning techniques","authors":"Fangming Qu, Nolan Dang, Borko Furht, Mehrdad Nojoumian","doi":"10.1186/s40537-024-00890-0","DOIUrl":"https://doi.org/10.1186/s40537-024-00890-0","url":null,"abstract":"<p>The flourishing realm of advanced driver-assistance systems (ADAS) as well as autonomous vehicles (AVs) presents exceptional opportunities to enhance safe driving. An essential aspect of this transformation involves monitoring driver behavior through observable physiological indicators, including the driver’s facial expressions, hand placement on the wheels, and the driver’s body postures. An artificial intelligence (AI) system under consideration alerts drivers about potentially unsafe behaviors using real-time voice notifications. This paper offers an all-embracing survey of neural network-based methodologies for studying these driver bio-metrics, presenting an exhaustive examination of their advantages and drawbacks. The evaluation includes two relevant datasets, separately categorizing ten different in-cabinet behaviors, providing a systematic classification for driver behaviors detection. The ultimate aim is to inform the development of driver behavior monitoring systems. This survey is a valuable guide for those dedicated to enhancing vehicle safety and preventing accidents caused by careless driving. The paper’s structure encompasses sections on autonomous vehicles, neural networks, driver behavior analysis methods, dataset utilization, and final findings and future suggestions, ensuring accessibility for audiences with diverse levels of understanding regarding the subject matter.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"4 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139952841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Algorithms of the Möbius function by random forests and neural networks 随机森林和神经网络的莫比乌斯函数算法
IF 8.1 2区 计算机科学
Journal of Big Data Pub Date : 2024-02-21 DOI: 10.1186/s40537-024-00889-7
Huan Qin, Yangbo Ye
{"title":"Algorithms of the Möbius function by random forests and neural networks","authors":"Huan Qin, Yangbo Ye","doi":"10.1186/s40537-024-00889-7","DOIUrl":"https://doi.org/10.1186/s40537-024-00889-7","url":null,"abstract":"<p>The Möbius function <span>(mu (n))</span> is known for containing limited information on the prime factorization of <i>n</i>. Its known algorithms, however, are all based on factorization and hence are exponentially slow on <span>(log n)</span>. Consequently, a faster algorithm of <span>(mu (n))</span> could potentially lead to a fast algorithm of prime factorization which in turn would throw doubt upon the security of most public-key cryptosystems. This research introduces novel approaches to compute <span>(mu (n))</span> using random forests and neural networks, harnessing the additive properties of <span>(mu (n))</span>. The machine learning models are trained on a substantial dataset with 317,284 observations (80%), comprising five feature variables, including values of <i>n</i> within the range of <span>(4times 10^9)</span>. We implement the Random Forest with Random Inputs (RFRI) and Feedforward Neural Network (FNN) architectures. The RFRI model achieves a predictive accuracy of 0.9493, a recall of 0.5865, and a precision of 0.6626. On the other hand, the FNN model attains a predictive accuracy of 0.7871, a recall of 0.9477, and a precision of 0.2784. These results strongly support the effectiveness and validity of the proposed algorithms.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"93 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139927812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Can we predict multi-party elections with Google Trends data? Evidence across elections, data windows, and model classes 我们能用谷歌趋势数据预测多党选举吗?跨选举、数据窗口和模型类别的证据
IF 8.1 2区 计算机科学
Journal of Big Data Pub Date : 2024-02-17 DOI: 10.1186/s40537-023-00868-4
Jan Behnert, Dean Lajic, Paul C. Bauer
{"title":"Can we predict multi-party elections with Google Trends data? Evidence across elections, data windows, and model classes","authors":"Jan Behnert, Dean Lajic, Paul C. Bauer","doi":"10.1186/s40537-023-00868-4","DOIUrl":"https://doi.org/10.1186/s40537-023-00868-4","url":null,"abstract":"<p>Google trends (GT), a service aggregating search queries on Google, has been used to predict various outcomes such as as the spread of influenza, automobile sales, unemployment claims, and travel destination planning [1, 2]. Social scientists also used GT to predict elections and referendums across different countries and time periods, sometimes with more, sometimes with less success. We provide unique evidence on the predictive power of GT in the German multi-party systems, forecasting four elections (2009, 2013, 2017, 2021). Thereby, we make several contributions: First, we present one of the first attempts to predict a multi-party election using GT and highlight the specific challenges that originate from this setting. In doing so, we also provide a comprehensive and systematic overview of prior research. Second, we develop a framework that allows for fine-grained variation of the GT data window both in terms of its width and distance to the election. Subsequently, we test the predictive accuracy of several thousand models resulting from those fine-grained specifications. Third, we compare the predictive power of different model classes that are purely GT data based but also incorporate polling data as well as previous elections. Finally, we provide a systematic overview of the challenges one faces in using GT data for predictions part of which have been neglected in prior research.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"10 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139754885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep learning enables the quantification of browning capacity of human adipose samples 深度学习可量化人体脂肪样本的褐变能力
IF 8.1 2区 计算机科学
Journal of Big Data Pub Date : 2024-02-11 DOI: 10.1186/s40537-024-00879-9
Yuxin Wang, Shiman Zuo, Nanfei Yang, Ani Jian, Wei Zheng, Zichun Hua, Pingping Shen
{"title":"Deep learning enables the quantification of browning capacity of human adipose samples","authors":"Yuxin Wang, Shiman Zuo, Nanfei Yang, Ani Jian, Wei Zheng, Zichun Hua, Pingping Shen","doi":"10.1186/s40537-024-00879-9","DOIUrl":"https://doi.org/10.1186/s40537-024-00879-9","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Background</h3><p>The recruitment of thermogenic adipocytes in human fat depots markedly improves metabolic disorders such as type 2 diabetes mellitus (T2DM). However, identification and quantification of thermogenic cells in human fats, especially in metabolic disorders patients, remains a major challenge. Here, we aim to provide a stringent validation of human thermogenic adipocyte signature genes, and construct transcriptome-based models to quantify the browning degree of human fats.</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>Evidence from RNA-seq, microarray analyses and experimental approaches were integrated to isolate robust human brown-like fat signature genes. Meta-analysis was employed to validate the performance of known human brown-like fat marker genes. Autoencoder was used to reveal the browning levels of human adipose samples for supervised machine learning. Ensemble machine learning was applied to devised molecular metrics for quantifying browning degree of human fats. Obesity and T2DM datasets were used to validate the performance of the molecular metrics in adipose-related metabolic disorders.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>Human brown-like adipocytes were heterogeneous populations which showed distinct transcriptional patterns and biological features. Only <i>DHRS11</i>, <i>REEP6</i> and <i>STX11</i> were robust signature genes that were consistently up-regulated in different human brown-like fats, especially in creatine-induced UCP1-independent adipocytes. The molecular metrices based on the expression patterns of the three signature genes, named human browning capacity index (HBI) and absolute HBI (absHBI), were superior to 26 traditional human brown-like fat marker genes and previously reported browning classifier in prediction of browning levels of human adipocytes and adipose tissues as well as primary cell cultures upon various physiological and pharmacological stimuli. Notably, these molecular metrics also reflected the insulin sensitivity and glucose-lipid metabolic activity of human adipose samples from obesity and T2DM patients.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>In summary, this study provides promising signatures and computational tools for evaluating browning levels of human adipose samples in response to physiological and medical intervention. The metrices construction pipeline provides an alternative approach for training machine learning models using unlabeled samples.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"227 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139773096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Federated Freeze BERT for text classification 用于文本分类的联合冻结 BERT
IF 8.1 2区 计算机科学
Journal of Big Data Pub Date : 2024-02-09 DOI: 10.1186/s40537-024-00885-x
Omar Galal, Ahmed H. Abdel-Gawad, Mona Farouk
{"title":"Federated Freeze BERT for text classification","authors":"Omar Galal, Ahmed H. Abdel-Gawad, Mona Farouk","doi":"10.1186/s40537-024-00885-x","DOIUrl":"https://doi.org/10.1186/s40537-024-00885-x","url":null,"abstract":"<p>Pre-trained BERT models have demonstrated exceptional performance in the context of text classification tasks. Certain problem domains necessitate data distribution without data sharing. Federated Learning (FL) allows multiple clients to collectively train a global model by sharing learned models rather than raw data. However, the adoption of BERT, a large model, within a Federated Learning framework incurs substantial communication costs. To address this challenge, we propose a novel framework, FedFreezeBERT, for BERT-based text classification. FedFreezeBERT works by adding an aggregation architecture on top of BERT to obtain better sentence embedding for classification while freezing BERT parameters. Keeping the model parameters frozen, FedFreezeBERT reduces the communication costs by a large factor compared to other state-of-the-art methods. FedFreezeBERT is implemented in a distributed version where the aggregation architecture only is being transferred and aggregated by FL algorithms such as FedAvg or FedProx. FedFreezeBERT is also implemented in a centralized version where the data embeddings extracted by BERT are sent to the central server to train the aggregation architecture. The experiments show that FedFreezeBERT achieves new state-of-the-art performance on Arabic sentiment analysis on the ArSarcasm-v2 dataset with a 12.9% and 1.2% improvement over FedAvg/FedProx and the previous SOTA respectively. FedFreezeBERT also reduces the communication cost by 5<span>(times)</span> compared to the previous SOTA.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"21 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139754926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信