Li Li;Chuanqi Tao;Hongjing Guo;Jingxuan Zhang;Xiaobing Sun
{"title":"FATS: Feature Distribution Analysis-Based Test Selection for Deep Learning Enhancement","authors":"Li Li;Chuanqi Tao;Hongjing Guo;Jingxuan Zhang;Xiaobing Sun","doi":"10.1109/TBDATA.2023.3334648","DOIUrl":"https://doi.org/10.1109/TBDATA.2023.3334648","url":null,"abstract":"Deep Learning has been applied to many applications across different domains. However, the distribution shift between the test data and training data is a major factor impacting the quality of deep neural networks (DNNs). To address this issue, existing research mainly focuses on enhancing DNN models by retraining them using labeled test data. However, labeling test data is costly, which seriously reduces the efficiency of DNN testing. To solve this problem, test selection strategically selected a small set of tests to label. Unfortunately, existing test selection methods seldom focus on the data distribution shift. To address the issue, this paper proposes an approach for test selection named Feature Distribution Analysis-Based Test Selection (FATS). FATS analyzes the distributions of test data and training data and then adopts learning to rank (a kind of supervised machine learning to solve ranking tasks) to intelligently combine the results of analysis for test selection. We conduct an empirical study on popular datasets and DNN models, and then compare FATS with seven test selection methods. Experiment results show that FATS effectively alleviates the impact of distribution shifts and outperforms the compared methods with the average accuracy improvement of 19.6%\u0000<inline-formula><tex-math>$sim$</tex-math></inline-formula>\u000069.7% for DNN model enhancement.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 2","pages":"132-145"},"PeriodicalIF":7.2,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140123522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Graph Structure Aware Contrastive Multi-View Clustering","authors":"Rui Chen;Yongqiang Tang;Xiangrui Cai;Xiaojie Yuan;Wenlong Feng;Wensheng Zhang","doi":"10.1109/TBDATA.2023.3334674","DOIUrl":"https://doi.org/10.1109/TBDATA.2023.3334674","url":null,"abstract":"Multi-view clustering has become a research hotspot in recent decades because of its effectiveness in heterogeneous data fusion. Although a large number of related studies have been developed one after another, most of them usually only concern with the characteristics of the data themselves and overlook the inherent connection among samples, hindering them from exploring structural knowledge of graph space. Moreover, many current works tend to highlight the compactness of one cluster without taking the differences between clusters into account. To track these two drawbacks, in this article, we propose a graph structure aware contrastive multi-view clustering (namely, GCMC) approach. Specifically, we incorporate the well-designed graph autoencoder with conventional multi-layer perception autoencoder to extract the structural and high-level representation of multi-view data, so that the underlying correlation of samples can be effectively squeezed for model learning. Then the contrastive learning paradigm is performed on multiple pseudo-label distributions to ensure that the positive pairs of pseudo-label representations share the complementarity across views while the divergence between negative pairs is sufficiently large. This makes each semantic cluster more discriminative, i.e., jointly satisfying intra-cluster compactness and inter-cluster exclusiveness. Through comprehensive experiments on eight widely-known datasets, we prove that the proposed approach can perform better than the state-of-the-art opponents.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 3","pages":"260-274"},"PeriodicalIF":7.2,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140924670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Don’t Be Misled by Emotion! Disentangle Emotions and Semantics for Cross-Language and Cross-Domain Rumor Detection","authors":"Yu Shi;Xi Zhang;Yuming Shang;Ning Yu","doi":"10.1109/TBDATA.2023.3334634","DOIUrl":"https://doi.org/10.1109/TBDATA.2023.3334634","url":null,"abstract":"Cross-language and cross-domain rumor detection is a crucial research topic for maintaining a healthy social media environment. Previous studies reveal that the emotions expressed in posts are important features for rumor detection. However, existing studies typically leverage the entangled representation of semantics and emotions, ignoring the fact that different languages and domains have different emotions toward rumors. Therefore, it inevitably leads to a biased adaptation of the features learned from the source to the target language and domain. To address this issue, this paper proposes a novel approach to adapt the knowledge obtained from the source to the target dataset by disentangling the emotional and semantic features of the datasets. Specifically, the proposed method mainly consists of three steps: (1) disentanglement, which encodes rumors into two separate semantic and emotional spaces to prevent emotional interference; (2) adaptation, merging semantics with the emotions from another language and domain for contrastive alignment to ensure effective adaptation; (3) joint training strategy, which enables the above two steps to work in synergy and mutually promote each other. Extensive experimental results demonstrate that the proposed method outperforms state-of-the-art baselines.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 3","pages":"249-259"},"PeriodicalIF":7.2,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140924736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AMDECDA: Attention Mechanism Combined With Data Ensemble Strategy for Predicting CircRNA-Disease Association","authors":"Lei Wang;Leon Wong;Zhu-Hong You;De-Shuang Huang","doi":"10.1109/TBDATA.2023.3334673","DOIUrl":"https://doi.org/10.1109/TBDATA.2023.3334673","url":null,"abstract":"Accumulating evidence from recent research reveals that circRNA is tightly bound to human complex disease and plays an important regulatory role in disease progression. Identifying disease-associated circRNA occupies a key role in the research of disease pathogenesis. In this study, we propose a new model AMDECDA for predicting circRNA-disease association (CDA) by combining attention mechanism and data ensemble strategy. Firstly, we fuse the heterogeneous information including circRNA Gaussian interaction profile (GIP), disease semantics and disease GIP, and then use the attention mechanism of Graph Attention Network (GAT) to focus on the critical information of data, reasonably allocate resources and extract their essential features. Finally, the ensemble deep RVFL network (edRVFL) is utilized to quickly and accurately predict CDA in the non-iterative manner of closed-form solutions. In the five-fold cross-validation experiment on the benchmark data set, AMDECDA achieves an accuracy of 93.10% with a sensitivity of 97.56% in 0.9235 AUC. In comparison with previous models, AMDECDA exhibits highly competitiveness. Furthermore, 26 of the top 30 unknown CDAs of AMDECDA predicted scores are proved by the related literature. These results indicate that AMDECDA can effectively anticipate latent CDA and provide help for further biological wet experiments.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"320-329"},"PeriodicalIF":7.5,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multilevel Stochastic Optimization for Imputation in Massive Medical Data Records","authors":"Wenrui Li;Xiaoyu Wang;Yuetian Sun;Snezana Milanovic;Mark Kon;Julio Enrique Castrillón-Candás","doi":"10.1109/TBDATA.2023.3328433","DOIUrl":"10.1109/TBDATA.2023.3328433","url":null,"abstract":"It has long been a recognized problem that many datasets contain significant levels of missing numerical data. A potentially critical predicate for application of machine learning methods to datasets involves addressing this problem. However, this is a challenging task. In this article, we apply a recently developed multi-level stochastic optimization approach to the problem of imputation in massive medical records. The approach is based on computational applied mathematics techniques and is highly accurate. In particular, for the Best Linear Unbiased Predictor (BLUP) this multi-level formulation is \u0000<italic>exact</i>\u0000, and is significantly faster and more numerically stable. This permits practical application of Kriging methods to data imputation problems for massive datasets. We test this approach on data from the National Inpatient Sample (NIS) data records, Healthcare Cost and Utilization Project (HCUP), Agency for Healthcare Research and Quality. Numerical results show that the multi-level method significantly outperforms current approaches and is numerically robust. It has superior accuracy as compared with methods recommended in the recent report from HCUP. Benchmark tests show up to 75% reductions in error. Furthermore, the results are also superior to recent state of the art methods such as discriminative deep learning.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 2","pages":"122-131"},"PeriodicalIF":7.2,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135261593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seungeun Oh;Hyelin Nam;Jihong Park;Praneeth Vepakomma;Ramesh Raskar;Mehdi Bennis;Seong-Lyun Kim
{"title":"Mix2SFL: Two-Way Mixup for Scalable, Accurate, and Communication-Efficient Split Federated Learning","authors":"Seungeun Oh;Hyelin Nam;Jihong Park;Praneeth Vepakomma;Ramesh Raskar;Mehdi Bennis;Seong-Lyun Kim","doi":"10.1109/TBDATA.2023.3328424","DOIUrl":"10.1109/TBDATA.2023.3328424","url":null,"abstract":"In recent years, split learning (SL) has emerged as a promising distributed learning framework that can utilize Big Data in parallel without privacy leakage while reducing client-side computing resources. In the initial implementation of SL, however, the server serves multiple clients sequentially incurring high latency. Parallel implementation of SL can alleviate this latency problem, but existing Parallel SL algorithms compromise scalability due to its fundamental structural problem. To this end, our previous works have proposed two scalable Parallel SL algorithms, dubbed SGLR and LocFedMix-SL, by solving the aforementioned fundamental problem of the Parallel SL structure. In this article, we propose a novel Parallel SL framework, coined Mix2SFL, that can ameliorate both accuracy and communication-efficiency while still ensuring scalability. Mix2SFL first supplies more samples to the server through a manifold mixup between the smashed data uploaded to the server as in SmashMix of LocFedMix-SL, and then averages the split-layer gradient as in GradMix of SGLR, followed by local model aggregation as in SFL. Numerical evaluation corroborates that Mix2SFL achieves improved performance in both accuracy and latency compared to the state-of-the-art SL algorithm with scalability guarantees. Moreover, its convergence speed as well as privacy guarantee are validated through the experimental results.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 3","pages":"238-248"},"PeriodicalIF":7.2,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10301639","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135318089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BL: An Efficient Index for Reachability Queries on Large Graphs","authors":"Changyong Yu;Tianmei Ren;Wenyu Li;Huimin Liu;Haitao Ma;Yuhai Zhao","doi":"10.1109/TBDATA.2023.3327215","DOIUrl":"10.1109/TBDATA.2023.3327215","url":null,"abstract":"Reachability query has important applications in many fields such as social networks, Semantic Web, and biological information networks. How to improve the query efficiency in directed acyclic graph (\u0000<italic>DAG</i>\u0000) has always been the main problem of reachability query research. Existing methods either can't prune unreachable pairs enough or can't perform well on both index size and query time. In this paper, we propose BL (\u0000<italic>Bridging Label</i>\u0000), a general index framework for reachability queries in large graphs. First, we summarize the relationships between BL and existing label indices. Second, we propose a kind of specific index, named minBL, which can avoid redundant labels. Moreover, we propose TFD-minBL and CTFD-minBL, which generate minBL under the TFD-based permutation single-pass and in incremental, respectively. Finally, we conduct a large number of extensive experiments on real and synthetic datasets. The experimental results show that our methods are much faster and use less storage overhead than the existing reachability query methods.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 2","pages":"108-121"},"PeriodicalIF":7.2,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135159141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Evidential K-Nearest Neighbor Classification on Big Data","authors":"Chaoyu Gong;Jim Demmel;Yang You","doi":"10.1109/TBDATA.2023.3327220","DOIUrl":"10.1109/TBDATA.2023.3327220","url":null,"abstract":"The \u0000<i>K</i>\u0000-Nearest Neighbor (K-NN) algorithm has garnered widespread utilization in real-world scenarios, due to its exceptional interpretability that other classification algorithms may not have. The evidential K-NN (EK-NN) algorithm builds upon the same nearest neighbor search procedure as K-NN, and provides more informative classification outcomes. However, EK-NN is not practical for Big Data because it is computationally complex. First, the search for \u0000<i>K</i>\u0000 nearest neighbors of test samples from \u0000<inline-formula><tex-math>$n$</tex-math></inline-formula>\u0000 training samples requires \u0000<inline-formula><tex-math>$O(n^{2})$</tex-math></inline-formula>\u0000 operations. Additionally, estimating parameters involves performing complicated matrix calculations that increase in scale as the dataset becomes larger. To address these issues, we propose two scalable EK-NN classifiers, Global Exact EK-NN and Local Approximate EK-NN, under the distributed Spark framework. Along with the Local Approximate EK-NN, a new distributed gradient descent algorithm is developed to learn parameters. Data parallelism is used to reduce negative impacts caused by data distribution differences. Experimental results show that Our algorithms are able to achieve state-of-the-art scaling efficiency and accuracy on large datasets with more than 10 million samples.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 3","pages":"226-237"},"PeriodicalIF":7.2,"publicationDate":"2023-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135158255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptively-Accelerated Parallel Stochastic Gradient Descent for High-Dimensional and Incomplete Data Representation Learning","authors":"Wen Qin;Xin Luo;MengChu Zhou","doi":"10.1109/TBDATA.2023.3326304","DOIUrl":"10.1109/TBDATA.2023.3326304","url":null,"abstract":"High-dimensional and incomplete (HDI) interactions among numerous nodes are commonly encountered in a Big Data-related application, like user-item interactions in a recommender system. Owing to its high efficiency and flexibility, a stochastic gradient descent (SGD) algorithm can enable efficient latent feature analysis (LFA) of HDI data for its precise representation, thereby enabling efficient solutions to knowledge acquisition issues like missing data estimation. However, LFA on HDI data involves a bilinear issue, making SGD-based LFA a sequential process, i.e., the update on a feature can impact the results on the others. Intervening the sequence of SGD-based LFA on HDI data can affect the training results. Therefore, a parallel SGD algorithm to LFA should be designed with care. Existing parallel SGD-based LFA models suffer from a) low parallelization degree, and b) slow convergence, which significantly restrict their scalability. Aiming at addressing these vital issues, this paper presents an \u0000<underline>A</u>\u0000daptively-accelerated \u0000<underline>P</u>\u0000arallel \u0000<underline>S</u>\u0000tochastic \u0000<underline>G</u>\u0000radient \u0000<underline>D</u>\u0000escent (AP-SGD) algorithm to LFA by: a) establishing a novel local minimum-based data splitting and scheduling scheme to reduce the scheduling cost among threads, thereby achieving high parallelization degree; and b) incorporating the adaptive momentum method into the learning scheme, thereby accelerating the convergence rate by making the learning rate and acceleration coefficient self-adaptive. The convergence of the achieved AP-SGD-based LFA model is theoretically proved. Experimental results on three HDI matrices generated by real industrial applications demonstrate that the AP-SGD-based LFA model outperforms state-of-the-art parallel SGD-based LFA models in both estimation accuracy for missing data and parallelization degree. Hence, it has the potential for efficient representation of HDI data in industrial scenes.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 1","pages":"92-107"},"PeriodicalIF":7.2,"publicationDate":"2023-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135107835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Federated Convolution Transformer for Fake News Detection","authors":"Youcef Djenouri;Ahmed Nabil Belbachir;Tomasz Michalak;Gautam Srivastava","doi":"10.1109/TBDATA.2023.3325746","DOIUrl":"10.1109/TBDATA.2023.3325746","url":null,"abstract":"We present a novel approach to detect fake news in Internet of Things (IoT) applications. By investigating federated learning and trusted authority methods, we address the issue of data security during training. Simultaneously, by investigating convolution transformers and user clustering, we deal with multi-modality issues in fake news data. First, we use dense embedding and the k-means algorithm to cluster users into groups that are similar to one another. We then develop a local model for each user using their local data. The server then receives the local models of users along with clustering information, and a trusted authority verifies their integrity there. We use two different types of aggregation in place of conventional federated learning systems. The initial step is to combine all users’ models to create a single global model. The second step entails compiling each user's model into a local model of comparable users. Both models are supplied to users, who then select the most suitable model for identifying fake news. By conducting extensive experiments using Twitter data, we demonstrate that the proposed method outperforms various baselines, where it achieves an average accuracy of 0.85 in comparison to others that do not exceed 0.81.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 3","pages":"214-225"},"PeriodicalIF":7.2,"publicationDate":"2023-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135008935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}