{"title":"Scalable Evidential K-Nearest Neighbor Classification on Big Data","authors":"Chaoyu Gong;Jim Demmel;Yang You","doi":"10.1109/TBDATA.2023.3327220","DOIUrl":"10.1109/TBDATA.2023.3327220","url":null,"abstract":"The \u0000<i>K</i>\u0000-Nearest Neighbor (K-NN) algorithm has garnered widespread utilization in real-world scenarios, due to its exceptional interpretability that other classification algorithms may not have. The evidential K-NN (EK-NN) algorithm builds upon the same nearest neighbor search procedure as K-NN, and provides more informative classification outcomes. However, EK-NN is not practical for Big Data because it is computationally complex. First, the search for \u0000<i>K</i>\u0000 nearest neighbors of test samples from \u0000<inline-formula><tex-math>$n$</tex-math></inline-formula>\u0000 training samples requires \u0000<inline-formula><tex-math>$O(n^{2})$</tex-math></inline-formula>\u0000 operations. Additionally, estimating parameters involves performing complicated matrix calculations that increase in scale as the dataset becomes larger. To address these issues, we propose two scalable EK-NN classifiers, Global Exact EK-NN and Local Approximate EK-NN, under the distributed Spark framework. Along with the Local Approximate EK-NN, a new distributed gradient descent algorithm is developed to learn parameters. Data parallelism is used to reduce negative impacts caused by data distribution differences. Experimental results show that Our algorithms are able to achieve state-of-the-art scaling efficiency and accuracy on large datasets with more than 10 million samples.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 3","pages":"226-237"},"PeriodicalIF":7.2,"publicationDate":"2023-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135158255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptively-Accelerated Parallel Stochastic Gradient Descent for High-Dimensional and Incomplete Data Representation Learning","authors":"Wen Qin;Xin Luo;MengChu Zhou","doi":"10.1109/TBDATA.2023.3326304","DOIUrl":"10.1109/TBDATA.2023.3326304","url":null,"abstract":"High-dimensional and incomplete (HDI) interactions among numerous nodes are commonly encountered in a Big Data-related application, like user-item interactions in a recommender system. Owing to its high efficiency and flexibility, a stochastic gradient descent (SGD) algorithm can enable efficient latent feature analysis (LFA) of HDI data for its precise representation, thereby enabling efficient solutions to knowledge acquisition issues like missing data estimation. However, LFA on HDI data involves a bilinear issue, making SGD-based LFA a sequential process, i.e., the update on a feature can impact the results on the others. Intervening the sequence of SGD-based LFA on HDI data can affect the training results. Therefore, a parallel SGD algorithm to LFA should be designed with care. Existing parallel SGD-based LFA models suffer from a) low parallelization degree, and b) slow convergence, which significantly restrict their scalability. Aiming at addressing these vital issues, this paper presents an \u0000<underline>A</u>\u0000daptively-accelerated \u0000<underline>P</u>\u0000arallel \u0000<underline>S</u>\u0000tochastic \u0000<underline>G</u>\u0000radient \u0000<underline>D</u>\u0000escent (AP-SGD) algorithm to LFA by: a) establishing a novel local minimum-based data splitting and scheduling scheme to reduce the scheduling cost among threads, thereby achieving high parallelization degree; and b) incorporating the adaptive momentum method into the learning scheme, thereby accelerating the convergence rate by making the learning rate and acceleration coefficient self-adaptive. The convergence of the achieved AP-SGD-based LFA model is theoretically proved. Experimental results on three HDI matrices generated by real industrial applications demonstrate that the AP-SGD-based LFA model outperforms state-of-the-art parallel SGD-based LFA models in both estimation accuracy for missing data and parallelization degree. Hence, it has the potential for efficient representation of HDI data in industrial scenes.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 1","pages":"92-107"},"PeriodicalIF":7.2,"publicationDate":"2023-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135107835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Federated Convolution Transformer for Fake News Detection","authors":"Youcef Djenouri;Ahmed Nabil Belbachir;Tomasz Michalak;Gautam Srivastava","doi":"10.1109/TBDATA.2023.3325746","DOIUrl":"10.1109/TBDATA.2023.3325746","url":null,"abstract":"We present a novel approach to detect fake news in Internet of Things (IoT) applications. By investigating federated learning and trusted authority methods, we address the issue of data security during training. Simultaneously, by investigating convolution transformers and user clustering, we deal with multi-modality issues in fake news data. First, we use dense embedding and the k-means algorithm to cluster users into groups that are similar to one another. We then develop a local model for each user using their local data. The server then receives the local models of users along with clustering information, and a trusted authority verifies their integrity there. We use two different types of aggregation in place of conventional federated learning systems. The initial step is to combine all users’ models to create a single global model. The second step entails compiling each user's model into a local model of comparable users. Both models are supplied to users, who then select the most suitable model for identifying fake news. By conducting extensive experiments using Twitter data, we demonstrate that the proposed method outperforms various baselines, where it achieves an average accuracy of 0.85 in comparison to others that do not exceed 0.81.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 3","pages":"214-225"},"PeriodicalIF":7.2,"publicationDate":"2023-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135008935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinghuan Lao;Dong Huang;Chang-Dong Wang;Jian-Huang Lai
{"title":"Towards Scalable Multi-View Clustering via Joint Learning of Many Bipartite Graphs","authors":"Jinghuan Lao;Dong Huang;Chang-Dong Wang;Jian-Huang Lai","doi":"10.1109/TBDATA.2023.3325045","DOIUrl":"10.1109/TBDATA.2023.3325045","url":null,"abstract":"This paper focuses on two limitations to previous multi-view clustering approaches. First, they frequently suffer from quadratic or cubic computational complexity, which restricts their feasibility for large-scale datasets. Second, they often rely on a single graph on each view, yet lack the ability to jointly explore many versatile graph structures for enhanced multi-view information exploration. In light of this, this paper presents a new Scalable Multi-view Clustering via Many Bipartite graphs (SMCMB) approach, which is capable of jointly learning and fusing many bipartite graphs from multiple views while maintaining high efficiency for very large-scale datasets. Different from the one-anchor-set-per-view paradigm, we first produce multiple diversified anchor sets on each view and thus obtain many anchor sets on multiple views, based on which the anchor-based subspace representation learning is enforced and many bipartite graphs are simultaneously learned. Then these bipartite graphs are efficiently partitioned to produce the base clusterings, which are further re-formulated into a unified bipartite graph for the final clustering. Note that SMCMB has almost linear time and space complexity. Extensive experiments on twenty general-scale and large-scale multi-view datasets confirm its superiority in scalability and robustness over the state-of-the-art.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 1","pages":"77-91"},"PeriodicalIF":7.2,"publicationDate":"2023-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136372168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhifei Ding;Jiahao Han;Rongtao Qian;Liming Shen;Siru Chen;Lingxin Yu;Yu Zhu;Richen Liu
{"title":"eBoF: Interactive Temporal Correlation Analysis for Ensemble Data Based on Bag-of-Features","authors":"Zhifei Ding;Jiahao Han;Rongtao Qian;Liming Shen;Siru Chen;Lingxin Yu;Yu Zhu;Richen Liu","doi":"10.1109/TBDATA.2023.3324482","DOIUrl":"https://doi.org/10.1109/TBDATA.2023.3324482","url":null,"abstract":"We propose eBoF, a novel time-varying ensemble data visualization approach based on the Bag-of-Features (BoF) model. In the eBoF model, we extract a simple and monotone interval from all target variables of ensemble scalar data as a local feature patch. Each local feature of a semantically simple single interval can be defined as a feature patch within the BoF model, with the duration of each interval (i.e., feature patch) serving as its frequency. Feature clusters in ensemble runs are then identified based on the similarity of temporal correlations. eBoF generates clusters along with their probability distributions across all feature patches while preserving the geo-spatial information, which is often lost in traditional topic modeling or clustering algorithms. The probability distribution across different clusters can help to generate reasonable clustering results, evaluated by domain knowledge. We conduct case studies and performance tests to evaluate the eBoF model and gather feedback from domain experts to further refine it. Evaluation results suggest the proposed eBoF can provide insightful and comprehensive evidence on ensemble simulation data analysis.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"9 6","pages":"1726-1737"},"PeriodicalIF":7.2,"publicationDate":"2023-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138138250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Label-Weighted Graph-Based Learning for Semi-Supervised Classification Under Label Noise","authors":"Naiyao Liang;Zuyuan Yang;Junhang Chen;Zhenni Li;Shengli Xie","doi":"10.1109/TBDATA.2023.3319249","DOIUrl":"10.1109/TBDATA.2023.3319249","url":null,"abstract":"Graph-based semi-supervised learning (GSSL) is a quite important technology due to its effectiveness in practice. Existing GSSL works often treat the given labels equally and ignore the unbalance importance of labels. In some inaccurate systems, the collected labels usually contain noise (noisy labels) and the methods treating labels equally suffer from the label noise. In this article, we propose a novel label-weighted learning method on graph for semi-supervised classification under label noise, which allows considering the contribution differences of labels. In particular, the label dependency of data is revealed by graph constraints. With the help of this label dependency, the proposed method develops the strategy of adaptive label weight, where label weights are assigned to labels adaptively. Accordingly, an efficient algorithm is developed to solve the proposed optimization objective, where each subproblem has a closed-form solution. Experimental results on a synthetic dataset and several real-world datasets show the advantage of the proposed method, compared to the state-of-the-art methods.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 1","pages":"55-65"},"PeriodicalIF":7.2,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135793726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Legal Transition Sequence Recognition of a Bounded Petri Net Using a Gate Recurrent Unit","authors":"Qingtian Zeng;Shuai Guo;Rui Cao;Ziqi Zhao;Hua Duan","doi":"10.1109/TBDATA.2023.3319252","DOIUrl":"10.1109/TBDATA.2023.3319252","url":null,"abstract":"The Gate Recurrent Unit (GRU) has a large blank in the application of legal transition sequences for bounded Petri nets. A GRU-based method is proposed for the recognition of bounded Petri net legal transition sequences. First, in a Petri net, legal and non-legal transition sequences are generated according to a certain noise ratio. Then, the legal and non-legal transition sequences are inputted into GRU to recognize the legal transition sequences by encoding the maximum variation sequence length with a uniform length. The proposed method is validated with different Petri nets at different noise ratios and compared with seven widely-known baselines. The results show that the proposed method achieves excellent recognition accuracy and robustness in most situations. Solving the problem that the existing methods cannot recognize the legal transition sequences of Petri nets in real time.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 1","pages":"66-76"},"PeriodicalIF":7.2,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135793555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Han Wu;Guanqi Zhu;Qi Liu;Hengshu Zhu;Hao Wang;Hongke Zhao;Chuanren Liu;Enhong Chen;Hui Xiong
{"title":"A Multi-Aspect Neural Tensor Factorization Framework for Patent Litigation Prediction","authors":"Han Wu;Guanqi Zhu;Qi Liu;Hengshu Zhu;Hao Wang;Hongke Zhao;Chuanren Liu;Enhong Chen;Hui Xiong","doi":"10.1109/TBDATA.2023.3313030","DOIUrl":"10.1109/TBDATA.2023.3313030","url":null,"abstract":"Patent litigation is an expensive and time-consuming legal process. To reduce costs, companies can proactively manage patents using predictive analysis to identify potential plaintiffs, defendants, and patents that may lead to litigation. However, there has been limited progress in predicting patent litigation due to the scarcity of lawsuits, the complexities of intentions, and the diversity of litigation characteristics. To this end, in this paper, we summarize the major causes of patent litigation into multiple aspects: the complex relations among plaintiffs, defendants and patents as well as the diverse content information from them. Along this line, we propose a Multi-aspect Neural Tensor Factorization (MANTF) framework for patent litigation prediction. First, a Pair-wise Tensor Factorization (PTF) module is designed to capture the complex relations among plaintiffs, defendants and patents inherent in a three-dimensional tensor, which will produce factorized latent vectors for companies and patents with pair-wise ranking estimators. Then, to better represent the patents and companies as an aid for PTF, we design a Patent Embedding Network (PEN) module and a Mask Company Embedding Network (MCEN) module to generate content-aware embedding for them, where PEN represents patents based on their meta, textual and graphical features, and MCEN represents companies by integrating their intrinsic features and competitions. Next, to integrate these three modules together, we leverage a Gaussian prior on the difference between factorized representations and content-aware embedding, and train MANTF in an end-to-end way. In the end, final predictions for patent litigation, i.e., the potentially litigated plaintiffs, defendants and patents, can be made with the well-trained model. We conduct extensive experiments on two real-world datasets, whose results prove that MANTF not only helps predict potential patent litigation but also shows robustness under various data sparse situations.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 1","pages":"35-54"},"PeriodicalIF":7.2,"publicationDate":"2023-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135597577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spatial-Temporal Contrasting for Fine-Grained Urban Flow Inference","authors":"Xovee Xu;Zhiyuan Wang;Qiang Gao;Ting Zhong;Bei Hui;Fan Zhou;Goce Trajcevski","doi":"10.1109/TBDATA.2023.3316471","DOIUrl":"https://doi.org/10.1109/TBDATA.2023.3316471","url":null,"abstract":"Fine-grained urban flow inference (FUFI) problem aims to infer the fine-grained flow maps from coarse-grained ones, benefiting various smart-city applications by reducing electricity, maintenance, and operation costs. Existing models use techniques from image super-resolution and achieve good performance in FUFI. However, they often rely on supervised learning with a large amount of training data, and often lack generalization capability and face overfitting. We present a new solution: \u0000<underline>S</u>\u0000patial-\u0000<underline>T</u>\u0000emporal \u0000<underline>C</u>\u0000ontrasting for Fine-Grained Urban \u0000<underline>F</u>\u0000low Inference (STCF). It consists of (i) two pre-training networks for spatial-temporal contrasting between flow maps; and (ii) one coupled fine-tuning network for fusing learned features. By attracting \u0000<italic>spatial-temporally similar</i>\u0000 flow maps while distancing dissimilar ones within the representation space, STCF enhances efficiency and performance. Comprehensive experiments on two large-scale, real-world urban flow datasets reveal that STCF reduces inference error by up to 13.5%, requiring significantly fewer data and model parameters than prior arts.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"9 6","pages":"1711-1725"},"PeriodicalIF":7.2,"publicationDate":"2023-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138138249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PHAED: A Speaker-Aware Parallel Hierarchical Attentive Encoder-Decoder Model for Multi-Turn Dialogue Generation","authors":"Zihao Wang;Ming Jiang;Junli Wang","doi":"10.1109/TBDATA.2023.3316472","DOIUrl":"10.1109/TBDATA.2023.3316472","url":null,"abstract":"This article presents a novel open-domain dialogue generation model emphasizing the differentiation of speakers in multi-turn conversations. Differing from prior work that treats the conversation history as a long text, we argue that capturing relative social relations among utterances (i.e., generated by either the same speaker or different persons) benefits the machine capturing fine-grained context information from a conversation history to improve context coherence in the generated response. Given that, we propose a Parallel Hierarchical Attentive Encoder-Decoder (PHAED) model that can effectively leverage conversation history by modeling each utterance with the awareness of its speaker and contextual associations with the same speaker's previous messages. Specifically, to distinguish the speaker roles over a multi-turn conversation (involving two speakers), we regard the utterances from one speaker as responses and those from the other as queries. After understanding queries via hierarchical encoder with inner-query and inter-query encodings, transformer-xl style decoder reuses the hidden states of previously generated responses to generate a new response. Our empirical results with three large-scale benchmarks show that PHAED significantly outperforms baseline models on both automatic and human evaluations. Furthermore, our ablation study shows that dialogue models with speaker tokens can generally decrease the possibility of generating non-coherent responses.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 1","pages":"23-34"},"PeriodicalIF":7.2,"publicationDate":"2023-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135501710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}