{"title":"Core Maintenance on Dynamic Graphs: A Distributed Approach Built on H-Index","authors":"Qiang-Sheng Hua;Hongen Wang;Hai Jin;Xuanhua Shi","doi":"10.1109/TBDATA.2024.3352973","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3352973","url":null,"abstract":"Core number is an essential tool for analyzing graph structure. Graphs in the real world are typically large and dynamic, requiring the development of distributed algorithms to refrain from expensive I/O operations and the maintenance algorithms to address dynamism. Core maintenance updates the core number of each vertex upon the insertion/deletion of vertices/edges. Although the state-of-the-art distributed maintenance algorithm (Weng et al.~2022) can handle multiple edge insertions/deletions simultaneously, it still has two aspects to improve. (I) Parallel processing is not allowed when inserting/removing edges with the same core number, reducing the degree of parallelism and raising the number of rounds. (II) During the implementation phase, only one thread is assigned to the vertices with the same core number, leading to the inability to fully utilize the distributed computing power. Furthermore, the h-index (Lü, et al. 2016) based distributed core decomposition algorithm (Montresor et al. 2013) can fully utilize the distributed computing power where all vertices can be processed in parallel. However, it requires all vertices to recompute their core numbers upon graph changes. In this article, we propose a distributed core maintenance algorithm based on h-index, which circumvents the issues of algorithm (Weng et al.~2022). In addition, our algorithm avoids core numbers recalculation where the numbers do not change. In comparison to the state-of-the-art distributed maintenance algorithm (Weng et al.~2022), the time speedup ratio is at least 100 in the scenarios of both insertion and deletion. Compared to the distributed core decomposition algorithm (Montresor et al. 2013), the average time speedup ratios are 2 and 8 for the cases of insertion and deletion, respectively.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 5","pages":"595-608"},"PeriodicalIF":7.5,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10388383","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142130280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Online Heterogeneous Streaming Feature Selection Without Feature Type Information","authors":"Peng Zhou;Yunyun Zhang;Zhaolong Ling;Yuanting Yan;Shu Zhao;Xindong Wu","doi":"10.1109/TBDATA.2024.3350630","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3350630","url":null,"abstract":"Feature selection aims to select an optimal minimal feature subset from the original datasets and has become an indispensable preprocessing component before data mining and machine learning, especially in the era of Big Data. However, features may be generated dynamically and arrive individually over time in practice, which we call streaming features. Most existing streaming feature selection methods assume that all dynamically generated features are the same type or assume we can know the feature type for each new arriving feature in advance, but this is unreasonable and unrealistic. Therefore, this paper first studies a practical issue of Online Heterogeneous Streaming Feature Selection without the feature type information before learning, named OHSFS. Specifically, we first model the streaming feature selection issue as a minimax problem. Then, in terms of MIC (Maximal Information Coefficient), we derive a new metric \u0000<inline-formula><tex-math>$MIC_{Gain}$</tex-math></inline-formula>\u0000 to determine whether a new streaming feature should be selected. To speed up the efficiency of OHSFS, we present the metric \u0000<inline-formula><tex-math>$MIC_{Cor}$</tex-math></inline-formula>\u0000 that can directly discard low correlation features. Finally, extensive experimental results indicate the effectiveness of OHSFS. Moreover, OHSFS is nonparametric and does not need to know the feature type before learning, which aligns with practical application needs.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"470-485"},"PeriodicalIF":7.5,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhanced Multi-Scale Features Mutual Mapping Fusion Based on Reverse Knowledge Distillation for Industrial Anomaly Detection and Localization","authors":"Guoxiang Tong;Quanquan Li;Yan Song","doi":"10.1109/TBDATA.2024.3350539","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3350539","url":null,"abstract":"Unsupervised anomaly detection methods based on knowledge distillation have exhibited promising results. However, there is still room for improvement in the differential characterization of anomalous samples. In this article, a novel anomaly detection and localization model based on reverse knowledge distillation is proposed, where an enhanced multi-scale feature mutual mapping feature fusion module is proposed to greatly extract discrepant features at different scales. This module helps enhance the difference in anomaly region representation in the teacher-student structure by inhomogeneously fusing features at different levels. Then, the coordinate attention mechanism is introduced in the reverse distillation structure to pay special attention to dominant issues, facilitating nice direction guidance and position encoding. Furthermore, an innovative single-category embedding memory bank, inspired by human memory mechanisms, is developed to normalize single-category embedding to encourage high-quality model reconstruction. Finally, in several categories of the well-known MVTec dataset, our model achieves better results than state-of-the-art models in terms of AUROC and PRO, with an overall average of 98.1%, 98.3%, and 95.0% for detection AUROC scores, localization AUROC scores, and localization PRO scores, respectively, across 15 categories. Extensive experiments are conducted on the ablation study to validate the contribution of each component of the model.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"498-513"},"PeriodicalIF":7.5,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Unsupervised Hashing via Exploiting Robust Cross-Modal Consistency","authors":"Xingbo Liu;Jiamin Li;Xiushan Nie;Xuening Zhang;Shaohua Wang;Yilong Yin","doi":"10.1109/TBDATA.2024.3350541","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3350541","url":null,"abstract":"Unsupervised cross-modal hashing has received increasing attention because of its efficiency and scalability for large-scale data retrieval and analysis. However, existing unsupervised cross-modal hashing methods primarily focus on learning shared feature embedding, ignoring robustness and consistency across different modalities. To this end, this study proposes a novel method called scalable unsupervised hashing (SUH) for large-scale cross-modal retrieval. In the proposed method, latent semantic information and common semantic embedding within heterogeneous data are simultaneously exploited using multimodal clustering and collective matrix factorization, respectively. Furthermore, the robust norm is seamlessly integrated into the two processes, making SUH insensitive to outliers. Based on the robust consistency exploited from the latent semantic information and feature embedding, hash codes can be learned discretely to avoid cumulative quantitation loss. The experimental results on five benchmark datasets demonstrate the effectiveness of the proposed method under various scenarios.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"514-527"},"PeriodicalIF":7.5,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Few-Shot Learning With Multi-Granularity Knowledge Fusion and Decision-Making","authors":"Yuling Su;Hong Zhao;Yifeng Zheng;Yu Wang","doi":"10.1109/TBDATA.2024.3350542","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3350542","url":null,"abstract":"Few-shot learning (FSL) is a challenging task in classifying new classes from few labelled examples. Many existing models embed class structural knowledge as prior knowledge to enhance FSL against data scarcity. However, they fall short of connecting the class structural knowledge with the limited visual information which plays a decisive role in FSL model performance. In this paper, we propose a unified FSL framework with multi-granularity knowledge fusion and decision-making (MGKFD) to overcome the limitation. We aim to simultaneously explore the visual information and structural knowledge, working in a mutual way to enhance FSL. On the one hand, we strongly connect global and local visual information with multi-granularity class knowledge to explore intra-image and inter-class relationships, generating specific multi-granularity class representations with limited images. On the other hand, a weight fusion strategy is introduced to integrate multi-granularity knowledge and visual information to make the classification decision of FSL. It enables models to learn more effectively from limited labelled examples and allows generalization to new classes. Moreover, considering varying erroneous predictions, a hierarchical loss is established by structural knowledge to minimize the classification loss, where greater degree of misclassification is penalized more. Experimental results on three benchmark datasets show the advantages of MGKFD over several advanced models.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"486-497"},"PeriodicalIF":7.5,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SCOREH+: A High-Order Node Proximity Spectral Clustering on Ratios-of-Eigenvectors Algorithm for Community Detection","authors":"Yanhui Zhu;Fang Hu;Lei Hsin Kuo;Jia Liu","doi":"10.1109/TBDATA.2023.3346715","DOIUrl":"https://doi.org/10.1109/TBDATA.2023.3346715","url":null,"abstract":"The research on complex networks has achieved significant progress in revealing the mesoscopic features of networks. Community detection is an important aspect of understanding real-world complex systems. We present in this paper a High-order node proximity Spectral Clustering on Ratios-of-Eigenvectors (SCOREH+) algorithm for locating communities in complex networks. The algorithm improves SCORE and SCORE+ and preserves high-order transitivity information of the network affinity matrix. We optimize the high-order proximity matrix from the initial affinity matrix using the Radial Basis Functions (RBFs) and Katz index. In addition to the optimization of the Laplacian matrix, we implement a procedure that joins an additional eigenvector (the \u0000<inline-formula><tex-math>$(k+1){rm th}$</tex-math></inline-formula>\u0000 leading eigenvector) to the spectrum domain for clustering if the network is considered to be a “weak signal” graph. The algorithm has been successfully applied to both real-world and synthetic data sets. The proposed algorithm is compared with state-of-art algorithms, such as ASE, Louvain, Fast-Greedy, Spectral Clustering (SC), SCORE, and SCORE+. To demonstrate the high efficacy of the proposed method, we conducted comparison experiments on eleven real-world networks and a number of synthetic networks with noise. The experimental results in most of these networks demonstrate that SCOREH+ outperforms the baseline methods. Moreover, by tuning the RBFs and their shaping parameters, we may generate state-of-the-art community structures on all real-world networks and even on noisy synthetic networks.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 3","pages":"301-312"},"PeriodicalIF":7.2,"publicationDate":"2023-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140924735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Causal Chain Graph Structure via Alternate Learning and Double Pruning","authors":"Shujing Yang;Fuyuan Cao;Kui Yu;Jiye Liang","doi":"10.1109/TBDATA.2023.3346712","DOIUrl":"https://doi.org/10.1109/TBDATA.2023.3346712","url":null,"abstract":"Causal chain graphs model the dependency structure between individuals when the assumption of individual independence in causal inference is violated. However, causal chain graphs are often unknown in practice and require learning from data. Existing learning algorithms have certain limitations. Specifically, learning local information requires multiple subset searches, building the skeleton requires additional conditional independence testing, and directing the edges requires obtaining local information from the skeleton again. To remedy these problems, we propose a novel algorithm for learning causal chain graph structure. The algorithm alternately learns the adjacencies and spouses of each variable as local information and doubly prunes them to obtain more accurate local information, which reduces subset searches, improves its accuracy, and facilitates subsequent learning. It then directly constructs the chain graphs skeleton using the learned adjacencies without conditional independence testing. Finally, it directs the edges of complexes using the learned adjacencies and spouses to learn chain graphs without reacquiring local information, further improving its efficiency. We conduct theoretical analysis to prove the correctness of our algorithm and compare it with the state-of-the-art algorithms on synthetic and real-world datasets. The experimental results demonstrate our algorithm is more reliable than its rivals.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"442-456"},"PeriodicalIF":7.5,"publicationDate":"2023-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cascaded Knowledge-Level Fusion Network for Online Course Recommendation System","authors":"Wenjun Ma;Yibing Zhao;Xiaomao Fan","doi":"10.1109/TBDATA.2023.3346711","DOIUrl":"https://doi.org/10.1109/TBDATA.2023.3346711","url":null,"abstract":"In light of the global proliferation of the COVID-19 pandemic, there is a notable surge in public interest towards Massive Open Online Courses (MOOCs) recently. Within the realm of personalized course-learning services, large amounts of online course recommendation systems have been developed to cater to the diverse needs of learners. However, despite these advancements, there still exist three unsolved challenges: 1) how to effectively utilize the course information spanning from the title-level to the more granular keyword-level; 2) how to well capture the sequential information among learning courses; 3) how to identify the high-correlated courses in the course corpora. To address these challenges, we propose a novel solution known as \u0000<bold>C</b>\u0000ascaded \u0000<bold>K</b>\u0000nowledge-level \u0000<bold>F</b>\u0000usion \u0000<bold>N</b>\u0000etwork (CKFN) for online course recommendation with incorporating a three-fold approach to maximize the utilization of course information: 1) two knowledge graphs spanning from the keyword-level to title-level; 2) a two-stage attention fusion mechanism; 3) a novel knowledge-aware negative sampling method. Experimental results on a real dataset of XuetangX demonstrate that CKFN surpasses existing baseline models by a substantial margin, thereby achieving the state-of-the-art recommendation performance. It means that CKFN can be potentially deployed into MOOCs platforms as a pivotal component to provide personalized course recommendation service.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"457-469"},"PeriodicalIF":7.5,"publicationDate":"2023-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiao Zhang;Zhaoqian He;Jinhai Li;Changlin Mei;Yanyan Yang
{"title":"Bi-Selection of Instances and Features Based on Neighborhood Importance Degree","authors":"Xiao Zhang;Zhaoqian He;Jinhai Li;Changlin Mei;Yanyan Yang","doi":"10.1109/TBDATA.2023.3342643","DOIUrl":"https://doi.org/10.1109/TBDATA.2023.3342643","url":null,"abstract":"As one of the most important concepts for classification learning, neighborhood granules obtained by dividing adjacent objects or instances can be regarded as the minimal elements to simulate human cognition. At present, neighborhood granules have been successfully applied to knowledge acquisition. Nevertheless, little work has been devoted to the simultaneous selection of features and instances by the use of neighborhood granules. To fill this gap, we investigate in this paper the issue of bi-selection of instances and features based on neighborhood importance degree (NID). First, the conditional neighborhood entropy is defined to measure decision uncertainty of a neighborhood granule. Considering both decision uncertainty and coverage ability of a neighborhood granule, we propose the concept of NID. Then, an instance selection algorithm is formulated to select representative instances based on NID. Furthermore, an NID-based feature selection algorithm is provided for a neighborhood decision system. By integrating the instance selection and feature selection methods, a bi-selection approach based on NID (BSNID) is finally proposed to select instances and features. Lastly, some numerical experiments are conducted to evaluate the performance of BSNID. The results demonstrate that BSNID can take account of both reduction ratio and classification accuracy and, therefore, performs satisfactorily in effectiveness.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"415-428"},"PeriodicalIF":7.5,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianghe Cai;Yuhui Deng;Yi Zhou;Jiande Huang;Geyong Min
{"title":"FIG: Feature-Weighted Information Granules With High Consistency Rate","authors":"Jianghe Cai;Yuhui Deng;Yi Zhou;Jiande Huang;Geyong Min","doi":"10.1109/TBDATA.2023.3343348","DOIUrl":"https://doi.org/10.1109/TBDATA.2023.3343348","url":null,"abstract":"Information granules are effective in revealing the structure of data. Therefore, it is a common practice in data mining to use information granules for classifying datasets. In the existing granular classifiers, the information granules are often classified according to the standard membership function only without considering the influence of different feature weights on the quality of granules and label classification results. In this article, we utilize the feature weighting of data to produce the information granules with high consistency rate called FIG. First, we use consistency rate and contribution scores to generate information granules. Then, we propose a granular two-stage classifier GTC based on FIG. GTC divides the data into fuzzy and fixed points and then calculates the interval matching degree to assign data points to the most suitable cluster in the second step. Finally, we compare FIG with two state-of-the-art granular models (T-GrM and FGC-rule), and classification accuracy is also compared with other classification algorithms. The extensive experiments on synthetic datasets and public datasets from UCI show that FIG has sufficient performance to describe the data structure and excellent capability under the constructed granular classifier GTC. Compared with T-GrM and FGC-rule, the time overhead required for FIG to obtain information granules is reduced by an average of 51.07%, the per unit quality of the granules is also increased by more than 14.74%. Compared with other classification algorithms, an average of 5.04% improves GTC accuracy.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"400-414"},"PeriodicalIF":7.5,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}