Zhigang Kan;Linhui Feng;Zhangyue Yin;Linbo Qiao;Xipeng Qiu;Dongsheng Li
{"title":"A Composable Generative Framework Based on Prompt Learning for Various Information Extraction Tasks","authors":"Zhigang Kan;Linhui Feng;Zhangyue Yin;Linbo Qiao;Xipeng Qiu;Dongsheng Li","doi":"10.1109/TBDATA.2023.3278977","DOIUrl":"10.1109/TBDATA.2023.3278977","url":null,"abstract":"Prompt learning is an effective paradigm that bridges gaps between the pre-training tasks and the corresponding downstream applications. Approaches based on this paradigm have achieved great transcendent results in various applications. However, it still needs to be answered how to design a general-purpose framework based on the prompt learning paradigm for various information extraction tasks. In this article, we propose a novel composable prompt-based generative framework, which could be applied to a wide range of tasks in the field of information extraction. Specifically, we reformulate information extraction tasks into the form of filling slots in pre-designed type-specific prompts, which consist of one or multiple sub-prompts. A strategy of constructing composable prompts is proposed to enhance the generalization ability in data-scarce scenarios. Furthermore, to fit this framework, we transform relation extraction into the task of determining semantic consistency in prompts. The experimental results demonstrate that our approach surpasses compared baselines on real-world datasets in data-abundant and data-scarce scenarios. Further analysis of the proposed framework is presented, as well as numerical experiments conducted to investigate impact factors of performance on various tasks.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"9 4","pages":"1238-1251"},"PeriodicalIF":7.2,"publicationDate":"2023-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46298707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenya Wang;Xiang Cheng;Sen Su;Jintao Liang;Haocheng Yang
{"title":"ATLAS: GAN-Based Differentially Private Multi-Party Data Sharing","authors":"Zhenya Wang;Xiang Cheng;Sen Su;Jintao Liang;Haocheng Yang","doi":"10.1109/TBDATA.2023.3277716","DOIUrl":"10.1109/TBDATA.2023.3277716","url":null,"abstract":"In this article, we study the problem of differentially private multi-party data sharing, where the involved parties assisted by a semi-honest curator collectively generate a shared dataset while satisfying differential privacy. Inspired by the success of data synthesis with the generative adversarial network (GAN), we propose a novel GAN-based differentially private multi-party data sharing approach named ATLAS. In ATLAS, we extend the original GAN to multiple discriminators, and let each party hold a discriminator while the curator holds a generator. To update the generator without compromising each party's privacy, we decompose the calculation of the generator's gradient and selectively sanitize the \u0000<italic>discriminators’ responses</i>\u0000. Additionally, we propose two methods to improve the utility of shared data, i.e., the collaborative discriminator filtering (CDF) method and the adaptive gradient perturbation (AGP) method. Specifically, the CDF method utilizes trained discriminators to refine synthetic records, while the AGP method adaptively adjusts the noise scale during training to reduce the impact of deferentially private noise on the final shared data. Extensive experiments on real-world datasets validate the superiority of our ATLAS approach.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"9 4","pages":"1225-1237"},"PeriodicalIF":7.2,"publicationDate":"2023-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49027469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yangwen Yu;Victor O. K. Li;Jacqueline C. K. Lam;Kelvin Chan
{"title":"GCN-ST-MDIR: Graph Convolutional Network-Based Spatial-Temporal Missing Air Pollution Data Pattern Identification and Recovery","authors":"Yangwen Yu;Victor O. K. Li;Jacqueline C. K. Lam;Kelvin Chan","doi":"10.1109/TBDATA.2023.3277710","DOIUrl":"10.1109/TBDATA.2023.3277710","url":null,"abstract":"Missing data pattern identification and recovery (MDIR) is vital for accurate air pollution monitoring. To recover the missing air pollution data, GCN-ST-MDIR, a Graph Convolutional Network (GCN)-based MDIR framework, is proposed to identify daily missing data patterns and automatically select the best recovery method. GCN-ST-MDIR presents four novelties: (1) A new graph construction is developed to improve GCN data representation for MDIR using S-T similarity matrix and domain-specific knowledge (e.g., weekend/weekday). (2) A TL component is used to pre-train LSCE and ILSCE models. (3) A GCN structure outputs a selection indicator to determine the dominant missing pattern for daily input. The pre-trained data recovery model's accuracy is incorporated into the GCN loss function to penalize the wrong indicator. (4) The output of the GCN structure is used as a score to combine LSCE and ILSCE. Results show that the domain-specific S-T regularity and irregularity can be used as the prior information for both GCN and ILSCE/LSCE to enhance feature extraction. Our model considerably improves the recovery performance as compared to the baselines. GCN-ST-MDIR has achieved an accuracy of 88.48% for general missing data recovery with consecutively and sporadically missing data. GCN-ST-MDIR can be extended to many other S-T MDIR challenges.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"9 5","pages":"1347-1364"},"PeriodicalIF":7.2,"publicationDate":"2023-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47277996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dual Uncertainty-Guided Mixing Consistency for Semi-Supervised 3D Medical Image Segmentation","authors":"Chenchu Xu;Yuan Yang;Zhiqiang Xia;Boyan Wang;Dong Zhang;Yanping Zhang;Shu Zhao","doi":"10.1109/TBDATA.2023.3258643","DOIUrl":"10.1109/TBDATA.2023.3258643","url":null,"abstract":"3D semi-supervised medical image segmentation is extremely essential in computer-aided diagnosis, which can reduce the time-consuming task of performing annotation. The challenges with current 3D semi-supervised segmentation algorithms includes the methods, limited attention to volume-wise context information, their inability to generate accurate pseudo labels and a failure to capture important details during data augmentation. This article proposes a dual uncertainty-guided mixing consistency network for accurate 3D semi-supervised segmentation, which can solve the above challenges. The proposed network consists of a Contrastive Training Module which improves the quality of augmented images by retaining the invariance of data augmentation between original data and their augmentations. The Dual Uncertainty Strategy calculates dual uncertainty between two different models to select a more confident area for subsequent segmentation. The Mixing Volume Consistency Module that guides the consistency between mixing before and after segmentation for final segmentation, uses dual uncertainty and can fully learn volume-wise context information. Results from evaluative experiments on brain tumor and left atrial segmentation shows that the proposed method outperforms state-of-the-art 3D semi-supervised methods as confirmed by quantitative and qualitative analysis on datasets. This effectively demonstrates that this study has the potential to become a medical tool for accurate segmentation. Code is available at: \u0000<uri>https://github.com/yang6277/DUMC</uri>\u0000.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"9 4","pages":"1156-1170"},"PeriodicalIF":7.2,"publicationDate":"2023-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42422060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Constraint-Driven Complexity-Aware Data Science Workflow for AutoBDA","authors":"Akila Siriweera;Incheon Paik;Huawei Huang","doi":"10.1109/TBDATA.2023.3256043","DOIUrl":"10.1109/TBDATA.2023.3256043","url":null,"abstract":"The Internet of Things, privacy, and technical constraints increase the demand for edge-based data-driven services, which is one of the major goals of Industry 4.0 and Society 5.0. Big data analysis (BDA) is the preferred approach to unleash hidden knowledge. However, BDA consumes excessive resources and time. These limitations hamper the meaningful adoption of BDA, especially the time and situation critical edge use cases, and hinder the goals of Industry 4.0 and Society 5.0. Automating the BDA process at the edge is a cognitive approach to address the aforementioned concerns. Data science workflow is an indispensable challenge for successful automation. Therefore, we conducted a systematic literature survey on data science workflow platforms as the first contribution. Moreover, we learned that the BDA workflow depends on diversified constraints and undergoes rigorous data-mining stages. These caused an increase in the solution space, dynamic constraints, complexity issues, and NP-hardness of BDA workflow. Graphplan is a heuristic AI-planning technique that can address concerns associated with BDA workflow. Therefore, as the second contribution, we adopted the graphplan to generate a workflow for edge-based BDA automation. Experiments demonstrate that the proposed method achieved our objectives.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"9 6","pages":"1438-1457"},"PeriodicalIF":7.2,"publicationDate":"2023-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62972028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Self-Supervised Nodes-Hyperedges Embedding for Heterogeneous Information Network Learning","authors":"Mengran Li;Yong Zhang;Wei Zhang;Yi Chu;Yongli Hu;Baocai Yin","doi":"10.1109/TBDATA.2023.3275374","DOIUrl":"10.1109/TBDATA.2023.3275374","url":null,"abstract":"The exploration of self-supervised information mining of heterogeneous datasets has gained significant traction in recent years. Heterogeneous graph neural networks (HGNNs) have emerged as a highly promising method for handling heterogeneous information networks (HINs) due to their superior performance. These networks leverage aggregation functions to convert pairwise relations-based features from raw heterogeneous graphs into embedding vectors. However, real-world HINs contain valuable higher-order relations that are often overlooked but can provide complementary information. To address this issue, we propose a novel method called \u0000<bold>S</b>\u0000elf-supervised \u0000<bold>N</b>\u0000odes-\u0000<bold>H</b>\u0000yperedges \u0000<bold>E</b>\u0000mbedding (SNHE), which leverages hypergraph structures to incorporate higher-order information into the embedding process of HINs. Our method decomposes the raw graph structure into snapshots based on various meta-paths, which are then transformed into hypergraphs to aggregate high-order information within the data and generate embedding representations. Given the complexity of HINs, we develop a dual self-supervised structure that maximizes mutual information in the enhanced graph data space, guides the overall model update, and reduces redundancy and noise. We evaluate our proposed method on various real-world datasets for node classification and clustering tasks, and compare it against state-of-the-art methods. The experimental results demonstrate the efficacy of our method. Our code is available at \u0000<uri>https://github.com/limengran98/SNHE</uri>\u0000.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"9 4","pages":"1210-1224"},"PeriodicalIF":7.2,"publicationDate":"2023-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42016844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammad Sultan Mahmud;Joshua Zhexue Huang;Rukhsana Ruby;Alladoumbaye Ngueilbaye;Kaishun Wu
{"title":"Approximate Clustering Ensemble Method for Big Data","authors":"Mohammad Sultan Mahmud;Joshua Zhexue Huang;Rukhsana Ruby;Alladoumbaye Ngueilbaye;Kaishun Wu","doi":"10.1109/TBDATA.2023.3255003","DOIUrl":"10.1109/TBDATA.2023.3255003","url":null,"abstract":"Clustering a big distributed dataset of hundred gigabytes or more is a challenging task in distributed computing. A popular method to tackle this problem is to use a random sample of the big dataset to compute an approximate result as an estimation of the true result computed from the entire dataset. In this paper, instead of using a single random sample, we use multiple random samples to compute an ensemble result as the estimation of the true result of the big dataset. We propose a distributed computing framework to compute the ensemble result. In this framework, a big dataset is represented in the RSP data model as random sample data blocks managed in a distributed file system. To compute the ensemble clustering result, a set of RSP data blocks is randomly selected as random samples and clustered independently in parallel on the nodes of a cluster to generate the component clustering results. The component results are transferred to the master node, which computes the ensemble result. Since the random samples are disjoint and traditional consensus functions cannot be used, we propose two new methods to integrate the component clustering results into the final ensemble result. The first method uses component cluster centers to build a graph and the METIS algorithm to cut the graph into subgraphs, from which a set of candidate cluster centers is found. A hierarchical clustering method is then used to generate the final set of \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000 cluster centers. The second method uses the clustering-by-passing-messages method to generate the final set of \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000 cluster centers. Finally, the \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000-means algorithm was used to allocate the entire dataset into \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000 clusters. Experiments were conducted on both synthetic and real-world datasets. The results show that the new ensemble clustering methods performed better than the comparison methods and that the distributed computing framework is efficient and scalable in clustering big datasets.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"9 4","pages":"1142-1155"},"PeriodicalIF":7.2,"publicationDate":"2023-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44823513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Novel Multi-Feature Fusion Facial Aesthetic Analysis Framework","authors":"Huanyu Chen;Weisheng Li;Xinbo Gao;Bin Xiao","doi":"10.1109/TBDATA.2023.3255582","DOIUrl":"10.1109/TBDATA.2023.3255582","url":null,"abstract":"Machine learning has been used in facial beauty prediction studies. However, the integrity of facial geometric information is not considered in facial aesthetic feature extraction, and the impact of other facial attributes (expression) on aesthetics. We propose a novel multi-feature fusion facial aesthetic analysis framework (NMFA) to overcome this problem. First, we designed a facial shape feature, which is an intuitive, visual quantitative description, based on B-spline. Second, we designed a representative low-dimensional facial structural feature to establish the theoretical basis of the facial structure, based on facial aesthetic structure and expression recognition theory. Next, we designed texture and holistic features based on Gabor and VGG-face network. Finally, we used a multi-feature fusion strategy to fuse them for aesthetic evaluation. Experiments were conducted on four databases. The results revealed that the proposed method realizes the visualization of facial shape features, enriches geometric information, solves the problem of lack of facial geometric information and difficulty to understand, and achieves excellent performance with fewer parameters.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"9 5","pages":"1302-1320"},"PeriodicalIF":7.2,"publicationDate":"2023-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45552573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Survey of Data Pricing for Data Marketplaces","authors":"Mengxiao Zhang;Fernando Beltrán;Jiamou Liu","doi":"10.1109/TBDATA.2023.3254152","DOIUrl":"10.1109/TBDATA.2023.3254152","url":null,"abstract":"A data marketplace is an online venue that brings data owners, data brokers, and data consumers together and facilitates commoditisation of data amongst them. Data pricing, as a key function of a data marketplace, demands quantifying the monetary value of data. A considerable number of studies on data pricing can be found in literature. This article attempts to comprehensively review the state-of-the-art on existing data pricing studies to provide a general understanding of this emerging research area. Our key contribution lies in a new taxonomy of data pricing studies that unifies different attributes determining data prices. The basis of our framework categorises these studies by the kind of market structure, be it sell-side, buy-side, or two-sided. Then in a sell-side market, the studies are further divided by query type, which defines the way a data consumer accesses data, while in a buy-side market, the studies are divided according to privacy notion, which defines the way to quantify privacy of data owners. In a two-sided market, both privacy notion and query type are used as criteria. We systematically examine the studies falling into each category in our taxonomy. Lastly, we discuss gaps within the existing research and define future research directions.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"9 4","pages":"1038-1056"},"PeriodicalIF":7.2,"publicationDate":"2023-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41913229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust Low Transformed Multi-Rank Tensor Completion With Deep Prior Regularization for Multi-Dimensional Image Recovery","authors":"Yao Li;Duo Qiu;Xiongjun Zhang","doi":"10.1109/TBDATA.2023.3254156","DOIUrl":"10.1109/TBDATA.2023.3254156","url":null,"abstract":"In this article, we study the robust tensor completion problem in three-dimensional image data, where only partial entries are available and the observed tensor is corrupted by Gaussian noise and sparse noise simultaneously. Compared with the existing tensor nuclear norm minimization for the low-rank component, we propose to use the transformed tensor nuclear norm to explore the global low-rankness of the underlying tensor. Moreover, the plug-and-play (PnP) deep prior denoiser is incorporated to preserve the local details of multi-dimensional images. Besides, the tensor \u0000<inline-formula><tex-math>$ell _{1}$</tex-math></inline-formula>\u0000 norm is utilized to characterize the sparseness of the sparse noise. A symmetric Gauss-Seidel based alternating direction method of multipliers is designed to solve the resulting model under the PnP framework with deep prior denoiser. Extensive numerical experiments on hyperspectral and multispectral images, videos, color images, and magnetic resonance image datasets are conducted to demonstrate the superior performance of the proposed model in comparison with several state-of-the-art models.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"9 5","pages":"1288-1301"},"PeriodicalIF":7.2,"publicationDate":"2023-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43506548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}