{"title":"GANBLR: A Tabular Data Generation Model","authors":"Yishuo Zhang, Nayyar Zaidi, Jiahui Zhou, Gang Li","doi":"10.1109/ICDM51629.2021.00103","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00103","url":null,"abstract":"Generative Adversarial Network (GAN) models have shown to be effective in a wide range of machine learning applications, and tabular data generation process has not been an exception. Notably, some state-of-the-art models of tabular data generation, such as CTGAN, TableGan, MedGAN, etc. are based on GAN models. Even though these models have resulted in superiour performance in generating artificial data when trained on a range of datasets, there is a lot of room (and desire) for improvement. Not to mention that existing methods do have some weaknesses other than performance. E.g., the current methods focus only on the performance of the model, and limited emphasis is given to the interpretation of the model. Secondly, the current models operate on raw features only, and hence they fail to exploit any prior knowledge on explicit feature interactions that can be utilized during data generation process. To alleviate the two above-mentioned limitations, in this work, we propose a novel tabular data generation model– Generative Adversarial Network modelling inspired from Naive Bayes and Logistic Regression’s relationship (GANBLR), which can not only address the interpretation limitation in existing tabular GAN-based models but can provide capability to handle explicit feature interactions. By extensively evaluating on wide range of datasets, we demonstrate GANBLR’S superiour performance as well as better interpretable capability (explanation of feature importance in the synthetic generation process) as compared to existing state-of-the-art tabular data generation models.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115774129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. T. Godziszewski, Tomasz P. Michalak, Marcin Waniek, Talal Rahwan, Kai Zhou, Yulin Zhu
{"title":"Attacking Similarity-Based Sign Prediction","authors":"M. T. Godziszewski, Tomasz P. Michalak, Marcin Waniek, Talal Rahwan, Kai Zhou, Yulin Zhu","doi":"10.1109/ICDM51629.2021.00173","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00173","url":null,"abstract":"In this paper, we present a computational analysis of the problem of attacking sign prediction, whereby the aim of the attacker (a network member) is to hide from the defender (an analyst) the signs of a target set of links by removing the signs of some other, non-target, links. The problem turns out to be NP-hard if either local or global similarity measures are used for sign prediction. We propose a heuristic algorithm and test its effectiveness on several real-life and synthetic datasets.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125038290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Conversion Prediction with Delayed Feedback: A Multi-task Learning Approach","authors":"Yilin Hou, Guangming Zhao, Chuanren Liu, Zhonglin Zu, Xiaoqiang Zhu","doi":"10.1109/ICDM51629.2021.00029","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00029","url":null,"abstract":"Online display advertising has become a vital business for large-scale E-commerce markets. As the main goal of advertisers is to reach interested customer prospects, accurate conversion prediction is essential for successful online display advertising. A particular challenge for conversion prediction is that conversions may occur long after the click events. Such delayed feedback makes it a non-trivial task to keep conversion prediction models updated and consistent with the latest customer distribution. Although several studies have been conducted to tackle the delayed feedback issue, the relationship between the early conversion and full term conversion has not been fully exploited to improve conversion prediction. In this paper, we consider conversion prediction as a multi-task learning problem by leveraging multiple conversion labels after different observation intervals. Specifically, we propose a multi-task model with an end-to-end architecture for conversion prediction. Our approach is guided by theoretical and probabilistic analysis of the early and full term conversions. Our mixture-of-experts module can integrate distinct characteristics of input features and optimize the task-specific experts. In addition, the multiple tasks are jointly learned with a regularization term to ensure the embedding consistency between tasks and prevent potential overfitting issues. In comparison with competitive benchmarks, our approach can significantly improve conversion prediction with delayed feedback and improve business performance of online display advertising.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126721490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fair Graph Auto-Encoder for Unbiased Graph Representations with Wasserstein Distance","authors":"Wei Fan, Kunpeng Liu, Rui Xie, Hao Liu, Hui Xiong, Yanjie Fu","doi":"10.1109/ICDM51629.2021.00122","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00122","url":null,"abstract":"The fairness issue is very important in deploying machine learning models as algorithms widely used in human society can be easily in discrimination. Researchers have studied disparity on tabular data a lot and proposed many methods to relieve bias. However, studies towards unfairness in graph are still at early stage while graph data that often represent connections among people in real-world applications can easily give rise to fairness issues and thus should be attached to great importance. Fair representation learning is one of the most effective methods to relieve bias, which aims to generate hidden representations of input data while obfuscating sensitive information. In graph setting, learning fair representations of graph (also called fair graph embeddings) is effective to solve graph unfairness problems. However, most existing works of fair graph embeddings only study fairness in a coarse granularity (i.e., group fairness), but overlook individual fairness. In this paper, we study fair graph representations from different levels. Specifically, we consider both group fairness and individual fairness on graph. To debias graph embeddings, we propose FairGAE, a fair graph auto-encoder model, to derive unbiased graph embeddings based on the tailor-designed fair Graph Convolution Network (GCN) layers. Then, to achieve multi-level fairness, we design a Wasserstein distance based regularizer to learn the optimal transport for fairer embeddings. To overcome the efficiency concern, we further bring up Sinkhorn divergence as the approximations of Wasserstein cost for computation. Finally, we apply the learned unbiased embeddings into the node classification task and conduct extensive experiments on two real-world graph datasets to demonstrate the improved performances of our approach.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117275121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
André Gustavo Maletzke, Denis Moreira dos Reis, Waqar Hassan, Gustavo E. A. P. A. Batista
{"title":"Accurately Quantifying under Score Variability","authors":"André Gustavo Maletzke, Denis Moreira dos Reis, Waqar Hassan, Gustavo E. A. P. A. Batista","doi":"10.1109/ICDM51629.2021.00149","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00149","url":null,"abstract":"The quantification objective is to predict the class distribution of a data sample. Therefore, this task intrinsically involves a drift in the class distribution that causes a mismatch between the training and test sets. However, existing quantification approaches assume that the feature distribution is stationary. We analyse for the first time how score-based quantifiers are affected by concept drifts and propose a novel drift-resilient quantifier for binary classes. Our proposal does not model the different types of concept drifts. Instead, we model the changes that such changes cause in the classification scores. This observation simplifies our analysis since distribution changes can only increase, decrease or maintain the overlap of the positive and negative classes in a rank induced by the scores. Our paper has two main contributions. The first one is MoSS, a model for synthetic scores. We use this model to show that state-of-the-art quantifiers underperform in the occurrence of any concept drift that changes the score distribution. Our second contribution is a quantifier, DySyn, that uses MoSS to estimate the class distribution. We show that DySyn statistically outperforms state-of-the-art quantifiers in a comprehensive comparison with real-world and benchmark datasets in the presence of concept drifts.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129713241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qingping Yang, Yingpeng Hu, Rongyu Cao, Hongwei Li, Ping Luo
{"title":"Zero-shot Key Information Extraction from Mixed-Style Tables: Pre-training on Wikipedia","authors":"Qingping Yang, Yingpeng Hu, Rongyu Cao, Hongwei Li, Ping Luo","doi":"10.1109/ICDM51629.2021.00187","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00187","url":null,"abstract":"Table, widely used in documents from various vertical domains, is a compact representation of data. There is always some strong demand to automatically extract key information from tables for further analysis. In addition, the set of keys that need to be extracted information is usually time-varying, which arises the issue of zero-shot keys in this situation. To increase the efficiency of these knowledge workers, in this study we aim to extract the values of a given set of keys from tables. Previous table-related studies mainly focus on relational, entity, and matrix tables. However, their methods fail on mixed-style tables, in which table headers might exist in any non-merged or merged cell, and the spatial relationships between headers and corresponding values are diverse. Here, we address this problem while taking mixed-style tables into account. To this end, we propose an end-to-end neural-based model, called Information Extraction in Mixed-style Table (IEMT). IEMT first uses BERT to extract textual semantics of the given key and the words in each cell. Then, it uses multi-layer CNN to capture the spatial and textual interactions among adjacent cells. Furthermore, to improve the accuracy on zero-shot keys, we pre-train IEMT on a dataset constructed on 0.4 million tables from Wikipedia and 140 million triplets from Ownthink. Experiments with the fine-tuning step on 26,869 financial tables show that the proposed model achieves 0.9323 accuracy for zero-shot keys, obtaining more than 8% increase compared with the model without pre-training.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"807 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130567523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hypergraph Convolutional Network for Group Recommendation","authors":"Renqi Jia, Xiaofei Zhou, Linhua Dong, Shirui Pan","doi":"10.1109/ICDM51629.2021.00036","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00036","url":null,"abstract":"Group activities have become an essential part of people’s daily life, which stimulates the requirement for intensive research on the group recommendation task, i.e., recommending items to a group of users. Most existing works focus on aggregating users’ interests within the group to learn group preference. These methods are faced with two problems. First, these methods only model the user preference inside a single group while ignoring the collaborative relations among users and items across different groups. Second, they assume that group preference is an aggregation of user interests, and factually a group may pursue some targets not derived from users’ interests. Thus they are insufficient to model the general group preferences which are independent of existing user interests. To address the above issues, we propose a novel dual channel Hypergraph Convolutional network for group Recommendation (HCR), which consists of member-level preference network and group-level preference network. In the member-level preference network, in order to capture cross-group collaborative connections among users and items, we devise a member-level hypergraph convolutional network to learn group members’ personal preferences. In the group-level preference network, the group’s general preference is captured by a group-level graph convolutional network based on group similarity. We evaluate our model on two real-world datasets and the experimental results show that the proposed model significantly and consistently outperforms state-of-the-art group recommendation techniques.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123523134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cangqi Zhou, Jinling Shang, Jing Zhang, Qianmu Li, Dianming Hu
{"title":"Topic-Attentive Encoder-Decoder with Pre-Trained Language Model for Keyphrase Generation","authors":"Cangqi Zhou, Jinling Shang, Jing Zhang, Qianmu Li, Dianming Hu","doi":"10.1109/ICDM51629.2021.00200","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00200","url":null,"abstract":"Keyphrase annotation task aims to retrieve the most representative phrases that express the essential gist of documents. In reality, some phrases that best summarize documents are often absent from the original text, which motivates researchers to develop generation methods, being able to create phrases. Existing generation approaches usually adopt the encoder-decoder framework for sequence generation. However, the widely-used recurrent neural network might fail to capture long-range dependencies among items. In addition, intuitively, as keyphrases are likely to correlate with topical words, some methods propose to introduce topic models into keyphrase generation. But they hardly leverage the global information of topics. In view of this, we employ the Transformer architecture with the pre-trained BERT model as the encoder-decoder framework for keyphrase generation. BERT and Transformer are demonstrated to be effective for many text mining tasks. But they have not been extensively studied for keyphrase generation. Furthermore, we propose a topic attention mechanism to utilize the corpus-level topic information globally for keyphrase generation. Specifically, we propose BertTKG, a keyphrase generation method that uses a contextualized neural topic model for corpus-level topic representation learning, and then enhances the document representations learned by pre-trained language model for better keyphrase decoding. Extensive experiments conducted on three public datasets manifest the superiority of BertTKG.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116494442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiao Xu, Xian Xu, Yuyao Sun, Xiaoshuang Liu, Xiang Li, G. Xie, Fei Wang
{"title":"Predictive Modeling of Clinical Events with Mutual Enhancement Between Longitudinal Patient Records and Medical Knowledge Graph","authors":"Xiao Xu, Xian Xu, Yuyao Sun, Xiaoshuang Liu, Xiang Li, G. Xie, Fei Wang","doi":"10.1109/ICDM51629.2021.00089","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00089","url":null,"abstract":"In recent years, with the better availability of medical data such as Electronic Health Records (EHR), more and more data mining models have been developed to explore the data-driven insights for better human health. However, there are many challenges for analyzing EHR such as high-dimensionality, temporality, sparsity, etc., which make the data-driven models less reliable. Medical knowledge graph (MKG), which encodes comprehensive knowledge about the medical concepts and relationships extracted from medical literature, holds great promise to regularize the data-driven models as prior knowledge. Nonetheless, the MKGs are typically not complete, which limits its utility in helping with the data mining process. In this paper, we propose a mutual enhancement framework MendMKG for predictive modeling of clinical events with both EHR and MKG. In particular, MendMKG first conducts a self-supervised learning strategy to simultaneously pre-train a graph attention network for embedding nodes and complete the MKG. It iteratively performs (1) an embedding-based knowledge graph completion module to derive missing edges, (2) and a reconstruction module of unlabeled EHR data to select high-quality ones from these edges, which would be further appended to the MKG to update the embedding model. Through the iterations, the two modules mutually benefit each other. Then, MendMKG uses the pre-trained graph attention network and the updated MKG to generate the visit embeddings to represent patient’s historical visits, and predict the diagnosis in future visit, through a fine-tuning approach. Experimental results on real world EHR corpus are provided to demonstrate the superiority of the proposed framework, compared to a series of state-of-the-art baselines.11The source code and knowledge graph data have been anonymously uploaded to https://github.com/1317375434/MendMKG.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114699343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Structure-Aware Stabilization of Adversarial Robustness with Massive Contrastive Adversaries","authors":"Shuo Yang, Zeyu Feng, Pei Du, Bo Du, Chang Xu","doi":"10.1109/ICDM51629.2021.00092","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00092","url":null,"abstract":"Recent researches indicate that the impact of adversarial perturbations on deep learning models is reflected not only on the alteration of predicted labels but also on the distortion of data structure in the representation space. Significant improvement of the model’s adversarial robustness can be achieved by reforming the structure-aware representation distortion. Current methods generally utilize the one-to-one representation alignment or the triplet information between the positive and negative pairs. However, in this paper, we show that the representation structure of the natural and adversarial examples cannot be well and stably captured if we only focus on a localized range of contrastive examples. To achieve better and more stable adversarial robustness, we propose to adjust the adversarial distortion of representation structure by using Massive Contrastive Adversaries (MCA). Inspired by the Noise-Contrastive Estimation (NCE), MCA exploits the contrastive information by employing m negative instances. Compared with existing methods, our method recruits a much wider range of negative examples per update, so a better and more stable representation relationship between the natural and adversarial examples can be captured. Theoretical analysis shows that the proposed MCA inherently maximizes a lower bound of the mutual information (MI) between the representations of the natural and adversarial examples. Empirical experiments on benchmark datasets demonstrate that MCA can achieve better and more stable intra-class compactness and inter-class divergence, which further induces better adversarial robustness.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126410506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}