Xiaoyan Zhu, Ting Wang, Jiayin Wang, Ying Xu, Yuqian Liu
{"title":"A new multiple instance algorithm using structural information","authors":"Xiaoyan Zhu, Ting Wang, Jiayin Wang, Ying Xu, Yuqian Liu","doi":"10.1109/ICDM51629.2021.00204","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00204","url":null,"abstract":"Multiple instance learning (MIL) is semisupervised learning that predicts the label of a bag with a wide diversity of instances. It has many applications and thus attracts increasingly more attention. In this paper, we propose a new MIL algorithm using the structural information of a bag to predict its label. In the proposed method, a bag is transformed into a graph, and spectral clustering is employed to divide the graph into several subgraphs. Then, the graph Fourier transform is utilized to extract the features of the subgraphs. Finally, an end-to-end neural network is used to predict the label of a bag with the extracted features. An empirical study with 25 datasets was conducted to validate the effectiveness of the proposed method. The experimental results show that the proposed method performs better than the 6 baseline methods on most datasets.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132557469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Lookahead Algorithm for Robust Subspace Recovery","authors":"Guihong Wan, H. Schweitzer","doi":"10.1109/ICDM51629.2021.00175","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00175","url":null,"abstract":"A common task in the analysis of data is to compute an approximate embedding of the data in a low-dimensional subspace. The standard algorithm for computing this subspace is the well-known Principal Component Analysis (PCA). PCA can be extended to the case where some data points are viewed as “outliers” that can be ignored, allowing the remaining data points (inliers”) to be more tightly embedded. We develop a new algorithm that detects outliers so that they can be removed prior to applying PCA. The main idea is to rank each point by looking ahead and evaluating the change in the global PCA error if that point is considered as an outlier. Our technical contribution is showing that this lookahead procedure can be implemented efficiently, producing an accurate algorithm with running time not much above the running time of standard PCA algorithms.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":" 65","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113952574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jie Huang, Qi Liu, Fei Wang, Zhenya Huang, Songtao Fang, Runze Wu, Enhong Chen, Yu Su, Shijin Wang
{"title":"Group-Level Cognitive Diagnosis: A Multi-Task Learning Perspective","authors":"Jie Huang, Qi Liu, Fei Wang, Zhenya Huang, Songtao Fang, Runze Wu, Enhong Chen, Yu Su, Shijin Wang","doi":"10.1109/ICDM51629.2021.00031","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00031","url":null,"abstract":"Most cognitive diagnosis research in education has been concentrated on individual assessment, aiming at discovering the latent characteristics of students. However, in many real-world scenarios, group-level assessment is an important and meaningful task, e.g., class assessment in different regions can discover the difference of teaching level in different contexts. In this work, we consider assessing cognitive ability for a group of students, which aims to mine groups’ proficiency on specific knowledge concepts. The significant challenge in this task is the sparsity of group-exercise response data, which seriously affects the assessment performance. Existing works either do not make effective use of additional student-exercise response data or fail to reasonably model the relationship between group ability and individual ability in different learning contexts, resulting in sub-optimal diagnosis results. To this end, we propose a general Multi-Task based Group-Level Cognitive Diagnosis (MGCD) framework, which is featured with three special designs: 1) We jointly model student-exercise responses and group-exercise responses in a multi-task manner to alleviate the sparsity of group-exercise responses; 2) We design a context-aware attention network to model the relationship between student knowledge state and group knowledge state in different contexts; 3) We model an interpretable cognitive layer to obtain student ability, group ability and exercise factors (e.g., difficulty), and then we leverage neural networks to learn complex interaction functions among them. Extensive experiments on real-world datasets demonstrate the generality of MGCD and the effectiveness of our attention design and multi-task learning.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114401566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xin Zhang, Yanhua Li, Xun Zhou, Oren Mangoubi, Ziming Zhang, Vincent Filardi, Jun Luo
{"title":"DAC-ML: Domain Adaptable Continuous Meta-Learning for Urban Dynamics Prediction","authors":"Xin Zhang, Yanhua Li, Xun Zhou, Oren Mangoubi, Ziming Zhang, Vincent Filardi, Jun Luo","doi":"10.1109/ICDM51629.2021.00102","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00102","url":null,"abstract":"Given the underlying road network of an urban area, the problem of urban dynamics prediction aims to capture the patterns of urban dynamics and to forecast short-term urban traffic status continuously from the historical observations. This problem is of fundamental importance to urban traffic management, planning, and various business services. However, predicting urban dynamics is challenging due to the highly dynamic (i.e., varying across geographical locations and evolving over time) and uncertain (i.e., affected by unexpected factors) nature of urban traffic systems. Recent works adopt meta-learning approaches to capture irregular and rare patterns but make unrealistic assumptions such as single-domain uncertainties and explicit temporal task segmentation. In this paper, we solve the urban dynamics prediction problem from the Bayesian meta-learning perspective and propose a novel domain adaptable continuous meta-learning approach (DAC-ML) that does not require task segmentation. Trained on a sequence of spatial-temporal urban dynamics data, DAC-ML aims to detect and infer unobserved latent variations (from task and domain levels) and generalize well in a sequential prediction setting, where the underlying data generating process varies over time. Experimental results on three real-world datasets demonstrate that DAC-ML can outperform baselines in urban dynamics prediction, especially when obvious urban dynamics and temporal uncertainties are present.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"532 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116494069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yoonsuk Kang, Woncheol Lee, Yeon-Chang Lee, Kyungsik Han, Sang-Wook Kim
{"title":"Adversarial Learning of Balanced Triangles for Accurate Community Detection on Signed Networks","authors":"Yoonsuk Kang, Woncheol Lee, Yeon-Chang Lee, Kyungsik Han, Sang-Wook Kim","doi":"10.1109/ICDM51629.2021.00137","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00137","url":null,"abstract":"In this paper, we propose a framework for embedding-based community detection on signed networks. It first represents all the nodes of a signed network as vectors in low-dimensional embedding space and conducts a clustering algorithm (e.g., k-means) on vectors, thereby detecting a community structure in the network. When performing the embedding process, our framework learns only the edges belonging to balanced triangles whose edge signs follow the balance theory, significantly excluding noise edges in learning. To address the sparsity of balanced triangles in a signed network, our framework learns not only the edges in balanced real-triangles but those in balanced virtual-triangles that are produced by our generator. Finally, our framework employs adversarial learning to generate more-realistic balanced virtual-triangles with less noise edges. Through extensive experiments using seven real-world networks, we validate the effectiveness of (1) learning edges belonging to balanced real/virtual-triangles and (2) employing adversarial learning for signed network embedding. We show that our framework consistently and significantly outperforms the state-of-the-art community detection methods in all datasets.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128500168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Stochastic Neural Network via Feature Distribution Calibration","authors":"Han Yang, Min Wang, Yun Zhou, Yongxin Yang","doi":"10.1109/ICDM51629.2021.00186","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00186","url":null,"abstract":"Stochastic neural network (SNN) has attracted increasing attention in recent years, which benefits several important tasks by modeling samples uncertainly, such as adversarial defense, label noise robustness, and model calibration. The current implementations of existing stochastic neural networks are mainly Gaussian noise injection, e.g., deep Variational Information Bottleneck (VIB) uses fixed Gaussian prior to derive noise injection, simple and effective stochastic neural network (SE-SNN) uses a non-informative Gaussian prior to implement it. However, Gaussian distribution assumption is insufficient to model more complex distributions of data in practical, such as the skewed distribution or multi-modal distribution. In this paper, we relax the strict Gaussian prior assumption, and propose a novel distribution calibrated stochastic neural network (DCSNN) which integrates two successive steps. These two steps are as follows: 1) The trained feature vector is preprocessed to make its feature distribution closer to the Gaussian-like distribution. 2) Gaussian distribution’s mean and variance are used to model the sample’s activation indeterminacy. The experimental results show that, compared with the existing methods, our proposed method can achieve state-of-the-art results in a variety of datasets, backbone architectures and multiple applications.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134166263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Compressibility of Distributed Document Representations","authors":"Blaž Škrlj, Matej Petković","doi":"10.1109/ICDM51629.2021.00166","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00166","url":null,"abstract":"Contemporary natural language processing (NLP) revolves around learning from latent document representations, generated either implicitly by neural language models or explicitly by methods such as doc2vec or similar. One of the key properties of the obtained representations is their dimension. Whilst the commonly adopted dimensions of 256 and 768 offer sufficient performance on many tasks, it is many times unclear whether the default dimension is the most suitable choice for the subsequent downstream learning tasks. Furthermore, representation dimensions are seldom subject to hyperparameter tunning due to computational constraints. The purpose of this paper is to demonstrate that a surprisingly simple and efficient recursive compression procedure can be sufficient to both significantly compress the initial representation, but also potentially improve its performance when considering the task of text classification. Having smaller and less noisy representations is the desired property during deployment, as orders of magnitude smaller models can significantly reduce the computational overload and with it the deployment costs. We propose CORE, a straightforward, compression-agnostic framework suitable for representation compression. The CORE’S performance is showcased and studied on a collection of 17 real-life corpora from biomedical, news, social media, and literary domains. We explored CORE’S behavior when considering contextual and non-contextual document representations, different compression levels, and 9 different compression algorithms. Current results based on more than 100,000 compression experiments indicate that recursive Singular Value Decomposition offers a very good trade-off between the compression efficiency and performance, making CORE useful in many existing, representation-dependent NLP pipelines.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"175 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116135601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Ting, Takashi Washio, Jonathan R. Wells, Hang Zhang
{"title":"Isolation Kernel Density Estimation","authors":"K. Ting, Takashi Washio, Jonathan R. Wells, Hang Zhang","doi":"10.1109/ICDM51629.2021.00073","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00073","url":null,"abstract":"This paper shows that adaptive kernel density estimator (KDE) can be derived effectively from Isolation Kernel. Existing adaptive KDEs often employ a data independent kernel such as Gaussian kernel. Therefore, it requires an additional means to adapt its bandwidth locally in a given dataset. Because Isolation Kernel is a data dependent kernel which is derived directly from data, no additional adaptive operation is required. The resultant estimator called IKDE is the only KDE that is fast and adaptive. Existing KDEs are either fast but non-adaptive or adaptive but slow. In addition, using IKDE for anomaly detection, we identify two advantages of IKDE over LOF (Local Outlier Factor), contributing to significantly faster runtime.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"344 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125805104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anne Hartebrodt, Reza Nasirigerdeh, David B. Blumenthal, Richard Röttger
{"title":"Federated Principal Component Analysis for Genome-Wide Association Studies","authors":"Anne Hartebrodt, Reza Nasirigerdeh, David B. Blumenthal, Richard Röttger","doi":"10.1109/ICDM51629.2021.00127","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00127","url":null,"abstract":"Federated learning (FL) has emerged as a privacy-aware alternative to centralized data analysis, especially for biomedical analyses such as genome-wide association studies (GWAS). The data remains with the owner, which enables studies previously impossible due to privacy protection regulations. Principal component analysis (PCA) is a frequent preprocessing step in GWAS, where the eigenvectors of the sample-by-sample covariance matrix are used as covariates in the statistical tests. Therefore, a federated version of PCA suitable for vertical data partitioning is required for federated GWAS. Existing federated PCA algorithms exchange the complete sample eigenvectors, a potential privacy breach. In this paper, we present a federated PCA algorithm for vertically partitioned data which does not exchange the sample eigenvectors and is hence suitable for federated GWAS. We show that it outperforms existing federated solutions in terms of convergence behavior and scalability. Additionally, we provide a user-friendly privacy-aware web tool to promote acceptance of federated PCA among GWAS researchers.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125880933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast computation of distance-generalized cores using sampling","authors":"Nikolaj Tatti","doi":"10.1007/s10115-023-01830-9","DOIUrl":"https://doi.org/10.1007/s10115-023-01830-9","url":null,"abstract":"","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123816631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}