{"title":"Adversarial Removal of Population Bias in Genomics Phenotype Prediction","authors":"Honggang Zhao, Wenlu Wang","doi":"10.1109/ICDMW58026.2022.00052","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00052","url":null,"abstract":"Many factors impact trait prediction from genotype data. One of the major confounding factors comes from the presence of population structure among sampled individuals, namely population stratification. When exists, it will lead to biased quantitative phenotype prediction, therefore hampering the unambiguous conclusions about prediction and limiting the downstream usage like disease evaluation or epidemiology survey. Population stratification is an implicit bias that can not be easily removed by data preprocessing. With the purpose of training a phenotype prediction model, we propose an adversarial training framework that ensures the genomics encoder is agnostic to sample populations. For better generalization, our adversarial training framework is orthogonal to the genomics encoder and phenotype prediction model. We experimentally ascertain our debiasing framework by testing on a real-world yield (phenotype) prediction dataset with soybean genomics. The developed frame-work is designed for general genomic data (e.g., human, livestock, and crops) while the phenotype can be either continuous or categorical variables.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133327903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Online Deep Knowledge Tracing","authors":"Wenxin Zhang, Yupei Zhang, Shuhui Liu, Xuequn Shang","doi":"10.1109/ICDMW58026.2022.00047","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00047","url":null,"abstract":"This study focuses on solving the problem of knowledge tracing in a practical situation, where the responses from students come in a stream. Most current works of deep knowledge tracing are pursuing to integrate of more side information or data structure, but they often fail to make self-update in the dynamic learning situation. Towards this end, we here proposed an online deep knowledge tracing model, dubbed ODKT, by utilizing the online gradient descent algorithm to develop the traditional deep knowledge tracing (DKT) into online learning. Rather than learning a perfect model, the ODKT aims to train DKT in its using process step by step. Experiments were conducted on four public datasets for knowledge tracing. The results demonstrate that the ODKT model is effective and more suitable for practical applications.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127975768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Holodnak, Olivia Brown, J. Matterer, Andrew Lemke
{"title":"Backdoor Poisoning of Encrypted Traffic Classifiers","authors":"J. Holodnak, Olivia Brown, J. Matterer, Andrew Lemke","doi":"10.1109/ICDMW58026.2022.00080","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00080","url":null,"abstract":"Significant recent research has focused on applying deep neural network models to the problem of network traffic classification. At the same time, much has been written about the vulnerability of deep neural networks to adversarial inputs, both during training and inference. In this work, we consider launching backdoor poisoning attacks against an encrypted network traffic classifier. We consider attacks based on padding network packets, which has the benefit of preserving the functionality of the network traffic. In particular, we consider a handcrafted attack, as well as an optimized attack leveraging universal adversarial perturbations. We find that poisoning attacks can be extremely successful if the adversary has the ability to modify both the labels and the data (dirty label attacks) and somewhat successful, depending on the attack strength and the target class, if the adversary perturbs only the data (clean label attacks).","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114328334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Above Ground Biomass Estimation of a Cocoa Plantation using Machine Learning","authors":"Sabrina Sankar, Marvin B. Lewis, Patrick Hosein","doi":"10.1109/ICDMW58026.2022.00147","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00147","url":null,"abstract":"The rapid increase in carbon dioxide in the atmosphere and its associated effects on climate change and global warming has raised the importance of monitoring carbon sequestration levels. Estimating above ground biomass (AGB) is one way of monitoring carbon sequestration in forested areas. Quantifying above ground biomass using direct methods is costly, time-consuming and, in many cases, impractical. However, remote sensing technologies such as LiDAR (Light Detection And Ranging) captures three dimensional information which can be used to perform this estimation. In particular, LiDAR can be used to estimate the diameter of a tree at breast height (DBH) and from this we can estimate its AGB. For this research we used LiDAR data, along with various Machine Learning (ML) algorithms (Multiple Linear Regression, Random Forest, Support Vector Regression and Regression Tree) to estimate DBH of cocoa trees. Various feature selection methods were used to select the most significant features for our model. The best performing algorithm was Random Forest which achieved an R2 value of 0.83 and Root Mean Square Estimate (RMSE) value of 0.062. This algorithm then estimated an AGB value of 28.75 ± 2.34 Mg/ha (Megagram per hectare). We compared this result with that obtained from locally-developed allometric equations for the same cocoa plot. The comparison proved our estimate to be 14.7% lower than the allometric equation. The results demonstrated that using ML with LiDAR measurements for AGB estimation is quite promising.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114107989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shiyu Wang, Fan Zhou, Yinbo Sun, Lintao Ma, James Zhang, Yang Zheng, Lei Lei, Yun Hu
{"title":"End-to-End Modeling of Hierarchical Time Series Using Autoregressive Transformer and Conditional Normalizing Flow-based Reconciliation","authors":"Shiyu Wang, Fan Zhou, Yinbo Sun, Lintao Ma, James Zhang, Yang Zheng, Lei Lei, Yun Hu","doi":"10.1109/ICDMW58026.2022.00141","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00141","url":null,"abstract":"Multivariate time series forecasting with hierarchi-cal structure is pervasive in real-world applications, demanding not only predicting each level of the hierarchy, but also recon-ciling all forecasts to ensure coherency, i.e., the forecasts should satisfy the hierarchical aggregation constraints. Moreover, the disparities of statistical characteristics between levels can be huge, worsened by non-Gaussian distributions and non-linear correlations. To this extent, we propose a novel end-to-end hierarchical time series forecasting model, based on conditioned normalizing flow-based autoregressive transformer reconciliation, to represent complex data distribution while simultaneously reconciling the forecasts to ensure coherency. Unlike other state-of-the-art methods, we achieve the forecasting and reconciliation simultaneously without requiring any explicit post-processing step. In addition, by harnessing the power of deep model, we do not rely on any assumption such as unbiased estimates or Gaussian distribution. Our evaluation experiments are conducted on four real-world hierarchical datasets from different industrial domains (three public ones and a dataset from the application servers of Alipay11Alipay is the world's leading company in payment technology. https:/len.wikipedia.org/wiki/Alipay) and the preliminary results demonstrate efficacy of our proposed method.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124347423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Rochac, N. Zhang, T. Deksissa, Jiajun Xu, Lara A. Thompson
{"title":"A Hybrid ConvLSTM Deep Neural Network for Noise Reduction and Data Augmentation for Prediction of Non-linear Dynamics of Streamflow","authors":"J. Rochac, N. Zhang, T. Deksissa, Jiajun Xu, Lara A. Thompson","doi":"10.1109/ICDMW58026.2022.00146","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00146","url":null,"abstract":"Long Short-Term Memory (LSTM) models are at the cutting edge of artificial learning and ecoinformatics in regards to water quantity prediction. However, one driver for more accuracy, efficient, and robust, water pollution perdition methods is climate change, and in particular global sea level rising. Statistical systems are no longer reliable and new prediction models need to be explored due to the increasing nonlinearity of streamflow predictors and extremes sea level changes. Another driver is that, in places with legacy infrastructure, updated water monitoring systems and unreliable forecasting framework, state-of-the-art LSTM -based models suffer due to the presence of noisy data. This paper proposes multiple LSTM-based models with Scharr filtering to improve the streamflow prediction accuracy against noise. A hybrid ConvLSTM approach is realized to overcome the nonlinearity of the main predictors and the noises. The evaluation results demonstrate that the proposed hybrid ConvLSTM model can effectively improve the overall prediction accuracy for both real-world data and the noise-augmented data. The hybrid ConvLSTM model also obtained competitive and even better performance compared with several state-of-the-art methods. In addition, our proposed design achieves comparable performance in terms of prediction time.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123266297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tanvir Hossain, Esra Akbas, Muhammad Ifte Khairul Islam
{"title":"EnD: Enhanced Dedensification for Graph Compressing and Embedding","authors":"Tanvir Hossain, Esra Akbas, Muhammad Ifte Khairul Islam","doi":"10.1109/ICDMW58026.2022.00092","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00092","url":null,"abstract":"Graph representation learning is essential in applying machine learning methods on large-scale networks. Several embedding approaches have shown promising outcomes in recent years. Nonetheless, on massive graphs, it may be time-consuming and space inefficient for direct applications of existing embedding methods. This paper presents a novel graph compression approach based on dedensification called Enhanced Dedensification with degree-based compression (EnD). The principal goal of our system is to assure decent compression of large graphs that eloquently favor their representation learning. For this purpose, we first compress the low-degree nodes and dedensify them to reduce the high-degree nodes' loads. Then, we embed the compressed graph instead of the original graph to decrease the representation learning cost. Our approach is a general meta-strategy that attains time and space efficiency over the original graph by applying the state-of-the-art graph embedding methods: Node2vec, DeepWalk, RiWalk, and xNetMf. Comprehensive ex-periments on large-scale real-world graphs validate the viability of our method, which shows sound performance on single and multi-label node classification tasks without losing accuracy.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127091399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Forecasting Unobserved Node States with spatio-temporal Graph Neural Networks","authors":"Andreas Roth, T. Liebig","doi":"10.1109/ICDMW58026.2022.00101","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00101","url":null,"abstract":"Forecasting future states of sensors is key to solving tasks like weather prediction, route planning, and many others when dealing with networks of sensors. But complete spatial coverage of sensors is generally unavailable and would practically be infeasible due to limitations in budget and other resources during deployment and maintenance. Currently existing approaches using machine learning are limited to the spatial locations where data was observed, causing limitations to downstream tasks. Inspired by the recent surge of Graph Neural Networks for spatio-temporal data processing, we investigate whether these can also forecast the state of locations with no sensors available. For this purpose, we develop a framework, named Forecasting Unobserved Node States (FUNS), that allows forecasting the state at entirely unobserved locations based on spatio-temporal correlations and the graph inductive bias. FUNS serves as a blueprint for optimizing models only on observed data and demonstrates good generalization capabilities for predicting the state at entirely unobserved locations during the testing stage. Our framework can be combined with any spatio-temporal Graph Neural Network, that exploits spatio-temporal correlations with surrounding observed locations by using the network's graph structure. Our employed model builds on a previous model by also allowing us to exploit prior knowledge about locations of interest, e.g. the road type. Our empirical evaluation of both simulated and real-world datasets demonstrates that Graph Neural Networks are well-suited for this task.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130665518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MultiAspectEmo: Multilingual and Language-Agnostic Aspect-Based Sentiment Analysis","authors":"Joanna Szolomicka, Jan Kocoń","doi":"10.1109/ICDMW58026.2022.00065","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00065","url":null,"abstract":"The paper addresses the important problem of multilingual and language-agnostic approaches to the aspect-based sentiment analysis (ABSA) task, using modern approaches based on transformer models. We propose a new dataset based on automatic translation of the Polish AspectEmo dataset together with cross-lingual transfer of tags describing aspect polarity. The result is a MultiAspectEmo dataset translated into five other languages: English, Czech, Spanish, French and Dutch. In this paper, we also present the original Tr Asp (Transformer-based Aspect Extraction and Classification) method, which is significantly better than methods from the literature in the ABSA task. In addition, we present multilingual and language-agnostic variants of this method, evaluated on the MultiAspectEmo and also the SemEval2016 datasets. We also test various language models for the ABSA task, including compressed models that give promising results while significantly reducing inference time and memory usage.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133207528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep Heterogeneous Graph Neural Networks via Similarity Regularization Loss and Hierarchical Fusion","authors":"Zhilong Xiong, Jia Cai","doi":"10.1109/ICDMW58026.2022.00104","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00104","url":null,"abstract":"Recently, Graph Neural Networks (GNNs) have emerged as a promising and powerful method for tackling graph-structured data. However, most real-world graph-structured data contains distinct types of objects (nodes) and links (edges), which is called heterogeneous graph. The heterogeneity and rich semantic information indeed increase the difficulties in handling heterogeneous graph. Most of the current heterogeneous graph neural networks (HeteGNNs) can only build on a very shallow structure. This is caused by a phenomenon called semantic confusion, where the node embeddings become indistinguishable with the growth of model depth, leading to the degradation of the model performance. In this paper, we address this problem by proposing a similarity regularization loss and hierarchical fusion based heterogeneous graph neural networks (SHGNN). The hierarchical fusion strategy is utilized to fuse the features of the node embeddings at each layer, which can improve the expressive power of the model, and then a similarity regularization loss is introduced, by which the problem of indistinguishability among nodes can be alleviated. Our approach is flexible to combine various HeteGNNs effectively. Experimental results on real-world heterogeneous graph-structured data demonstrate the state-of-the-art performance of the proposed approach, which can efficiently mitigate the semantic confusion problem.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134336843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}