{"title":"Feature Extraction and Prediction of Combined Text and Survey Data using Two-Staged Modeling","authors":"A. A. Neloy, M. Turgeon","doi":"10.1109/ICDMW58026.2022.00064","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00064","url":null,"abstract":"Deep learning (DL) based natural language processing (NLP) has recently grown as one the fastest research domain and retained remarkable improvement in many applications. Due to the significant amount of data, the adaptation of feature learning and symmetric data efficiency is a critical underlying task in such applications. However, their ability to extract features is limited due to a lack of proper model formation. Moreover, the use of these methods on smaller datasets is unexplored and underdeveloped compared to more popular research areas. This work introduces a two-stage modeling approach to combine classical statistical analysis with NLP problems in a real-world dataset. We effectively layout a combination of the classical statistical model incorporating a stacked ensemble classifier and a DL framework of convolutional neural network (CNN) and Bidirectional Recurrent Neural Networks (Bi-RNN) to structure a more decomposed architecture with lower computational complexity. Additionally, the experimental results illustrating 96.69 % training and 70.56 % testing accuracy and hypothesis testing from our DL models followed by an ablation study empirically demonstrate the validation of our proposed combined modeling technique.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116606562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Piccialli, F. Giampaolo, Vincenzo Schiano Di Cola, Federico Gatta, Diletta Chiaro, E. Prezioso, Stefano Izzo, S. Cuomo
{"title":"A machine learning-based approach for mercury detection in marine waters","authors":"F. Piccialli, F. Giampaolo, Vincenzo Schiano Di Cola, Federico Gatta, Diletta Chiaro, E. Prezioso, Stefano Izzo, S. Cuomo","doi":"10.1109/ICDMW58026.2022.00074","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00074","url":null,"abstract":"Thanks to the widespread use of mobile devices, analyses that in the past had to be carried out in specifically designated and equipped laboratories and which required long processing times, may now take place outdoor and in real time. In the marine science, for example, the development of a mobile and compact system for the on-site detection of heavy metals contamination in seawater would be helpful for scientists and society in at least two ways: i) reduction of time and costs associated with these experiments; ii) the implementation of a strategy for outdoor analysis, eventually embeddable in a lab-on-hardware system. This paper falls within the context of machine learning (ML) for utility pattern mining applied on interdisciplinary domains: starting from wellplates images, we provide a novel proof-of-concept (PoC) machine learning-based framework to assist scientists in their daily research on seawater samples, proposing a system which automatically recognise wells in a multiwell firstly and then predicts the degree of fluorescence in each of them, thus showing possible presence of heavy metals.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127817917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Udesh Kumarasinghe, Mohamed Nabeel, K. de Zoysa, K. Gunawardana, Charitha Elvitigala
{"title":"HeteroGuard: Defending Heterogeneous Graph Neural Networks against Adversarial Attacks","authors":"Udesh Kumarasinghe, Mohamed Nabeel, K. de Zoysa, K. Gunawardana, Charitha Elvitigala","doi":"10.1109/ICDMW58026.2022.00096","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00096","url":null,"abstract":"Graph neural networks (GNNs) have achieved re-markable success in many application domains including drug discovery, program analysis, social networks, and cyber security. However, it has been shown that they are not robust against adversarial attacks. In the recent past, many adversarial attacks against homogeneous GNNs and defenses have been proposed. However, most of these attacks and defenses are ineffective on heterogeneous graphs as these algorithms optimize under the assumption that all edge and node types are of the same and further they introduce semantically incorrect edges to perturbed graphs. Here, we first develop, HetePR-BCD, a training time (i.e. poisoning) adversarial attack on heterogeneous graphs that outperforms the start of the art attacks proposed in the literature. Our experimental results on three benchmark heterogeneous graphs show that our attack, with a small perturbation budget of 15 %, degrades the performance up to 32 % (Fl score) compared to existing ones. It is concerning to mention that existing defenses are not robust against our attack. These defenses primarily modify the GNN's neural message passing operators assuming that adversarial attacks tend to connect nodes with dissimilar features, but this assumption does not hold in heterogeneous graphs. We construct HeteroGuard, an effective defense against training time attacks including HetePR-BCD on heterogeneous models. HeteroGuard outperforms the existing defenses by 3–8 % on Fl score depending on the benchmark dataset.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133034498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Degree-Related Bias in Link Prediction","authors":"Yu Wang, Tyler Derr","doi":"10.1109/ICDMW58026.2022.00103","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00103","url":null,"abstract":"Link prediction is a fundamental problem for network-structured data and has achieved unprecedented success in many real-world applications. Despite the significant progress being made towards improving its performance by characterizing underlined topological patterns or leveraging representation learning, few works have focused on the imbalanced performance among nodes of different degrees. In this paper, we propose a novel problem, degree-related bias and evaluation bias, on link prediction with an emphasis on recommender system applications. We first empirically demonstrate the performance differ-ence among nodes with different degrees and then theoretically prove that Recall is an unbiased evaluation metric compared with Fl, NDCG and Precision. Furthermore, we show that under the unbiased evaluation metric Recall, low-degree nodes tend to have higher performance than high-degree nodes in link prediction.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133810451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pavel Shumkovskii, A. Kovantsev, Elizaveta Stavinova, P. Chunaev
{"title":"MetaSieve: Performance vs. Complexity Sieve for Time Series Forecasting","authors":"Pavel Shumkovskii, A. Kovantsev, Elizaveta Stavinova, P. Chunaev","doi":"10.1109/ICDMW58026.2022.00037","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00037","url":null,"abstract":"Motivated by the problem of finding optimal Performance vs. Complexity trade-off in the task of forecasting time series data, we propose a model-agnostic method MetaSieve that performs data dichotomy (i.e., in fact, sieves the data instances in a meta-learning manner) according to a chosen quality level while iterating over the model's complexity. The method is inspired by classical iterative numerical optimization ones but is applied to sets of time series. As a result, the method is significantly less time consuming than a traditional brute force-based meta-learning algorithm. It further turns out in the experiments that the MetaSieve quality results are rather comparable to those of the brute force-based one thus one has a noticeable reduction in time consumption in exchange for a slight decrease of forecasting quality. Additionally, we experimentally show a good performance of a MetaSieve-based classifier that provides the Performance vs. Complexity classes a priori, i.e. before the actual forecasting, on synthetic and real-world time series data.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115615492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anne Marthe Sophie Ngo Bibinbe, A. J. Mahamadou, Michael Franklin Mbouopda, E. Nguifo
{"title":"DragStream: An Anomaly And Concept Drift Detector In Univariate Data Streams","authors":"Anne Marthe Sophie Ngo Bibinbe, A. J. Mahamadou, Michael Franklin Mbouopda, E. Nguifo","doi":"10.1109/ICDMW58026.2022.00113","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00113","url":null,"abstract":"Anomaly detection in data streams comes with different technical challenges due to the data nature. The main challenges include storage limitations, the speed of data arrival, and concept drifts. In the literature, methods for mining data streams in order to detect anomalies have been proposed. While some methods focus on tackling a specific issue, other methods handle diverse problems but may have high complexity (time and memory). In the present work, we propose DragStream, a novel subsequence anomaly and concept drift detection algorithm for univariate data streams. DragStream extends the subsequence anomaly detection method for time series data Drag to streaming data. Furthermore, the new method is inspired by the well-known Matrix Profile, Drag, and MILOF which are respectively point and subsequence anomaly detection methods for time series and data streams. We conducted intensive experiments and statistical analysis to evaluate the performance of the proposed approach against existing methods. The results show that our method is competitive in performance while being linear in time and memory complexity. Finally, we provide an open-source implementation of the new method.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123658573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reda Khoufache, M. Dilmi, Hanene Azzag, Etienne Gofinnet, M. Lebbah
{"title":"Emerging properties from Bayesian Non-Parametric for multiple clustering: Application for multi-view image dataset","authors":"Reda Khoufache, M. Dilmi, Hanene Azzag, Etienne Gofinnet, M. Lebbah","doi":"10.1109/ICDMW58026.2022.00013","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00013","url":null,"abstract":"Artificial Intelligence (AI) in supermarkets is moving fast with the recent advances in deep learning. One important project in the retail sector is the development of AI solutions for smart stores, mainly to improve product recognition. In this paper, we present a new framework to address the multi-view image classification using multiple clustering. The proposed framework combines a pre-trained Vision Transformer with a Bayesian Non-Parametric multiple clustering. In this work, we propose an M CM C- based inference approach to learn the column-partition and the row-partitions. This method infers multiple clustering solutions and allows to find automatically the number of clusters. Our method provides interesting results on a multi-view image dataset and emphasizes, on one hand, the power of pre-trained Vision Transformers combined with the multiple clustering algorithm, on the other hand, the usefulness of the Bayesian Non-Parametric modeling, which automatically performs a model selection.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123678262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mining Valuable Fuzzy Patterns via the RFM Model","authors":"Yanlin Qi, Fuyin Lai, Guoting Chen, Wensheng Gan","doi":"10.1109/ICDMW58026.2022.00075","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00075","url":null,"abstract":"This paper aims to propose an effective algorithm to discover valuable patterns by applying the fuzzy method to the RFM model. RFM analysis is a common method in customer relationship management, through which we can identify valuable customer groups. By combining RFM analysis with frequent pattern mining, valuable RFM - patterns can be found from the RFM-pattern-tree, such as the RFMP-growth algorithm. Aiming to mine patterns that have quantitative relationships among items, we introduce the fuzzy method in the RFM model, and we present a fuzzy - Rfu - tree algorithm in which a new pruning strategy is proposed to prune candidate patterns. Experiments show the effectiveness of the new algorithm. The new algorithm guarantees a high overlap degree with the RFM-patterns gen-erated by RFMP-growth, with more valuable information (with additional fuzzy level) in the mined patterns.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126541686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unknown Type Streaming Feature Selection via Maximal Information Coefficient","authors":"Peng Zhou, Yunyun Zhang, Yuan-Ting Yan, Shu Zhao","doi":"10.1109/ICDMW58026.2022.00089","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00089","url":null,"abstract":"Feature selection aims to select an optimal minimal feature subset from the original datasets and has become an indispensable preprocessing component before data mining and machine learning, especially in the era of big data. Most feature selection methods implicitly assume that we can know the feature type (categorical, numerical, or mixed) before learning, then design corresponding measurements to calculate the correlation between features. However, in practical applications, features may be generated dynamically and arrive one by one over time, which we call streaming features. Most existing streaming feature selection methods assume that all dynamically generated features are the same type or assume we can know the feature type for each new arriving feature on the fly, but this is unreasonable and unrealistic. Therefore, this paper firstly studies a practical issue of Unknown Type Streaming Feature Selection and proposes a new method to handle it, named UT-SFS. Extensive experimental results indicate the effectiveness of our new method. UT-SFS is nonparametric and does not need to know the feature type before learning, which aligns with practical application needs.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125894631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nitin Ramrakhiyani, Sangameshwar Patil, Manideep Jella, Alok Kumar, G. Palshikar
{"title":"Extracting Entities and Events from Cyber-Physical Security Incident Reports","authors":"Nitin Ramrakhiyani, Sangameshwar Patil, Manideep Jella, Alok Kumar, G. Palshikar","doi":"10.1109/ICDMW58026.2022.00083","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00083","url":null,"abstract":"Cyber- physical systems are an important part of many industries such as the chemical process industry, manufac- turing industry, automobiles, and even sophisticated weaponry. Given the economic importance and influence of these systems, they have increasingly faced the cybersecurity attacks. In this paper, we provide a dataset of real-life security incident reports on cyber-physical systems annotated with entities and events that are important for analysing such security incidents. We analyze and identify the limitations of the 'Domain Objects' in Structured Threat Information Expression (STIX) standard as well as recent research literature for the entity type clas- sification schemes in cybersecurity domain. We propose an updated classification scheme for entity types in the cybersecurity domain. The enhanced coverage provided by the entity scheme is important for automated information extraction and natural language understanding of textual reports containing details of the cybersecurity incident reports. We use deep-learning based sequence labelling techniques and cybersecurity domain specific word embed dings to set up a benchmark for entity and event extraction for cyber- physical security incident report analysis. The annotated dataset of real-life industrial security incidents will be made available for research purpose.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121620742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}