{"title":"Detecting fuzzy-rough conditional anomalies","authors":"Qian Hu , Zhong Yuan , Jusheng Mi , Jun Zhang","doi":"10.1016/j.ins.2024.121560","DOIUrl":"10.1016/j.ins.2024.121560","url":null,"abstract":"<div><div>The purpose of conditional anomaly detection is to identify samples that significantly deviate from the majority of other samples under specific conditions within a dataset. It has been successfully applied to numerous practical scenarios such as forest fire prevention, gas well leakage detection, and remote sensing data analysis. Aiming at the issue of conditional anomaly detection, this paper utilizes the characteristics of fuzzy rough set theory to construct a conditional anomaly detection method that can effectively handle numerical or mixed attribute data. By defining the fuzzy inner boundary, the subset of contextual data is first divided into two parts, i.e. the fuzzy lower approximation and the fuzzy inner boundary. Subsequently, the fuzzy inner boundary is further divided into two distinct segments: the fuzzy abnormal boundary and the fuzzy main boundary. So far, three-way regions can be obtained, i.e., the fuzzy abnormal boundary, the fuzzy main boundary, and the fuzzy lower approximation. Then, a fuzzy-rough conditional anomaly detection model is constructed based on the above three-way regions. Finally, a related algorithm is proposed for the detection model and its effectiveness is verified by data experiments.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121560"},"PeriodicalIF":8.1,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142538568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Diego García-Gil , David López , Daniel Argüelles-Martino , Jacinto Carrasco , Ignacio Aguilera-Martos , Julián Luengo , Francisco Herrera
{"title":"Developing Big Data anomaly dynamic and static detection algorithms: AnomalyDSD spark package","authors":"Diego García-Gil , David López , Daniel Argüelles-Martino , Jacinto Carrasco , Ignacio Aguilera-Martos , Julián Luengo , Francisco Herrera","doi":"10.1016/j.ins.2024.121587","DOIUrl":"10.1016/j.ins.2024.121587","url":null,"abstract":"<div><h3>Background</h3><div>Anomaly detection is the process of identifying observations that differ greatly from the majority of data. Unsupervised anomaly detection aims to find outliers in data that is not labeled, therefore, the anomalous instances are unknown. The exponential data generation has led to the era of Big Data. This scenario brings new challenges to classic anomaly detection problems due to the massive and unsupervised accumulation of data. Traditional methods are not able to cop up with computing and time requirements of Big Data problems.</div></div><div><h3>Methods</h3><div>In this paper, we propose four distributed algorithm designs for Big Data anomaly detection problems: HBOS_BD, LODA_BD, LSCP_BD, and XGBOD_BD. They have been designed following the MapReduce distributed methodology in order to be capable of handling Big Data problems.</div></div><div><h3>Results</h3><div>These algorithms have been integrated into an Spark Package, focused on static and dynamic Big Data anomaly detection tasks, namely AnomalyDSD. Experiments using a real-world case of study have shown the performance and validity of the proposals for Big Data problems.</div></div><div><h3>Conclusions</h3><div>With this proposal, we have enabled the practitioner to efficiently and effectively detect anomalies in Big Data datasets, where the early detection of an anomaly can lead to a proper and timely decision.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121587"},"PeriodicalIF":8.1,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Significance-based decision tree for interpretable categorical data clustering","authors":"Lianyu Hu, Mudi Jiang, Xinying Liu, Zengyou He","doi":"10.1016/j.ins.2024.121588","DOIUrl":"10.1016/j.ins.2024.121588","url":null,"abstract":"<div><div>Numerous clustering algorithms prioritize accuracy, but in high-risk domains, the interpretability of clustering methods is crucial as well. The inherent heterogeneity of categorical data makes it particularly challenging for users to comprehend clustering outcomes. Currently, the majority of interpretable clustering methods are tailored for numerical data and utilize decision tree models, leaving interpretable clustering for categorical data as a less explored domain. Additionally, existing interpretable clustering algorithms often depend on external, potentially non-interpretable algorithms and lack transparency in the decision-making process during tree construction. In this paper, we tackle the problem of interpretable categorical data clustering by growing a decision tree in a statistically meaningful manner. We formulate the evaluation of candidate splits as a multivariate two-sample testing problem, where a single <em>p</em>-value is derived by combining significance evidence from all individual categories. This approach provides a reliable and controllable method for selecting the optimal split while determining its statistical significance. Extensive experimental results on real-world data sets demonstrate that our algorithm achieves comparable performance in terms of cluster quality, running efficiency, and explainability relative to its counterparts.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121588"},"PeriodicalIF":8.1,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142539058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hengqian Wang , Lei Chen , Kuangrong Hao , Xin Cai , Bing Wei
{"title":"A novel self-training framework for semi-supervised soft sensor modeling based on indeterminate variational autoencoder","authors":"Hengqian Wang , Lei Chen , Kuangrong Hao , Xin Cai , Bing Wei","doi":"10.1016/j.ins.2024.121565","DOIUrl":"10.1016/j.ins.2024.121565","url":null,"abstract":"<div><div>In modern industrial processes, the high acquisition cost of labeled data can lead to a large number of unlabeled samples, which greatly impacts the accuracy of traditional soft sensor models. To this end, this paper proposes a novel semi-supervised soft sensor framework that can fully utilize the unlabeled data to expand the original labeled data, and ultimately improve the prediction accuracy. Specifically, an indeterminate variational autoencoder (IVAE) is first proposed to obtain pseudo-labels and their uncertainties for unlabeled data. On this basis, the IVAE-based self-training (ST-IVAE) framework is further naturally proposed to expand the original small labeled dataset through continuous circulation. Among them, a variance-based oversampling (VOS) strategy is introduced to better utilize the pseudo-label uncertainty. By determining similar sample sets through the comparison of Kullback-Leibler (KL) divergence obtained by the proposed IVAE model, each sample can be independently modeled for prediction. The effectiveness of the proposed semi-supervised framework is verified on two real industrial processes. Comparable results illustrate that the ST-IVAE framework can still predict well even in the presence of missing input data compared to state-of-the-art methodologies in addressing semi-supervised soft sensing challenges.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121565"},"PeriodicalIF":8.1,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142573324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Some notes on the consequences of pretreatment of multivariate data","authors":"Ali S. Hadi , Rida Moustafa","doi":"10.1016/j.ins.2024.121580","DOIUrl":"10.1016/j.ins.2024.121580","url":null,"abstract":"<div><div>With the advent of data technologies, we have various types of data, such as structured, unstructured and semi-structured. Performing certain statistical or machine learning techniques may require careful preprocessing or pretreatment of the data to make them suitable for analysis. For example, given a data matrix <strong>X</strong>, which represents <em>n</em> multivariate observations or cases on <em>p</em> variables or features, the columns/rows of <strong>X</strong> may be pretreated before applying statistical or machine learning techniques to the data. While centering and/or scaling the variables do not alter the correlation structure nor the graphical representation of the data, centering/scaling the observations do. We investigate various row pretreatment methods more closely and show with theoretical proofs and numerical examples that centering/scaling the rows of <strong>X</strong> changes both the graphical structure of the observations in the multi-dimensional space and the correlation structure among the variables. There may be good reasons for performing row centering/scaling on the data and we are not against it, but analysts who use such row operations should be aware of the geometrical and correlation structures one has performed on the data and should also demonstrate that the process results in a new, more appropriate structure for their questions.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121580"},"PeriodicalIF":8.1,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142538695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A diversity and reliability-enhanced synthetic minority oversampling technique for multi-label learning","authors":"Yanlu Gong , Quanwang Wu , Mengchu Zhou , Chao Chen","doi":"10.1016/j.ins.2024.121579","DOIUrl":"10.1016/j.ins.2024.121579","url":null,"abstract":"<div><div>The class imbalance issue is generally intrinsic in multi-label datasets due to the fact that they have a large number of labels and each sample is associated with only a few of them. This causes the trained multi-label classifier to be biased towards the majority labels. Multi-label oversampling methods have been proposed to handle this issue, and they fall into clone-based and Synthetic Minority Oversampling TEchnique-based (SMOTE-based) ones. However, the former duplicates minority samples and may result in over-fitting whereas the latter may generate unreliable synthetic samples. In this work, we propose a Diversity and Reliability-enhanced SMOTE for multi-label learning (DR-SMOTE). In it, the minority classes are determined according to their label imbalance ratios. A reliable minority sample is used as a seed to generate a synthetic one while a reference sample is selected for it to confine the synthesis region. Features of the synthetic samples are determined probabilistically in this region and their labels are set identically to those of the seeds. We carry out experiments with eleven multi-label datasets to compare DR-SMOTE against seven existing resampling methods based on four base multi-label classifiers. The experimental results demonstrate DR-SMOTE’s superiority over its peers in terms of several evaluation metrics.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121579"},"PeriodicalIF":8.1,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhezhe Xing , Yuxin Ye , Rui Song , Yun Teng , Ziheng Li , Jiawen Liu
{"title":"Sample feature enhancement model based on heterogeneous graph representation learning for few-shot relation classification","authors":"Zhezhe Xing , Yuxin Ye , Rui Song , Yun Teng , Ziheng Li , Jiawen Liu","doi":"10.1016/j.ins.2024.121583","DOIUrl":"10.1016/j.ins.2024.121583","url":null,"abstract":"<div><div>Few-Shot Relation Classification (FSRC) aims to predict novel relationships by learning from limited samples. Graph Neural Network (GNN) approaches for FSRC constructs data as graphs, effectively capturing sample features through graph representation learning. However, they often face several challenges: 1) They tend to neglect the interactions between samples from different support sets and overlook the implicit noise in labels, leading to sub-optimal sample feature generation. 2) They struggle to deeply mine the diverse semantic information present in FSRC data. 3) Over-smoothing and overfitting limit the model's depth and adversely affect overall performance. To address these issues, we propose a Sample Representation Enhancement model based on Heterogeneous Graph Neural Network (SRE-HGNN) for FSRC. This method leverages inter-sample and inter-class associations (i.e., label mutual attention) to effectively fuse features and generate more expressive sample representations. Edge-heterogeneous GNNs are employed to enhance sample features by capturing heterogeneous information of varying depths through different edge attentions. Additionally, we introduce an attention-based neighbor node culling method, enabling the model to stack higher levels and extract deeper inter-sample associations, thereby improving performance. Finally, experiments are conducted for the FSRC task, and SRE-HGNN achieves an average accuracy improvement of 1.84% and 1.02% across two public datasets.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121583"},"PeriodicalIF":8.1,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142539059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SGO: An innovative oversampling approach for imbalanced datasets using SVM and genetic algorithms","authors":"Jianfeng Deng, Dongmei Wang, Jinan Gu, Chen Chen","doi":"10.1016/j.ins.2024.121584","DOIUrl":"10.1016/j.ins.2024.121584","url":null,"abstract":"<div><div>Imbalanced datasets present a challenging problem in machine learning and artificial intelligence. Since most models typically assume balanced data distributions, imbalanced positive and negative examples can lead to significant bias in prediction or classification tasks. Current over-sampling methods frequently encounter issues like overfitting and boundary bias. A novel imbalanced data augmentation technique called SVM-GA over-sampling (SGO) is proposed in this paper, which integrates Support Vector Machines (SVM) with Genetic Algorithms (GA). Our approach leverages SVM to identify the decision boundary and uses GA to generate new minority samples along this boundary, effectively addressing both over-fitting and boundary biases. It has been experimentally validated that SGO outperforms the traditional methods on most datasets, providing a novel and effective approach to address imbalanced data problems, with potential application prospects and generalization value.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121584"},"PeriodicalIF":8.1,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142538567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PMformer: A novel informer-based model for accurate long-term time series prediction","authors":"Yuewei Xue, Shaopeng Guan, Wanhai Jia","doi":"10.1016/j.ins.2024.121586","DOIUrl":"10.1016/j.ins.2024.121586","url":null,"abstract":"<div><div>When applied to long-term time series forecasting, Informer struggles to capture temporal dependencies effectively, leading to suboptimal forecasting accuracy. To address this issue, we propose PMformer, a novel model based on Informer for long-term time series prediction. First, we introduce a probabilistic patch sampling attention mechanism that utilizes a patch-based strategy to compute attention scores within randomly selected sequence patches. This localized approach enhances the model's capability to capture local temporal dependencies, allowing it to better understand and process critical local features in time series while reducing computational complexity. Additionally, we propose a multi-scale scaling sparse attention technique that balances attention distribution by combining coarse- and fine-grained attention scores, thereby improving the model's ability to capture global sequence information. Finally, we design a dilated causal pooling layer and a multilayer perceptual cross self-attention decoder to further enhance the model's prediction accuracy by capturing key information in long-term correlations and precisely focusing on sequences. We conducted experiments on both multivariate and univariate time series forecasting tasks. The results show that PMformer outperforms six baseline models, including PatchTST and FEDformer, in terms of MAE and MSE metrics. This demonstrates its superior ability to capture temporal dependencies, achieving more accurate predictions.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121586"},"PeriodicalIF":8.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142530662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qian Liu , Junchen Ye , Haohan Liang , Leilei Sun , Bowen Du
{"title":"TS-MAE: A masked autoencoder for time series representation learning","authors":"Qian Liu , Junchen Ye , Haohan Liang , Leilei Sun , Bowen Du","doi":"10.1016/j.ins.2024.121576","DOIUrl":"10.1016/j.ins.2024.121576","url":null,"abstract":"<div><div>Self-supervised learning (SSL) has been widely researched in recent years. In Particular, generative self-supervised learning methods have achieved remarkable success in many AI domains, such as MAE in computer vision, well-known BERT, GPT in natural language processing, and GraphMAE in graph learning. However, in the context of time series analysis, not only is the work that follows this line limited but also the performance has not reached the potential as promised in other fields. To fill this gap, we propose a simple and elegant masked autoencoder for time series representation learning. Firstly, unlike most existing work which uses the Transformer as the backbone, we build our model based on neural ordinary differential equation which possesses excellent mathematical properties. Compared with the position encoding in Transformer, modeling the evolution patterns continuously could better extract the temporal dependency. Secondly, a timestamp-wise mask strategy is provided to cooperate with the autoencoder to avoid bias, and it also could reduce the cross-imputation between variables to learn more robust representations. Lastly, extensive experiments conducted on two classical tasks demonstrate the superiority of our model over the state-of-the-art ones.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121576"},"PeriodicalIF":8.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142538570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}