{"title":"A discrete equilibrium optimization algorithm for breast cancer diagnosis","authors":"Hichem Haouassi, Rafik Mahdaoui, Ouahiba Chouhal","doi":"10.3233/ida-226665","DOIUrl":"https://doi.org/10.3233/ida-226665","url":null,"abstract":"Illness diagnosis is the essential step in designating a treatment. Nowadays, Technological advancements in medical equipment can produce many features to describe breast cancer disease with more comprehensive and discriminant data. Based on the patient’s medical data, several data-driven models are proposed for breast cancer diagnosis using learning techniques such as naive Bayes, neural networks, and SVM. However, the models generated are hardly understandable, so doctors cannot interpret them. This work aims to study breast cancer diagnosis using the associative classification technique. It generates interpretable diagnosis models. In this work, an associative classification approach for breast cancer diagnosis based on the Discrete Equilibrium Optimization Algorithm (DEOA) named Discrete Equilibrium Optimization Algorithm for Associative Classification (DEOA-AC) is proposed. DEOA-AC aims to generate accurate and interpretable diagnosis rules directly from datasets. Firstly, all features in the dataset that contains continuous values are discretized. Secondly, for each class, a new dataset is created from the original dataset and contains only the chosen class’s instances. Finally, the new proposed DEOA is called for each new dataset to generate an optimal rule set. The DEOA-AC approach is evaluated on five well-known and recently used breast cancer datasets and compared with two recently proposed and three classical breast cancer diagnosis algorithms. The comparison results show that the proposed approach can generate more accurate and interpretable diagnosis models for breast cancer than other algorithms.","PeriodicalId":50355,"journal":{"name":"Intelligent Data Analysis","volume":"81 1","pages":"1185-1204"},"PeriodicalIF":1.7,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90504168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianxiang Zheng, Xianmin Wang, Yixiang Chen, Fujia Yu, Jing Li
{"title":"Feature evolvable learning with image streams","authors":"Tianxiang Zheng, Xianmin Wang, Yixiang Chen, Fujia Yu, Jing Li","doi":"10.3233/ida-226799","DOIUrl":"https://doi.org/10.3233/ida-226799","url":null,"abstract":"Feature Evolvable Stream Learning (FESL) has received extensive attentions during the past few years where old features could vanish and new features could appear when learning with streaming data. Existing FESL algorithms are mainly designed for simple datasets with low-dimension features, nevertheless they are ineffective to deal with complex streams such as image sequences. Such crux lies in two facts: (1) the shallow model, which is supported to be feasible for the low-dimension streams, fails to reveal the complex nonlinear patterns of images, and (2) the linear mapping used to recover the vanished features from the new ones is inadequate to reconstruct the old features of image streams. In response, this paper explores a new online learning paradigm: Feature Evolvable Learning with Image Streams (FELIS) which attempts to make the online learners less restrictive and more applicable. In particular, we present a novel ensemble residual network (ERN), in which the prediction is weighted combination of classifiers learnt by the feature representations from several residual blocks, such that the learning is able to start with a shallow network that enjoys fast convergence, and then gradually switch to a deeper model when more data has been received to learn more complex hypotheses. Moreover, we amend the first residual block of ERN as an autoencoder, and then proposed a latent representation mapping (LRM) approach to exploit the relationship between the previous and current feature space of the image streams via minimizing the discrepancy of the latent representations from the two different feature spaces. We carried out experiments on both virtual and real scenarios over large-scale images, and the experimental results demonstrate the effectiveness of the proposed method.","PeriodicalId":50355,"journal":{"name":"Intelligent Data Analysis","volume":"118 1","pages":"1047-1063"},"PeriodicalIF":1.7,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84934681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jorge Meira, Bruno Veloso, V. Bolón-Canedo, G. Marreiros, Amparo Alonso-Betanzos, João Gama
{"title":"Data-driven predictive maintenance framework for railway systems","authors":"Jorge Meira, Bruno Veloso, V. Bolón-Canedo, G. Marreiros, Amparo Alonso-Betanzos, João Gama","doi":"10.3233/ida-226811","DOIUrl":"https://doi.org/10.3233/ida-226811","url":null,"abstract":"The emergence of the Industry 4.0 trend brings automation and data exchange to industrial manufacturing. Using computational systems and IoT devices allows businesses to collect and deal with vast volumes of sensorial and business process data. The growing and proliferation of big data and machine learning technologies enable strategic decisions based on the analyzed data. This study suggests a data-driven predictive maintenance framework for the air production unit (APU) system of a train of Metro do Porto. The proposed method assists in detecting failures and errors in machinery before they reach critical stages. We present an anomaly detection model following an unsupervised approach, combining the Half-Space-trees method with One Class K Nearest Neighbor, adapted to deal with data streams. We evaluate and compare our approach with the Half-Space-Trees method applied without the One Class K Nearest Neighbor combination. Our model produced few type-I errors, significantly increasing the value of precision when compared to the Half-Space-Trees model. Our proposal achieved high anomaly detection performance, predicting most of the catastrophic failures of the APU train system.","PeriodicalId":50355,"journal":{"name":"Intelligent Data Analysis","volume":"36 1","pages":"1087-1102"},"PeriodicalIF":1.7,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88704132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An improved hybrid structure learning strategy for Bayesian networks based on ensemble learning","authors":"Wenlong Gao, Zhimei Zeng, Xiaojie Ma, Yongsong Ke, Minqian Zhi","doi":"10.3233/ida-226818","DOIUrl":"https://doi.org/10.3233/ida-226818","url":null,"abstract":"In the application of Bayesian networks to solve practical problems, it is likely to encounter the situation that the data set is expensive and difficult to obtain in large quantities and the small data set is easy to cause the inaccuracy of Bayesian network (BN) scoring functions, which affects the BN optimization results. Therefore, how to better learn Bayesian network structures under a small data set is an important problem we need to pay attention to and solve. This paper introduces the idea of parallel ensemble learning and proposes a new hybrid Bayesian network structure learning algorithm. The algorithm adopts the elite-based structure learner using genetic algorithm (ESL-GA) as the base learner. Firstly, the adjacency matrices of the network structures learned by ESL-GA are weighted and averaged. Then, according to the preset threshold, the edges between variables with weak dependence are filtered to obtain a fusion matrix. Finally, the fusion matrix is modified as the adjacency matrix of the integrated Bayesian network so as to obtain the final Bayesian network structure. Comparative experiments on the standard Bayesian network data sets show that the accuracy and reliability of the proposed algorithm are significantly better than other algorithms.","PeriodicalId":50355,"journal":{"name":"Intelligent Data Analysis","volume":"2 1","pages":"1103-1120"},"PeriodicalIF":1.7,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89915534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A parallel and balanced SVM algorithm on spark for data-intensive computing","authors":"Jianjiang Li, Jinliang Shi, Zhiguo Liu, Can Feng","doi":"10.3233/ida-226774","DOIUrl":"https://doi.org/10.3233/ida-226774","url":null,"abstract":"Support Vector Machine (SVM) is a machine learning with excellent classification performance, which has been widely used in various fields such as data mining, text classification, face recognition and etc. However, when data volume scales to a certain level, the computational time becomes too long and the efficiency becomes low. To address this issue, we propose a parallel balanced SVM algorithm based on Spark, named PB-SVM, which is optimized on the basis of the traditional Cascade SVM algorithm. PB-SVM contains three parts, i.e., Clustering Equal Division, Balancing Shuffle and Iteration Termination, which solves the problems of data skew of Cascade SVM and the large difference between local support vector and global support vector. We implement PB-SVM in AliCloud Spark distributed cluster with five kinds of public datasets. Our experimental results show that in the two-classification test on the dataset covtype, compared with MLlib-SVM and Cascade SVM on Spark, PB-SVM improves efficiency by 38.9% and 75.4%, and the accuracy is improved by 7.16% and 8.38%. Moreover, in the multi-classification test, compared with Cascade SVM on Spark on the dataset covtype, PB-SVM improves efficiency and accuracy by 94.8% and 18.26% respectively.","PeriodicalId":50355,"journal":{"name":"Intelligent Data Analysis","volume":"58 1","pages":"1065-1086"},"PeriodicalIF":1.7,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80024805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Filter pruning via feature map clustering","authors":"Wei Li, Yongxing He, Xiaoyu Zhang, Yongchuan Tang","doi":"10.3233/ida-226810","DOIUrl":"https://doi.org/10.3233/ida-226810","url":null,"abstract":"With the help of network compression algorithms, deep neural networks can be applied on low-power embedded systems and mobile devices such as drones, satellites, and smartphones. Filter pruning is a sub-direction of network compression research, which reduces memory and computational consumption by reducing the number of parameters of model filters. Previous works utilized the “more-simple-less-important” criterion for pruning filters. That is, filters with the smaller norm or more sparse weights in the network are preferentially pruned. In this paper, we found that feature maps are not fully positively correlated with the sparsity of filter weights by observing the visualization of feature maps and the corresponding filters. Hence, we came up with the idea that the priority of filter pruning should be determined by redundancy rather than sparsity. The redundancy of a filter is the measure of whether the output of the filter is repeated with other filters. Based on this, we defined a criterion called redundancy index to rank the filters and introduced it into our filter pruning strategy. Extensive experiments demonstrate the effectiveness of our approach on different model architectures, including VGGNet, GoogleNet, DenseNet, and ResNet. The models compressed with our strategy surpass the state-of-the-art in terms of Floating Point Operations Per Second (FLOPs), parameters reduction, and classification accuracy.","PeriodicalId":50355,"journal":{"name":"Intelligent Data Analysis","volume":"10 1","pages":"911-933"},"PeriodicalIF":1.7,"publicationDate":"2023-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87416909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Active ordinal classification by querying relative information","authors":"Deniu He","doi":"10.3233/ida-226899","DOIUrl":"https://doi.org/10.3233/ida-226899","url":null,"abstract":"Collecting and learning with auxiliary information is a way to further reduce the labeling cost of active learning. This paper studies the problem of active learning for ordinal classification by querying low-cost relative information (instance-pair relation information) through pairwise queries. Two challenges in this study that arise are how to train an ordinal classifier with absolute information (labeled data) and relative information simultaneously and how to select appropriate query pairs for querying. To solve the first problem, we convert the absolute and relative information into the class interval-labeled training instances form by introducing a class interval concept and two reasoning rules. Then, we design a new ordinal classification model for learning with the class interval-labeled training instances. For query pair selection, we specify that each query pair consists of an unlabeled instance and a labeled instance. The unlabeled instance is selected by a margin-based critical instance selection method, and the corresponding labeled instance is selected based on an expected cost minimization strategy. Extensive experiments on twelve public datasets validate that the proposed method is superior to the state-of-the-art methods.","PeriodicalId":50355,"journal":{"name":"Intelligent Data Analysis","volume":"12 1 1","pages":"977-1002"},"PeriodicalIF":1.7,"publicationDate":"2023-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78330420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Safe co-training for semi-supervised regression","authors":"Liyan Liu, P. Huang, Hong Yu, Fan Min","doi":"10.3233/ida-226718","DOIUrl":"https://doi.org/10.3233/ida-226718","url":null,"abstract":"Co-training is a popular semi-supervised learning method. The learners exchange pseudo-labels obtained from different views to reduce the accumulation of errors. One of the key issues is how to ensure the quality of pseudo-labels. However, the pseudo-labels obtained during the co-training process may be inaccurate. In this paper, we propose a safe co-training (SaCo) algorithm for regression with two new characteristics. First, the safe labeling technique obtains pseudo-labels that are certified by both views to ensure their reliability. It differs from popular techniques of using two views to assign pseudo-labels to each other. Second, the label dynamic adjustment strategy updates the previous pseudo-labels to keep them up-to-date. These pseudo-labels are predicted using the augmented training data. Experiments are conducted on twelve datasets commonly used for regression testing. Results show that SaCo is superior to other co-training style regression algorithms and state-of-the-art semi-supervised regression algorithms.","PeriodicalId":50355,"journal":{"name":"Intelligent Data Analysis","volume":"164 S343","pages":"959-975"},"PeriodicalIF":1.7,"publicationDate":"2023-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72407843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic mutual information-based feature selection for multi-label learning","authors":"Kyung-jun Kim, C. Jun","doi":"10.3233/ida-226666","DOIUrl":"https://doi.org/10.3233/ida-226666","url":null,"abstract":"In classification problems, feature selection is used to identify important input features to reduce the dimensionality of the input space while improving or maintaining classification performance. Traditional feature selection algorithms are designed to handle single-label learning, but classification problems have recently emerged in multi-label domain. In this study, we propose a novel feature selection algorithm for classifying multi-label data. This proposed method is based on dynamic mutual information, which can handle redundancy among features controlling the input space. We compare the proposed method with some existing problem transformation and algorithm adaptation methods applied to real multi-label datasets using the metrics of multi-label accuracy and hamming loss. The results show that the proposed method demonstrates more stable and better performance for nearly all multi-label datasets.","PeriodicalId":50355,"journal":{"name":"Intelligent Data Analysis","volume":"153 1","pages":"891-909"},"PeriodicalIF":1.7,"publicationDate":"2023-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73720704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RegRL-KG: Learning an L1 regularized reinforcement agent for keyphrase generation","authors":"Yu Yao, Peng Yang, Guangzhen Zhao, Juncheng Leng","doi":"10.3233/ida-226561","DOIUrl":"https://doi.org/10.3233/ida-226561","url":null,"abstract":"Keyphrase generation (KG) aims at condensing the content from the source text to the target concise phrases. Though many KG algorithms have been proposed, most of them are tailored into deep learning settings with various specially designed strategies and may fail in solving the bias exposure problem. Reinforcement Learning (RL), a class of control optimization techniques, are well suited to compensate for some of the limitations of deep learning methods. Nevertheless, RL methods typically suffer from four core difficulties in keyphrase generation: environment interaction and effective exploration, complex action control, reward design, and task-specific obstacle. To tackle this difficult but significant task, we present RegRL-KG, including actor-critic based-reinforcement learning control and L1 policy regularization under the first principle of minimizing the maximum likelihood estimation (MLE) criterion by a sequence-to-sequence (Seq2Seq) deep learnining model, for efficient keyphrase generation. The agent utilizes an actor-critic network to control the generated probability distribution and employs L1 policy regularization to solve the bias exposure problem. Extensive experiments show that our method brings improvement in terms of the evaluation metrics on five scientific article benchmark datasets.","PeriodicalId":50355,"journal":{"name":"Intelligent Data Analysis","volume":"104 1","pages":"1003-1021"},"PeriodicalIF":1.7,"publicationDate":"2023-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90963063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}