{"title":"How to Measure the Researcher Impact with the Aid of its Impactable Area: A Concrete Approach Using Distance Geometry","authors":"Beniamino Cappelletti-Montano, Gianmarco Cherchi, Benedetto Manca, Stefano Montaldo, Monica Musio","doi":"10.1007/s00357-024-09490-2","DOIUrl":"https://doi.org/10.1007/s00357-024-09490-2","url":null,"abstract":"<p>Assuming that the subject of each scientific publication can be identified by one or more classification entities, we address the problem of determining a similarity function (distance) between classification entities based on how often two classification entities are used in the same publication. This similarity function is then used to obtain a representation of the classification entities as points of an Euclidean space of a suitable dimension by means of optimization and dimensionality reduction algorithms. This procedure allows us also to represent the researchers as points in the same Euclidean space and to determine the distance between researchers according to their scientific production. As a case study, we consider as classification entities the codes of the American Mathematical Society Classification System.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"56 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-task Support Vector Machine Classifier with Generalized Huber Loss","authors":"Qi Liu, Wenxin Zhu, Zhengming Dai, Zhihong Ma","doi":"10.1007/s00357-024-09488-w","DOIUrl":"https://doi.org/10.1007/s00357-024-09488-w","url":null,"abstract":"<p>Compared to single-task learning (STL), multi-task learning (MTL) achieves a better generalization by exploiting domain-specific information implicit in the training signals of several related tasks. The adaptation of MTL to support vector machines (SVMs) is a rather successful example. Inspired by the recently published generalized Huber loss SVM (GHSVM) and regularized multi-task learning (RMTL), we propose a novel generalized Huber loss multi-task support vector machine including linear and non-linear cases for binary classification, named as MTL-GHSVM. The new method extends the GHSVM from single-task to multi-task learning, and the application of Huber loss to MTL-SVM is innovative to the best of our knowledge. The proposed method has two main advantages: on the one hand, compared with SVMs with hinge loss and GHSVM, our MTL-GHSVM using the differentiable generalized Huber loss has better generalization performance; on the other hand, it adopts functional iteration to find the optimal solution, and does not need to solve a quadratic programming problem (QPP), which can significantly reduce the computational cost. Numerical experiments have been conducted on fifteen real datasets, and the results demonstrate the effectiveness of the proposed multi-task classification algorithm compared with the state-of-the-art algorithms.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"166 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Clustering-Based Oversampling Algorithm for Multi-class Imbalance Learning","authors":"Haixia Zhao, Jian Wu","doi":"10.1007/s00357-024-09491-1","DOIUrl":"https://doi.org/10.1007/s00357-024-09491-1","url":null,"abstract":"<p>Multi-class imbalanced data learning faces many challenges. Its complex structural characteristics cause severe intra-class imbalance or overgeneralization in most solution strategies. This negatively affects data learning. This paper proposes a clustering-based oversampling algorithm (COM) to handle multi-class imbalance learning. In order to avoid the loss of important information, COM clusters the minority class based on the structural characteristics of the instances, among which rare instances and outliers are carefully portrayed through assigning a sampling weight to each of the clusters. Clusters with high densities are given low weights, and then, oversampling is performed within clusters to avoid overgeneralization. COM avoids intra-class imbalance effectively because low-density clusters are more likely than high-density ones to be selected to synthesize instances. Our study used the UCI and KEEL imbalanced datasets to demonstrate the effectiveness and stability of the proposed method.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"17 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combining Semi-supervised Clustering and Classification Under a Generalized Framework","authors":"Zhen Jiang, Lingyun Zhao, Yu Lu","doi":"10.1007/s00357-024-09489-9","DOIUrl":"https://doi.org/10.1007/s00357-024-09489-9","url":null,"abstract":"<p>Most machine learning algorithms rely on having a sufficient amount of labeled data to train a reliable classifier. However, labeling data is often costly and time-consuming, while unlabeled data can be readily accessible. Therefore, learning from both labeled and unlabeled data has become a hot topic of interest. Inspired by the co-training algorithm, we present a learning framework called CSCC, which combines semi-supervised clustering and classification to learn from both labeled and unlabeled data. Unlike existing co-training style methods that construct diverse classifiers to learn from each other, CSCC leverages the diversity between semi-supervised clustering and classification models to achieve mutual enhancement. Existing classification algorithms can be easily adapted to CSCC, allowing them to generalize from a few labeled data. Especially, in order to bridge the gap between class information and clustering, we propose a semi-supervised hierarchical clustering algorithm that utilizes labeled data to guide the process of cluster-splitting. Within the CSCC framework, we introduce two loss functions to supervise the iterative updating of the semi-supervised clustering and classification models, respectively. Extensive experiments conducted on a variety of benchmark datasets validate the superiority of CSCC over other state-of-the-art methods.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"13 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Slope Stability Classification Model Based on Single-Valued Neutrosophic Matrix Energy and Its Application Under a Single-Valued Neutrosophic Matrix Scenario","authors":"Jun Ye, Kaiqian Du, Shigui Du, Rui Yong","doi":"10.1007/s00357-024-09487-x","DOIUrl":"https://doi.org/10.1007/s00357-024-09487-x","url":null,"abstract":"<p>Since matrix energy (ME) implies the expressive merit of collective information, a classification method based on ME has not been investigated in the existing literature, which reflects its research gap in a matrix scenario. Therefore, the purpose of this paper is to propose a slope stability classification model based on the single-valued neutrosophic matrix (SVNM) energy to solve the current research gap in slope stability classification analysis with uncertain and inconsistent information. In this study, we first present SVNM and define the SVNM energy based on true, uncertain, and false MEs. Then, using a neutrosophication technique based on true, false, and uncertain Gaussian membership functions, the multiple sampling data of the stability affecting factors for each slope are transformed into SVNM. Next, a slope stability classification model based on the SVNM energy and score function is developed to solve the slope stability classification analysis under the full SVNM scenario of both the affecting factor weights and the affecting factors of slope stability. Finally, the developed classification model is applied to the classification analysis of 50 slope samples collected from different areas of Zhejiang province in China as a case study to verify its rationality and accuracy under the SVNM scenario. The accuracy of the classification results for the 50 slope samples is 100%.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"29 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Effective Crow Search Algorithm and Its Application in Data Clustering","authors":"Rajesh Ranjan, Jitender Kumar Chhabra","doi":"10.1007/s00357-024-09486-y","DOIUrl":"https://doi.org/10.1007/s00357-024-09486-y","url":null,"abstract":"<p>In today’s data-centric world, the significance of generated data has increased manifold. Clustering the data into a similar group is one of the dynamic research areas among other data practices. Several algorithms’ proposals exist for clustering. Apart from the traditional algorithms, researchers worldwide have successfully employed some metaheuristic approaches for clustering. The crow search algorithm (CSA) is a recently introduced swarm-based algorithm that imitates the performance of the crow. An effective crow search algorithm (ECSA) has been proposed in the present work, which dynamically attunes its parameter to sustain the search balance and perform an oppositional-based random initialization. The ECSA is evaluated over CEC2019 Benchmark Functions and simulated for data clustering tasks compared with well-known metaheuristic approaches and famous partition-based K-means algorithm over benchmark datasets. The results reveal that the ECSA performs better than other algorithms in the context of external cluster quality metrics and convergence rate.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"95 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexa A. Sochaniwsky, Michael P. B. Gallaugher, Yang Tang, Paul D. McNicholas
{"title":"Flexible Clustering with a Sparse Mixture of Generalized Hyperbolic Distributions","authors":"Alexa A. Sochaniwsky, Michael P. B. Gallaugher, Yang Tang, Paul D. McNicholas","doi":"10.1007/s00357-024-09479-x","DOIUrl":"https://doi.org/10.1007/s00357-024-09479-x","url":null,"abstract":"<p>Robust clustering of high-dimensional data is an important topic because clusters in real datasets are often heavy-tailed and/or asymmetric. Traditional approaches to model-based clustering often fail for high dimensional data, e.g., due to the number of free covariance parameters. A parametrization of the component scale matrices for the mixture of generalized hyperbolic distributions is proposed. This parameterization includes a penalty term in the likelihood. An analytically feasible expectation-maximization algorithm is developed by placing a gamma-lasso penalty constraining the concentration matrix. The proposed methodology is investigated through simulation studies and illustrated using two real datasets.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"33 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141614134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marek Gagolewski, Anna Cena, Maciej Bartoszuk, Łukasz Brzozowski
{"title":"Clustering with Minimum Spanning Trees: How Good Can It Be?","authors":"Marek Gagolewski, Anna Cena, Maciej Bartoszuk, Łukasz Brzozowski","doi":"10.1007/s00357-024-09483-1","DOIUrl":"https://doi.org/10.1007/s00357-024-09483-1","url":null,"abstract":"<p>Minimum spanning trees (MSTs) provide a convenient representation of datasets in numerous pattern recognition activities. Moreover, they are relatively fast to compute. In this paper, we quantify the extent to which they are meaningful in low-dimensional partitional data clustering tasks. By identifying the upper bounds for the agreement between the best (oracle) algorithm and the expert labels from a large battery of benchmark data, we discover that MST methods can be very competitive. Next, we review, study, extend, and generalise a few existing, state-of-the-art MST-based partitioning schemes. This leads to some new noteworthy approaches. Overall, the Genie and the information-theoretic methods often outperform the non-MST algorithms such as K-means, Gaussian mixtures, spectral clustering, Birch, density-based, and classical hierarchical agglomerative procedures. Nevertheless, we identify that there is still some room for improvement, and thus the development of novel algorithms is encouraged.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"4 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141577897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A New Matrix Feature Selection Strategy in Machine Learning Models for Certain Krylov Solver Prediction","authors":"Hai-Bing Sun, Yan-Fei Jing, Xiao-Wen Xu","doi":"10.1007/s00357-024-09484-0","DOIUrl":"https://doi.org/10.1007/s00357-024-09484-0","url":null,"abstract":"<p>Numerical simulation processes in scientific and engineering applications require efficient solutions of large sparse linear systems, and variants of Krylov subspace solvers with various preconditioning techniques have been developed. However, it is time-consuming for practitioners with trial and error to find a high-performance Krylov solver in a candidate solver set for a given linear system. Therefore, it is instructive to select an efficient solver intelligently among a solver set rather than exploratory application of all solvers to solve the linear system. One promising direction of solver selection is to apply machine learning methods to construct a mapping from the matrix features to the candidate solvers. However, the computation of some matrix features is quite difficult. In this paper, we design a new selection strategy of matrix features to reduce computing cost, and then employ the selected features to construct a machine learning classifier to predict an appropriate solver for a given linear system. Numerical experiments on two attractive GMRES-type solvers for solving linear systems from the University of Florida Sparse Matrix Collection and Matrix Market verify the efficiency of our strategy, not only reducing the computing time for obtaining features and construction time of classifier but also keeping more than 90% prediction accuracy.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"30 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141573774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cluster Validation Based on Fisher’s Linear Discriminant Analysis","authors":"Fabian Kächele, Nora Schneider","doi":"10.1007/s00357-024-09481-3","DOIUrl":"https://doi.org/10.1007/s00357-024-09481-3","url":null,"abstract":"<p>Cluster analysis aims to find meaningful groups, called clusters, in data. The objects within a cluster should be similar to each other and dissimilar to objects from other clusters. The fundamental question arising is whether found clusters are “valid clusters” or not. Existing cluster validity indices are computation-intensive, make assumptions about the underlying cluster structure, or cannot detect the absence of clusters. Thus, we present a new cluster validation framework to assess the validity of a clustering and determine the underlying number of clusters <span>(k^*)</span>. Within the framework, we introduce a new merge criterion analyzing the data in a one-dimensional projection, which maximizes the ratio of between-cluster- variance to within-cluster-variance in the clusters. Nonetheless, other local methods can be applied as a merge criterion within the framework. Experiments on synthetic and real-world data sets show promising results for both the overall framework and the introduced merge criterion.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"40 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141549520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}