Advances in Data Analysis and Classification最新文献_第3页

Editorial for ADAC issue 2 of volume 18 (2024) ADAC 第 18 卷（2024 年）第 2 期社论

IF 1.4 4区计算机科学

Advances in Data Analysis and Classification Pub Date : 2024-06-10 DOI: 10.1007/s11634-024-00597-3

Maurizio Vichi, Andrea Cerioli, Hans A. Kestler, Akinori Okada, Claus Weihs

引用次数: 0

Clustering large mixed-type data with ordinal variables 使用顺序变量对大型混合型数据进行聚类

IF 1.3 4区计算机科学

Advances in Data Analysis and Classification Pub Date : 2024-05-27 DOI: 10.1007/s11634-024-00595-5

Gero Szepannek, Rabea Aschenbruck, Adalbert Wilhelm

引用次数: 0

A two-group canonical variate analysis biplot for an optimal display of both means and cases 两组典型变量分析双线图，优化显示均值和情况

IF 1.3 4区计算机科学

Advances in Data Analysis and Classification Pub Date : 2024-05-06 DOI: 10.1007/s11634-024-00593-7

Niel le Roux, Sugnet Gardner-Lubbe

{"title":"A two-group canonical variate analysis biplot for an optimal display of both means and cases","authors":"Niel le Roux, Sugnet Gardner-Lubbe","doi":"10.1007/s11634-024-00593-7","DOIUrl":"10.1007/s11634-024-00593-7","url":null,"abstract":"<div><p>Canonical variate analysis (CVA) entails a two-sided eigenvalue decomposition. When the number of groups, <i>J</i>, is less than the number of variables, <i>p</i>, at most <span>(J-1)</span> eigenvalues are not exactly zero. A CVA biplot is the simultaneous display of the two entities: group means as points and variables as calibrated biplot axes. It follows that with two groups the group means can be exactly represented in a one-dimensional biplot but the individual samples are approximated. We define a criterion to measure the quality of representing the individual samples in a CVA biplot. Then, for the two-group case we propose an additional dimension for constructing an optimal two-dimensional CVA biplot. The proposed novel CVA biplot maintains the exact display of group means and biplot axes, but the individual sample points satisfy the optimality criterion in a unique simultaneous display of group means, calibrated biplot axes for the variables, and within group samples. Although our primary aim is to address two-group CVA, our proposal extends immediately to an optimal three-dimensional biplot when encountering the equally important case of comparing three groups in practice.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"19 3","pages":"721 - 748"},"PeriodicalIF":1.3,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-024-00593-7.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140888158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Clustering functional data via variational inference 通过变异推理对功能数据进行聚类

IF 1.3 4区计算机科学

Advances in Data Analysis and Classification Pub Date : 2024-04-30 DOI: 10.1007/s11634-024-00590-w

Chengqian Xian, Camila P. E. de Souza, John Jewell, Ronaldo Dias

引用次数: 0

Liszt’s Étude S.136 no.1: audio data analysis of two different piano recordings 李斯特的 Étude S.136 no.1：两种不同钢琴录音的音频数据分析

IF 1.4 4区计算机科学

Advances in Data Analysis and Classification Pub Date : 2024-04-26 DOI: 10.1007/s11634-024-00594-6

Matteo Farnè

{"title":"Liszt’s Étude S.136 no.1: audio data analysis of two different piano recordings","authors":"Matteo Farnè","doi":"10.1007/s11634-024-00594-6","DOIUrl":"10.1007/s11634-024-00594-6","url":null,"abstract":"<div><p>In this paper, we review the main signal processing tools of Music Information Retrieval (MIR) from audio data, and we apply them to two recordings (by Leslie Howard and Thomas Rajna) of Franz Liszt’s Étude S.136 no.1, with the aim of uncovering the macro-formal structure and comparing the interpretative styles of the two performers. In particular, after a thorough spectrogram analysis, we perform a segmentation based on the degree of novelty, in the sense of spectral dissimilarity, calculated frame-by-frame via the cosine distance. We then compare the metrical, temporal and timbrical features of the two executions by MIR tools. Via this method, we are able to identify in a data-driven way the different moments of the piece according to their melodic and harmonic content, and to find out that Rajna’s execution is faster and less various, in terms of intensity and timbre, than Howard’s one. This enquiry represents a case study able to show the potentialities of MIR from audio data in supporting traditional music score analyses and in providing objective information for statistically founded musical execution analyses.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"797 - 822"},"PeriodicalIF":1.4,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-024-00594-6.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140801727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Comparison of internal evaluation criteria in hierarchical clustering of categorical data 分类数据分层聚类的内部评价标准比较

IF 1.3 4区计算机科学

Advances in Data Analysis and Classification Pub Date : 2024-04-13 DOI: 10.1007/s11634-024-00592-8

Zdenek Sulc, Jaroslav Hornicek, Hana Rezankova, Jana Cibulkova

{"title":"Comparison of internal evaluation criteria in hierarchical clustering of categorical data","authors":"Zdenek Sulc, Jaroslav Hornicek, Hana Rezankova, Jana Cibulkova","doi":"10.1007/s11634-024-00592-8","DOIUrl":"10.1007/s11634-024-00592-8","url":null,"abstract":"<div><p>The paper discusses eleven internal evaluation criteria that can be used in the area of hierarchical clustering of categorical data. The criteria are divided into two distinct groups based on how they treat the cluster quality: variability- and distance-based. The paper follows three main aims. The first one is to compare the examined criteria regarding their mutual similarity and dependence on the clustered datasets’ properties and the used similarity measures. The second one is to analyze the relationships between internal and external cluster evaluation to determine how well the internal criteria can recognize the original number of clusters in datasets and to what extent they provide comparable results to the external criteria. The third aim is to propose two new variability-based internal evaluation criteria. In the experiment, 81 types of generated datasets with controlled properties are used. The results show which internal criteria can be recommended for specific tasks, such as judging the cluster quality or the optimal number of clusters determination.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"19 3","pages":"619 - 648"},"PeriodicalIF":1.3,"publicationDate":"2024-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140589022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multidimensional scaling for big data 大数据的多维扩展

IF 1.3 4区计算机科学

Advances in Data Analysis and Classification Pub Date : 2024-04-13 DOI: 10.1007/s11634-024-00591-9

Pedro Delicado, Cristian Pachón-García

{"title":"Multidimensional scaling for big data","authors":"Pedro Delicado, Cristian Pachón-García","doi":"10.1007/s11634-024-00591-9","DOIUrl":"10.1007/s11634-024-00591-9","url":null,"abstract":"<div><p>We present a set of algorithms implementing multidimensional scaling (MDS) for large data sets. MDS is a family of dimensionality reduction techniques using a <span>(n times n)</span> distance matrix as input, where <i>n</i> is the number of individuals, and producing a low dimensional configuration: a <span>(ntimes r)</span> matrix with <span>(r<<n)</span>. When <i>n</i> is large, MDS is unaffordable with classical MDS algorithms because their extremely large memory and time requirements. We compare six non-standard algorithms intended to overcome these difficulties. They are based on the central idea of partitioning the data set into small pieces, where classical MDS methods can work. Two of these algorithms are original proposals. In order to check the performance of the algorithms as well as to compare them, we have done a simulation study. Additionally, we have used the algorithms to obtain an MDS configuration for EMNIST: a real large data set with more than 800000 points. We conclude that all the algorithms are appropriate to use for obtaining an MDS configuration, but we recommend to use one of our proposals, since it is a fast algorithm with satisfactory statistical properties when working with big data. An <span>R</span> package implementing the algorithms has been created.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"19 3","pages":"649 - 670"},"PeriodicalIF":1.3,"publicationDate":"2024-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-024-00591-9.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140588671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

View selection in multi-view stacking: choosing the meta-learner 多视图叠加中的视图选择：选择元学习器

IF 1.3 4区计算机科学

Advances in Data Analysis and Classification Pub Date : 2024-04-12 DOI: 10.1007/s11634-024-00587-5

Wouter van Loon, Marjolein Fokkema, Botond Szabo, Mark de Rooij

{"title":"View selection in multi-view stacking: choosing the meta-learner","authors":"Wouter van Loon, Marjolein Fokkema, Botond Szabo, Mark de Rooij","doi":"10.1007/s11634-024-00587-5","DOIUrl":"10.1007/s11634-024-00587-5","url":null,"abstract":"<div><p>Multi-view stacking is a framework for combining information from different views (i.e. different feature sets) describing the same set of objects. In this framework, a <i>base-learner</i> algorithm is trained on each view separately, and their predictions are then combined by a <i>meta-learner</i> algorithm. In a previous study, stacked penalized logistic regression, a special case of multi-view stacking, has been shown to be useful in identifying which views are most important for prediction. In this article we expand this research by considering seven different algorithms to use as the meta-learner, and evaluating their view selection and classification performance in simulations and two applications on real gene-expression data sets. Our results suggest that if both view selection and classification accuracy are important to the research at hand, then the nonnegative lasso, nonnegative adaptive lasso and nonnegative elastic net are suitable meta-learners. Exactly which among these three is to be preferred depends on the research context. The remaining four meta-learners, namely nonnegative ridge regression, nonnegative forward selection, stability selection and the interpolating predictor, show little advantages in order to be preferred over the other three.\u0000</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"19 3","pages":"579 - 617"},"PeriodicalIF":1.3,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-024-00587-5.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140602607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Natural-neighborhood based, label-specific undersampling for imbalanced, multi-label data 基于自然邻域的特定标签欠采样，用于不平衡多标签数据

IF 1.4 4区计算机科学

Advances in Data Analysis and Classification Pub Date : 2024-03-30 DOI: 10.1007/s11634-024-00589-3

Payel Sadhukhan, Sarbani Palit

{"title":"Natural-neighborhood based, label-specific undersampling for imbalanced, multi-label data","authors":"Payel Sadhukhan, Sarbani Palit","doi":"10.1007/s11634-024-00589-3","DOIUrl":"10.1007/s11634-024-00589-3","url":null,"abstract":"<div><p>This work presents a novel undersampling scheme to tackle the imbalance problem in multi-label datasets. We use the principles of the natural nearest neighborhood and follow a paradigm of label-specific undersampling. Natural-nearest neighborhood is a parameter-free principle. Our scheme’s novelty lies in exploring the parameter-optimization-free natural nearest neighborhood principles. The class imbalance problem is particularly challenging in a multi-label context, as the imbalance ratio and the majority–minority distributions vary from label to label. Consequently, the majority–minority class overlaps also vary across the labels. Working on this aspect, we propose a framework where a single natural neighbor search is sufficient to identify all the label-specific overlaps. Natural neighbor information is also used to find the key lattices of the majority class (which we do not undersample). The performance of the proposed method, NaNUML, indicates its ability to mitigate the class-imbalance issue in multi-label datasets to a considerable extent. We could also establish a statistically superior performance over other competing methods several times. An empirical study involving twelve real-world multi-label datasets, seven competing methods, and four evaluating metrics—shows that the proposed method effectively handles the class-imbalance issue in multi-label datasets. In this work, we have presented a novel label-specific undersampling scheme, NaNUML, for multi-label datasets. NaNUML is based on the parameter-free natural neighbor search and the key factor, neighborhood size ‘k’ is determined without invoking any parameter optimization.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"723 - 744"},"PeriodicalIF":1.4,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140363801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Clustering ensemble extraction: a knowledge reuse framework 聚类组合提取：知识再利用框架

IF 1.3 4区计算机科学

Advances in Data Analysis and Classification Pub Date : 2024-03-27 DOI: 10.1007/s11634-024-00588-4

Mohaddeseh Sedghi, Ebrahim Akbari, Homayun Motameni, Touraj Banirostam

{"title":"Clustering ensemble extraction: a knowledge reuse framework","authors":"Mohaddeseh Sedghi, Ebrahim Akbari, Homayun Motameni, Touraj Banirostam","doi":"10.1007/s11634-024-00588-4","DOIUrl":"10.1007/s11634-024-00588-4","url":null,"abstract":"<div><p>Clustering ensemble combines several fundamental clusterings with a consensus function to produce the final clustering without gaining access to data features. The quality and diversity of a vast library of base clusterings influence the performance of the consensus function. When a huge library of various clusterings is not available, this function produces results of lower quality than those of the basic clustering. The expansion of diverse clusters in the collection to increase the performance of consensus, especially in cases where there is no access to specific data features or assumptions in the data distribution, has still remained an open problem. The approach proposed in this paper, Clustering Ensemble Extraction, considers the similarity criterion at the cluster level and places the most similar clusters in the same group. Then, it extracts new clusters with the help of the Extracting Clusters Algorithm. Finally, two new consensus functions, namely Cluster-based extracted partitioning algorithm and Meta-cluster extracted algorithm, are defined and then applied to new clusters in order to create a high-quality clustering. The results of the empirical experiments conducted in this study showed that the new consensus function obtained by our proposed method outperformed the methods previously proposed in the literature regarding the clustering quality and efficiency.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"19 3","pages":"551 - 578"},"PeriodicalIF":1.3,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140312089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0