Journal of Classification最新文献

筛选
英文 中文
Editorial: Journal of Classification Vol. 41-2 社论:分类学杂志》第 41-2 卷
IF 1.8 4区 计算机科学
Journal of Classification Pub Date : 2024-07-03 DOI: 10.1007/s00357-024-09485-z
P. McNicholas
{"title":"Editorial: Journal of Classification Vol. 41-2","authors":"P. McNicholas","doi":"10.1007/s00357-024-09485-z","DOIUrl":"https://doi.org/10.1007/s00357-024-09485-z","url":null,"abstract":"","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141681706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A New Look at the Dirichlet Distribution: Robustness, Clustering, and Both Together 迪里夏特分布新视角:稳健性、聚类和两者兼而有之
IF 2 4区 计算机科学
Journal of Classification Pub Date : 2024-07-02 DOI: 10.1007/s00357-024-09480-4
Salvatore D. Tomarchio, Antonio Punzo, Johannes T. Ferreira, Andriette Bekker
{"title":"A New Look at the Dirichlet Distribution: Robustness, Clustering, and Both Together","authors":"Salvatore D. Tomarchio, Antonio Punzo, Johannes T. Ferreira, Andriette Bekker","doi":"10.1007/s00357-024-09480-4","DOIUrl":"https://doi.org/10.1007/s00357-024-09480-4","url":null,"abstract":"<p>Compositional data have peculiar characteristics that pose significant challenges to traditional statistical methods and models. Within this framework, we use a convenient mode parametrized Dirichlet distribution across multiple fields of statistics. In particular, we propose finite mixtures of unimodal Dirichlet (UD) distributions for model-based clustering and classification. Then, we introduce the contaminated UD (CUD) distribution, a heavy-tailed generalization of the UD distribution that allows for a more flexible tail behavior in the presence of atypical observations. Thirdly, we propose finite mixtures of CUD distributions to jointly account for the presence of clusters and atypical points in the data. Parameter estimation is carried out by directly maximizing the maximum likelihood or by using an expectation-maximization (EM) algorithm. Two analyses are conducted on simulated data to illustrate the effects of atypical observations on parameter estimation and data classification, and how our proposals address both aspects. Furthermore, two real datasets are investigated and the results obtained via our models are discussed.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic Topic Title Assignment with Word Embedding 通过单词嵌入自动分配主题标题
IF 2 4区 计算机科学
Journal of Classification Pub Date : 2024-07-01 DOI: 10.1007/s00357-024-09476-0
Gianpaolo Zammarchi, Maurizio Romano, Claudio Conversano
{"title":"Automatic Topic Title Assignment with Word Embedding","authors":"Gianpaolo Zammarchi, Maurizio Romano, Claudio Conversano","doi":"10.1007/s00357-024-09476-0","DOIUrl":"https://doi.org/10.1007/s00357-024-09476-0","url":null,"abstract":"<p>In this paper, we propose TAWE (title assignment with word embedding), a new method to automatically assign titles to topics inferred from sets of documents. This method combines the results obtained from the topic modeling performed with, e.g., latent Dirichlet allocation (LDA) or other suitable methods and the word embedding representation of words in a vector space. This representation preserves the meaning of the words while allowing to find the most suitable word that represents the topic. The procedure is twofold: first, a cleaned text is used to build the LDA model to infer a desirable number of latent topics; second, a reasonable number of words and their weights are extracted from each topic and represented in n-dimensional space using word embedding. Based on the selected weighted words, a centroid is computed, and the closest word is chosen as the title of the topic. To test the method, we used a collection of tweets about climate change downloaded from some of the main newspapers accounts on Twitter. Results showed that TAWE is a suitable method for automatically assigning a topic title.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Normalised Clustering Accuracy: An Asymmetric External Cluster Validity Measure 归一化聚类精度:一种非对称外部聚类有效性测量方法
IF 2 4区 计算机科学
Journal of Classification Pub Date : 2024-06-28 DOI: 10.1007/s00357-024-09482-2
Marek Gagolewski
{"title":"Normalised Clustering Accuracy: An Asymmetric External Cluster Validity Measure","authors":"Marek Gagolewski","doi":"10.1007/s00357-024-09482-2","DOIUrl":"https://doi.org/10.1007/s00357-024-09482-2","url":null,"abstract":"<p>There is no, nor will there ever be, single best clustering algorithm. Nevertheless, we would still like to be able to distinguish between methods that work well on certain task types and those that systematically underperform. Clustering algorithms are traditionally evaluated using either internal or external validity measures. Internal measures quantify different aspects of the obtained partitions, e.g., the average degree of cluster compactness or point separability. However, their validity is questionable because the clusterings they endorse can sometimes be meaningless. External measures, on the other hand, compare the algorithms’ outputs to fixed ground truth groupings provided by experts. In this paper, we argue that the commonly used classical partition similarity scores, such as the normalised mutual information, Fowlkes–Mallows, or adjusted Rand index, miss some desirable properties. In particular, they do not identify worst-case scenarios correctly, nor are they easily interpretable. As a consequence, the evaluation of clustering algorithms on diverse benchmark datasets can be difficult. To remedy these issues, we propose and analyse a new measure: a version of the optimal set-matching accuracy, which is normalised, monotonic with respect to some similarity relation, scale-invariant, and corrected for the imbalancedness of cluster sizes (but neither symmetric nor adjusted for chance).</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sensitivity and Specificity versus Precision and Recall, and Related Dilemmas 灵敏度和特异性与精确度和召回率及相关难题
IF 2 4区 计算机科学
Journal of Classification Pub Date : 2024-06-26 DOI: 10.1007/s00357-024-09478-y
William Cullerne Bown
{"title":"Sensitivity and Specificity versus Precision and Recall, and Related Dilemmas","authors":"William Cullerne Bown","doi":"10.1007/s00357-024-09478-y","DOIUrl":"https://doi.org/10.1007/s00357-024-09478-y","url":null,"abstract":"<p>Many evaluations of binary classifiers begin by adopting a pair of indicators, most often sensitivity and specificity or precision and recall. Despite this, we lack a general, pan-disciplinary basis for choosing one pair over the other, or over one of four other sibling pairs. Related obscurity afflicts the choice between the receiver operating characteristic and the precision-recall curve. Here, I return to first principles to separate concerns and distinguish more than 50 foundational concepts. This allows me to establish six rules that allow one to identify which pair is correct. The choice depends on the context in which the classifier is to operate, the intended use of the classifications, their intended user(s), and the measurability of the underlying classes, but not skew. The rules can be applied by those who develop, operate, or regulate them to classifiers composed of technology, people, or combinations of the two.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Clustering Longitudinal Data for Growth Curve Modelling by Gibbs Sampler and Information Criterion 利用吉布斯采样器和信息标准对纵向数据进行聚类以建立生长曲线模型
IF 2 4区 计算机科学
Journal of Classification Pub Date : 2024-06-19 DOI: 10.1007/s00357-024-09477-z
Yu Fei, Rongli Li, Zhouhong Li, Guoqi Qian
{"title":"Clustering Longitudinal Data for Growth Curve Modelling by Gibbs Sampler and Information Criterion","authors":"Yu Fei, Rongli Li, Zhouhong Li, Guoqi Qian","doi":"10.1007/s00357-024-09477-z","DOIUrl":"https://doi.org/10.1007/s00357-024-09477-z","url":null,"abstract":"<p>Clustering longitudinal data for growth curve modelling is considered in this paper, where we aim to optimally estimate the underpinning unknown group partition matrix. Instead of following the conventional soft clustering approach, which assumes the columns of the partition matrix to have i.i.d. multinomial or categorical prior distributions and uses a regression model with the response following a finite mixture distribution to estimate the posterior distribution of the partition matrix, we propose an iterative partition and regression procedure to find the best partition matrix and the associated best growth curve regression model for each identified cluster. We show that the best partition matrix is the one minimizing a recently developed empirical Bayes information criterion (eBIC), which, due to the involved combinatorial explosion, is difficult to compute via enumerating all candidate partition matrices. Thus, we develop a Gibbs sampling method to generate a Markov chain of candidate partition matrices that has its equilibrium probability distribution equal the one induced from eBIC. We further show that the best partition matrix, given a priori the number of latent clusters, can be consistently estimated and is computationally scalable based on this Markov chain. The number of latent clusters is also best estimated by minimizing eBIC. The proposed iterative clustering and regression method is assessed by a comprehensive simulation study before being applied to two real-world growth curve modelling examples involving longitudinal data clustering.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Density Peak Clustering Using Grey Wolf Optimization Approach 利用灰狼优化法进行密度峰聚类
IF 2 4区 计算机科学
Journal of Classification Pub Date : 2024-06-05 DOI: 10.1007/s00357-024-09475-1
Preeti, Kusum Deep
{"title":"Density Peak Clustering Using Grey Wolf Optimization Approach","authors":"Preeti, Kusum Deep","doi":"10.1007/s00357-024-09475-1","DOIUrl":"https://doi.org/10.1007/s00357-024-09475-1","url":null,"abstract":"","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141382124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Finding Outliers in Gaussian Model-based Clustering 在基于高斯模型的聚类中查找异常值
IF 2 4区 计算机科学
Journal of Classification Pub Date : 2024-05-30 DOI: 10.1007/s00357-024-09473-3
Katharine M. Clark, Paul D. McNicholas
{"title":"Finding Outliers in Gaussian Model-based Clustering","authors":"Katharine M. Clark, Paul D. McNicholas","doi":"10.1007/s00357-024-09473-3","DOIUrl":"https://doi.org/10.1007/s00357-024-09473-3","url":null,"abstract":"<p>Clustering, or unsupervised classification, is a task often plagued by outliers. Yet there is a paucity of work on handling outliers in clustering. Outlier identification algorithms tend to fall into three broad categories: outlier inclusion, outlier trimming, and <i>post hoc</i> outlier identification methods, with the former two often requiring pre-specification of the number of outliers. The fact that sample squared Mahalanobis distance is beta-distributed is used to derive an approximate distribution for the log-likelihoods of subset finite Gaussian mixture models. An algorithm is then proposed that removes the least plausible points according to the subset log-likelihoods, which are deemed outliers, until the subset log-likelihoods adhere to the reference distribution. This results in a trimming method, called OCLUST, that inherently estimates the number of outliers.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141188966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SNN-PDM: An Improved Probability Density Machine Algorithm Based on Shared Nearest Neighbors Clustering Technique SNN-PDM:基于共享近邻聚类技术的改进型概率密度机算法
IF 2 4区 计算机科学
Journal of Classification Pub Date : 2024-05-17 DOI: 10.1007/s00357-024-09474-2
Shiqi Wu, Hualong Yu, Yan Gu, Changbin Shao, Shang Gao
{"title":"SNN-PDM: An Improved Probability Density Machine Algorithm Based on Shared Nearest Neighbors Clustering Technique","authors":"Shiqi Wu, Hualong Yu, Yan Gu, Changbin Shao, Shang Gao","doi":"10.1007/s00357-024-09474-2","DOIUrl":"https://doi.org/10.1007/s00357-024-09474-2","url":null,"abstract":"","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140964939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Novel Classification Algorithm Based on the Synergy Between Dynamic Clustering with Adaptive Distances and K-Nearest Neighbors 基于自适应距离动态聚类与 K 近邻协同作用的新型分类算法
IF 2 4区 计算机科学
Journal of Classification Pub Date : 2024-05-11 DOI: 10.1007/s00357-024-09471-5
Mohammed Sabri, Rosanna Verde, Antonio Balzanella, Fabrizio Maturo, Hamid Tairi, Ali Yahyaouy, Jamal Riffi
{"title":"A Novel Classification Algorithm Based on the Synergy Between Dynamic Clustering with Adaptive Distances and K-Nearest Neighbors","authors":"Mohammed Sabri, Rosanna Verde, Antonio Balzanella, Fabrizio Maturo, Hamid Tairi, Ali Yahyaouy, Jamal Riffi","doi":"10.1007/s00357-024-09471-5","DOIUrl":"https://doi.org/10.1007/s00357-024-09471-5","url":null,"abstract":"<p>This paper introduces a novel supervised classification method based on dynamic clustering (DC) and K-nearest neighbor (KNN) learning algorithms, denoted DC-KNN. The aim is to improve the accuracy of a classifier by using a DC method to discover the hidden patterns of the apriori groups of the training set. It provides a partitioning of each group into a predetermined number of subgroups. A new objective function is designed for the DC variant, based on a trade-off between the compactness and separation of all subgroups in the original groups. Moreover, the proposed DC method uses adaptive distances which assign a set of weights to the variables of each cluster, which depend on both their intra-cluster and inter-cluster structure. DC-KNN performs the minimization of a suitable objective function. Next, the KNN algorithm takes into account objects by assigning them to the label of subgroups. Furthermore, the classification step is performed according to two KNN competing algorithms. The proposed strategies have been evaluated using both synthetic data and widely used real datasets from public repositories. The achieved results have confirmed the effectiveness and robustness of the strategy in improving classification accuracy in comparison to alternative approaches.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140938766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信