Mirko Bunse, Alejandro Moreo, Fabrizio Sebastiani, Martin Senz
{"title":"Regularization-based methods for ordinal quantification","authors":"Mirko Bunse, Alejandro Moreo, Fabrizio Sebastiani, Martin Senz","doi":"10.1007/s10618-024-01067-2","DOIUrl":null,"url":null,"abstract":"<p>Quantification, i.e., the task of predicting the class prevalence values in bags of unlabeled data items, has received increased attention in recent years. However, most quantification research has concentrated on developing algorithms for binary and multi-class problems in which the classes are not ordered. Here, we study the ordinal case, i.e., the case in which a total order is defined on the set of <span>\\(n>2\\)</span> classes. We give three main contributions to this field. First, we create and make available two datasets for ordinal quantification (OQ) research that overcome the inadequacies of the previously available ones. Second, we experimentally compare the most important OQ algorithms proposed in the literature so far. To this end, we bring together algorithms proposed by authors from very different research fields, such as data mining and astrophysics, who were unaware of each others’ developments. Third, we propose a novel class of regularized OQ algorithms, which outperforms existing algorithms in our experiments. The key to this gain in performance is that our regularization prevents ordinally implausible estimates, assuming that ordinal distributions tend to be smooth in practice. We informally verify this assumption for several real-world applications.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"75 1","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Mining and Knowledge Discovery","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10618-024-01067-2","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Quantification, i.e., the task of predicting the class prevalence values in bags of unlabeled data items, has received increased attention in recent years. However, most quantification research has concentrated on developing algorithms for binary and multi-class problems in which the classes are not ordered. Here, we study the ordinal case, i.e., the case in which a total order is defined on the set of \(n>2\) classes. We give three main contributions to this field. First, we create and make available two datasets for ordinal quantification (OQ) research that overcome the inadequacies of the previously available ones. Second, we experimentally compare the most important OQ algorithms proposed in the literature so far. To this end, we bring together algorithms proposed by authors from very different research fields, such as data mining and astrophysics, who were unaware of each others’ developments. Third, we propose a novel class of regularized OQ algorithms, which outperforms existing algorithms in our experiments. The key to this gain in performance is that our regularization prevents ordinally implausible estimates, assuming that ordinal distributions tend to be smooth in practice. We informally verify this assumption for several real-world applications.
量化,即预测未标记数据项袋中类别流行值的任务,近年来受到越来越多的关注。然而,大多数量化研究都集中在开发二元和多类问题的算法上,在这些问题中,类是没有排序的。在这里,我们研究的是序数情况,即在类(n>2\)集合上定义了总序的情况。我们对这一领域有三个主要贡献。首先,我们创建并提供了两个用于序量化(OQ)研究的数据集,克服了之前可用数据集的不足。其次,我们通过实验比较了迄今为止文献中提出的最重要的 OQ 算法。为此,我们汇集了来自数据挖掘和天体物理学等不同研究领域的作者提出的算法,这些作者并不了解彼此的发展情况。第三,我们提出了一类新型正则化 OQ 算法,在实验中表现优于现有算法。性能提升的关键在于,我们的正则化可以防止顺序上难以置信的估计,假设顺序分布在实践中趋于平稳。我们在几个实际应用中非正式地验证了这一假设。
期刊介绍:
Advances in data gathering, storage, and distribution have created a need for computational tools and techniques to aid in data analysis. Data Mining and Knowledge Discovery in Databases (KDD) is a rapidly growing area of research and application that builds on techniques and theories from many fields, including statistics, databases, pattern recognition and learning, data visualization, uncertainty modelling, data warehousing and OLAP, optimization, and high performance computing.