intRinsic: An R Package for Model-Based Estimation of the Intrinsic Dimension of a Dataset

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Statistical Software Pub Date : 2021-02-23 DOI:10.18637/jss.v106.i09

Francesco Denti

{"title":"intRinsic: An R Package for Model-Based Estimation of the Intrinsic Dimension of a Dataset","authors":"Francesco Denti","doi":"10.18637/jss.v106.i09","DOIUrl":null,"url":null,"abstract":"This article illustrates intRinsic, an R package that implements novel state-of-the-art likelihood-based estimators of the intrinsic dimension of a dataset, an essential quantity for most dimensionality reduction techniques. In order to make these novel estimators easily accessible, the package contains a small number of high-level functions that rely on a broader set of efficient, low-level routines. Generally speaking, intRinsic encompasses models that fall into two categories: homogeneous and heterogeneous intrinsic dimension estimators. The first category contains the two nearest neighbors estimator, a method derived from the distributional properties of the ratios of the distances between each data point and its first two closest neighbors. The functions dedicated to this method carry out inference under both the frequentist and Bayesian frameworks. In the second category, we find the heterogeneous intrinsic dimension algorithm, a Bayesian mixture model for which an efficient Gibbs sampler is implemented. After presenting the theoretical background, we demonstrate the performance of the models on simulated datasets. This way, we can facilitate the exposition by immediately assessing the validity of the results. Then, we employ the package to study the intrinsic dimension of the Alon dataset, obtained from a famous microarray experiment. Finally, we show how the estimation of homogeneous and heterogeneous intrinsic dimensions allows us to gain valuable insights into the topological structure of a dataset.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":"14 1","pages":""},"PeriodicalIF":8.1000,"publicationDate":"2021-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Statistical Software","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.18637/jss.v106.i09","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 4

Abstract

This article illustrates intRinsic, an R package that implements novel state-of-the-art likelihood-based estimators of the intrinsic dimension of a dataset, an essential quantity for most dimensionality reduction techniques. In order to make these novel estimators easily accessible, the package contains a small number of high-level functions that rely on a broader set of efficient, low-level routines. Generally speaking, intRinsic encompasses models that fall into two categories: homogeneous and heterogeneous intrinsic dimension estimators. The first category contains the two nearest neighbors estimator, a method derived from the distributional properties of the ratios of the distances between each data point and its first two closest neighbors. The functions dedicated to this method carry out inference under both the frequentist and Bayesian frameworks. In the second category, we find the heterogeneous intrinsic dimension algorithm, a Bayesian mixture model for which an efficient Gibbs sampler is implemented. After presenting the theoretical background, we demonstrate the performance of the models on simulated datasets. This way, we can facilitate the exposition by immediately assessing the validity of the results. Then, we employ the package to study the intrinsic dimension of the Alon dataset, obtained from a famous microarray experiment. Finally, we show how the estimation of homogeneous and heterogeneous intrinsic dimensions allows us to gain valuable insights into the topological structure of a dataset.

查看原文本刊更多论文

一个基于模型估计数据集内在维数的R包

本文演示了intRinsic，这是一个R包，它实现了对数据集的内在维度(对于大多数降维技术来说都是必不可少的量)的最新的基于似然的估计。为了使这些新颖的估计器易于访问，该包包含了少量依赖于更广泛的高效、低级例程集的高级函数。一般来说，intRinsic包含两类模型:同质和异质intRinsic维估计器。第一类包含两个最近邻估计器，这是一种从每个数据点与其前两个最近邻之间的距离之比的分布特性推导出来的方法。专用于该方法的函数在频率论和贝叶斯框架下进行推理。在第二类中，我们发现了异构本征维算法，这是一种贝叶斯混合模型，它实现了一个有效的吉布斯采样器。在介绍了理论背景之后，我们在模拟数据集上验证了模型的性能。这样，我们可以通过立即评估结果的有效性来促进阐述。然后，我们使用包来研究从一个著名的微阵列实验中获得的Alon数据集的固有维数。最后，我们展示了对同质和异质内在维度的估计如何使我们获得对数据集拓扑结构的有价值的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Statistical Software 工程技术-计算机：跨学科应用

CiteScore

10.70

自引率

1.70%

发文量

审稿时长

6-12 weeks

期刊介绍： The Journal of Statistical Software (JSS) publishes open-source software and corresponding reproducible articles discussing all aspects of the design, implementation, documentation, application, evaluation, comparison, maintainance and distribution of software dedicated to improvement of state-of-the-art in statistical computing in all areas of empirical research. Open-source code and articles are jointly reviewed and published in this journal and should be accessible to a broad community of practitioners, teachers, and researchers in the field of statistics.