Estimating the Density Ratio between Distributions with High Discrepancy using Multinomial Logistic Regression

Trans. Mach. Learn. Res. Pub Date : 2023-05-01 DOI:10.48550/arXiv.2305.00869

Akash Srivastava, Seung-Jun Han, Kai Xu, Benjamin Rhodes, Michael U Gutmann

{"title":"Estimating the Density Ratio between Distributions with High Discrepancy using Multinomial Logistic Regression","authors":"Akash Srivastava, Seung-Jun Han, Kai Xu, Benjamin Rhodes, Michael U Gutmann","doi":"10.48550/arXiv.2305.00869","DOIUrl":null,"url":null,"abstract":"Functions of the ratio of the densities $p/q$ are widely used in machine learning to quantify the discrepancy between the two distributions $p$ and $q$. For high-dimensional distributions, binary classification-based density ratio estimators have shown great promise. However, when densities are well separated, estimating the density ratio with a binary classifier is challenging. In this work, we show that the state-of-the-art density ratio estimators perform poorly on well-separated cases and demonstrate that this is due to distribution shifts between training and evaluation time. We present an alternative method that leverages multi-class classification for density ratio estimation and does not suffer from distribution shift issues. The method uses a set of auxiliary densities $\\{m_k\\}_{k=1}^K$ and trains a multi-class logistic regression to classify the samples from $p, q$, and $\\{m_k\\}_{k=1}^K$ into $K+2$ classes. We show that if these auxiliary densities are constructed such that they overlap with $p$ and $q$, then a multi-class logistic regression allows for estimating $\\log p/q$ on the domain of any of the $K+2$ distributions and resolves the distribution shift problems of the current state-of-the-art methods. We compare our method to state-of-the-art density ratio estimators on both synthetic and real datasets and demonstrate its superior performance on the tasks of density ratio estimation, mutual information estimation, and representation learning. Code: https://www.blackswhan.com/mdre/","PeriodicalId":432739,"journal":{"name":"Trans. Mach. Learn. Res.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Trans. Mach. Learn. Res.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2305.00869","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Functions of the ratio of the densities $p/q$ are widely used in machine learning to quantify the discrepancy between the two distributions $p$ and $q$. For high-dimensional distributions, binary classification-based density ratio estimators have shown great promise. However, when densities are well separated, estimating the density ratio with a binary classifier is challenging. In this work, we show that the state-of-the-art density ratio estimators perform poorly on well-separated cases and demonstrate that this is due to distribution shifts between training and evaluation time. We present an alternative method that leverages multi-class classification for density ratio estimation and does not suffer from distribution shift issues. The method uses a set of auxiliary densities $\{m_k\}_{k=1}^K$ and trains a multi-class logistic regression to classify the samples from $p, q$, and $\{m_k\}_{k=1}^K$ into $K+2$ classes. We show that if these auxiliary densities are constructed such that they overlap with $p$ and $q$, then a multi-class logistic regression allows for estimating $\log p/q$ on the domain of any of the $K+2$ distributions and resolves the distribution shift problems of the current state-of-the-art methods. We compare our method to state-of-the-art density ratio estimators on both synthetic and real datasets and demonstrate its superior performance on the tasks of density ratio estimation, mutual information estimation, and representation learning. Code: https://www.blackswhan.com/mdre/

查看原文本刊更多论文

用多项逻辑回归估计高差异分布间的密度比

密度之比函数在机器学习中被广泛用于量化两个分布$p$和$q$之间的差异。对于高维分布，基于二元分类的密度比估计器显示出很大的前景。然而，当密度分离得很好时，用二值分类器估计密度比是具有挑战性的。在这项工作中，我们证明了最先进的密度比估计器在分离良好的情况下表现不佳，并证明了这是由于训练和评估时间之间的分布变化。我们提出了一种替代方法，该方法利用多类分类进行密度比估计，并且不会受到分布移位问题的影响。该方法使用一组辅助密度$\{m_k\}_{k=1}^ k $，并训练多类逻辑回归将$p, q$和$\{m_k\}_{k=1}^ k $的样本分类为$ k +2$类。我们表明，如果这些辅助密度的构造使得它们与$p$和$q$重叠，那么多类逻辑回归允许在任何$K+2$分布的域上估计$\log p/q$，并解决当前最先进方法的分布移位问题。我们将我们的方法与最先进的密度比估计器在合成和真实数据集上进行了比较，并证明了它在密度比估计、互信息估计和表示学习任务上的优越性能。代码:https://www.blackswhan.com/mdre/

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Trans. Mach. Learn. Res.

自引率

0.00%

发文量