{"title":"℘-MinHash Algorithm for Continuous Probability Measures: Theory and Application to Machine Learning","authors":"Ping Li, Xiaoyun Li, G. Samorodnitsky","doi":"10.1145/3511808.3557413","DOIUrl":null,"url":null,"abstract":"This paper studies the scale-invariant \"probability Jaccard'' (ProbJ), noted as ℐ℘, which is another variant of weighted Jaccard similarity. The standard and commonly used Jaccard index is not invariant of data scaling. Thus, the probability Jaccard can be a potentially useful extension to probability distributions. Before our paper, the problem of hashing the ℐ℘ for continuous probability measures is an open problem, where rigorous definitions and analysis are still absent in literature. In our work, we solve this problem systematically and completely. Specifically, we formalize the definition of ℐ℘ in continuous measure space, and propose a general ℘-MinHash sampling algorithm which generates samples following any target distribution, and preserves ℐ℘ between two distributions by the hash collision. In addition, a refined early stopping rule is proposed under a practical boundedness assumption. We validate the theory through simulation and experiments, and demonstrate the application of our method in machine learning problems.","PeriodicalId":389624,"journal":{"name":"Proceedings of the 31st ACM International Conference on Information & Knowledge Management","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 31st ACM International Conference on Information & Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3511808.3557413","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper studies the scale-invariant "probability Jaccard'' (ProbJ), noted as ℐ℘, which is another variant of weighted Jaccard similarity. The standard and commonly used Jaccard index is not invariant of data scaling. Thus, the probability Jaccard can be a potentially useful extension to probability distributions. Before our paper, the problem of hashing the ℐ℘ for continuous probability measures is an open problem, where rigorous definitions and analysis are still absent in literature. In our work, we solve this problem systematically and completely. Specifically, we formalize the definition of ℐ℘ in continuous measure space, and propose a general ℘-MinHash sampling algorithm which generates samples following any target distribution, and preserves ℐ℘ between two distributions by the hash collision. In addition, a refined early stopping rule is proposed under a practical boundedness assumption. We validate the theory through simulation and experiments, and demonstrate the application of our method in machine learning problems.