Shangyu Chen , Pengfei Fang , Mehrtash Harandi , Trung Le , Jianfei Cai , Dinh Phung
{"title":"HVQ-VAE: Variational auto-encoder with hyperbolic vector quantization","authors":"Shangyu Chen , Pengfei Fang , Mehrtash Harandi , Trung Le , Jianfei Cai , Dinh Phung","doi":"10.1016/j.cviu.2025.104392","DOIUrl":null,"url":null,"abstract":"<div><div>Vector quantized-variational autoencoder (VQ-VAE) and its variants have made significant progress in creating discrete latent space via learning a codebook. Previous works on VQ-VAE have focused on discrete latent spaces in Euclidean or in spherical spaces. This paper studies the geometric prior of hyperbolic spaces as a way to improve the learning capacity of VQ-VAE. That being said, working with the VQ-VAE in the hyperbolic space is not without difficulties, and the benefits of using hyperbolic space as the geometric prior for the latent space have never been studied in VQ-VAE. We bridge this gap by developing the VQ-VAE with hyperbolic vector quantization. To this end, we propose the hyperbolic VQ-VAE (HVQ-VAE), which learns the latent embedding of data and the codebook in the hyperbolic space. Specifically, we endow the discrete latent space in the Poincaré ball, such that the clustering algorithm can be formulated and optimized in the Poincaré ball. Thorough experiments against various baselines are conducted to evaluate the superiority of the proposed HVQ-VAE empirically. We show that HVQ-VAE enjoys better image reconstruction, effective codebook usage, and fast convergence than baselines. We also present evidence that HVQ-VAE outperforms VQ-VAE in low-dimensional latent space.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104392"},"PeriodicalIF":4.3000,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001158","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Vector quantized-variational autoencoder (VQ-VAE) and its variants have made significant progress in creating discrete latent space via learning a codebook. Previous works on VQ-VAE have focused on discrete latent spaces in Euclidean or in spherical spaces. This paper studies the geometric prior of hyperbolic spaces as a way to improve the learning capacity of VQ-VAE. That being said, working with the VQ-VAE in the hyperbolic space is not without difficulties, and the benefits of using hyperbolic space as the geometric prior for the latent space have never been studied in VQ-VAE. We bridge this gap by developing the VQ-VAE with hyperbolic vector quantization. To this end, we propose the hyperbolic VQ-VAE (HVQ-VAE), which learns the latent embedding of data and the codebook in the hyperbolic space. Specifically, we endow the discrete latent space in the Poincaré ball, such that the clustering algorithm can be formulated and optimized in the Poincaré ball. Thorough experiments against various baselines are conducted to evaluate the superiority of the proposed HVQ-VAE empirically. We show that HVQ-VAE enjoys better image reconstruction, effective codebook usage, and fast convergence than baselines. We also present evidence that HVQ-VAE outperforms VQ-VAE in low-dimensional latent space.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems