Liang Li , Guochu Chen , Haiyan Wang , Baojiang Li , Bin Wang , Zizhen Yi , Chunbo Zhao
{"title":"VHTformer: A joint query perception method for visual-haptic-textual information based on Transformer","authors":"Liang Li , Guochu Chen , Haiyan Wang , Baojiang Li , Bin Wang , Zizhen Yi , Chunbo Zhao","doi":"10.1016/j.asoc.2025.113529","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal information fusion research struggles with aligning heterogeneous modalities and addressing data imbalance, especially when integrating visual, haptic, and text—three modalities offering complementary perceptual and semantic features. Current research focuses on Transformers for unimodal and vision-haptics bimodal tasks, neglecting tri-modal integration. Leveraging text's semantic bridging capacity could address this limitation in cross-sensory learning. We propose VHTformer, a Transformer-based framework designed to unify visual, haptic, and textual modalities via joint query learning. The model leverages hierarchical attention mechanisms: self-attention refines intra-modal features (e.g., extracting texture from haptic signals or contextual semantics from text). Meanwhile, cross-attention aligns spatial-semantic patterns across modalities through learnable joint queries. This enables synergistic fusion of geometric shapes (vision), material properties (haptics), and descriptive attributes (text). Experiments were conducted on three multimodal datasets—ObjectFolder 2.0, Touch and Go, and ObjectFolder Real—covering a total of 100 + object categories with diverse material and shape properties. To mitigate class imbalance and ensure statistical reliability, we adopted stratified 5-fold cross-validation. In addition, we conducted robustness evaluations under Gaussian noise injection to verify the model's robustness. VHTformer achieves up to 99.55 % recognition accuracy and demonstrates strong robustness, highlighting the value of tri-modal integration for comprehensive object understanding.</div></div>","PeriodicalId":50737,"journal":{"name":"Applied Soft Computing","volume":"181 ","pages":"Article 113529"},"PeriodicalIF":7.2000,"publicationDate":"2025-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Soft Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1568494625008403","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Multimodal information fusion research struggles with aligning heterogeneous modalities and addressing data imbalance, especially when integrating visual, haptic, and text—three modalities offering complementary perceptual and semantic features. Current research focuses on Transformers for unimodal and vision-haptics bimodal tasks, neglecting tri-modal integration. Leveraging text's semantic bridging capacity could address this limitation in cross-sensory learning. We propose VHTformer, a Transformer-based framework designed to unify visual, haptic, and textual modalities via joint query learning. The model leverages hierarchical attention mechanisms: self-attention refines intra-modal features (e.g., extracting texture from haptic signals or contextual semantics from text). Meanwhile, cross-attention aligns spatial-semantic patterns across modalities through learnable joint queries. This enables synergistic fusion of geometric shapes (vision), material properties (haptics), and descriptive attributes (text). Experiments were conducted on three multimodal datasets—ObjectFolder 2.0, Touch and Go, and ObjectFolder Real—covering a total of 100 + object categories with diverse material and shape properties. To mitigate class imbalance and ensure statistical reliability, we adopted stratified 5-fold cross-validation. In addition, we conducted robustness evaluations under Gaussian noise injection to verify the model's robustness. VHTformer achieves up to 99.55 % recognition accuracy and demonstrates strong robustness, highlighting the value of tri-modal integration for comprehensive object understanding.
期刊介绍:
Applied Soft Computing is an international journal promoting an integrated view of soft computing to solve real life problems.The focus is to publish the highest quality research in application and convergence of the areas of Fuzzy Logic, Neural Networks, Evolutionary Computing, Rough Sets and other similar techniques to address real world complexities.
Applied Soft Computing is a rolling publication: articles are published as soon as the editor-in-chief has accepted them. Therefore, the web site will continuously be updated with new articles and the publication time will be short.