Enhancing Indoor Occupancy Prediction via Sparse Query-Based Multi-Level Consistent Knowledge Distillation

IF 5.3 2区计算机科学 Q2 ROBOTICS

IEEE Robotics and Automation Letters Pub Date : 2025-09-29 DOI:10.1109/LRA.2025.3615532

Xiang Li;Yupeng Zheng;Pengfei Li;Yilun Chen;Ya-Qin Zhang;Wenchao Ding

{"title":"Enhancing Indoor Occupancy Prediction via Sparse Query-Based Multi-Level Consistent Knowledge Distillation","authors":"Xiang Li;Yupeng Zheng;Pengfei Li;Yilun Chen;Ya-Qin Zhang;Wenchao Ding","doi":"10.1109/LRA.2025.3615532","DOIUrl":null,"url":null,"abstract":"Occupancy prediction provides critical geometric and semantic understanding for robotics but faces efficiency-accuracy trade-offs. Current dense methods suffer computational waste on empty voxels, while sparse query-based approaches lack robustness in diverse and complex indoor scenes. In this letter, we propose DiScene, a novel sparse query-based framework that leverages multi-level distillation to achieve efficient and robust occupancy prediction. In particular, our method incorporates two key innovations: (1) a Multi-level Consistent Knowledge Distillation strategy, which transfers hierarchical representations from large teacher models to lightweight students through coordinated alignment across four levels, including encoder-level feature alignment, query-level feature matching, prior-level spatial guidance, and anchor-level high-confidence knowledge transfer and (2) a Teacher-Guided Initialization policy, employing optimized parameter warm-up to accelerate model convergence. Validated on the Occ-Scannet benchmark, DiScene achieves 23.2 FPS without depth priors while outperforming our baseline method, OPUS, by 36.1% and even better than the depth-enhanced version, OPUS<inline-formula><tex-math>$\\dagger$</tex-math></inline-formula>. With depth integration, DiScene<inline-formula><tex-math>$\\dagger$</tex-math></inline-formula> attains new SOTA performance, surpassing EmbodiedOcc by 3.7% with 1.62× faster inference speed. Furthermore, experiments on the Occ3D-nuScenes benchmark and in-the-wild scenarios demonstrate the versatility of our approach in various environments.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 11","pages":"11690-11697"},"PeriodicalIF":5.3000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Robotics and Automation Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11183690/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Occupancy prediction provides critical geometric and semantic understanding for robotics but faces efficiency-accuracy trade-offs. Current dense methods suffer computational waste on empty voxels, while sparse query-based approaches lack robustness in diverse and complex indoor scenes. In this letter, we propose DiScene, a novel sparse query-based framework that leverages multi-level distillation to achieve efficient and robust occupancy prediction. In particular, our method incorporates two key innovations: (1) a Multi-level Consistent Knowledge Distillation strategy, which transfers hierarchical representations from large teacher models to lightweight students through coordinated alignment across four levels, including encoder-level feature alignment, query-level feature matching, prior-level spatial guidance, and anchor-level high-confidence knowledge transfer and (2) a Teacher-Guided Initialization policy, employing optimized parameter warm-up to accelerate model convergence. Validated on the Occ-Scannet benchmark, DiScene achieves 23.2 FPS without depth priors while outperforming our baseline method, OPUS, by 36.1% and even better than the depth-enhanced version, OPUS

$\dagger$

. With depth integration, DiScene

$\dagger$

attains new SOTA performance, surpassing EmbodiedOcc by 3.7% with 1.62× faster inference speed. Furthermore, experiments on the Occ3D-nuScenes benchmark and in-the-wild scenarios demonstrate the versatility of our approach in various environments.

查看原文本刊更多论文

基于稀疏查询的多层次一致性知识精馏增强室内占用率预测

占位预测为机器人提供了关键的几何和语义理解，但面临效率和准确性的权衡。当前的密集方法在空体素上存在计算浪费，而基于稀疏查询的方法在复杂多样的室内场景中缺乏鲁棒性。在这封信中，我们提出了DiScene，一个新的基于稀疏查询的框架，利用多级蒸馏来实现高效和鲁棒的占用预测。特别是，我们的方法包含两个关键的创新：(1)采用多层一致性知识精馏法，通过编者级特征对齐、查询级特征匹配、先验级空间引导和锚定级高置信度知识转移等四个层次的协调对齐，将大型教师模型的分层表示传递给轻量级学生；(2)采用优化参数预热的教师引导初始化策略，加速模型收敛。在Occ-Scannet基准测试上验证，DiScene在没有深度先验的情况下达到23.2 FPS，同时比我们的基准方法OPUS高出36.1%，甚至比深度增强版本OPUS$\dagger$更好。通过深度集成，DiScene$\dagger$实现了新的SOTA性能，超越了EmbodiedOcc 3.7%，推理速度提高了1.62倍。此外，Occ3D-nuScenes基准和野外场景的实验证明了我们的方法在各种环境中的多功能性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Robotics and Automation Letters Computer Science-Computer Science Applications

CiteScore

9.60

自引率

15.40%

发文量

1428

期刊介绍： The scope of this journal is to publish peer-reviewed articles that provide a timely and concise account of innovative research ideas and application results, reporting significant theoretical findings and application case studies in areas of robotics and automation.