{"title":"Spatio-Temporal Pyramid Keypoint Detection With Event Cameras","authors":"Yuqing Zhu;Yuan Gao;Tianle Ding;Xiang Liu;Wenfei Yang;Tianzhu Zhang","doi":"10.1109/TCSVT.2025.3559299","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3559299","url":null,"abstract":"Event cameras are bio-inspired sensors with diverse advantages, including high temporal resolution and minimal power consumption. Therefore, event cameras enjoy a wide range of applications in computer vision, among which event keypoint detection plays a vital role. However, repeatable event keypoint detection remains challenging because the lack of temporal inter-frame interaction leads to descriptors with limited temporal consistency, which restricts the ability to perceive keypoint motion. Besides, detectors learned at single scale features are not suitable for event keypoints with significant motion speed differences in high-speed scenarios. To deal with these problems, we propose a novel Spatio-Temporal Pyramid Keypoint Detection Network (STPNet) for event cameras via a temporally consistent descriptor learning (TCL) module and a spatially diverse detector learning (SDL) module. The proposed STPNet enjoys several merits. First, the TCL module generates temporally consistent descriptors for specific keypoint motion patterns. Second, the SDL module produces spatially diverse detectors for applications in high-speed motion scenarios. Extensive experimental results on three challenging benchmarks show that our method notably outperforms state-of-the-art event keypoint detection methods. Specifically, our STPNet can outperform the best event keypoint detection method by 0.21px in reprj. error on Event-Camera, 4% in IoU on N-Caltech101, 0.13px in reprj. error on HVGA ATIS Corner and 5.94% in matching accuracy on DSEC.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9384-9397"},"PeriodicalIF":11.1,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Matryoshka Learning With Metric Transfer for Image-Text Matching","authors":"Pengzhe Wang;Lei Zhang;Zhendong Mao;Nenan Lyu;Yongdong Zhang","doi":"10.1109/TCSVT.2025.3558996","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3558996","url":null,"abstract":"Image-text matching is a significant technology for vision-language tasks, as it bridges the semantic gap between visual and text modalities. Although existing methods have achieved remarkable progress, high-dimensional embeddings or ensemble methods are often used to achieve sufficiently good recall or accuracy, which significantly increase the computational and storage costs in practical applications. Knowledge distillation can help achieve resource-efficient deployment, however, existing techniques are not directly applicable to cross-modal matching scenarios. The main difficulties arise from two aspects: 1) the distillation from teacher model to student model is usually conducted in two separate stages, and this inconsistency in learning objectives may lead to sub-optimal compression results. 2) distilling knowledge from each modality independently cannot ensure the preservation of cross-modal alignment established in the original embeddings, which can lead to the compressed ones failing to achieve accurate alignment. To address these issues, we propose a novel Matryoshka Learning with Metric Transfer framework (MAMET) for image-text matching. After capturing multi-granularity information through multiple high-dimensional embeddings, we propose an efficient Matryoshka training process with shared backbone to compress the different granularity information into a low-dimensional embedding, facilitating the integration of cross-modal matching and knowledge distillation in one single stage. Meanwhile, a novel metric transfer criterion is innovated to diversely align the metric relations across embedding spaces of different dimensions and modalities, ensuring a good cross-modal alignment after distillation. In this way, our MAMET transfers strong representation and generalization capability from the high-dimensional ensemble models to a basic network, which not only can get great performance boost, but also introduce no extra overhead during online inference. Extensive experiments on benchmark datasets demonstrate the superior effectiveness and efficiency of our MAMET, consistently achieving an average of 2%-20% performance improvement over state-of-the-art methods across various backbones and domains.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9502-9516"},"PeriodicalIF":11.1,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Coarse-to-Fine Hypergraph Network for Spatiotemporal Action Detection","authors":"Ping Li;Xingchao Ye;Lingfeng He","doi":"10.1109/TCSVT.2025.3558939","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3558939","url":null,"abstract":"Spatiotemporal action detection localizes the action instances along both spatial and temporal dimensions, by identifying action start time and end time, action class, and object (e.g., actor) bounding boxes. It faces two primary challenges: 1) varying durations of actions and inconsistent tempo of action instances within the same class, and 2) modeling complex object interactions, which are not well handled by previous methods. For the former, we develop the coarse-to-fine attention module, which employs an efficient dynamic time warping to make a coarse estimation of action frames by eliminating context-agnostic features, and further adopts the attention mechanism to capture the first-order object relations within those action frames. This results in a finer-granularity of action estimation. For the latter, we design the ternary high-order hypergraph neural networks, which model the spatial relation, the motion dynamics, and the high-order relations of different objects across frames. This encourages the positive relation of the objects within the same actions, while suppressing the negative relation of those in different actions. Therefore, we present a Coarse-to-Fine Hypergraph Network, abbreviated as CFHN, for spatiotemporal action detection, by considering the object local context, the first-order object relations, and the high-order object relations together. It combines the spatiotemporal first-order and high-order features along the channel dimension to obtain satisfying detection results. Extensive experiments on several benchmarks including AVA, JHMDB-21, and UCF101-24 demonstrate the superiority of the proposed approach.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8653-8665"},"PeriodicalIF":11.1,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Structured Light Image Planar-Topography Feature Decomposition for Generalizable 3D Shape Measurement","authors":"Mingyang Lei;Jingfan Fan;Long Shao;Hong Song;Deqiang Xiao;Danni Ai;Tianyu Fu;Yucong Lin;Ying Gu;Jian Yang","doi":"10.1109/TCSVT.2025.3558732","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3558732","url":null,"abstract":"The application of structured light (SL) techniques has achieved remarkable success in three-dimensional (3D) measurements. Traditional methods generally calculate SL information pixel by pixel to obtain the measurement results. Recently, the rise of deep learning (DL) has led to significant developments in this task. However, existing DL-based methods generally learn all features within the image in an end-to-end manner, ignoring the distinction between SL and non-SL information. Therefore, these methods may encounter difficulties in focusing on subtle variations in SL patterns across different scenes, thereby degrading measurement precision. To overcome this challenge, we propose a novel SL Image Planar-Topography Feature Decomposition Network (SIDNet). To fully utilize the information from different SL modality images (fringe and speckle), we decompose different modalities into topography features (modality-specific) and planar features (modality-shared). A physics-driven decomposition loss is proposed to make the topography/planar features dissimilar/similar, which guides the network to distinguish between SL and non-SL information. Moreover, to obtain modality-fused features with global overview and local detail information, we propose a wrapped phase-driven feature fusion module. Specifically, a novel Tri-modality Mamba block is designed to integrate different sources with the guidance of the wrapped phase features. Extensive experiments demonstrate the superiority of our SIDNet in multiple simulated 3D measurement scenes. Moreover, our method shows better generalization ability than other DL models and can be directly applicable to unseen real-world scenes.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9517-9529"},"PeriodicalIF":11.1,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BLENet: A Bio-Inspired Lightweight and Efficient Network for Left Ventricle Segmentation in Echocardiography","authors":"Xintao Pang;Fengjuan Yao;Yanming Zhang;Yue Sun;Edmundo Patricio Lopes Lao;Chuan Lin;Patrick Cheong-Iao Pang;Wei Wang;Wei Li;Zhifan Gao;Tao Tan","doi":"10.1109/TCSVT.2025.3558496","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3558496","url":null,"abstract":"In echocardiography, accurate segmentation of the left ventricle at end-diastole (ED) and end-systole (ES) is crucial for quantitative assessment of left ventricular ejection fraction. However, as a dynamic imaging modality requiring real-time analysis and frequently performed in various clinical settings with portable devices, this challenges mainstream approaches that primarily enhance model performance by increasing the number of parameters and computational costs, while lacking targeted optimization for its characteristics. To address these challenges, we propose BLENet, a lightweight segmentation model inspired by biological vision mechanisms. By integrating key mechanisms from biological vision systems with medical image features, our model achieves efficient and accurate segmentation. Specifically, the center-surround antagonism of retinal ganglion cells and the lateral geniculate nucleus exhibits high sensitivity to contrast variations, corresponding to the distinct contrast between the ventricular chamber (hypoechoic) and myocardial wall (hyperechoic) in ultrasound images. Based on this, we designed an antagonistic module to enhance feature extraction in target regions. Subsequently, the directional selectivity mechanism in the V1 cortex aligns with the variable directional features of the ventricular boundary, inspiring our direction-selective module to improve segmentation accuracy. Finally, we introduce an adaptive wavelet fusion module in the decoding network to address the limited receptive field of convolutions and enhance feature integration in cardiac ultrasound. Experiments demonstrate that our model contains only 0.16M parameters and requires no pre-training. On the CAMUS dataset, it achieves Dice coefficient values of 0.951 and 0.927 for ED and ES phases respectively, while on the EchoNet-Dynamic dataset, it achieves 0.933 and 0.909, with an inference speed of 112 FPS on NVIDIA RTX 2080 Ti. Evaluation on an external clinical dataset indicates our model’s promising generalization and potential for clinical application.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9218-9233"},"PeriodicalIF":11.1,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chuang Yang;Xu Han;Tao Han;Han Han;Bingxuan Zhao;Qi Wang
{"title":"Edge Approximation Text Detector","authors":"Chuang Yang;Xu Han;Tao Han;Han Han;Bingxuan Zhao;Qi Wang","doi":"10.1109/TCSVT.2025.3558634","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3558634","url":null,"abstract":"Pursuing efficient text shape representations helps scene text detection models focus on compact foreground regions and optimize the contour reconstruction steps to simplify the whole detection pipeline. Current approaches either represent irregular shapes via box-to-polygon strategy or decomposing a contour into pieces for fitting gradually, the deficiency of coarse contours or complex pipelines always exists in these models. Considering the above issues, we introduce <italic>EdgeText</i> to fit text contours compactly while alleviating excessive contour rebuilding processes. Concretely, it is observed that the two long edges of texts can be regarded as smooth curves. It allows us to build contours via continuous and smooth edges that cover text regions tightly instead of fitting piecewise, which helps avoid the two limitations in current models. Inspired by this observation, EdgeText formulates the text representation as the edge approximation problem via parameterized curve fitting functions. In the inference stage, our model starts with locating text centers, and then creating curve functions for approximating text edges relying on the points. Meanwhile, truncation points are determined based on the location features. In the end, extracting curve segments from curve functions by using the pixel coordinate information brought by truncation points to reconstruct text contours. Furthermore, considering the deep dependency of EdgeText on text edges, a bilateral enhanced perception (BEP) module is designed. It encourages our model to pay attention to the recognition of edge features. Additionally, to accelerate the learning of the curve function parameters, we introduce a proportional integral loss (PI-loss) to force the proposed model to focus on the curve distribution and avoid being disturbed by text scales. Ablation experiments demonstrate that EdgeText can fit scene texts compactly and naturally. Comparisons show that EdgeText is superior to existing methods on multiple public datasets. Code is available at <uri>https://github.com/omtcyang/EdgeTD</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9234-9245"},"PeriodicalIF":11.1,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenjiang Du;Zhitao Liu;Guan Wang;Jiwei Wei;Sophyani Banaamwini Yussif;Zheng Wang;Ning Xie;Yang Yang
{"title":"CMNet: Cross-Modal Coarse-to-Fine Network for Point Cloud Completion Based on Patches","authors":"Zhenjiang Du;Zhitao Liu;Guan Wang;Jiwei Wei;Sophyani Banaamwini Yussif;Zheng Wang;Ning Xie;Yang Yang","doi":"10.1109/TCSVT.2025.3557842","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3557842","url":null,"abstract":"Point clouds serve as the foundational representation of 3D objects, playing a pivotal role in both computer vision and computer graphics. Recently, the acquisition of point clouds has been effortless because of the development of hardware devices. However, the collected point clouds may be incomplete due to environmental conditions, such as occlusion. Therefore, completing partial point clouds becomes an essential task. The majority of current methods address point cloud completion via the utilization of shape priors. While these methods have demonstrated commendable performance, they often encounter challenges in preserving the global structural and geometric details of the 3D shape. In contrast to those mentioned earlier, we propose a novel cross-modal coarse-to-fine network (CMNet) for point cloud completion. Our method utilizes additional image information to provide global information, thus avoiding the loss of structure. To ensure that the generated results contain sufficient geometric details, we propose a coarse-to-fine learning approach based on multiple patches. Specifically, we encode the image and use multiple generators to generate multiple coarse patches, which are combined into a complete shape. Subsequently, based on the coarse patches generated in advance, we generate fine patches by combining partial point cloud information. Experimental results show that our method achieves state-of-the-art performance on point cloud completion.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9132-9147"},"PeriodicalIF":11.1,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IEEE Transactions on Circuits and Systems for Video Technology Publication Information","authors":"","doi":"10.1109/TCSVT.2025.3547202","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3547202","url":null,"abstract":"","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 4","pages":"C2-C2"},"PeriodicalIF":8.3,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10949578","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143777912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IEEE Circuits and Systems Society Information","authors":"","doi":"10.1109/TCSVT.2025.3547204","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3547204","url":null,"abstract":"","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 4","pages":"C3-C3"},"PeriodicalIF":8.3,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10949579","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143783370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Guest Editorial Introduction to the Special Issue on Segment Anything for Videos and Beyond","authors":"Wenguan Wang;Hengshuang Zhao;Xinggang Wang;Fisher Yu;David Crandall","doi":"10.1109/TCSVT.2025.3551468","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3551468","url":null,"abstract":"","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 4","pages":"2947-2950"},"PeriodicalIF":8.3,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10949584","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143777828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}