{"title":"You Might Not Need Attention Diagonals","authors":"Yiming Cui;Xin Yao;Shijin Wang;Guoping Hu","doi":"10.1109/LSP.2025.3601497","DOIUrl":"https://doi.org/10.1109/LSP.2025.3601497","url":null,"abstract":"Pre-trained language models, such as GPT, BERT, have revolutionized natural language processing tasks across various fields. However, the current multi-head self-attention mechanisms in these models exhibit an “over self-confidence” issue, which has been underexplored in prior research, causing the model to attend heavily to itself rather than other tokens. In this study, we propose a simple yet efficient solution: discarding diagonal elements in the attention matrix, allowing the model to focus more on other tokens. Our experiments reveal that the proposed approach not only consistently improves upon vanilla attention in transformer models for diverse natural language understanding tasks, particularly for smaller models in resource-limited conditions, but also exhibits faster convergence in training speed. This effectiveness generalizes well across different languages, model types, and various natural language understanding tasks, while requiring almost no additional computation. Our findings challenge previous assumptions about multi-head self-attention and suggest a promising direction for developing more effective pre-trained language models.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"3435-3439"},"PeriodicalIF":3.9,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Dual-Path Multiple Instance Learning Network Guided by Image Quality Assessment for Cervical Whole Slide Image Classification","authors":"Lanlan Kang;Jian Wang;Jian Qin;Yongjun He;Bo Ding","doi":"10.1109/LSP.2025.3601043","DOIUrl":"https://doi.org/10.1109/LSP.2025.3601043","url":null,"abstract":"The existing cervical whole slide image classification methods ignore the influence of image quality, resulting in low classification accuracy. To address this, we propose a dual-path multiple instance learning classification method guided by image quality assessment. Specifically, a pre-trained quality assessment model assigns quality scores to patches, splitting them into high- and low-quality paths. In the high-quality path, patch features are weighted by their quality scores to emphasize reliable diagnostic regions. In the low-quality path, a key instance is selected using clustering and feature distance matching. Finally, a cross-attention module fuses features across quality levels. Our method achieves 94.64% accuracy and 91.74% AUC on a dataset of 2,434 WSIs collected from five medical centers, outperforming state-of-the-art methods.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"3285-3289"},"PeriodicalIF":3.9,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144909407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Grid-Free Radio Map Estimation via Unsupervised Implicit Continuous Representation","authors":"Xiaonan Chen;Jun Wang","doi":"10.1109/LSP.2025.3601038","DOIUrl":"https://doi.org/10.1109/LSP.2025.3601038","url":null,"abstract":"Radio map estimation (RME), also known as spectrum cartography (SC), aims to estimate instantaneous signal power distribution over a certain space-frequency region. Recent RME approaches typically discretize the to-be-estimated radio map into grid cells under a fixed resolution. Meshing subtly adds structural priors, e.g., low-rankness or deep image priors, to the radio map. These priors can effectively enhance the performance of RME, especially in blind scenarios. However, the downside is all the locations in a grid cell will share the same signal power, which is overly simplistic and contradict the continuity nature of power propagation. This work puts forth a blind grid-free RME framework. We introduce implicit continuous representation (ICR), which learns a mapping between spatial coordinates and power propagation pattern of each transmitter. This mechanism conceptually enables estimating the signal power at any spatial location within a certain region. With some model-based interpretations and designated optimization criteria, the ICR-based framework could be fully unsupervised, using only sampled data for training. This implies that our approach is not prone to the prevalent generalizability issue. Experiments under simulated and ray-tracing datasets verify the effectiveness of the proposed approach.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"3430-3434"},"PeriodicalIF":3.9,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Scale Cross-Dimensional Attention Network for Gland Segmentation","authors":"Chaozhi Yu;Hongnan Cheng;Yufei Huang;Zhizhe Lin;Teng Zhou","doi":"10.1109/LSP.2025.3600374","DOIUrl":"https://doi.org/10.1109/LSP.2025.3600374","url":null,"abstract":"Gland lesions affect a large global population. Accurately segmenting surface structures is crucial for assisting in the diagnosis of these diseases. In this direction, we investigate two key issues: 1) How to accurately segment gland morphology and irregular boundaries and 2) How to distinguish gland internal heterogeneity and its similarity to the background. The main results are that 1) parallel multi-scale attention (PMA) smooths the segmentation of blurred boundaries of varying sizes and improves detail accuracy. 2) Cross-dimensional attention (CDA) models the dependencies between gland channels and spatial dimensions to enhance the understanding of spatial information both inside and outside the gland, thereby more accurately distinguishing the gland from the background. Per the main results, we propose a multi-scale cross-dimensional attention network (MCANet) for gland segmentation. Extensive experiments on six real-world datasets demonstrate the superior performance of our method in gland segmentation.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"3365-3369"},"PeriodicalIF":3.9,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144909223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peijun Jin;Huizhang Yang;Meng Sun;Xuanchen Guo;Shuolin Pan
{"title":"Least-Square Estimation of FM Rates for Removing LFM Interference in SAR Images","authors":"Peijun Jin;Huizhang Yang;Meng Sun;Xuanchen Guo;Shuolin Pan","doi":"10.1109/LSP.2025.3600375","DOIUrl":"https://doi.org/10.1109/LSP.2025.3600375","url":null,"abstract":"Ground and spaceborne radars can cause severe linear-frequency modulation (LFM) interference in space borne synthetic aperture radar (SAR) imagery. To address this problem, it is important to develop an efficient algorithm for LFM interference suppression in SAR images. For this purpose, in this paper we first propose a fast estimator for retrieving the frequency-modulation (FM) rates of the LFM interference based on least-square estimation. Then, we develop a spectral focusing-based algorithm for removing LFM interference using the estimated FM rates. Real-data and simulation experiments show that the proposed algorithm can accurately estimate the FM rates and effectively remove LFM interference signatures in interferometric wide-swath single-look-complex images.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"3310-3314"},"PeriodicalIF":3.9,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144909221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Altering Query Prompting With Contrastive Learning for Multimodal Intent Recognition","authors":"Yuxin Jia;Xueping Wang;Zhanpeng Shao;Min Liu","doi":"10.1109/LSP.2025.3599107","DOIUrl":"https://doi.org/10.1109/LSP.2025.3599107","url":null,"abstract":"Multimodal intent recognition utilizes heterogeneous modalities such as visual, auditory, and textual cues to infer user intent, serving as a pivotal component in human-machine interaction. Existing approaches, however, often rely on unimodal paradigms or shallow multimodal fusion, failing to model cross-modal semantic dependencies and struggling to extract discriminative features from non-verbal modalities, limiting their robustness in complex scenarios. To mitigate these limitations, we propose an Altering Query Prompting with Contrastive Learning framework (AQP-CL) that dynamically aligns and refines multimodal representations. Specifically, the Altering Query Prompting (AQP) module introduces a tri-modality rotation attention mechanism, where textual, visual, and acoustic modalities cyclically alternate as queries in cross-attention operations. This approach addresses modality bias while strengthening interdependencies between modalities, ultimately yielding intent-aware fused feature representations that preserve discriminative cues. The Label-semantic Augmented Contrastive Learning (LACL) strategy generates augmented samples through the intent-aware query prompt and enhances feature discrimination via NT-Xent loss on label tokens. By integrating high-confidence textual semantics from intent labels, LACL refines auxiliary modality features through contrastive alignment, ensuring robust cross-modal representation learning. Evaluations on IEMOCAP and MIntRec validate AQP-CL’s superiority, achieving state-of-the-art precision of 77.78% on IEMOCAP, a 3.41% improvement over existing methods.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"3345-3349"},"PeriodicalIF":3.9,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144909228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Task Diffusion With Masked Measurements","authors":"Mahdi Shamsi;Farokh Marvasti","doi":"10.1109/LSP.2025.3600370","DOIUrl":"https://doi.org/10.1109/LSP.2025.3600370","url":null,"abstract":"This letter addresses the problem of clustered multitask distributed estimation under masked measurements, where network nodes observe partial or incomplete data due to sensing limitations, communication constraints, or privacy requirements. We propose a novel extension of the Diffusion LMS (DLMS) algorithm that incorporates node-specific masking and a task-clustered structure. A tailored network-wide optimization problem is formulated to jointly handle masked observations and inter-cluster multitask estimation. Convergence analysis and simulation results demonstrate the effectiveness and robustness of the proposed approach in improving estimation performance under partial observability.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"3490-3494"},"PeriodicalIF":3.9,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145078695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"StreamMel: Real-Time Zero-Shot Text-to-Speech Via Interleaved Continuous Autoregressive Modeling","authors":"Hui Wang;Yifan Yang;Shujie Liu;Jinyu Li;Lingwei Meng;Yanqing Liu;Jiaming Zhou;Haoqin Sun;Yan Lu;Yong Qin","doi":"10.1109/LSP.2025.3600376","DOIUrl":"https://doi.org/10.1109/LSP.2025.3600376","url":null,"abstract":"Recent advances in zero-shot text-to-speech (TTS) synthesis have achieved high-quality speech generationfor unseen speakers, but most systems remain unsuitable for real-time applications because of their offline design. Current streaming TTS paradigms often rely on multi-stage pipelines and discrete representations,leading to increased computational cost and suboptimal system performance. In this work, we propose StreamMel, a pioneering single-stage streaming TTS framework that models continuous mel-spectrograms. By interleaving text tokens with acoustic frames, StreamMel enables low-latency, autoregressive synthesis while preserving high speaker similarity and naturalness. Experiments on LibriSpeech demonstrate that StreamMel outperforms existing streaming TTS baselines in both quality and latency. It even achieves performance comparable to offline systems while supporting efficient real-time generation, showcasing broad prospects for integration with real-time speech large language models.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"3530-3534"},"PeriodicalIF":3.9,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145090033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Domain-Factored Untrained Deep Prior for Spectrum Cartography","authors":"Subash Timilsina;Sagar Shrestha;Lei Cheng;Xiao Fu","doi":"10.1109/LSP.2025.3599714","DOIUrl":"https://doi.org/10.1109/LSP.2025.3599714","url":null,"abstract":"<italic>Spectrum cartography</i> (SC) aims to estimate the radio power map of multiple emitters over space and frequency using limited sensor data. Recent advances leverage learned <italic>deep generative models</i> (DGMs) as structural priors, achieving state-of-the-art performance by capturing complex spatial-spectral patterns. However, DGMs require large training datasets and may suffer under distribution shifts. To address these limitations, we propose a <italic>training-free</i> SC approach based on <italic>untrained neural networks</i> (UNNs), which encode structural priors through architectural design. Our custom UNN exploits a spatio-spectral factorization model rooted in the physical structure of radio maps, enabling low sample complexity. Experiments show that our method matches the performance of DGM-based SC without any training data.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"3440-3444"},"PeriodicalIF":3.9,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145057463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DOA or Speaker Embedding: Which is Better for Multi-Microphone Target Speaker Extraction","authors":"Shuang Zhang;Jie Zhang;Yichi Wang;Haoyin Yan","doi":"10.1109/LSP.2025.3600168","DOIUrl":"https://doi.org/10.1109/LSP.2025.3600168","url":null,"abstract":"Target speaker extraction (TSE) is a useful front-end to improve the speech quality and intelligibility for speech applications, whereas direction-of-arrival (DOA) and speaker embedding are two of the most often-used assistive clues to identify the target speaker in audio-only multi-microphone systems. Both can significantly improve the TSE performance compared to blind TSE models, which however have not yet been comprehensively compared in literature. In order to show their pros and cons, in this work we therefore build a unified framework for a fair comparison that allows for both DOA and speaker embedding as the assistive clue. The DOA is used to calculate multichannel spatiotemporal speech features and a speaker encoder is designed to extract the speaker embedding, either of which is then fused with the noisy speech features for TSE. We can then evaluate their respective strengths in diverse acoustic conditions, e.g., varying noise level, microphone number, speaker location. Results show that given true DOA angles, the DOA-based TSE model always outperforms the speaker embedding based counterpart regardless of noise/microphone/location conditions, meaning the stronger discriminativity of DOA in terms of speaker identity. This superiority becomes smaller if the DOA mis-match increases, and the latter can do better in the large DOA mismatch case.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"3350-3354"},"PeriodicalIF":3.9,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144909102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}