{"title":"On Conditional Independence Graph Learning From Multi-Attribute Gaussian Dependent Time Series","authors":"Jitendra K. Tugnait","doi":"10.1109/OJSP.2025.3578807","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3578807","url":null,"abstract":"Estimation of the conditional independence graph (CIG) of high-dimensional multivariate Gaussian time series from multi-attribute data is considered. Existing methods for graph estimation for such data are based on single-attribute models where one associates a scalar time series with each node. In multi-attribute graphical models, each node represents a random vector or vector time series. In this paper we provide a unified theoretical analysis of multi-attribute graph learning for dependent time series using a penalized log-likelihood objective function formulated in the frequency domain using the discrete Fourier transform of the time-domain data. We consider both convex (sparse-group lasso) and non-convex (log-sum and SCAD group penalties) penalty/regularization functions. We establish sufficient conditions in a high-dimensional setting for consistency (convergence of the inverse power spectral density to true value in the Frobenius norm), local convexity when using non-convex penalties, and graph recovery. We do not impose any incoherence or irrepresentability condition for our convergence results. We also empirically investigate selection of the tuning parameters based on the Bayesian information criterion, and illustrate our approach using numerical examples utilizing both synthetic and real data.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"705-721"},"PeriodicalIF":2.9,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11030300","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Random Matrix Theory Predictions of Dominant Mode Rejection SINR Loss due to Signal in the Training Data","authors":"Christopher C. Hulbert;Kathleen E. Wage","doi":"10.1109/OJSP.2025.3578812","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3578812","url":null,"abstract":"Detection and estimation performance depends on signal-to-interference-plus-noise ratio (SINR) at the output of an array. The Capon beamformer (BF) designed with ensemble statistics achieves the optimum SINR in stationary environments. Adaptive BFs compute their weights using the sample covariance matrix (SCM) obtained from snapshots, i.e., training samples. SINR loss, the ratio of adaptive to optimal SINR, quantifies the number of snapshots required to achieve a desired average level of performance. For adaptive Capon BFs that invert the full SCM, Reed et al. derived the SINR loss distribution and Miller quantified how the desired signal’s presence in the snapshots degrades that loss. Abraham and Owsley designed dominant mode rejection (DMR) for cases where the number of snapshots is less than or approximately equal to the number of sensors. DMR’s success in snapshot-starved passive sonar scenarios led to its application in other areas such as hyperspectral sensing and medical imaging. DMR forms a modified SCM as a weighted combination of the identity matrix and the dominant eigensubspace containing the loud interferers, thereby eliminating the inverse of the poorly estimated noise subspace. This work leverages recent random matrix theory (RMT) results to develop DMR performance predictions under the assumption that the desired signal is contained in the training data. Using white noise gain and interference suppression predictions, the paper derives a lower bound on DMR’s average SINR loss and confirms its accuracy using Monte Carlo simulations. Moreover, this paper creates a new eigensubspace leakage estimator applicable to broader RMT applications.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"735-752"},"PeriodicalIF":2.9,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11030297","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144550496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gerardo Roa-Dabike;Michael A. Akeroyd;Scott Bannister;Jon P. Barker;Trevor J. Cox;Bruno Fazenda;Jennifer Firth;Simone Graetzer;Alinka Greasley;Rebecca R. Vos;William M. Whitmer
{"title":"The First Cadenza Challenges: Using Machine Learning Competitions to Improve Music for Listeners With a Hearing Loss","authors":"Gerardo Roa-Dabike;Michael A. Akeroyd;Scott Bannister;Jon P. Barker;Trevor J. Cox;Bruno Fazenda;Jennifer Firth;Simone Graetzer;Alinka Greasley;Rebecca R. Vos;William M. Whitmer","doi":"10.1109/OJSP.2025.3578299","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3578299","url":null,"abstract":"Listening to music can be an issue for those with a hearing impairment, and hearing aids are not a universal solution. This paper details the first use of an open challenge methodology to improve the audio quality of music for those with hearing loss through machine learning. The first challenge (CAD1) had 9 participants. The second was a 2024 ICASSP grand challenge (ICASSP24), which attracted 17 entrants. The challenge tasks concerned demixing and remixing pop/rock music to allow a personalized rebalancing of the instruments in the mix, along with amplification to correct for raised hearing thresholds. The software baselines provided for entrants to build upon used two state-of-the-art demix algorithms: Hybrid Demucs and Open-Unmix. Objective evaluation used HAAQI, the Hearing-Aid Audio Quality Index. No entries improved on the best baseline in CAD1. It is suggested that this arose because demixing algorithms are relatively mature, and recent work has shown that access to large (private) datasets is needed to further improve performance. Learning from this, for ICASSP24 the scenario was made more difficult by using loudspeaker reproduction and specifying gains to be applied before remixing. This also made the scenario more useful for listening through hearing aids. Nine entrants scored better than the best ICASSP24 baseline. Most of the entrants used a refined version of Hybrid Demucs and NAL-R amplification. The highest scoring system combined the outputs of several demixing algorithms in an ensemble approach. These challenges are now open benchmarks for future research with freely available software and data.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"722-734"},"PeriodicalIF":2.9,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11030066","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144536564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parthasaarathy Sudarsanam;Irene Martín-Morató;Aapo Hakala;Tuomas Virtanen
{"title":"AVCaps: An Audio-Visual Dataset With Modality-Specific Captions","authors":"Parthasaarathy Sudarsanam;Irene Martín-Morató;Aapo Hakala;Tuomas Virtanen","doi":"10.1109/OJSP.2025.3578296","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3578296","url":null,"abstract":"This paper introduces AVCaps, an audio-visual dataset that contains separate textual captions for the audio, visual, and audio-visual contents of video clips. The dataset contains 2061 video clips constituting a total of 28.8 hours. We provide up to 5 captions for the audio, visual, and audio-visual content of each clip, crowdsourced separately. Existing datasets focus on a single modality or do not provide modality-specific captions, limiting the study of how each modality contributes to overall comprehension in multimodal settings. Our dataset addresses this critical gap in multimodal research by offering a resource for studying how audio and visual content are captioned individually, as well as how audio-visual content is captioned in relation to these individual modalities. Crowdsourced audio-visual captions are prone to favor visual content over audio content. To avoid this we use large language models (LLMs) to generate three balanced audio-visual captions for each clip based on the crowdsourced captions. We present captioning and retrieval experiments to illustrate the effectiveness of modality-specific captions in evaluating model performance. Specifically, we show that the modality-specific captions allow us to quantitatively assess how well a model understands audio and visual information from a given video. Notably, we find that a model trained on the balanced LLM-generated audio-visual captions captures audio information more effectively compared to a model trained on crowdsourced audio-visual captions. This model achieves a 14% higher Sentence-BERT similarity on crowdsourced audio captions compared to a model trained on crowdsourced audio-visual captions, which are typically more biased towards visual information. We also discuss the possibilities in multimodal representation learning, question answering, developing new video captioning metrics, and generative AI that this dataset unlocks. The dataset is available publicly at Zenodo and Hugging Face.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"691-704"},"PeriodicalIF":2.9,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11029114","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144511206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unsupervised Action Anticipation Through Action Cluster Prediction","authors":"Jiuxu Chen;Nupur Thakur;Sachin Chhabra;Baoxin Li","doi":"10.1109/OJSP.2025.3578300","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3578300","url":null,"abstract":"Predicting near-future human actions in videos has become a focal point of research, driven by applications such as human-helping robotics, collaborative AI services, and surveillance video analysis. However, the inherent challenge lies in deciphering the complex spatial-temporal dynamics inherent in typical video feeds. While existing works excel in constrained settings with fine-grained action ground-truth labels, the general unavailability of such labeling at the frame level poses a significant hurdle. In this paper, we present an innovative solution to anticipate future human actions without relying on any form of supervision. Our approach involves generating pseudo-labels for video frames through the clustering of frame-wise visual features. These pseudo-labels are then input into a temporal sequence modeling module that learns to predict future actions in terms of pseudo-labels. Apart from the action anticipation method, we propose an innovative evaluation scheme GreedyMapper, a unique many-to-one mapping scheme that provides a practical solution to the many-to-one mapping challenge, a task that existing mapping algorithms struggle to address. Through comprehensive experimentation conducted on demanding real-world cooking datasets, our unsupervised method demonstrates superior performance compared to weakly-supervised approaches by a significant margin on the 50Salads dataset. When applied to the Breakfast dataset, our approach yields strong performance compared to the baselines in an unsupervised setting and delivers competitive results to (weakly) supervised methods under a similar setting.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"641-650"},"PeriodicalIF":2.9,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11029147","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144366940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Federated Learning With Automated Dual-Level Hyperparameter Tuning","authors":"Rakib Ul Haque;Panagiotis Markopoulos","doi":"10.1109/OJSP.2025.3578273","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3578273","url":null,"abstract":"Federated Learning (FL) is a decentralized machine learning (ML) approach where multiple clients collaboratively train a shared model over several update rounds without exchanging local data. Similar to centralized learning, determining hyperparameters (HPs) like learning rate and batch size remains challenging yet critical for model performance. Current adaptive HP-tuning methods are often domain-specific and heavily influenced by initialization. Moreover, model accuracy often improves slowly, requiring many update rounds. This slow improvement is particularly problematic for FL, where each update round incurs high communication costs in addition to computation and energy costs. In this work, we introduce FLAUTO, the first method to perform dynamic HP-tuning simultaneously at both local (client) and global (server) levels. This dual-level adaptation directly addresses critical bottlenecks in FL, including slow convergence, client heterogeneity, and high communication costs, distinguishing it from existing approaches. FLAUTO leverages training loss and relative local model deviation as novel metrics, enabling robust and dynamic hyperparameter adjustments without reliance on initial guesses. By prioritizing high performance in early update rounds, FLAUTO significantly reduces communication and energy overhead—key challenges in FL deployments. Comprehensive experimental studies on image classification and object detection tasks demonstrate that FLAUTO consistently outperforms state-of-the-art methods, establishing its efficacy and broad applicability.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"795-802"},"PeriodicalIF":2.9,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11029096","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144634874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multidimensional Polynomial Phase Estimation","authors":"Heedong Do;Namyoon Lee;Angel Lozano","doi":"10.1109/OJSP.2025.3577503","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3577503","url":null,"abstract":"An estimation method is presented for polynomial phase signals, i.e., those adopting the form of a complex exponential whose phase is polynomial in its indices. Transcending the scope of existing techniques, the proposed estimator can handle an arbitrary number of dimensions and an arbitrary set of polynomial degrees along each dimension; the only requirement is that the number of observations per dimension exceeds the highest degree thereon. Embodied by a highly compact sequential algorithm, this estimator is efficient at high signal-to-noise ratios (SNRs), exhibiting a computational complexity that is strictly linear in the number of observations and at most quadratic in the number of polynomial terms. To reinforce the performance at low and medium SNRs, where any phase estimator is bound to be hampered by the inherent ambiguity caused by phase wrappings, suitable functionalities are incorporated and shown to be highly effective.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"651-681"},"PeriodicalIF":2.9,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11027552","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144367013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mask Optimization for Image Inpainting Using No-Reference Image Quality Assessment","authors":"Taiki Uchiyama;Mariko Isogawa","doi":"10.1109/OJSP.2025.3577089","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3577089","url":null,"abstract":"Image inpainting is a technique designed to remove unwanted regions from images and restore them. This technique is expected to be applied in various applications, including image editing, virtual reality (VR), mixed reality (MR), and augmented reality (AR). Typically, the inpainting process is based on missing regions predefined by user-applied masks. However, the specified areas may not always be ideal for inpainting, and the quality of the inpainting results varies depending on the annotated masked region. Therefore, this paper addresses the task of <bold>generating masks that improve inpainting results</b>. To this end, we proposed a method that utilized No-Reference Image Quality Assessment (NR-IQA), which can score image quality without a reference image, to generate masked regions that maximize inpainting quality.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"856-864"},"PeriodicalIF":2.7,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11025170","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144725309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel S. Nicolau;Lucas A. Thomaz;Luis M. N. Tavora;Sergio M. M. Faria
{"title":"Enhancing Learning-Based Cross-Modality Prediction for Lossless Medical Imaging Compression","authors":"Daniel S. Nicolau;Lucas A. Thomaz;Luis M. N. Tavora;Sergio M. M. Faria","doi":"10.1109/OJSP.2025.3564830","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3564830","url":null,"abstract":"Multimodal medical imaging, which involves the simultaneous acquisition of different modalities, enhances diagnostic accuracy and provides comprehensive visualization of anatomy and physiology. However, this significantly increases data size, posing storage and transmission challenges. Standard image codecs fail to properly exploit cross-modality redundancies, limiting coding efficiency. In this paper, a novel approach is proposed to enhance the compression gain and to reduce the computational complexity of a lossless cross-modality coding scheme for multimodal image pairs. The scheme uses a deep learning-based approach with Image-to-Image translation based on a Generative Adversarial Network architecture to generate an estimated image of one modality from its cross-modal pair. Two different approaches for inter-modal prediction are considered: one using the original and the estimated images for the inter-prediction scheme and another considering a weighted sum of both images. Subsequently, a decider based on a Convolutional Neural Network is employed to estimate the best coding approach to be selected among the two alternatives, before the coding step. A novel loss function that considers the decision accuracy and the compression gain of the chosen prediction approach is applied to improve the decision-making task. The experimental results on PET-CT and PET-MRI datasets demonstrate that the proposed approach improves by 11.76% and 4.61% the compression efficiency when compared with the single modality intra-coding of the Versatile Video Coding. Additionally, this approach allows to reduce the computational complexity by almost half in comparison to selecting the most compression-efficient after testing both schemes.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"489-497"},"PeriodicalIF":2.9,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10978054","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143943910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Content-Adaptive Inference for State-of-the-Art Learned Video Compression","authors":"Ahmet Bilican;M. Akın Yılmaz;A. Murat Tekalp","doi":"10.1109/OJSP.2025.3564817","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3564817","url":null,"abstract":"While the BD-rate performance of recent learned video codec models in both low-delay and random-access modes exceed that of respective modes of traditional codecs on average over common benchmarks, the performance improvements for individual videos with complex/large motions is much smaller compared to scenes with simple motion. This is related to the inability of a learned encoder model to generalize to motion vector ranges that have not been seen in the training set, which causes loss of performance in both coding of flow fields as well as frame prediction and coding. As a remedy, we propose a generic (model-agnostic) framework to control the scale of motion vectors in a scene during inference (encoding) to approximately match the range of motion vectors in the test and training videos by adaptively downsampling frames. This results in down-scaled motion vectors enabling: i) better flow estimation; hence, frame prediction and ii) more efficient flow compression. We show that the proposed framework for content-adaptive inference improves the BD-rate performance of already state-of-the-art low-delay video codec DCVC-FM by up to 41% on individual videos without any model fine tuning. We present ablation studies to show measures of motion and scene complexity can be used to predict the effectiveness of the proposed framework.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"498-506"},"PeriodicalIF":2.9,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10978087","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143943980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}