{"title":"Formant Tracking by Combining Deep Neural Network and Linear Prediction","authors":"Sudarsana Reddy Kadiri;Kevin Huang;Christina Hagedorn;Dani Byrd;Paavo Alku;Shrikanth Narayanan","doi":"10.1109/OJSP.2025.3530876","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3530876","url":null,"abstract":"Formant tracking is an area of speech science that has recently undergone a technology shift from classical model-driven signal processing methods to modern data-driven deep learning methods. In this study, these two domains are combined in formant tracking by refining the formants estimated by a data-driven deep neural network (DNN) with formant estimates given by a model-driven linear prediction (LP) method. In the refinement process, the three lowest formants, initially estimated by the DNN-based method, are frame-wise replaced with local spectral peaks identified by the LP method. The LP-based refinement stage can be seamlessly integrated into the DNN without any training. As an LP method, the study advocates the use of quasiclosed phase forward-backward (QCP-FB) analysis. Three spectral representations are compared as DNN inputs: mel-frequency cepstral coefficients (MFCCs), the spectrogram, and the complex spectrogram. Formant tracking performance was evaluated by comparing the proposed refined DNN tracker with seven reference trackers, which included both signal processing and deep learning based methods. As evaluation data, ground truth formants of the Vocal Tract Resonance (VTR) corpus were used. The results demonstrate that the refined DNN trackers outperformed all conventional trackers. The best results were obtained by using the MFCC input for the DNN. The proposed MFCC refinement (MFCC-DNN<sub>QCP-FB</sub>) reduced estimation errors by 0.8 Hz, 12.9 Hz, and 11.7 Hz for the first (F1), second (F2), and third (F3) formants, respectively, compared to the Deep Formants refinement (DeepF<sub>QCP-FB</sub>). When compared to the model-driven KARMA tracking method, the proposed refinement reduced estimation errors by 2.3 Hz, 55.5 Hz, and 143.4 Hz for F1, F2, and F3, respectively. A detailed evaluation across various phonetic categories and gender groups showed that the proposed hybrid refinement approach improves formanttracking performance across most test conditions.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"222-230"},"PeriodicalIF":2.9,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10843356","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143430569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ali N. Salman;Karen Rosero;Lucas Goncalves;Carlos Busso
{"title":"Mixture of Emotion Dependent Experts: Facial Expressions Recognition in Videos Through Stacked Expert Models","authors":"Ali N. Salman;Karen Rosero;Lucas Goncalves;Carlos Busso","doi":"10.1109/OJSP.2025.3530793","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3530793","url":null,"abstract":"Recent advancements in <italic>dynamic facial expression recognition</i> (DFER) have predominantly utilized static features, which are theoretically inferior to dynamic features. However, models fully trained with dynamic features often suffer from over-fitting due to the limited size and diversity of the training data for fully <italic>supervised learning</i> (SL) models. A significant challenge with existing models based on static features in recognizing emotions from videos is their tendency to form biased representations, often unbalanced or skewed towards more prevalent or basic emotional features present in the static domain, especially with posed expression. Therefore, this approach under-represents the nuances present in the dynamic domain. To address this issue, our study introduces a novel approach that we refer to as <italic>mixture of emotion-dependent experts</i> (MoEDE). This strategy relies on emotion-specific feature extractors to produce more diverse emotional static features to train DFER systems. Each emotion-dependent expert focuses exclusively on one emotional category, formulating the problem as binary classifiers. Our DFER model combines these static representations with recurrent models, modeling their temporal relationships. The proposed MoEDE DFER approach achieves a macro F1-score of 74.5%, marking a significant improvement over the baseline, which presented a macro F1-score of 70.9% . The DFER baseline is similar to MoEDE, but it uses a single static feature extractor rather than stacked extractors. Additionally, our proposed approach shows consistent improvements compared to other four popular baselines.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"323-332"},"PeriodicalIF":2.9,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10843404","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143629647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Classification Models With Sophisticated Counterfactual Images","authors":"Xiang Li;Ren Togo;Keisuke Maeda;Takahiro Ogawa;Miki Haseyama","doi":"10.1109/OJSP.2025.3530843","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3530843","url":null,"abstract":"In deep learning, training data, which are mainly from realistic scenarios, often carry certain biases. This causes deep learning models to learn incorrect relationships between features when using these training data. However, because these models have <italic>black boxes</i>, these problems cannot be solved effectively. In this paper, we aimed to 1) improve existing processes for generating language-guided counterfactual images and 2) employ counterfactual images to efficiently and directly identify model weaknesses in learning incorrect feature relationships. Furthermore, 3) we combined counterfactual images into datasets to fine-tune the model, thus correcting the model's perception of feature relationships. Through extensive experimentation, we confirmed the improvement in the quality of the generated counterfactual images, which can more effectively enhance the classification ability of various models.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"89-98"},"PeriodicalIF":2.9,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10843353","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143379496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Feasibility Study of Location Authentication for IoT Data Using Power Grid Signatures","authors":"Mudi Zhang;Charana Sonnadara;Sahil Shah;Min Wu","doi":"10.1109/OJSP.2025.3530847","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3530847","url":null,"abstract":"Ambient signatures related to the power grid offer an under-utilized opportunity to verify the time and location of sensing data collected by the Internet-of-Things (IoT). Such power signatures as the Electrical Network Frequency (ENF) have been used in multimedia forensics to answer questions about the time and location of audio-visual recordings. Going beyond multimedia data, this paper investigates a refined power signature of Electrical Network Voltage (ENV) for IoT sensing data and carries out a feasibility study of location verification for IoT data. ENV reflects the variations of the power system's supply voltage over time and is also present in the optical sensing data, akin to ENF. A physical model showing the presence of ENV in the optical sensing data is presented along with the corresponding signal processing mechanisms to estimate and utilize ENV signals from the power and optical sensing data as location stamps. Experiments are conducted in the State of Maryland of the United States to demonstrate the feasibility of using ENV signals for location authentication of IoT data.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"405-416"},"PeriodicalIF":2.9,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10843385","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143740238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lucas Goncalves;Huang-Cheng Chou;Ali N. Salman;Chi-Chun Lee;Carlos Busso
{"title":"Jointly Learning From Unimodal and Multimodal-Rated Labels in Audio-Visual Emotion Recognition","authors":"Lucas Goncalves;Huang-Cheng Chou;Ali N. Salman;Chi-Chun Lee;Carlos Busso","doi":"10.1109/OJSP.2025.3530274","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3530274","url":null,"abstract":"<italic>Audio-visual emotion recognition</i> (AVER) has been an important research area in <italic>human-computer interaction</i> (HCI). Traditionally, audio-visual emotional datasets and corresponding models derive their ground truths from annotations obtained by raters after watching the audio-visual stimuli. This conventional method, however, neglects the nuanced human perception of emotional states, which varies when annotations are made under different emotional stimuli conditions—whether through unimodal or multimodal stimuli. This study investigates the potential for enhanced AVER system performance by integrating diverse levels of annotation stimuli, reflective of varying perceptual evaluations. We propose a two-stage training method to train models with the labels elicited by audio-only, face-only, and audio-visual stimuli. Our approach utilizes different levels of annotation stimuli according to which modality is present within different layers of the model, effectively modeling annotation at the unimodal and multi-modal levels to capture the full scope of emotion perception across unimodal and multimodal contexts. We conduct the experiments and evaluate the models on the CREMA-D emotion database. The proposed methods achieved the best performances in macro-/weighted-F1 scores. Additionally, we measure the model calibration, performance bias, and fairness metrics considering the age, gender, and race of the AVER systems.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"165-174"},"PeriodicalIF":2.9,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10842047","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143404005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Takayuki Sasaki;Kazuya Hayase;Masaki Kitahara;Shunsuke Ono
{"title":"Sparse Regularization With Reverse Sorted Sum of Squares via an Unrolled Difference-of-Convex Approach","authors":"Takayuki Sasaki;Kazuya Hayase;Masaki Kitahara;Shunsuke Ono","doi":"10.1109/OJSP.2025.3529312","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3529312","url":null,"abstract":"This paper proposes a sparse regularization method with a novel sorted regularization function. Sparse regularization is commonly used to solve underdetermined inverse problems. Traditional sparse regularization functions, such as <inline-formula><tex-math>$L_{1}$</tex-math></inline-formula>-norm, suffer from problems like amplitude underestimation and vanishing perturbations. The reverse ordered weighted <inline-formula><tex-math>$L_{1}$</tex-math></inline-formula>-norm (ROWL) addresses these issues but introduces new challenges. These include developing an algorithm grounded in theory, not heuristics, reducing computational complexity, enabling the automatic determination of numerous parameters, and ensuring the number of iterations remains feasible. In this study, we propose a novel sparse regularization function called the reverse sorted sum of squares (RSSS) and then construct an unrolled algorithm that addresses both the aforementioned two problems and these four challenges. The core idea of our proposed method lies in transforming the optimization problem into a difference-of-convex programming problem, for which solutions are known. In experiments, we apply the RSSS regularization method to image deblurring and super-resolution tasks and confirmed its superior performance compared to conventional methods, all with feasible computational complexity.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"57-67"},"PeriodicalIF":2.9,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10840312","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143379508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Wasserstein Non-Negative Matrix Factorization for Multi-Layered Graphs and its Application to Mobility Data","authors":"Hirotaka Kaji;Kazushi Ikeda","doi":"10.1109/OJSP.2025.3528869","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3528869","url":null,"abstract":"Multi-layered graphs are popular in mobility studies because transportation data include multiple modalities, such as railways, buses, and taxis. Another example of a multi-layered graph is the time series of mobility when periodicity is considered. The graphs are analyzed using standard signal processing methods such as singular value decomposition and tensor analysis, which can estimate missing values. However, their feature extraction abilities are insufficient for optimizing mobility networks. This study proposes a method that combines the Wasserstein non-negative matrix factorization (W-NMF) with line graphs to obtain low-dimensional representations of multi-layered graphs. A line graph is defined as the dual graph of a graph, where the vertices correspond to the edges of the original graph, and the edges correspond to the vertices. Thus, the shortest path length between two vertices in the line graph corresponds to the distance between the edges in the original graph. Through experiments using synthetic and benchmark datasets, we show that the performance and robustness of our method are superior to conventional methods. Additionally, we apply our method to real-world taxi origin—destination data as a mobility dataset and discuss the findings.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"194-202"},"PeriodicalIF":2.9,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10840315","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143379613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Target Tracking Using a Time-Varying Autoregressive Dynamic Model","authors":"Ralph J. Mcdougall;Simon J. Godsill","doi":"10.1109/OJSP.2025.3528896","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3528896","url":null,"abstract":"Target tracking algorithms commonly use structured dynamic models which require prior training of fixed model parameters. These trackers have reduced accuracy if the target behaviour does not match the dynamic model. This work develops an algorithm that can infer target dynamic behaviour online, allowing the target dynamic to be time-varying as well. A time-varying target dynamic allows the target to change its level of maneuverability continuously through the trajectory, so the trajectory may have highly variable levels of agility. The developed tracker assumes the target dynamic can be described by an autoregressive model with time-varying parameters and constant, but unknown innovation variance. The autoregressive coefficients and innovation variance are then inferred online while simultaneously tracking the target. A data-association model is included to allow for clutter in the target measurements. This tracker is then compared against common structured trackers and is shown that it can approximate these models, while also showing better state filtering and prediction accuracy for an agile target.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"147-155"},"PeriodicalIF":2.9,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10840270","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143403968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jose N. Filipe;Luis M. N. Tavora;Sergio M. M. Faria;Antonio Navarro;Pedro A. A. Assuncao
{"title":"Linear Multivariate Decision Trees for Fast QTMT Partitioning in VVC","authors":"Jose N. Filipe;Luis M. N. Tavora;Sergio M. M. Faria;Antonio Navarro;Pedro A. A. Assuncao","doi":"10.1109/OJSP.2025.3528897","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3528897","url":null,"abstract":"The demand for ultra-high definition (UHD) content has led to the development of advanced compression tools to enhance the efficiency of standard codecs. One such tool is the Quaternary Tree and Multi-Type Tree (QTMT) used in the Versatile Video Coding (VVC), which significantly improves coding efficiency over previous standards, but introduces substantially higher computational complexity. To address the challenge of reducing computational complexity with minimal impact on coding efficiency, this paper presents a novel approach for intra-coding 360<inline-formula><tex-math>$^{circ }$</tex-math></inline-formula> video in Equirectangular Projection (ERP) format. By exploiting distinct complexity and spatial characteristics of the North, Equator, and South regions in ERP images, the proposed method is devised upon a region-based approach, using novel linear multivariate decision trees to determine whether a given partition type can be skipped. Optimisation of model parameters and an adaptive thresholding method is also presented. The experimental results show a Complexity Gain of approximately 16% with a negligible BD-Rate loss of only 0.06%, surpassing current state-of-the-art methods in terms of complexity gain per percentage point of BD-Rate loss.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"175-183"},"PeriodicalIF":2.9,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10840301","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143379494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Generative Class Incremental Learning Performance With a Model Forgetting Approach","authors":"Taro Togo;Ren Togo;Keisuke Maeda;Takahiro Ogawa;Miki Haseyama","doi":"10.1109/OJSP.2025.3528900","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3528900","url":null,"abstract":"This study presents a novel approach to Generative Class Incremental Learning (GCIL) by introducing the forgetting mechanism, aimed at dynamically managing class information for better adaptation to streaming data. GCIL is one of the hot topics in the field of computer vision, and it is considered one of the important tasks in society as one of the continual learning approaches for generative models. The ability to forget is a crucial brain function that facilitates continual learning by selectively discarding less relevant information for humans. However, in the field of machine learning models, the concept of intentionally forgetting has not been extensively investigated. In this study, we aim to bridge this gap by incorporating the forgetting mechanisms into GCIL, thereby examining their impact on the models' ability to learn in continual learning. Through our experiments, we have found that integrating the forgetting mechanisms significantly enhances the models' performance in acquiring new knowledge, underscoring the positive role that strategic forgetting plays in the process of continual learning.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"203-212"},"PeriodicalIF":2.9,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10840249","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143430568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}