Enhancing Speech Activity Detection in Air Traffic Control Communication via Push-to-Talk Event Identification

IF 5.6 2区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Instrumentation and Measurement Pub Date : 2025-03-26 DOI:10.1109/TIM.2025.3554853

Dongyue Guo;Xuehang You;Wang Yue;Chunpeng Wang;Jianwei Zhang;Yi Lin

{"title":"Enhancing Speech Activity Detection in Air Traffic Control Communication via Push-to-Talk Event Identification","authors":"Dongyue Guo;Xuehang You;Wang Yue;Chunpeng Wang;Jianwei Zhang;Yi Lin","doi":"10.1109/TIM.2025.3554853","DOIUrl":null,"url":null,"abstract":"Speech activity detection (SAD) serves as a foundational and critical component for automatic speech recognition and understanding (ASRU) applications in the air traffic control (ATC) domain. However, mid-speech clipping and hangover problems caused by the inaccurate identification of speech endpoints bring significant challenges to the existing SAD approaches in the ATC communication environments. To address these challenges, in this article, a novel ATC-SAD framework is proposed to improve the accuracy of SAD in ATC communication by measuring the release event of the push-to-talk (PTT) switch (denoted as PTT event). Compared to the conventional SAD approaches, the proposed framework can not only distinguish speech and nonspeech signals but also has the ability to detect the PTT events from audio streams, thereby effectively identifying the speech endpoints. To mine informative features from audio signals for the SAD tasks, a multiview feature learning (MFL) module is designed to extract the acoustic features from time, frequency, and cepstrum domains. Furthermore, an attention-based feature aggregation (AFA) module is designed to project the acoustic features into the embedding space. A contrastive learning module is proposed to learn the discriminative features among the three distinct classes, which is expected to improve the performance of the classification task. In addition, to explore more effective neural architectures, four classical neural networks serve as backbone networks to conduct the proposed ATC-SAD framework. Experimental results on a real-world ATC dataset demonstrate the superiority of our proposed framework over competitive baselines, achieving high accuracy and robustness in challenging ATC communication scenarios.","PeriodicalId":13341,"journal":{"name":"IEEE Transactions on Instrumentation and Measurement","volume":"74 ","pages":"1-12"},"PeriodicalIF":5.6000,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Instrumentation and Measurement","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10942411/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Speech activity detection (SAD) serves as a foundational and critical component for automatic speech recognition and understanding (ASRU) applications in the air traffic control (ATC) domain. However, mid-speech clipping and hangover problems caused by the inaccurate identification of speech endpoints bring significant challenges to the existing SAD approaches in the ATC communication environments. To address these challenges, in this article, a novel ATC-SAD framework is proposed to improve the accuracy of SAD in ATC communication by measuring the release event of the push-to-talk (PTT) switch (denoted as PTT event). Compared to the conventional SAD approaches, the proposed framework can not only distinguish speech and nonspeech signals but also has the ability to detect the PTT events from audio streams, thereby effectively identifying the speech endpoints. To mine informative features from audio signals for the SAD tasks, a multiview feature learning (MFL) module is designed to extract the acoustic features from time, frequency, and cepstrum domains. Furthermore, an attention-based feature aggregation (AFA) module is designed to project the acoustic features into the embedding space. A contrastive learning module is proposed to learn the discriminative features among the three distinct classes, which is expected to improve the performance of the classification task. In addition, to explore more effective neural architectures, four classical neural networks serve as backbone networks to conduct the proposed ATC-SAD framework. Experimental results on a real-world ATC dataset demonstrate the superiority of our proposed framework over competitive baselines, achieving high accuracy and robustness in challenging ATC communication scenarios.

查看原文本刊更多论文

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Instrumentation and Measurement 工程技术-工程：电子与电气

CiteScore

9.00

自引率

23.20%

发文量

1294

审稿时长

3.9 months

期刊介绍： Papers are sought that address innovative solutions to the development and use of electrical and electronic instruments and equipment to measure, monitor and/or record physical phenomena for the purpose of advancing measurement science, methods, functionality and applications. The scope of these papers may encompass: (1) theory, methodology, and practice of measurement; (2) design, development and evaluation of instrumentation and measurement systems and components used in generating, acquiring, conditioning and processing signals; (3) analysis, representation, display, and preservation of the information obtained from a set of measurements; and (4) scientific and technical support to establishment and maintenance of technical standards in the field of Instrumentation and Measurement.