{"title":"Overlapped/Non-Overlapped Speech Transition Point Detection Using Bag-of-Audio-Words","authors":"Shikha Baghel, S. Prasanna, P. Guha","doi":"10.1109/SPCOM50965.2020.9179591","DOIUrl":null,"url":null,"abstract":"Overlapped speech refers to an audio signal which contains speech of two or more speakers speaking simultaneously. Overlapped speech is one of the main sources of error for speaker diarization systems. This work presents an initial study to identify the transition points of overlapped to non-overlapped speech and vice-versa. Characteristics of overlapped and non-overlapped speech are examined in terms of the vocal tract system, excitation source, and modulation spectrum. The Hilbert envelope (HE) of Linear Prediction (LP) residual signal represents the excitation source characteristics of speech signal. The Sum of Ten Largest Peaks (STLP) of the spectrum and Mel-Frequency Cepstral Coefficients (MFCCs) represent the vocal tract shape information. The modulation spectrum energy (ModSE) captures the information of slowly varying temporal envelope of speech. A Bag-of-Audio-Words (BoAW) based approach is used to detect the transition points. News debates are one of the main sources of naturally occurred overlapped speech. Therefore, the present work is evaluated on Indian news debate scenario. A high Identification Rate (IR) and low Spurious Rate (SR) is observed when all the features are used simultaneously as a 16d feature(13-MFCCs, HE of LP residual, STLP and ModSE) for the detection task.","PeriodicalId":208527,"journal":{"name":"2020 International Conference on Signal Processing and Communications (SPCOM)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference on Signal Processing and Communications (SPCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPCOM50965.2020.9179591","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Overlapped speech refers to an audio signal which contains speech of two or more speakers speaking simultaneously. Overlapped speech is one of the main sources of error for speaker diarization systems. This work presents an initial study to identify the transition points of overlapped to non-overlapped speech and vice-versa. Characteristics of overlapped and non-overlapped speech are examined in terms of the vocal tract system, excitation source, and modulation spectrum. The Hilbert envelope (HE) of Linear Prediction (LP) residual signal represents the excitation source characteristics of speech signal. The Sum of Ten Largest Peaks (STLP) of the spectrum and Mel-Frequency Cepstral Coefficients (MFCCs) represent the vocal tract shape information. The modulation spectrum energy (ModSE) captures the information of slowly varying temporal envelope of speech. A Bag-of-Audio-Words (BoAW) based approach is used to detect the transition points. News debates are one of the main sources of naturally occurred overlapped speech. Therefore, the present work is evaluated on Indian news debate scenario. A high Identification Rate (IR) and low Spurious Rate (SR) is observed when all the features are used simultaneously as a 16d feature(13-MFCCs, HE of LP residual, STLP and ModSE) for the detection task.