{"title":"基于混响效应的音位相干分析对实用自动语音识别的影响","authors":"Hyeonuk Nam, Yong-Hwa Park","doi":"10.1016/j.apacoust.2024.110233","DOIUrl":null,"url":null,"abstract":"<div><p>Reverberation is one of the most critical obstacles to adopt automatic speech recognition (ASR) in real life environments. Therefore, comprehensive understanding on the effect of reverberation to ASR is required to design robust ASR systems for practical uses. To deepen our understanding on the effect of reverberation to practical ASR, we performed a phonemic analysis on commercial ASR system. The analysis method involves a new metric named <em>mean phoneme coherence (MPC)</em>, defined by time–frequency-averaged coherence function between clean and reverberated speech spectrograms of each phoneme. MPC measures the amount of <em>spectral contamination</em> on phonemes under certain reverberation condition thus quantifies not only the amount of reverberation on the phonemes but also contextual influences on the phoneme within sentence spoken in the reverberation condition. MPC was proven to represent the amount of reverberation and intelligibility of speeches under given reverberation condition by comparing MPC with word error rate (WER) in real reverberation conditions. Furthermore, the relationship between phoneme groups’ vulnerability to spectral contamination and ASR performance upon reverberation is analyzed by comparing median of phoneme groups’ MPC distribution with phoneme group word accuracy (PGWA). Analysis has shown that the two quantities show weak correlation, thus reverberation differently affects the intelligibility of phonemes. In addition, a comparative study among phoneme groups has shown that nasals and semivowels show the least robust ASR performances to reverberation while nasals and stops are most vulnerable to cause spectral contamination. The results and discussions present what should be taken into account for ASR robust to reverberation.</p></div>","PeriodicalId":55506,"journal":{"name":"Applied Acoustics","volume":null,"pages":null},"PeriodicalIF":3.4000,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Coherence-based phonemic analysis on the effect of reverberation to practical automatic speech recognition\",\"authors\":\"Hyeonuk Nam, Yong-Hwa Park\",\"doi\":\"10.1016/j.apacoust.2024.110233\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Reverberation is one of the most critical obstacles to adopt automatic speech recognition (ASR) in real life environments. Therefore, comprehensive understanding on the effect of reverberation to ASR is required to design robust ASR systems for practical uses. To deepen our understanding on the effect of reverberation to practical ASR, we performed a phonemic analysis on commercial ASR system. The analysis method involves a new metric named <em>mean phoneme coherence (MPC)</em>, defined by time–frequency-averaged coherence function between clean and reverberated speech spectrograms of each phoneme. MPC measures the amount of <em>spectral contamination</em> on phonemes under certain reverberation condition thus quantifies not only the amount of reverberation on the phonemes but also contextual influences on the phoneme within sentence spoken in the reverberation condition. MPC was proven to represent the amount of reverberation and intelligibility of speeches under given reverberation condition by comparing MPC with word error rate (WER) in real reverberation conditions. Furthermore, the relationship between phoneme groups’ vulnerability to spectral contamination and ASR performance upon reverberation is analyzed by comparing median of phoneme groups’ MPC distribution with phoneme group word accuracy (PGWA). Analysis has shown that the two quantities show weak correlation, thus reverberation differently affects the intelligibility of phonemes. In addition, a comparative study among phoneme groups has shown that nasals and semivowels show the least robust ASR performances to reverberation while nasals and stops are most vulnerable to cause spectral contamination. The results and discussions present what should be taken into account for ASR robust to reverberation.</p></div>\",\"PeriodicalId\":55506,\"journal\":{\"name\":\"Applied Acoustics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2024-08-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Acoustics\",\"FirstCategoryId\":\"101\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0003682X24003840\",\"RegionNum\":2,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Acoustics","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003682X24003840","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0
摘要
混响是在现实生活环境中采用自动语音识别(ASR)的最关键障碍之一。因此,需要全面了解混响对自动语音识别的影响,才能设计出实用的强大自动语音识别系统。为了加深对混响对实用 ASR 影响的理解,我们对商用 ASR 系统进行了音位分析。该分析方法采用了一种名为平均音素相干性(MPC)的新指标,该指标由每个音素的干净语音频谱图和混响语音频谱图之间的时频平均相干性函数定义。MPC 可测量特定混响条件下音素的频谱污染量,因此不仅能量化音素所受的混响量,还能量化混响条件下句子中音素所受的上下文影响。通过将 MPC 与实际混响条件下的词错误率(WER)进行比较,证明了 MPC 能够代表特定混响条件下的混响量和语音可懂度。此外,通过比较音素组 MPC 分布中值与音素组词语准确率(PGWA),分析了音素组在混响条件下易受频谱污染影响的程度与 ASR 性能之间的关系。分析表明,这两个量呈现出微弱的相关性,因此混响对音素可懂度的影响是不同的。此外,音素组之间的比较研究表明,鼻音和半元音在混响中的 ASR 表现最差,而鼻音和停顿音最容易受到频谱污染的影响。研究结果和讨论介绍了 ASR 对混响的稳健性应考虑的因素。
Coherence-based phonemic analysis on the effect of reverberation to practical automatic speech recognition
Reverberation is one of the most critical obstacles to adopt automatic speech recognition (ASR) in real life environments. Therefore, comprehensive understanding on the effect of reverberation to ASR is required to design robust ASR systems for practical uses. To deepen our understanding on the effect of reverberation to practical ASR, we performed a phonemic analysis on commercial ASR system. The analysis method involves a new metric named mean phoneme coherence (MPC), defined by time–frequency-averaged coherence function between clean and reverberated speech spectrograms of each phoneme. MPC measures the amount of spectral contamination on phonemes under certain reverberation condition thus quantifies not only the amount of reverberation on the phonemes but also contextual influences on the phoneme within sentence spoken in the reverberation condition. MPC was proven to represent the amount of reverberation and intelligibility of speeches under given reverberation condition by comparing MPC with word error rate (WER) in real reverberation conditions. Furthermore, the relationship between phoneme groups’ vulnerability to spectral contamination and ASR performance upon reverberation is analyzed by comparing median of phoneme groups’ MPC distribution with phoneme group word accuracy (PGWA). Analysis has shown that the two quantities show weak correlation, thus reverberation differently affects the intelligibility of phonemes. In addition, a comparative study among phoneme groups has shown that nasals and semivowels show the least robust ASR performances to reverberation while nasals and stops are most vulnerable to cause spectral contamination. The results and discussions present what should be taken into account for ASR robust to reverberation.
期刊介绍:
Since its launch in 1968, Applied Acoustics has been publishing high quality research papers providing state-of-the-art coverage of research findings for engineers and scientists involved in applications of acoustics in the widest sense.
Applied Acoustics looks not only at recent developments in the understanding of acoustics but also at ways of exploiting that understanding. The Journal aims to encourage the exchange of practical experience through publication and in so doing creates a fund of technological information that can be used for solving related problems. The presentation of information in graphical or tabular form is especially encouraged. If a report of a mathematical development is a necessary part of a paper it is important to ensure that it is there only as an integral part of a practical solution to a problem and is supported by data. Applied Acoustics encourages the exchange of practical experience in the following ways: • Complete Papers • Short Technical Notes • Review Articles; and thereby provides a wealth of technological information that can be used to solve related problems.
Manuscripts that address all fields of applications of acoustics ranging from medicine and NDT to the environment and buildings are welcome.