A method of phonemic annotation for Chinese dialects based on a deep learning model with adaptive temporal attention and a feature disentangling structure

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2024-02-05 DOI:10.1016/j.csl.2024.101624

Bowen Jiang , Qianhui Dong , Guojin Liu

{"title":"A method of phonemic annotation for Chinese dialects based on a deep learning model with adaptive temporal attention and a feature disentangling structure","authors":"Bowen Jiang , Qianhui Dong , Guojin Liu","doi":"10.1016/j.csl.2024.101624","DOIUrl":null,"url":null,"abstract":"<div><p>Phonemic annotation is aimed at annotating a speech fragment with phonemic symbols. As the phonetic features of a speech fragment vary greatly among different languages including their dialects, it is a significant way to describe and write down the phonetic system of a language utilizing phonemic symbols. It is meaningful to develop an automatic and effective method for this task. In this paper, we first establish a Chinese dataset where each datum consists of an original speech signal and the corresponding phonemic characters which are annotated manually. Furthermore, we propose a deep learning model to realize automatic phonemic annotation for speech fragments spoken in diverse Chinese dialects. The overall structure of the model is a many-to-many deep bi-directional gated recurrent unit (GRU) network, and an adaptive temporal attention mechanism is applied to communicate the encoder and decoder modules to prevent any loss of features adaptively. Meanwhile, a feature disentangling structure based on a generative adversarial network (GAN) is adopted to attenuate the interference towards the phonemic annotation task caused by unrelated tone features in the original speech signal and further improve the phonemic annotation performance. Extensive experimental results have verified the superiority of our model and proposed strategies over the utilized dataset.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"86 ","pages":"Article 101624"},"PeriodicalIF":3.1000,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S088523082400007X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Phonemic annotation is aimed at annotating a speech fragment with phonemic symbols. As the phonetic features of a speech fragment vary greatly among different languages including their dialects, it is a significant way to describe and write down the phonetic system of a language utilizing phonemic symbols. It is meaningful to develop an automatic and effective method for this task. In this paper, we first establish a Chinese dataset where each datum consists of an original speech signal and the corresponding phonemic characters which are annotated manually. Furthermore, we propose a deep learning model to realize automatic phonemic annotation for speech fragments spoken in diverse Chinese dialects. The overall structure of the model is a many-to-many deep bi-directional gated recurrent unit (GRU) network, and an adaptive temporal attention mechanism is applied to communicate the encoder and decoder modules to prevent any loss of features adaptively. Meanwhile, a feature disentangling structure based on a generative adversarial network (GAN) is adopted to attenuate the interference towards the phonemic annotation task caused by unrelated tone features in the original speech signal and further improve the phonemic annotation performance. Extensive experimental results have verified the superiority of our model and proposed strategies over the utilized dataset.

查看原文本刊更多论文

基于具有自适应时空注意力的深度学习模型和特征分解结构的汉语方言音位标注方法

音位标注的目的是用音位符号标注语音片段。由于不同语言（包括方言）语音片段的语音特征差异很大，因此利用音位符号来描述和记录一种语言的语音系统是一种重要的方法。为音位标注任务开发一种自动有效的方法很有意义。在本文中，我们首先利用了一个中文数据集，其中每个数据都由原始语音信号和相应的人工标注音位字符组成。此外，我们还提出了一种深度学习模型，以实现对不同汉语方言语音片段的音位标注。该模型的整体结构是一个多对多的深度双向门控递归单元（GRU）网络，并采用自适应时间注意力机制来沟通编码器和解码器模块，以自适应地防止任何特征丢失。同时，还采用了基于生成对抗网络（GAN）的特征分离结构，以减弱原始语音片段中不相关的音调特征对音位标注任务的干扰，进一步提高音位标注性能。广泛的实验结果验证了我们的模型和建议的策略在所使用的数据集上的优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.