An Effective Hierarchical Graph Attention Network Modeling Approach for Pronunciation Assessment

IF 5.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-08-26 DOI:10.1109/TASLP.2024.3449111

Bi-Cheng Yan;Berlin Chen

{"title":"An Effective Hierarchical Graph Attention Network Modeling Approach for Pronunciation Assessment","authors":"Bi-Cheng Yan;Berlin Chen","doi":"10.1109/TASLP.2024.3449111","DOIUrl":null,"url":null,"abstract":"Automatic pronunciation assessment (APA) manages to quantify second language (L2) learners’ pronunciation proficiency in a target language by providing fine-grained feedback with multiple aspect scores (e.g., accuracy, fluency, and completeness) at various linguistic levels (i.e., phone, word, and utterance). Most of the existing efforts commonly follow a parallel modeling framework, which takes a sequence of phone-level pronunciation feature embeddings of a learner's utterance as input and then predicts multiple aspect scores across various linguistic levels. However, these approaches neither take the hierarchy of linguistic units into account nor consider the relatedness among the pronunciation aspects in an explicit manner. In light of this, we put forward an effective modeling approach for APA, termed HierGAT, which is grounded on a hierarchical graph attention network. Our approach facilitates hierarchical modeling of the input utterance as a heterogeneous graph that contains linguistic nodes at various levels of granularity. On top of the tactfully designed hierarchical graph message passing mechanism, intricate interdependencies within and across different linguistic levels are encapsulated and the language hierarchy of an utterance is factored in as well. Furthermore, we also design a novel aspect attention module to encode relatedness among aspects. To our knowledge, we are the first to introduce multiple types of linguistic nodes into graph-based neural networks for APA and perform a comprehensive qualitative analysis to investigate their merits. A series of experiments conducted on the speechocean762 benchmark dataset suggests the feasibility and effectiveness of our approach in relation to several competitive baselines.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3974-3985"},"PeriodicalIF":5.1000,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10648884/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Automatic pronunciation assessment (APA) manages to quantify second language (L2) learners’ pronunciation proficiency in a target language by providing fine-grained feedback with multiple aspect scores (e.g., accuracy, fluency, and completeness) at various linguistic levels (i.e., phone, word, and utterance). Most of the existing efforts commonly follow a parallel modeling framework, which takes a sequence of phone-level pronunciation feature embeddings of a learner's utterance as input and then predicts multiple aspect scores across various linguistic levels. However, these approaches neither take the hierarchy of linguistic units into account nor consider the relatedness among the pronunciation aspects in an explicit manner. In light of this, we put forward an effective modeling approach for APA, termed HierGAT, which is grounded on a hierarchical graph attention network. Our approach facilitates hierarchical modeling of the input utterance as a heterogeneous graph that contains linguistic nodes at various levels of granularity. On top of the tactfully designed hierarchical graph message passing mechanism, intricate interdependencies within and across different linguistic levels are encapsulated and the language hierarchy of an utterance is factored in as well. Furthermore, we also design a novel aspect attention module to encode relatedness among aspects. To our knowledge, we are the first to introduce multiple types of linguistic nodes into graph-based neural networks for APA and perform a comprehensive qualitative analysis to investigate their merits. A series of experiments conducted on the speechocean762 benchmark dataset suggests the feasibility and effectiveness of our approach in relation to several competitive baselines.

查看原文本刊更多论文

发音评估的有效层次图注意网络建模方法

自动发音评估（APA）通过在不同语言层面（即电话、单词和语篇）提供多方面评分（如准确度、流利度和完整性）的细粒度反馈来量化第二语言（L2）学习者的目标语言发音水平。现有的大多数方法通常采用并行建模框架，将学习者语篇的电话级发音特征嵌入序列作为输入，然后预测不同语言级别的多个方面得分。然而，这些方法既没有考虑语言单位的层次结构，也没有明确考虑发音方面之间的关联性。有鉴于此，我们提出了一种有效的 APA 建模方法，称为 HierGAT，它以分层图注意网络为基础。我们的方法有利于将输入语篇作为一个异构图进行分层建模，该图包含不同粒度的语言节点。在巧妙设计的分层图信息传递机制之上，不同语言层次内部和之间错综复杂的相互依赖关系被封装起来，语篇的语言层次结构也被考虑在内。此外，我们还设计了一个新颖的方面关注模块来编码各方面之间的相关性。据我们所知，我们是第一个在基于图的 APA 神经网络中引入多种类型的语言节点，并对其优点进行全面定性分析的人。在 speechocean762 基准数据集上进行的一系列实验表明，我们的方法与几种具有竞争力的基线方法相比是可行和有效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.