Linguistically informed automatic speech recognition in Sanskrit

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2025-07-12 DOI:10.1016/j.csl.2025.101861

Rishabh Kumar , Devaraja Adiga , Rishav Ranjan , Amrith Krishna , Ganesh Ramakrishnan , Pawan Goyal , Preethi Jyothi

{"title":"Linguistically informed automatic speech recognition in Sanskrit","authors":"Rishabh Kumar , Devaraja Adiga , Rishav Ranjan , Amrith Krishna , Ganesh Ramakrishnan , Pawan Goyal , Preethi Jyothi","doi":"10.1016/j.csl.2025.101861","DOIUrl":null,"url":null,"abstract":"<div><div>The field of Automatic Speech Recognition (ASR) for Sanskrit is marked by distinctive challenges, primarily due to the language’s intricate linguistic and morphological characteristics. Recognizing the burgeoning interest in this domain, we present the ‘Vāksañcayah’ speech corpus, a comprehensive collection that captures the linguistic depth and complexities of Sanskrit. Building upon our prior work, which focused on various acoustic model (AM) and language model (LM) units, we present an enhanced ASR system. This system integrates innovative subword tokenization methods and enriches the search space with linguistic insights. Addressing the issue of high out-of-vocabulary (OOV) rates and the prevalence of infrequently used words in Sanskrit, we employed a subword-based language model. Our approach mitigates these challenges and facilitates the generation of a subword-based search space. While effective in numerous scenarios, this model encounters limitations regarding long-range dependencies and semantic context comprehension. To counter these limitations, we leveraged Sanskrit’s rich morphological framework, thus achieving a more holistic understanding. The subword-based search space is subsequently transformed into a word-based format and augmented with morphological and lexical data, derived from a lexically driven shallow parser. Enhancing this further, we rescore transitions within this enriched space using a supervised morphological parser specifically designed for Sanskrit. Our proposed methodology is currently acclaimed as the most advanced in the realm of Sanskrit ASR, achieving a Word Error Rate (WER) of 12.54 and an improvement of 3.77 absolute points over the previous best. Additionally, we annotated 500 utterances with detailed morphological data and their corresponding lemmas, providing a basis for extensive linguistic analysis.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101861"},"PeriodicalIF":3.4000,"publicationDate":"2025-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000865","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The field of Automatic Speech Recognition (ASR) for Sanskrit is marked by distinctive challenges, primarily due to the language’s intricate linguistic and morphological characteristics. Recognizing the burgeoning interest in this domain, we present the ‘Vāksañcayah’ speech corpus, a comprehensive collection that captures the linguistic depth and complexities of Sanskrit. Building upon our prior work, which focused on various acoustic model (AM) and language model (LM) units, we present an enhanced ASR system. This system integrates innovative subword tokenization methods and enriches the search space with linguistic insights. Addressing the issue of high out-of-vocabulary (OOV) rates and the prevalence of infrequently used words in Sanskrit, we employed a subword-based language model. Our approach mitigates these challenges and facilitates the generation of a subword-based search space. While effective in numerous scenarios, this model encounters limitations regarding long-range dependencies and semantic context comprehension. To counter these limitations, we leveraged Sanskrit’s rich morphological framework, thus achieving a more holistic understanding. The subword-based search space is subsequently transformed into a word-based format and augmented with morphological and lexical data, derived from a lexically driven shallow parser. Enhancing this further, we rescore transitions within this enriched space using a supervised morphological parser specifically designed for Sanskrit. Our proposed methodology is currently acclaimed as the most advanced in the realm of Sanskrit ASR, achieving a Word Error Rate (WER) of 12.54 and an improvement of 3.77 absolute points over the previous best. Additionally, we annotated 500 utterances with detailed morphological data and their corresponding lemmas, providing a basis for extensive linguistic analysis.

查看原文本刊更多论文

基于语言学的梵语自动语音识别

梵语的自动语音识别（ASR）领域面临着独特的挑战，主要是由于该语言复杂的语言和形态特征。认识到这一领域的新兴兴趣，我们提出了“Vāksañcayah”语音语料库，一个全面的集合，捕捉了梵语的语言深度和复杂性。基于我们之前的工作，重点是各种声学模型（AM）和语言模型（LM）单元，我们提出了一个增强的ASR系统。该系统集成了创新的子词标记方法，丰富了具有语言学洞察力的搜索空间。为了解决梵语词汇外（OOV）率高和不常用词普遍存在的问题，我们采用了基于子词的语言模型。我们的方法减轻了这些挑战，并促进了基于子词的搜索空间的生成。虽然该模型在许多场景中都是有效的，但它在远程依赖关系和语义上下文理解方面遇到了限制。为了克服这些限制，我们利用梵语丰富的形态学框架，从而获得更全面的理解。随后将基于子词的搜索空间转换为基于词的格式，并使用词法和词法数据进行扩充，这些数据来自词法驱动的浅层解析器。进一步增强了这一点，我们使用专门为梵语设计的监督形态学解析器在这个丰富的空间内重新记录转换。我们提出的方法目前被认为是梵语ASR领域最先进的方法，实现了12.54的单词错误率（WER），比之前最好的方法提高了3.77个绝对分数。此外，我们用详细的形态学数据和相应的引理注释了500个话语，为广泛的语言分析提供了基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.