Jiayu Du , Jinpeng Li , Guoguo Chen , Wei-Qiang Zhang
{"title":"SpeechColab leaderboard: An open-source platform for automatic speech recognition evaluation","authors":"Jiayu Du , Jinpeng Li , Guoguo Chen , Wei-Qiang Zhang","doi":"10.1016/j.csl.2025.101805","DOIUrl":null,"url":null,"abstract":"<div><div>In the wake of the surging tide of deep learning over the past decade, Automatic Speech Recognition (ASR) has garnered substantial attention, leading to the emergence of numerous publicly accessible ASR systems that are actively being integrated into our daily lives. Nonetheless, impartial and replicable evaluations of these ASR systems encounter challenges due to various subtleties. In this paper we introduce the SpeechColab Leaderboard, a general-purpose, open-source platform designed for ASR evaluation. With this platform: (i) We report a comprehensive benchmark, unveiling the current state-of-the-art panorama for ASR systems, covering both open-source models and industrial commercial services. (ii) We quantize how distinct nuances in the scoring pipeline influence the final benchmark outcomes, including capitalization, punctuation, interjection, contraction, synonym usage, compound words, etc. These issues have gained prominence in the context of the transition towards End-to-End ASR systems. (iii) We propose and discuss a modification to the conventional Token-Error-Rate (TER) metric, called modified-TER (mTER), inspired from Kolmogorov Complexity and Normalized Information Distance (NID). The proposed metric becomes normalized and symmetrical (with regard to reference and hypothesis). A large-scale empirical study is then presented comparing TER and mTER. The SpeechColab Leaderboard is accessible at <span><span>https://github.com/SpeechColab/Leaderboard</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"94 ","pages":"Article 101805"},"PeriodicalIF":3.1000,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000300","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In the wake of the surging tide of deep learning over the past decade, Automatic Speech Recognition (ASR) has garnered substantial attention, leading to the emergence of numerous publicly accessible ASR systems that are actively being integrated into our daily lives. Nonetheless, impartial and replicable evaluations of these ASR systems encounter challenges due to various subtleties. In this paper we introduce the SpeechColab Leaderboard, a general-purpose, open-source platform designed for ASR evaluation. With this platform: (i) We report a comprehensive benchmark, unveiling the current state-of-the-art panorama for ASR systems, covering both open-source models and industrial commercial services. (ii) We quantize how distinct nuances in the scoring pipeline influence the final benchmark outcomes, including capitalization, punctuation, interjection, contraction, synonym usage, compound words, etc. These issues have gained prominence in the context of the transition towards End-to-End ASR systems. (iii) We propose and discuss a modification to the conventional Token-Error-Rate (TER) metric, called modified-TER (mTER), inspired from Kolmogorov Complexity and Normalized Information Distance (NID). The proposed metric becomes normalized and symmetrical (with regard to reference and hypothesis). A large-scale empirical study is then presented comparing TER and mTER. The SpeechColab Leaderboard is accessible at https://github.com/SpeechColab/Leaderboard.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.