Titouan Parcollet , Ha Nguyen , Solène Evain , Marcely Zanon Boito , Adrien Pupier , Salima Mdhaffar , Hang Le , Sina Alisamir , Natalia Tomashenko , Marco Dinarelli , Shucong Zhang , Alexandre Allauzen , Maximin Coavoux , Yannick Estève , Mickael Rouvier , Jerôme Goulian , Benjamin Lecouteux , François Portet , Solange Rossato , Fabien Ringeval , Laurent Besacier
{"title":"LeBenchmark 2.0:法语语音自监督表征的标准化、可复制和增强型框架","authors":"Titouan Parcollet , Ha Nguyen , Solène Evain , Marcely Zanon Boito , Adrien Pupier , Salima Mdhaffar , Hang Le , Sina Alisamir , Natalia Tomashenko , Marco Dinarelli , Shucong Zhang , Alexandre Allauzen , Maximin Coavoux , Yannick Estève , Mickael Rouvier , Jerôme Goulian , Benjamin Lecouteux , François Portet , Solange Rossato , Fabien Ringeval , Laurent Besacier","doi":"10.1016/j.csl.2024.101622","DOIUrl":null,"url":null,"abstract":"<div><p>Self-supervised learning (SSL) is at the origin of unprecedented improvements in many different domains including computer vision and natural language processing. Speech processing drastically benefitted from SSL as most of the current domain-related tasks are now being approached with pre-trained models. This work introduces <em>LeBenchmark 2.0</em> an open-source framework for assessing and building SSL-equipped French speech technologies. It includes documented, large-scale and heterogeneous corpora with up to 14,000 h of heterogeneous speech, ten pre-trained SSL wav2vec 2.0 models containing from 26 million to one billion learnable parameters shared with the community, and an evaluation protocol made of six downstream tasks to complement existing benchmarks. <em>LeBenchmark 2.0</em> also presents unique perspectives on pre-trained SSL models for speech with the investigation of frozen versus fine-tuned downstream models, task-agnostic versus task-specific pre-trained models as well as a discussion on the carbon footprint of large-scale model training. Overall, the newly introduced models trained on 14,000 h of French speech outperform multilingual and previous <em>LeBenchmark</em> SSL models across the benchmark but also required up to four times more energy for pre-training.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"86 ","pages":"Article 101622"},"PeriodicalIF":3.1000,"publicationDate":"2024-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LeBenchmark 2.0: A standardized, replicable and enhanced framework for self-supervised representations of French speech\",\"authors\":\"Titouan Parcollet , Ha Nguyen , Solène Evain , Marcely Zanon Boito , Adrien Pupier , Salima Mdhaffar , Hang Le , Sina Alisamir , Natalia Tomashenko , Marco Dinarelli , Shucong Zhang , Alexandre Allauzen , Maximin Coavoux , Yannick Estève , Mickael Rouvier , Jerôme Goulian , Benjamin Lecouteux , François Portet , Solange Rossato , Fabien Ringeval , Laurent Besacier\",\"doi\":\"10.1016/j.csl.2024.101622\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Self-supervised learning (SSL) is at the origin of unprecedented improvements in many different domains including computer vision and natural language processing. Speech processing drastically benefitted from SSL as most of the current domain-related tasks are now being approached with pre-trained models. This work introduces <em>LeBenchmark 2.0</em> an open-source framework for assessing and building SSL-equipped French speech technologies. It includes documented, large-scale and heterogeneous corpora with up to 14,000 h of heterogeneous speech, ten pre-trained SSL wav2vec 2.0 models containing from 26 million to one billion learnable parameters shared with the community, and an evaluation protocol made of six downstream tasks to complement existing benchmarks. <em>LeBenchmark 2.0</em> also presents unique perspectives on pre-trained SSL models for speech with the investigation of frozen versus fine-tuned downstream models, task-agnostic versus task-specific pre-trained models as well as a discussion on the carbon footprint of large-scale model training. Overall, the newly introduced models trained on 14,000 h of French speech outperform multilingual and previous <em>LeBenchmark</em> SSL models across the benchmark but also required up to four times more energy for pre-training.</p></div>\",\"PeriodicalId\":50638,\"journal\":{\"name\":\"Computer Speech and Language\",\"volume\":\"86 \",\"pages\":\"Article 101622\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2024-02-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Speech and Language\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0885230824000056\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824000056","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
LeBenchmark 2.0: A standardized, replicable and enhanced framework for self-supervised representations of French speech
Self-supervised learning (SSL) is at the origin of unprecedented improvements in many different domains including computer vision and natural language processing. Speech processing drastically benefitted from SSL as most of the current domain-related tasks are now being approached with pre-trained models. This work introduces LeBenchmark 2.0 an open-source framework for assessing and building SSL-equipped French speech technologies. It includes documented, large-scale and heterogeneous corpora with up to 14,000 h of heterogeneous speech, ten pre-trained SSL wav2vec 2.0 models containing from 26 million to one billion learnable parameters shared with the community, and an evaluation protocol made of six downstream tasks to complement existing benchmarks. LeBenchmark 2.0 also presents unique perspectives on pre-trained SSL models for speech with the investigation of frozen versus fine-tuned downstream models, task-agnostic versus task-specific pre-trained models as well as a discussion on the carbon footprint of large-scale model training. Overall, the newly introduced models trained on 14,000 h of French speech outperform multilingual and previous LeBenchmark SSL models across the benchmark but also required up to four times more energy for pre-training.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.