{"title":"摩洛哥语DARIJA语音识别和语音到文本翻译语料库","authors":"Maria Labied, A. Belangour, M. Banane","doi":"10.1109/ICAISC56366.2023.10085164","DOIUrl":null,"url":null,"abstract":"This paper introduces an automated collection of a speech corpus for the Moroccan Arabic dialect “Darija” (DARIJA-C) which is intended for speech-to-text translation from Moroccan Darija into classical Arabic language. The DARIJA-C corpus is designed for Moroccan Darija automatic speech-to-text translation purposes. Nevertheless, it can be useful for automatic speech recognition of this dialect. To address both scale and sustainability, the DARIJA-C project uses crowdsourcing to collect and validate speech transcriptions and translations. By providing an automatic web platform for recording speech, along with their corresponding translation by distinct unknown speakers. The first versions of the Darija-C dataset will include only the translation of Moroccan Darija speech to classical Arabic. In later versions, we will include the translation of Moroccan Darija into other international languages such as French, and English, … The goal of this work is to build the largest crowdsourced corpus of Darija speech, which to our knowledge will be the first corpus for Moroccan Darija Speech-to-text translation.","PeriodicalId":422888,"journal":{"name":"2023 1st International Conference on Advanced Innovations in Smart Cities (ICAISC)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DARIJA-C: towards a Moroccan DARIJA Speech recognition and speech-to-text Translation Corpus\",\"authors\":\"Maria Labied, A. Belangour, M. Banane\",\"doi\":\"10.1109/ICAISC56366.2023.10085164\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper introduces an automated collection of a speech corpus for the Moroccan Arabic dialect “Darija” (DARIJA-C) which is intended for speech-to-text translation from Moroccan Darija into classical Arabic language. The DARIJA-C corpus is designed for Moroccan Darija automatic speech-to-text translation purposes. Nevertheless, it can be useful for automatic speech recognition of this dialect. To address both scale and sustainability, the DARIJA-C project uses crowdsourcing to collect and validate speech transcriptions and translations. By providing an automatic web platform for recording speech, along with their corresponding translation by distinct unknown speakers. The first versions of the Darija-C dataset will include only the translation of Moroccan Darija speech to classical Arabic. In later versions, we will include the translation of Moroccan Darija into other international languages such as French, and English, … The goal of this work is to build the largest crowdsourced corpus of Darija speech, which to our knowledge will be the first corpus for Moroccan Darija Speech-to-text translation.\",\"PeriodicalId\":422888,\"journal\":{\"name\":\"2023 1st International Conference on Advanced Innovations in Smart Cities (ICAISC)\",\"volume\":\"99 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 1st International Conference on Advanced Innovations in Smart Cities (ICAISC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICAISC56366.2023.10085164\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 1st International Conference on Advanced Innovations in Smart Cities (ICAISC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAISC56366.2023.10085164","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
DARIJA-C: towards a Moroccan DARIJA Speech recognition and speech-to-text Translation Corpus
This paper introduces an automated collection of a speech corpus for the Moroccan Arabic dialect “Darija” (DARIJA-C) which is intended for speech-to-text translation from Moroccan Darija into classical Arabic language. The DARIJA-C corpus is designed for Moroccan Darija automatic speech-to-text translation purposes. Nevertheless, it can be useful for automatic speech recognition of this dialect. To address both scale and sustainability, the DARIJA-C project uses crowdsourcing to collect and validate speech transcriptions and translations. By providing an automatic web platform for recording speech, along with their corresponding translation by distinct unknown speakers. The first versions of the Darija-C dataset will include only the translation of Moroccan Darija speech to classical Arabic. In later versions, we will include the translation of Moroccan Darija into other international languages such as French, and English, … The goal of this work is to build the largest crowdsourced corpus of Darija speech, which to our knowledge will be the first corpus for Moroccan Darija Speech-to-text translation.