学习识别阿拉伯语和德语方言使用多核

Workshop on NLP for Similar Languages, Varieties and Dialects Pub Date : 2017-04-01 DOI:10.18653/v1/W17-1225

Radu Tudor Ionescu, Andrei M. Butnaru

{"title":"学习识别阿拉伯语和德语方言使用多核","authors":"Radu Tudor Ionescu, Andrei M. Butnaru","doi":"10.18653/v1/W17-1225","DOIUrl":null,"url":null,"abstract":"We present a machine learning approach for the Arabic Dialect Identification (ADI) and the German Dialect Identification (GDI) Closed Shared Tasks of the DSL 2017 Challenge. The proposed approach combines several kernels using multiple kernel learning. While most of our kernels are based on character p-grams (also known as n-grams) extracted from speech transcripts, we also use a kernel based on i-vectors, a low-dimensional representation of audio recordings, provided only for the Arabic data. In the learning stage, we independently employ Kernel Discriminant Analysis (KDA) and Kernel Ridge Regression (KRR). Our approach is shallow and simple, but the empirical results obtained in the shared tasks prove that it achieves very good results. Indeed, we ranked on the first place in the ADI Shared Task with a weighted F1 score of 76.32% (4.62% above the second place) and on the fifth place in the GDI Shared Task with a weighted F1 score of 63.67% (2.57% below the first place).","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"153 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":"{\"title\":\"Learning to Identify Arabic and German Dialects using Multiple Kernels\",\"authors\":\"Radu Tudor Ionescu, Andrei M. Butnaru\",\"doi\":\"10.18653/v1/W17-1225\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a machine learning approach for the Arabic Dialect Identification (ADI) and the German Dialect Identification (GDI) Closed Shared Tasks of the DSL 2017 Challenge. The proposed approach combines several kernels using multiple kernel learning. While most of our kernels are based on character p-grams (also known as n-grams) extracted from speech transcripts, we also use a kernel based on i-vectors, a low-dimensional representation of audio recordings, provided only for the Arabic data. In the learning stage, we independently employ Kernel Discriminant Analysis (KDA) and Kernel Ridge Regression (KRR). Our approach is shallow and simple, but the empirical results obtained in the shared tasks prove that it achieves very good results. Indeed, we ranked on the first place in the ADI Shared Task with a weighted F1 score of 76.32% (4.62% above the second place) and on the fifth place in the GDI Shared Task with a weighted F1 score of 63.67% (2.57% below the first place).\",\"PeriodicalId\":167439,\"journal\":{\"name\":\"Workshop on NLP for Similar Languages, Varieties and Dialects\",\"volume\":\"153 2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"38\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Workshop on NLP for Similar Languages, Varieties and Dialects\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/W17-1225\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on NLP for Similar Languages, Varieties and Dialects","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/W17-1225","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 38

摘要

我们提出了一种机器学习方法，用于DSL 2017挑战赛的阿拉伯语方言识别(ADI)和德语方言识别(GDI)封闭共享任务。提出的方法使用多核学习将多个核结合起来。虽然我们的大多数核都是基于从语音记录中提取的字符p-grams(也称为n-grams)，但我们也使用基于i-vectors的核，这是一种音频记录的低维表示，仅为阿拉伯语数据提供。在学习阶段，我们独立使用核判别分析(KDA)和核岭回归(KRR)。我们的方法是肤浅和简单的，但在共享任务中获得的经验结果证明，它取得了很好的效果。事实上，我们在ADI共享任务中排名第一，F1加权得分为76.32%(比第二名高4.62%)，在GDI共享任务中排名第五，F1加权得分为63.67%(比第一名低2.57%)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Learning to Identify Arabic and German Dialects using Multiple Kernels

We present a machine learning approach for the Arabic Dialect Identification (ADI) and the German Dialect Identification (GDI) Closed Shared Tasks of the DSL 2017 Challenge. The proposed approach combines several kernels using multiple kernel learning. While most of our kernels are based on character p-grams (also known as n-grams) extracted from speech transcripts, we also use a kernel based on i-vectors, a low-dimensional representation of audio recordings, provided only for the Arabic data. In the learning stage, we independently employ Kernel Discriminant Analysis (KDA) and Kernel Ridge Regression (KRR). Our approach is shallow and simple, but the empirical results obtained in the shared tasks prove that it achieves very good results. Indeed, we ranked on the first place in the ADI Shared Task with a weighted F1 score of 76.32% (4.62% above the second place) and on the fifth place in the GDI Shared Task with a weighted F1 score of 63.67% (2.57% below the first place).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Workshop on NLP for Similar Languages, Varieties and Dialects

自引率

0.00%

发文量