使用机器学习方法分析巴西土著语料库

Anais do XVIII Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2021) Pub Date : 2021-11-29 DOI:10.5753/eniac.2021.18246

T. Lima, André C. A. Nascimento, P. Miranda, R. F. Mello

{"title":"使用机器学习方法分析巴西土著语料库","authors":"T. Lima, André C. A. Nascimento, P. Miranda, R. F. Mello","doi":"10.5753/eniac.2021.18246","DOIUrl":null,"url":null,"abstract":"In Brazil, several minority languages suffer a serious risk of extinction. The appropriate documentation of such languages is a fundamental step to avoid that. However, for some of those languages, only a small amount of text corpora is digitally accessible. Meanwhile there are many issues related to the identification of indigenous languages, which may help to identify key similarities among them, as well as to connect related languages and dialects. Therefore, this paper proposes to study and automatically classify 26 neglected Brazilian native languages, considering a small amount of training data, under a supervised and unsupervised setting. Our findings indicate that the use of machine learning models to the analysis of Brazilian Indigenous corpora is very promising, and we hope this work encourage more research on this topic in the next years.","PeriodicalId":318676,"journal":{"name":"Anais do XVIII Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2021)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Analysis of a Brazilian Indigenous corpus using machine learning methods\",\"authors\":\"T. Lima, André C. A. Nascimento, P. Miranda, R. F. Mello\",\"doi\":\"10.5753/eniac.2021.18246\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In Brazil, several minority languages suffer a serious risk of extinction. The appropriate documentation of such languages is a fundamental step to avoid that. However, for some of those languages, only a small amount of text corpora is digitally accessible. Meanwhile there are many issues related to the identification of indigenous languages, which may help to identify key similarities among them, as well as to connect related languages and dialects. Therefore, this paper proposes to study and automatically classify 26 neglected Brazilian native languages, considering a small amount of training data, under a supervised and unsupervised setting. Our findings indicate that the use of machine learning models to the analysis of Brazilian Indigenous corpora is very promising, and we hope this work encourage more research on this topic in the next years.\",\"PeriodicalId\":318676,\"journal\":{\"name\":\"Anais do XVIII Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2021)\",\"volume\":\"88 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Anais do XVIII Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2021)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5753/eniac.2021.18246\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Anais do XVIII Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2021)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5753/eniac.2021.18246","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

在巴西，几种少数民族语言面临着灭绝的严重危险。这些语言的适当文档是避免这种情况的基本步骤。然而，对于其中的一些语言，只有一小部分的文本语料库是数字访问的。与此同时，还有许多与土著语言识别相关的问题，这可能有助于识别它们之间的关键相似性，并将相关语言和方言联系起来。因此，本文提出在有监督和无监督两种设置下，考虑少量训练数据，对26种被忽视的巴西母语进行研究和自动分类。我们的研究结果表明，使用机器学习模型来分析巴西土著语料库是非常有前途的，我们希望这项工作能在未来几年鼓励更多关于这一主题的研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Analysis of a Brazilian Indigenous corpus using machine learning methods

In Brazil, several minority languages suffer a serious risk of extinction. The appropriate documentation of such languages is a fundamental step to avoid that. However, for some of those languages, only a small amount of text corpora is digitally accessible. Meanwhile there are many issues related to the identification of indigenous languages, which may help to identify key similarities among them, as well as to connect related languages and dialects. Therefore, this paper proposes to study and automatically classify 26 neglected Brazilian native languages, considering a small amount of training data, under a supervised and unsupervised setting. Our findings indicate that the use of machine learning models to the analysis of Brazilian Indigenous corpora is very promising, and we hope this work encourage more research on this topic in the next years.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Anais do XVIII Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2021)

自引率

0.00%

发文量