SigMoreFun Submission to the SIGMORPHON Shared Task on Interlinear Glossing

Taiqi He, Lindia Tjuatja, Nathaniel R. Robinson, Shinji Watanabe, David R. Mortensen, Graham Neubig, L. Levin
{"title":"SigMoreFun Submission to the SIGMORPHON Shared Task on Interlinear Glossing","authors":"Taiqi He, Lindia Tjuatja, Nathaniel R. Robinson, Shinji Watanabe, David R. Mortensen, Graham Neubig, L. Levin","doi":"10.18653/v1/2023.sigmorphon-1.22","DOIUrl":null,"url":null,"abstract":"In our submission to the SIGMORPHON 2023 Shared Task on interlinear glossing (IGT), we explore approaches to data augmentation and modeling across seven low-resource languages. For data augmentation, we explore two approaches: creating artificial data from the provided training data and utilizing existing IGT resources in other languages. On the modeling side, we test an enhanced version of the provided token classification baseline as well as a pretrained multilingual seq2seq model. Additionally, we apply post-correction using a dictionary for Gitksan, the language with the smallest amount of data. We find that our token classification models are the best performing, with the highest word-level accuracy for Arapaho and highest morpheme-level accuracy for Gitksan out of all submissions. We also show that data augmentation is an effective strategy, though applying artificial data pretraining has very different effects across both models tested.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"69 21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Special Interest Group on Computational Morphology and Phonology Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2023.sigmorphon-1.22","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

In our submission to the SIGMORPHON 2023 Shared Task on interlinear glossing (IGT), we explore approaches to data augmentation and modeling across seven low-resource languages. For data augmentation, we explore two approaches: creating artificial data from the provided training data and utilizing existing IGT resources in other languages. On the modeling side, we test an enhanced version of the provided token classification baseline as well as a pretrained multilingual seq2seq model. Additionally, we apply post-correction using a dictionary for Gitksan, the language with the smallest amount of data. We find that our token classification models are the best performing, with the highest word-level accuracy for Arapaho and highest morpheme-level accuracy for Gitksan out of all submissions. We also show that data augmentation is an effective strategy, though applying artificial data pretraining has very different effects across both models tested.
SigMoreFun提交给SIGMORPHON关于行间光泽的共享任务
在我们提交给SIGMORPHON 2023关于行间光泽(IGT)的共享任务中,我们探索了跨七种低资源语言的数据增强和建模方法。对于数据增强,我们探索了两种方法:从提供的训练数据中创建人工数据和利用其他语言的现有IGT资源。在建模方面,我们测试了所提供的令牌分类基线的增强版本以及预训练的多语言seq2seq模型。此外,我们使用字典对具有最小数据量的吉特克桑语进行后校正。我们发现我们的标记分类模型表现最好,在所有提交的文本中,Arapaho具有最高的词级精度,Gitksan具有最高的语素级精度。我们还表明,数据增强是一种有效的策略,尽管在测试的两个模型中应用人工数据预训练会产生非常不同的效果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信