变形:一个基于变形的土耳其语形态消歧器

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Turkish Journal of Electrical Engineering and Computer Sciences Pub Date : 2022-01-01 DOI:10.55730/1300-0632.3912

Hilal Özer, E. E. Korkmaz

{"title":"变形:一个基于变形的土耳其语形态消歧器","authors":"Hilal Özer, E. E. Korkmaz","doi":"10.55730/1300-0632.3912","DOIUrl":null,"url":null,"abstract":": The agglutinative nature of the Turkish language has a complex morphological structure, and there are generally more than one parse for a given word. Before further processing, morphological disambiguation is required to determine the correct morphological analysis of a word. Morphological disambiguation is one of the first and crucial steps in natural language processing since its success determines later analyses. In our proposed morphological disambiguation method, we used a transformer-based sequence-to-sequence neural network architecture. Transformers are commonly used in various NLP tasks, and they produce state-of-the-art results in machine translation. However, to the best of our knowledge, transformer-based encoder-decoders have not been studied in morphological disambiguation. In this study, in addition to character level tokenization, three input subword representations are evaluated, which are unigram, bytepair, and wordpiece tokenization methods. We have achieved the best accuracy with character input representation which is 96.25%. Although the proposed model is developed for Turkish language, it is not language-dependent, so it can be applied to a larger set of languages.","PeriodicalId":49410,"journal":{"name":"Turkish Journal of Electrical Engineering and Computer Sciences","volume":"30 1","pages":"1897-1913"},"PeriodicalIF":1.5000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Transmorph: a transformer based morphological disambiguator for Turkish\",\"authors\":\"Hilal Özer, E. E. Korkmaz\",\"doi\":\"10.55730/1300-0632.3912\",\"DOIUrl\":null,\"url\":null,\"abstract\":\": The agglutinative nature of the Turkish language has a complex morphological structure, and there are generally more than one parse for a given word. Before further processing, morphological disambiguation is required to determine the correct morphological analysis of a word. Morphological disambiguation is one of the first and crucial steps in natural language processing since its success determines later analyses. In our proposed morphological disambiguation method, we used a transformer-based sequence-to-sequence neural network architecture. Transformers are commonly used in various NLP tasks, and they produce state-of-the-art results in machine translation. However, to the best of our knowledge, transformer-based encoder-decoders have not been studied in morphological disambiguation. In this study, in addition to character level tokenization, three input subword representations are evaluated, which are unigram, bytepair, and wordpiece tokenization methods. We have achieved the best accuracy with character input representation which is 96.25%. Although the proposed model is developed for Turkish language, it is not language-dependent, so it can be applied to a larger set of languages.\",\"PeriodicalId\":49410,\"journal\":{\"name\":\"Turkish Journal of Electrical Engineering and Computer Sciences\",\"volume\":\"30 1\",\"pages\":\"1897-1913\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2022-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Turkish Journal of Electrical Engineering and Computer Sciences\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.55730/1300-0632.3912\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Turkish Journal of Electrical Engineering and Computer Sciences","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.55730/1300-0632.3912","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 1

摘要

土耳其语的黏着性质具有复杂的形态结构，并且通常对给定的单词有不止一种解析。在进一步处理之前，需要进行词形消歧，以确定单词的正确词形分析。形态消歧是自然语言处理的第一步和关键步骤之一，因为它的成功决定了以后的分析。在我们提出的形态消歧方法中，我们使用了基于变压器的序列到序列神经网络架构。变压器通常用于各种NLP任务，它们在机器翻译中产生最先进的结果。然而，据我们所知，基于变换的编码器-解码器尚未在形态学消歧中进行研究。在本研究中，除了字符级标记化之外，还评估了三种输入子词表示，即单字符、字节对和词块标记化方法。我们在字符输入表示方面取得了最好的准确率，达到96.25%。虽然建议的模型是为土耳其语开发的，但它不依赖于语言，因此它可以应用于更大的语言集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Transmorph: a transformer based morphological disambiguator for Turkish

: The agglutinative nature of the Turkish language has a complex morphological structure, and there are generally more than one parse for a given word. Before further processing, morphological disambiguation is required to determine the correct morphological analysis of a word. Morphological disambiguation is one of the first and crucial steps in natural language processing since its success determines later analyses. In our proposed morphological disambiguation method, we used a transformer-based sequence-to-sequence neural network architecture. Transformers are commonly used in various NLP tasks, and they produce state-of-the-art results in machine translation. However, to the best of our knowledge, transformer-based encoder-decoders have not been studied in morphological disambiguation. In this study, in addition to character level tokenization, three input subword representations are evaluated, which are unigram, bytepair, and wordpiece tokenization methods. We have achieved the best accuracy with character input representation which is 96.25%. Although the proposed model is developed for Turkish language, it is not language-dependent, so it can be applied to a larger set of languages.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Turkish Journal of Electrical Engineering and Computer Sciences COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

2.90

自引率

9.10%

发文量

审稿时长

6.9 months

期刊介绍： The Turkish Journal of Electrical Engineering & Computer Sciences is published electronically 6 times a year by the Scientific and Technological Research Council of Turkey (TÜBİTAK) Accepts English-language manuscripts in the areas of power and energy, environmental sustainability and energy efficiency, electronics, industry applications, control systems, information and systems, applied electromagnetics, communications, signal and image processing, tomographic image reconstruction, face recognition, biometrics, speech processing, video processing and analysis, object recognition, classification, feature extraction, parallel and distributed computing, cognitive systems, interaction, robotics, digital libraries and content, personalized healthcare, ICT for mobility, sensors, and artificial intelligence. Contribution is open to researchers of all nationalities.