{"title":"基于Seq2Seq模型的阿尔巴尼亚语拼写校正","authors":"Evis Trandafili, Alba Haveriku, Antea Bendo","doi":"10.23919/softcom55329.2022.9911495","DOIUrl":null,"url":null,"abstract":"In this paper we present a model which detects and corrects spelling mistakes in Albanian language. Most of the available literature and published papers discuss the implementation and optimization of spell checkers for the English language. Until now, unfortunately, there is a lack of published works for the processing of Albanian language. We are going to explain the process of implementing a spelling corrector for Albanian, from the dataset creation until the provision of the results. The proposed model is based in the Sequence to Sequence (Seq2Seq) model with Bahdanau Attention. Since there is a lack of public datasets in Albanian, we created a dataset with 958,116 sentences collected from electronic books, Wikipedia articles and various legal documents in Albanian language. We experimented with the hyperparameters values in our neural network to find the optimal parameters which provided the best results. We propose that by enriching the initial dataset, not only in dimension but also by linking it with other tools such as POS (Part of Speech) tagging, a higher level of accuracy can be achieved.","PeriodicalId":261625,"journal":{"name":"2022 International Conference on Software, Telecommunications and Computer Networks (SoftCOM)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Employing a Seq2Seq Model for Spelling Correction in Albanian Language\",\"authors\":\"Evis Trandafili, Alba Haveriku, Antea Bendo\",\"doi\":\"10.23919/softcom55329.2022.9911495\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we present a model which detects and corrects spelling mistakes in Albanian language. Most of the available literature and published papers discuss the implementation and optimization of spell checkers for the English language. Until now, unfortunately, there is a lack of published works for the processing of Albanian language. We are going to explain the process of implementing a spelling corrector for Albanian, from the dataset creation until the provision of the results. The proposed model is based in the Sequence to Sequence (Seq2Seq) model with Bahdanau Attention. Since there is a lack of public datasets in Albanian, we created a dataset with 958,116 sentences collected from electronic books, Wikipedia articles and various legal documents in Albanian language. We experimented with the hyperparameters values in our neural network to find the optimal parameters which provided the best results. We propose that by enriching the initial dataset, not only in dimension but also by linking it with other tools such as POS (Part of Speech) tagging, a higher level of accuracy can be achieved.\",\"PeriodicalId\":261625,\"journal\":{\"name\":\"2022 International Conference on Software, Telecommunications and Computer Networks (SoftCOM)\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Conference on Software, Telecommunications and Computer Networks (SoftCOM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/softcom55329.2022.9911495\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Software, Telecommunications and Computer Networks (SoftCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/softcom55329.2022.9911495","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Employing a Seq2Seq Model for Spelling Correction in Albanian Language
In this paper we present a model which detects and corrects spelling mistakes in Albanian language. Most of the available literature and published papers discuss the implementation and optimization of spell checkers for the English language. Until now, unfortunately, there is a lack of published works for the processing of Albanian language. We are going to explain the process of implementing a spelling corrector for Albanian, from the dataset creation until the provision of the results. The proposed model is based in the Sequence to Sequence (Seq2Seq) model with Bahdanau Attention. Since there is a lack of public datasets in Albanian, we created a dataset with 958,116 sentences collected from electronic books, Wikipedia articles and various legal documents in Albanian language. We experimented with the hyperparameters values in our neural network to find the optimal parameters which provided the best results. We propose that by enriching the initial dataset, not only in dimension but also by linking it with other tools such as POS (Part of Speech) tagging, a higher level of accuracy can be achieved.