{"title":"Extractive Question Answering for Kazakh Language","authors":"Magzhan Shymbayev, Yermek Alimzhanov","doi":"10.1109/SIST58284.2023.10223508","DOIUrl":null,"url":null,"abstract":"This article provides research and development of an extractive question answering system based on the BERT-like model for the Kazakh language. Developing an extractive question answering system requires large training datasets - tens of thousands of annotated question-answer pairs. Such datasets are not available in the majority of languages, including Kazakh. To address this issue, the Kazakh Question Answering Dataset (KazQA) is introduced, which is based on the Stanford Question Answering Dataset (SQuAD) and generated through machine translation using the Google Cloud Translation API. Different large pretrained contextual language models are used as the baseline models - ALBERT and multilingual BERT and are compared with the newly trained monolingual Kazakh model KazBERT. The results demonstrate that the proposed approach can effectively generate question answering systems in low-resourced Kazakh language.","PeriodicalId":367406,"journal":{"name":"2023 IEEE International Conference on Smart Information Systems and Technologies (SIST)","volume":"148 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Conference on Smart Information Systems and Technologies (SIST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIST58284.2023.10223508","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This article provides research and development of an extractive question answering system based on the BERT-like model for the Kazakh language. Developing an extractive question answering system requires large training datasets - tens of thousands of annotated question-answer pairs. Such datasets are not available in the majority of languages, including Kazakh. To address this issue, the Kazakh Question Answering Dataset (KazQA) is introduced, which is based on the Stanford Question Answering Dataset (SQuAD) and generated through machine translation using the Google Cloud Translation API. Different large pretrained contextual language models are used as the baseline models - ALBERT and multilingual BERT and are compared with the newly trained monolingual Kazakh model KazBERT. The results demonstrate that the proposed approach can effectively generate question answering systems in low-resourced Kazakh language.