Negin Abadani, Jamshid Mozafari, A. Fatemi, Mohammd Ali Nematbakhsh, A. Kazemi
{"title":"ParSQuAD: Machine Translated SQuAD dataset for Persian Question Answering","authors":"Negin Abadani, Jamshid Mozafari, A. Fatemi, Mohammd Ali Nematbakhsh, A. Kazemi","doi":"10.1109/ICWR51868.2021.9443126","DOIUrl":null,"url":null,"abstract":"Recent advances in the field of Question Answering (QA) have improved state-of-the-art results. Due to the availability of rich English training datasets for this task, most results reported are for this language. However, due to the lack of Persian datasets, less research has been done for the latter language therefore the results are hard to compare. In the present work, we introduce the Persian Question Answering Dataset (ParSQuAD) translated from the well-known SQuAD 2.0 dataset. Our dataset comes in two versions depending on whether it has been manually or automatically corrected. The result is the first large-scale QA training resource for Persian. We train three baseline models, one of which, achieves an F1 score of 56.66% and an exact match ratio of 52.86% on the test set with the first version and an F1 score of 70.84 % and an exact match ratio of 67.73% with the second version.","PeriodicalId":377597,"journal":{"name":"2021 7th International Conference on Web Research (ICWR)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 7th International Conference on Web Research (ICWR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICWR51868.2021.9443126","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12
Abstract
Recent advances in the field of Question Answering (QA) have improved state-of-the-art results. Due to the availability of rich English training datasets for this task, most results reported are for this language. However, due to the lack of Persian datasets, less research has been done for the latter language therefore the results are hard to compare. In the present work, we introduce the Persian Question Answering Dataset (ParSQuAD) translated from the well-known SQuAD 2.0 dataset. Our dataset comes in two versions depending on whether it has been manually or automatically corrected. The result is the first large-scale QA training resource for Persian. We train three baseline models, one of which, achieves an F1 score of 56.66% and an exact match ratio of 52.86% on the test set with the first version and an F1 score of 70.84 % and an exact match ratio of 67.73% with the second version.