Mohamed Zouitni, Alami Hamza, Said Lafkiar, Nabil Burmani, Mohammed Taleb, Noureddine En-Nahnahi
{"title":"Machine Learning Based Methods for Arabic Duplicate Question Detection","authors":"Mohamed Zouitni, Alami Hamza, Said Lafkiar, Nabil Burmani, Mohammed Taleb, Noureddine En-Nahnahi","doi":"10.1109/ISCV54655.2022.9806071","DOIUrl":null,"url":null,"abstract":"Incorporating a duplicate question detection system can be beneficial for various systems such as community forums or question answering systems. Detecting question that have already an answer improves the user experience by reducing the search time and returning the correct answer. In this paper, we construct several methods for Arabic duplicate question detection based on machine learning. First, the pre-processing step is applied to clean and normalize questions. Next, we use Term Frequency Inverse Document Frequency (TF-IDF), Word2Vec, and FastText methods to map questions from their textual format into a vector space. Then, we trained various shallow learning methods (SVM, XGBoost, Random Forest, Logistic Regression) and deep learning methods (CNN, RNN, LSTM, GRU) with the objective of detecting if a pair of questions is duplicate or not. Various experiments were conducted to evaluate the performances of our models. The results obtained show that the deep learning model based on GRU with FastText representation performed better compared to the other models.","PeriodicalId":426665,"journal":{"name":"2022 International Conference on Intelligent Systems and Computer Vision (ISCV)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Intelligent Systems and Computer Vision (ISCV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCV54655.2022.9806071","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Incorporating a duplicate question detection system can be beneficial for various systems such as community forums or question answering systems. Detecting question that have already an answer improves the user experience by reducing the search time and returning the correct answer. In this paper, we construct several methods for Arabic duplicate question detection based on machine learning. First, the pre-processing step is applied to clean and normalize questions. Next, we use Term Frequency Inverse Document Frequency (TF-IDF), Word2Vec, and FastText methods to map questions from their textual format into a vector space. Then, we trained various shallow learning methods (SVM, XGBoost, Random Forest, Logistic Regression) and deep learning methods (CNN, RNN, LSTM, GRU) with the objective of detecting if a pair of questions is duplicate or not. Various experiments were conducted to evaluate the performances of our models. The results obtained show that the deep learning model based on GRU with FastText representation performed better compared to the other models.