H. T. Le, Dung T. Cao, Trung Bui, Long T. Luong, Huy-Quang Nguyen
{"title":"Improve Quora Question Pair Dataset for Question Similarity Task","authors":"H. T. Le, Dung T. Cao, Trung Bui, Long T. Luong, Huy-Quang Nguyen","doi":"10.1109/RIVF51545.2021.9642071","DOIUrl":null,"url":null,"abstract":"Automatic detection of semantically equivalent questions is a task of the utmost importance in a question answering system. The Quora dataset, which was released in the Quora Question Pairs competition organized by Kaggle, has now been used by many researches to train the system in solving the task of identifying duplicate questions. However, the ground truth labels on this dataset are not 100% accurate and may include incorrect labeling. In this paper, we concentrate on improving the quality of the Quora dataset by combining several strategies, basing on Bert, rules, and reassigning labels by humans.","PeriodicalId":6860,"journal":{"name":"2021 RIVF International Conference on Computing and Communication Technologies (RIVF)","volume":"19 1","pages":"1-5"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 RIVF International Conference on Computing and Communication Technologies (RIVF)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RIVF51545.2021.9642071","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Automatic detection of semantically equivalent questions is a task of the utmost importance in a question answering system. The Quora dataset, which was released in the Quora Question Pairs competition organized by Kaggle, has now been used by many researches to train the system in solving the task of identifying duplicate questions. However, the ground truth labels on this dataset are not 100% accurate and may include incorrect labeling. In this paper, we concentrate on improving the quality of the Quora dataset by combining several strategies, basing on Bert, rules, and reassigning labels by humans.