H. T. Le, Dung T. Cao, Trung Bui, Long T. Luong, Huy-Quang Nguyen
{"title":"改进Quora问题对数据集的问题相似度任务","authors":"H. T. Le, Dung T. Cao, Trung Bui, Long T. Luong, Huy-Quang Nguyen","doi":"10.1109/RIVF51545.2021.9642071","DOIUrl":null,"url":null,"abstract":"Automatic detection of semantically equivalent questions is a task of the utmost importance in a question answering system. The Quora dataset, which was released in the Quora Question Pairs competition organized by Kaggle, has now been used by many researches to train the system in solving the task of identifying duplicate questions. However, the ground truth labels on this dataset are not 100% accurate and may include incorrect labeling. In this paper, we concentrate on improving the quality of the Quora dataset by combining several strategies, basing on Bert, rules, and reassigning labels by humans.","PeriodicalId":6860,"journal":{"name":"2021 RIVF International Conference on Computing and Communication Technologies (RIVF)","volume":"19 1","pages":"1-5"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Improve Quora Question Pair Dataset for Question Similarity Task\",\"authors\":\"H. T. Le, Dung T. Cao, Trung Bui, Long T. Luong, Huy-Quang Nguyen\",\"doi\":\"10.1109/RIVF51545.2021.9642071\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic detection of semantically equivalent questions is a task of the utmost importance in a question answering system. The Quora dataset, which was released in the Quora Question Pairs competition organized by Kaggle, has now been used by many researches to train the system in solving the task of identifying duplicate questions. However, the ground truth labels on this dataset are not 100% accurate and may include incorrect labeling. In this paper, we concentrate on improving the quality of the Quora dataset by combining several strategies, basing on Bert, rules, and reassigning labels by humans.\",\"PeriodicalId\":6860,\"journal\":{\"name\":\"2021 RIVF International Conference on Computing and Communication Technologies (RIVF)\",\"volume\":\"19 1\",\"pages\":\"1-5\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 RIVF International Conference on Computing and Communication Technologies (RIVF)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/RIVF51545.2021.9642071\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 RIVF International Conference on Computing and Communication Technologies (RIVF)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RIVF51545.2021.9642071","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Improve Quora Question Pair Dataset for Question Similarity Task
Automatic detection of semantically equivalent questions is a task of the utmost importance in a question answering system. The Quora dataset, which was released in the Quora Question Pairs competition organized by Kaggle, has now been used by many researches to train the system in solving the task of identifying duplicate questions. However, the ground truth labels on this dataset are not 100% accurate and may include incorrect labeling. In this paper, we concentrate on improving the quality of the Quora dataset by combining several strategies, basing on Bert, rules, and reassigning labels by humans.