{"title":"基于Siamese LSTM的Quora问题对重复问题检测的增强深度学习模型","authors":"M. Chandra, Andrea Rodrigues, Jossy P. George","doi":"10.1109/icdcece53908.2022.9792906","DOIUrl":null,"url":null,"abstract":"The question answering platform Quora has millions of users which increases the probability of questions asked with similar intent. One question may be structured in two different ways by two users, and answering similar questions repeatedly impacts user experience. Manual filtration of such questions is a tedious task, so Quora attempts to detect and remove these duplicate questions by using the Random Forest Model, which is not completely effective. As Quora contains question answers in the form of text data, different Natural Language Processing techniques are used to transform the text data into numerical vectors. In this research, the log loss metric acts as the primary metric to evaluate different models. The primary contribution is that the Siamese network is used to process two questions parallelly and find vectors representation of each question. The vectors computed by this method enables similarity detection which is more effective than existing models. GloVe word embedding is used to understand the semantic similarity between two questions. The random classifier is built as the base model and logistic regression, linear SVM and XGBoost model are used to reduce the log loss. Finally, a Siamese LSTM is proposed which reduces the loss dramatically.","PeriodicalId":417643,"journal":{"name":"2022 IEEE International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"An Enhanced Deep Learning Model for Duplicate Question Detection on Quora Question pairs using Siamese LSTM\",\"authors\":\"M. Chandra, Andrea Rodrigues, Jossy P. George\",\"doi\":\"10.1109/icdcece53908.2022.9792906\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The question answering platform Quora has millions of users which increases the probability of questions asked with similar intent. One question may be structured in two different ways by two users, and answering similar questions repeatedly impacts user experience. Manual filtration of such questions is a tedious task, so Quora attempts to detect and remove these duplicate questions by using the Random Forest Model, which is not completely effective. As Quora contains question answers in the form of text data, different Natural Language Processing techniques are used to transform the text data into numerical vectors. In this research, the log loss metric acts as the primary metric to evaluate different models. The primary contribution is that the Siamese network is used to process two questions parallelly and find vectors representation of each question. The vectors computed by this method enables similarity detection which is more effective than existing models. GloVe word embedding is used to understand the semantic similarity between two questions. The random classifier is built as the base model and logistic regression, linear SVM and XGBoost model are used to reduce the log loss. Finally, a Siamese LSTM is proposed which reduces the loss dramatically.\",\"PeriodicalId\":417643,\"journal\":{\"name\":\"2022 IEEE International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE)\",\"volume\":\"21 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-04-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/icdcece53908.2022.9792906\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icdcece53908.2022.9792906","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An Enhanced Deep Learning Model for Duplicate Question Detection on Quora Question pairs using Siamese LSTM
The question answering platform Quora has millions of users which increases the probability of questions asked with similar intent. One question may be structured in two different ways by two users, and answering similar questions repeatedly impacts user experience. Manual filtration of such questions is a tedious task, so Quora attempts to detect and remove these duplicate questions by using the Random Forest Model, which is not completely effective. As Quora contains question answers in the form of text data, different Natural Language Processing techniques are used to transform the text data into numerical vectors. In this research, the log loss metric acts as the primary metric to evaluate different models. The primary contribution is that the Siamese network is used to process two questions parallelly and find vectors representation of each question. The vectors computed by this method enables similarity detection which is more effective than existing models. GloVe word embedding is used to understand the semantic similarity between two questions. The random classifier is built as the base model and logistic regression, linear SVM and XGBoost model are used to reduce the log loss. Finally, a Siamese LSTM is proposed which reduces the loss dramatically.