Applying Transfer Learning for Improving Domain-Specific Search Experience Using Query to Question Similarity

Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence Pub Date : 2020-12-24 DOI:10.1145/3446132.3446403

Ankush Chopra, S. Agrawal, Sohom Ghosh

{"title":"Applying Transfer Learning for Improving Domain-Specific Search Experience Using Query to Question Similarity","authors":"Ankush Chopra, S. Agrawal, Sohom Ghosh","doi":"10.1145/3446132.3446403","DOIUrl":null,"url":null,"abstract":"Search is one of the most common platforms used to seek information. However, users mostly get overloaded with results whenever they use such a platform to resolve their queries. Nowadays, direct answers to queries are being provided as a part of the search experience. The question-answer (QA) retrieval process plays a significant role in enriching the search experience. Most off-the-shelf Semantic Textual Similarity models work fine for well-formed search queries, but their performances degrade when applied to a domain-specific setting having incomplete or grammatically ill-formed search queries in prevalence. In this paper, we discuss a framework for calculating similarities between a given input query and a set of predefined questions to retrieve the question which matches to it the most. We have used it for the financial domain, but the framework is generalized for any domain-specific search engine and can be used in other domains as well. We use Siamese network [6] over Long Short-Term Memory (LSTM) [3] models to train a classifier which generates un-normalized and normalized similarity scores for a given pair of questions. Moreover, for each of these question pairs, we calculate three other similarity scores: cosine similarity between their average word2vec embeddings [15], cosine similarity between their sentence embeddings [7] generated using RoBERTa [17] and their customized fuzzy-match score. Finally, we develop a meta-classifier using Support Vector Machines [19] for combining these five scores to detect if a given pair of questions is similar. We benchmark our model's performance against existing State Of The Art (SOTA) models on Quora Question Pairs (QQP) dataset1 as well as a dataset specific to the financial domain. After evaluating its performance on the financial domain specific data, we conclude that it not only outperforms several existing SOTA models on F1 score but also has decent accuracy.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3446132.3446403","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Search is one of the most common platforms used to seek information. However, users mostly get overloaded with results whenever they use such a platform to resolve their queries. Nowadays, direct answers to queries are being provided as a part of the search experience. The question-answer (QA) retrieval process plays a significant role in enriching the search experience. Most off-the-shelf Semantic Textual Similarity models work fine for well-formed search queries, but their performances degrade when applied to a domain-specific setting having incomplete or grammatically ill-formed search queries in prevalence. In this paper, we discuss a framework for calculating similarities between a given input query and a set of predefined questions to retrieve the question which matches to it the most. We have used it for the financial domain, but the framework is generalized for any domain-specific search engine and can be used in other domains as well. We use Siamese network [6] over Long Short-Term Memory (LSTM) [3] models to train a classifier which generates un-normalized and normalized similarity scores for a given pair of questions. Moreover, for each of these question pairs, we calculate three other similarity scores: cosine similarity between their average word2vec embeddings [15], cosine similarity between their sentence embeddings [7] generated using RoBERTa [17] and their customized fuzzy-match score. Finally, we develop a meta-classifier using Support Vector Machines [19] for combining these five scores to detect if a given pair of questions is similar. We benchmark our model's performance against existing State Of The Art (SOTA) models on Quora Question Pairs (QQP) dataset1 as well as a dataset specific to the financial domain. After evaluating its performance on the financial domain specific data, we conclude that it not only outperforms several existing SOTA models on F1 score but also has decent accuracy.

查看原文本刊更多论文

应用迁移学习改进基于查询问题相似度的特定领域搜索体验

搜索是最常用的信息搜索平台之一。然而，每当用户使用这样的平台来解决他们的查询时，结果往往会过载。如今，对查询的直接回答已成为搜索体验的一部分。问答检索过程在丰富搜索体验方面起着重要的作用。大多数现成的语义文本相似度模型对于格式良好的搜索查询都能很好地工作，但是当应用于普遍存在不完整或语法格式错误的搜索查询的特定领域设置时，它们的性能就会下降。在本文中，我们讨论了一个计算给定输入查询和一组预定义问题之间相似度的框架，以检索与它最匹配的问题。我们已经将其用于金融领域，但该框架适用于任何特定于领域的搜索引擎，也可以用于其他领域。我们在长短期记忆(LSTM)[3]模型上使用Siamese网络[6]来训练一个分类器，该分类器为给定的一对问题生成非规范化和规范化的相似性分数。此外，对于这些问题对中的每一个，我们计算了另外三个相似度分数:它们的平均word2vec嵌入[15]之间的余弦相似度，它们的句子嵌入[7]之间的余弦相似度，使用RoBERTa[17]生成的余弦相似度，以及它们定制的模糊匹配分数。最后，我们使用支持向量机(Support Vector Machines)开发了一个元分类器[19]，用于组合这五个分数来检测给定的一对问题是否相似。我们将模型的性能与Quora问题对(QQP)数据集上现有的SOTA模型以及特定于金融领域的数据集进行基准测试。在评估了它在金融领域特定数据上的表现后，我们得出结论，它不仅在F1得分上优于几种现有的SOTA模型，而且具有不错的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence

自引率

0.00%

发文量