Sneha Choudhary, Haritha Guttikonda, Dibyendu Roy Chowdhury, G. Learmonth
{"title":"使用深度学习的文档检索","authors":"Sneha Choudhary, Haritha Guttikonda, Dibyendu Roy Chowdhury, G. Learmonth","doi":"10.1109/SIEDS49339.2020.9106632","DOIUrl":null,"url":null,"abstract":"Document Retrieval has seen significant advancements in the last few decades. Latest developments in Natural Language Processing have made it possible to incorporate context and complex lexical patterns to document representations. This opens new possibilities for developing advanced retrieval systems. Traditional approaches for indexing documents suggest averaging word and sentence encoding to form fixed-length document embeddings. However, the common bag-of-word approach fails to incorporate the semantic context, which can be critical for understanding document-query relevancy. We address this by leveraging Bidirectional Encoder Representations from Transformers (BERT) to create semantically rich document embeddings. BERT compensates the limitations of the Term Frequency Inverse Document Frequency (TF-IDF) by incorporating contextual embeddings. In this paper, we propose an ensemble of BERT and TF-IDF for a document retrieval system, where TFIDF and BERT together score the documents against a query, to retrieve a final set of top K documents. We critically compare our model against the standard TF-IDF method and demonstrate a significant performance improvement on MS MARCO data (Microsoft-curated data of Bing queries).","PeriodicalId":331495,"journal":{"name":"2020 Systems and Information Engineering Design Symposium (SIEDS)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Document Retrieval Using Deep Learning\",\"authors\":\"Sneha Choudhary, Haritha Guttikonda, Dibyendu Roy Chowdhury, G. Learmonth\",\"doi\":\"10.1109/SIEDS49339.2020.9106632\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Document Retrieval has seen significant advancements in the last few decades. Latest developments in Natural Language Processing have made it possible to incorporate context and complex lexical patterns to document representations. This opens new possibilities for developing advanced retrieval systems. Traditional approaches for indexing documents suggest averaging word and sentence encoding to form fixed-length document embeddings. However, the common bag-of-word approach fails to incorporate the semantic context, which can be critical for understanding document-query relevancy. We address this by leveraging Bidirectional Encoder Representations from Transformers (BERT) to create semantically rich document embeddings. BERT compensates the limitations of the Term Frequency Inverse Document Frequency (TF-IDF) by incorporating contextual embeddings. In this paper, we propose an ensemble of BERT and TF-IDF for a document retrieval system, where TFIDF and BERT together score the documents against a query, to retrieve a final set of top K documents. We critically compare our model against the standard TF-IDF method and demonstrate a significant performance improvement on MS MARCO data (Microsoft-curated data of Bing queries).\",\"PeriodicalId\":331495,\"journal\":{\"name\":\"2020 Systems and Information Engineering Design Symposium (SIEDS)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 Systems and Information Engineering Design Symposium (SIEDS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SIEDS49339.2020.9106632\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Systems and Information Engineering Design Symposium (SIEDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIEDS49339.2020.9106632","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Document Retrieval has seen significant advancements in the last few decades. Latest developments in Natural Language Processing have made it possible to incorporate context and complex lexical patterns to document representations. This opens new possibilities for developing advanced retrieval systems. Traditional approaches for indexing documents suggest averaging word and sentence encoding to form fixed-length document embeddings. However, the common bag-of-word approach fails to incorporate the semantic context, which can be critical for understanding document-query relevancy. We address this by leveraging Bidirectional Encoder Representations from Transformers (BERT) to create semantically rich document embeddings. BERT compensates the limitations of the Term Frequency Inverse Document Frequency (TF-IDF) by incorporating contextual embeddings. In this paper, we propose an ensemble of BERT and TF-IDF for a document retrieval system, where TFIDF and BERT together score the documents against a query, to retrieve a final set of top K documents. We critically compare our model against the standard TF-IDF method and demonstrate a significant performance improvement on MS MARCO data (Microsoft-curated data of Bing queries).