基于目标主题感知的用户生成内容短句检索Doc2Vec

Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services Pub Date : 2019-12-02 DOI:10.1145/3366030.3366126

Kosuke Kurihara, Yoshiyuki Shoji, Sumio Fujita, M. Dürst

{"title":"基于目标主题感知的用户生成内容短句检索Doc2Vec","authors":"Kosuke Kurihara, Yoshiyuki Shoji, Sumio Fujita, M. Dürst","doi":"10.1145/3366030.3366126","DOIUrl":null,"url":null,"abstract":"This paper proposes a new method of supplementing the context of short sentences for the training phase of Doc2Vec. Since CGM (Consumer Generated Media) sites and SNS sites become widespread, the importance of similarity calculation between a given query and a short sentence is increasing. As an example, a search by the query \"sad\" should find actual expressions such as \"I needed a handkerchief\" on a movie review site. Doc2Vec is one of the most widely used methods for vectorization of queries and sentences. However, Doc2Vec often exhibits low accuracy if the training data consists of short sentences, because they lack context. We modified Doc2Vec with the hypothesis that other posts for the same topic (i.e. reviews for the same movie in online movie review sites) may share the same background. Our method uses target-topic IDs instead of sentence IDs as the context in the training phase of the Doc2Vec with the PV-DM model; this model estimates the next term from a few previous terms and context. The model trained with item IDs vectorizes a sentence more accurately than a model trained with sentence IDs. We conducted a large-scale experiment using 1.2 million movie review posts and a crowdsourcing-based evaluation. The experimental result demonstrates that our new method achieves higher precision and nDCG than previous Doc2Vec variants and traditional topic modeling methods.","PeriodicalId":446280,"journal":{"name":"Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Target-Topic Aware Doc2Vec for Short Sentence Retrieval from User Generated Content\",\"authors\":\"Kosuke Kurihara, Yoshiyuki Shoji, Sumio Fujita, M. Dürst\",\"doi\":\"10.1145/3366030.3366126\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes a new method of supplementing the context of short sentences for the training phase of Doc2Vec. Since CGM (Consumer Generated Media) sites and SNS sites become widespread, the importance of similarity calculation between a given query and a short sentence is increasing. As an example, a search by the query \\\"sad\\\" should find actual expressions such as \\\"I needed a handkerchief\\\" on a movie review site. Doc2Vec is one of the most widely used methods for vectorization of queries and sentences. However, Doc2Vec often exhibits low accuracy if the training data consists of short sentences, because they lack context. We modified Doc2Vec with the hypothesis that other posts for the same topic (i.e. reviews for the same movie in online movie review sites) may share the same background. Our method uses target-topic IDs instead of sentence IDs as the context in the training phase of the Doc2Vec with the PV-DM model; this model estimates the next term from a few previous terms and context. The model trained with item IDs vectorizes a sentence more accurately than a model trained with sentence IDs. We conducted a large-scale experiment using 1.2 million movie review posts and a crowdsourcing-based evaluation. The experimental result demonstrates that our new method achieves higher precision and nDCG than previous Doc2Vec variants and traditional topic modeling methods.\",\"PeriodicalId\":446280,\"journal\":{\"name\":\"Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services\",\"volume\":\"21 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3366030.3366126\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3366030.3366126","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

针对Doc2Vec的训练阶段，提出了一种补充短句上下文的新方法。随着CGM (Consumer Generated Media，消费者生成媒体)网站和SNS网站的普及，给定查询和短句子之间的相似度计算变得越来越重要。例如，通过查询“悲伤”进行搜索，应该在电影评论网站上找到实际的表达，例如“我需要一块手帕”。Doc2Vec是查询和句子矢量化中使用最广泛的方法之一。然而，如果训练数据由短句组成，Doc2Vec往往表现出较低的准确性，因为它们缺乏上下文。我们对Doc2Vec进行了修改，假设同一主题的其他帖子(即在线电影评论网站上对同一部电影的评论)可能具有相同的背景。我们的方法在使用PV-DM模型的Doc2Vec的训练阶段使用目标主题id而不是句子id作为上下文;该模型根据之前的几个术语和上下文估计下一个术语。用项目id训练的模型比用句子id训练的模型更准确地对句子进行矢量化。我们进行了一项大规模的实验，使用了120万篇电影评论，并进行了基于众包的评估。实验结果表明，与之前的Doc2Vec变体和传统主题建模方法相比，新方法具有更高的精度和nDCG。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Target-Topic Aware Doc2Vec for Short Sentence Retrieval from User Generated Content

This paper proposes a new method of supplementing the context of short sentences for the training phase of Doc2Vec. Since CGM (Consumer Generated Media) sites and SNS sites become widespread, the importance of similarity calculation between a given query and a short sentence is increasing. As an example, a search by the query "sad" should find actual expressions such as "I needed a handkerchief" on a movie review site. Doc2Vec is one of the most widely used methods for vectorization of queries and sentences. However, Doc2Vec often exhibits low accuracy if the training data consists of short sentences, because they lack context. We modified Doc2Vec with the hypothesis that other posts for the same topic (i.e. reviews for the same movie in online movie review sites) may share the same background. Our method uses target-topic IDs instead of sentence IDs as the context in the training phase of the Doc2Vec with the PV-DM model; this model estimates the next term from a few previous terms and context. The model trained with item IDs vectorizes a sentence more accurately than a model trained with sentence IDs. We conducted a large-scale experiment using 1.2 million movie review posts and a crowdsourcing-based evaluation. The experimental result demonstrates that our new method achieves higher precision and nDCG than previous Doc2Vec variants and traditional topic modeling methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services

自引率

0.00%

发文量