Seq2Set2Seq：通过多标签预测和确定性点过程在社交媒体中生成回复关键词的两阶段分离法

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing Pub Date : 2024-02-05 DOI:10.1145/3644074

Jie Liu, Yaguang Li, Shizhu He, Shun Wu, Kang Liu, Shenping Liu, Jiong Wang, Qing Zhang

{"title":"Seq2Set2Seq：通过多标签预测和确定性点过程在社交媒体中生成回复关键词的两阶段分离法","authors":"Jie Liu, Yaguang Li, Shizhu He, Shun Wu, Kang Liu, Shenping Liu, Jiong Wang, Qing Zhang","doi":"10.1145/3644074","DOIUrl":null,"url":null,"abstract":"<p>Social media produces large amounts of contents every day. How to predict the potential influences of the contents from a social reply feedback perspective is a key issue that has not been explored. Thus, we propose a novel task named reply keyword prediction in social media, which aims to predict the keywords in the potential replies as many aspects as possible. One prerequisite challenge is that the accessible social media datasets labeling such keywords remain absent. To solve this issue, we propose a new dataset, to study the reply keyword prediction in Social Media. This task could be seen as a single-turn dialogue keyword prediction for open-domain dialogue system. However, existing methods for dialogue keyword prediction cannot be adopted directly, which have two main drawbacks. First, they do not provide an explicit mechanism to model topic complementarity between keywords which is crucial in social media to controllably model all aspects of replies. Second, the collocations of keywords are not explicitly modeled, which also makes it less controllable to optimize for fine-grained prediction since the context information is much less than that in dialogue. To address these issues, we propose a two-stage disentangled framework, which can optimize the complementarity and collocation explicitly in a disentangled fashion. In the first stage, we use a sequence-to-set paradigm via multi-label prediction and determinantal point processes, to generate a set of keyword seeds satisfying the complementarity. In the second stage, we adopt a set-to-sequence paradigm via seq2seq model with the keyword seeds guidance from the set, to generate the more-fine-grained keywords with collocation. Experiments show that this method can generate not only a more diverse set of keywords but also more relevant and consistent keywords. Furthermore, the keywords obtained based on this method can achieve better reply generation results in the retrieval-based system than others.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"1 1","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Seq2Set2Seq: A Two-stage Disentangled Method for Reply Keyword Generation in Social Media via Multi-label Prediction and Determinantal Point Processes\",\"authors\":\"Jie Liu, Yaguang Li, Shizhu He, Shun Wu, Kang Liu, Shenping Liu, Jiong Wang, Qing Zhang\",\"doi\":\"10.1145/3644074\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Social media produces large amounts of contents every day. How to predict the potential influences of the contents from a social reply feedback perspective is a key issue that has not been explored. Thus, we propose a novel task named reply keyword prediction in social media, which aims to predict the keywords in the potential replies as many aspects as possible. One prerequisite challenge is that the accessible social media datasets labeling such keywords remain absent. To solve this issue, we propose a new dataset, to study the reply keyword prediction in Social Media. This task could be seen as a single-turn dialogue keyword prediction for open-domain dialogue system. However, existing methods for dialogue keyword prediction cannot be adopted directly, which have two main drawbacks. First, they do not provide an explicit mechanism to model topic complementarity between keywords which is crucial in social media to controllably model all aspects of replies. Second, the collocations of keywords are not explicitly modeled, which also makes it less controllable to optimize for fine-grained prediction since the context information is much less than that in dialogue. To address these issues, we propose a two-stage disentangled framework, which can optimize the complementarity and collocation explicitly in a disentangled fashion. In the first stage, we use a sequence-to-set paradigm via multi-label prediction and determinantal point processes, to generate a set of keyword seeds satisfying the complementarity. In the second stage, we adopt a set-to-sequence paradigm via seq2seq model with the keyword seeds guidance from the set, to generate the more-fine-grained keywords with collocation. Experiments show that this method can generate not only a more diverse set of keywords but also more relevant and consistent keywords. Furthermore, the keywords obtained based on this method can achieve better reply generation results in the retrieval-based system than others.</p>\",\"PeriodicalId\":54312,\"journal\":{\"name\":\"ACM Transactions on Asian and Low-Resource Language Information Processing\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2024-02-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Asian and Low-Resource Language Information Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3644074\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Asian and Low-Resource Language Information Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3644074","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

社交媒体每天都会产生大量内容。如何从社交回复反馈的角度预测这些内容的潜在影响是一个尚未探索的关键问题。因此，我们提出了一项名为 "社交媒体中回复关键词预测 "的新任务，旨在尽可能多地预测潜在回复中的关键词。一个先决挑战是，标注此类关键词的可访问社交媒体数据集仍然缺乏。为了解决这个问题，我们提出了一个新的数据集来研究社交媒体中的回复关键词预测。这项任务可视为开放域对话系统的单轮对话关键词预测。然而，现有的对话关键词预测方法不能直接采用，它们有两个主要缺点。首先，它们没有提供明确的机制来模拟关键词之间的话题互补性，而这在社交媒体中对于可控地模拟回复的各个方面至关重要。其次，关键词的搭配没有明确建模，这也使得优化细粒度预测的可控性降低，因为上下文信息比对话中的信息要少得多。为了解决这些问题，我们提出了一个两阶段分解框架，可以分解的方式明确优化互补性和搭配。在第一阶段，我们使用序列到集合范式，通过多标签预测和行列式点过程，生成一组满足互补性的关键词种子。在第二阶段，我们通过 seq2seq 模型，采用集合到序列的范式，以集合中的关键字种子为导向，生成具有搭配性的更细粒度关键字。实验表明，这种方法不仅能生成更多样化的关键词集，还能生成更相关、更一致的关键词。此外，在基于检索的系统中，基于该方法生成的关键词能获得比其他方法更好的回复生成结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Seq2Set2Seq: A Two-stage Disentangled Method for Reply Keyword Generation in Social Media via Multi-label Prediction and Determinantal Point Processes

Social media produces large amounts of contents every day. How to predict the potential influences of the contents from a social reply feedback perspective is a key issue that has not been explored. Thus, we propose a novel task named reply keyword prediction in social media, which aims to predict the keywords in the potential replies as many aspects as possible. One prerequisite challenge is that the accessible social media datasets labeling such keywords remain absent. To solve this issue, we propose a new dataset, to study the reply keyword prediction in Social Media. This task could be seen as a single-turn dialogue keyword prediction for open-domain dialogue system. However, existing methods for dialogue keyword prediction cannot be adopted directly, which have two main drawbacks. First, they do not provide an explicit mechanism to model topic complementarity between keywords which is crucial in social media to controllably model all aspects of replies. Second, the collocations of keywords are not explicitly modeled, which also makes it less controllable to optimize for fine-grained prediction since the context information is much less than that in dialogue. To address these issues, we propose a two-stage disentangled framework, which can optimize the complementarity and collocation explicitly in a disentangled fashion. In the first stage, we use a sequence-to-set paradigm via multi-label prediction and determinantal point processes, to generate a set of keyword seeds satisfying the complementarity. In the second stage, we adopt a set-to-sequence paradigm via seq2seq model with the keyword seeds guidance from the set, to generate the more-fine-grained keywords with collocation. Experiments show that this method can generate not only a more diverse set of keywords but also more relevant and consistent keywords. Furthermore, the keywords obtained based on this method can achieve better reply generation results in the retrieval-based system than others.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Asian and Low-Resource Language Information Processing Computer Science-General Computer Science

CiteScore

3.60

自引率

15.00%

发文量

241

期刊介绍： The ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) publishes high quality original archival papers and technical notes in the areas of computation and processing of information in Asian languages, low-resource languages of Africa, Australasia, Oceania and the Americas, as well as related disciplines. The subject areas covered by TALLIP include, but are not limited to: -Computational Linguistics: including computational phonology, computational morphology, computational syntax (e.g. parsing), computational semantics, computational pragmatics, etc. -Linguistic Resources: including computational lexicography, terminology, electronic dictionaries, cross-lingual dictionaries, electronic thesauri, etc. -Hardware and software algorithms and tools for Asian or low-resource language processing, e.g., handwritten character recognition. -Information Understanding: including text understanding, speech understanding, character recognition, discourse processing, dialogue systems, etc. -Machine Translation involving Asian or low-resource languages. -Information Retrieval: including natural language processing (NLP) for concept-based indexing, natural language query interfaces, semantic relevance judgments, etc. -Information Extraction and Filtering: including automatic abstraction, user profiling, etc. -Speech processing: including text-to-speech synthesis and automatic speech recognition. -Multimedia Asian Information Processing: including speech, image, video, image/text translation, etc. -Cross-lingual information processing involving Asian or low-resource languages. -Papers that deal in theory, systems design, evaluation and applications in the aforesaid subjects are appropriate for TALLIP. Emphasis will be placed on the originality and the practical significance of the reported research.