特定任务的短文档扩展框架

Proceedings of the 25th ACM International on Conference on Information and Knowledge Management Pub Date : 2016-10-24 DOI:10.1145/2983323.2983811

Ramakrishna Bairi, Raghavendra Udupa, Ganesh Ramakrishnan

{"title":"特定任务的短文档扩展框架","authors":"Ramakrishna Bairi, Raghavendra Udupa, Ganesh Ramakrishnan","doi":"10.1145/2983323.2983811","DOIUrl":null,"url":null,"abstract":"Collections that contain a large number of short texts are becoming increasingly common (eg., tweets, reviews, etc). Analytical tasks (such as classification, clustering, etc.) involving short texts could be challenging due to the lack of context and owing to their sparseness. An often encountered problem is low accuracy on the task. A standard technique used in the handling of short texts is expanding them before subjecting them to the task. However, existing works on short text expansion suffer from certain limitations: (i) they depend on domain knowledge to expand the text; (ii) they employ task-specific heuristics; and (iii) the expansion procedure is tightly coupled to the task. This makes it hard to adapt a procedure, designed for one task, into another. We present an expansion technique -- TIDE (Task-specIfic short Document Expansion) -- that can be applied on several Machine Learning, NLP and Information Retrieval tasks on short texts (such as short text classification, clustering, entity disambiguation, and the like) without using task specific heuristics and domain-specific knowledge for expansion. At the same time, our technique is capable of learning to expand short texts in a task-specific way. That is, the same technique that is applied to expand a short text in two different tasks is able to learn to produce different expansions depending upon what expansion benefits the task's performance. To speed up the learning process, we also introduce a technique called block learning. Our experiments with classification and clustering tasks show that our framework improves upon several baselines according to the standard evaluation metrics which includes the accuracy and normalized mutual information (NMI).","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"A Framework for Task-specific Short Document Expansion\",\"authors\":\"Ramakrishna Bairi, Raghavendra Udupa, Ganesh Ramakrishnan\",\"doi\":\"10.1145/2983323.2983811\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Collections that contain a large number of short texts are becoming increasingly common (eg., tweets, reviews, etc). Analytical tasks (such as classification, clustering, etc.) involving short texts could be challenging due to the lack of context and owing to their sparseness. An often encountered problem is low accuracy on the task. A standard technique used in the handling of short texts is expanding them before subjecting them to the task. However, existing works on short text expansion suffer from certain limitations: (i) they depend on domain knowledge to expand the text; (ii) they employ task-specific heuristics; and (iii) the expansion procedure is tightly coupled to the task. This makes it hard to adapt a procedure, designed for one task, into another. We present an expansion technique -- TIDE (Task-specIfic short Document Expansion) -- that can be applied on several Machine Learning, NLP and Information Retrieval tasks on short texts (such as short text classification, clustering, entity disambiguation, and the like) without using task specific heuristics and domain-specific knowledge for expansion. At the same time, our technique is capable of learning to expand short texts in a task-specific way. That is, the same technique that is applied to expand a short text in two different tasks is able to learn to produce different expansions depending upon what expansion benefits the task's performance. To speed up the learning process, we also introduce a technique called block learning. Our experiments with classification and clustering tasks show that our framework improves upon several baselines according to the standard evaluation metrics which includes the accuracy and normalized mutual information (NMI).\",\"PeriodicalId\":250808,\"journal\":{\"name\":\"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-10-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2983323.2983811\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2983323.2983811","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

包含大量短文本的集合正变得越来越普遍。比如推特、评论等)。由于缺乏上下文和它们的稀疏性，涉及短文本的分析任务(如分类、聚类等)可能具有挑战性。一个经常遇到的问题是任务的准确性低。在处理短文本时使用的标准技巧是在完成任务之前展开它们。然而，现有的短文本扩展工作存在一定的局限性:(1)依赖领域知识进行文本扩展;(ii)他们采用特定任务的启发式方法;(3)展开过程与任务紧密耦合。这使得将为一项任务设计的程序调整到另一项任务变得困难。我们提出了一种扩展技术——TIDE (task -specific short Document expansion)——它可以应用于几个关于短文本的机器学习、自然语言处理和信息检索任务(如短文本分类、聚类、实体消歧等)，而不需要使用任务特定的启发式和领域特定的知识进行扩展。同时，我们的技术能够学习以特定任务的方式扩展短文本。也就是说，应用于在两个不同任务中展开短文本的相同技术能够学习产生不同的展开，这取决于哪种展开有利于任务的性能。为了加快学习过程，我们还引入了一种称为块学习的技术。我们对分类和聚类任务的实验表明，我们的框架根据包括准确率和归一化互信息(NMI)在内的标准评估指标在几个基线上进行了改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Framework for Task-specific Short Document Expansion

Collections that contain a large number of short texts are becoming increasingly common (eg., tweets, reviews, etc). Analytical tasks (such as classification, clustering, etc.) involving short texts could be challenging due to the lack of context and owing to their sparseness. An often encountered problem is low accuracy on the task. A standard technique used in the handling of short texts is expanding them before subjecting them to the task. However, existing works on short text expansion suffer from certain limitations: (i) they depend on domain knowledge to expand the text; (ii) they employ task-specific heuristics; and (iii) the expansion procedure is tightly coupled to the task. This makes it hard to adapt a procedure, designed for one task, into another. We present an expansion technique -- TIDE (Task-specIfic short Document Expansion) -- that can be applied on several Machine Learning, NLP and Information Retrieval tasks on short texts (such as short text classification, clustering, entity disambiguation, and the like) without using task specific heuristics and domain-specific knowledge for expansion. At the same time, our technique is capable of learning to expand short texts in a task-specific way. That is, the same technique that is applied to expand a short text in two different tasks is able to learn to produce different expansions depending upon what expansion benefits the task's performance. To speed up the learning process, we also introduce a technique called block learning. Our experiments with classification and clustering tasks show that our framework improves upon several baselines according to the standard evaluation metrics which includes the accuracy and normalized mutual information (NMI).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

自引率

0.00%

发文量