Deriving labeled training data for topic link detection by alternating words

2015 International Conference on Data and Software Engineering (ICoDSE) Pub Date : 2015-11-01 DOI:10.1109/ICODSE.2015.7436976

Marc W. Abel, S. M. Chung

引用次数: 0

Abstract

Although classifiers can be trained to estimate whether two short text segments relate to a common topic, obtaining training data for supervised learning presents a hurdle. The natural approach would be to train with topic-aligned pairs of text segments from a large corpus, but nothing is available to locate such alignments. We offer that simply partitioning the words of a large document according to their odd and even positions will yield training data suitable for certain applications and sets of features. The reason is that the partitioned texts are topic-aligned along their respective lengths despite sharing no original word instances. We further show that parametrically introducing a small amount of overlap into the partitioned texts can greatly improve the precision of a classifier.

查看原文本刊更多论文

通过交替词提取主题链接检测的标记训练数据

尽管可以训练分类器来估计两个短文本片段是否与一个共同主题相关，但获取监督学习的训练数据存在障碍。自然的方法是使用来自大型语料库的主题对齐的文本片段对进行训练，但是没有任何方法可以定位这种对齐。我们提出，简单地根据奇偶位置对大型文档中的单词进行分区，将产生适合某些应用程序和特征集的训练数据。原因是，尽管不共享原始单词实例，但分割的文本按照各自的长度与主题对齐。我们进一步表明，在分割文本中参数化地引入少量重叠可以大大提高分类器的精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 International Conference on Data and Software Engineering (ICoDSE)

自引率

0.00%

发文量