An Approach to the Problem of Annotation of Research Publications

Proceedings of the Eighth ACM International Conference on Web Search and Data Mining Pub Date : 2015-02-02 DOI:10.1145/2684822.2697032

Ekaterina Chernyak

引用次数: 22

Abstract

An approach to multiple labelling research papers is explored. We develop techniques for annotating/labeling research papers in informatics and computer sciences with key phrases taken from the ACM Computing Classification System. The techniques utilize a phrase-to-text relevance measure so that only those phrases that are most relevant go to the annotation. Three phrase-to-text relevance measures are experimentally compared in this setting. The measures are: (a) cosine relevance score between conventional vector space representations of the texts coded with tf-idf weighting; (b) popular characteristic of probability of term generation BM25; and (c) an in-house characteristic of conditional probability of symbols averaged over matching fragments in suffix trees representing texts and phrases, CPAMF. In an experiment conducted over a set of texts published in journals of the ACM and manually annotated by their authors, CPAMF outperforms both the cosine measure and BM25 by a wide margin.

查看原文本刊更多论文

研究出版物注释问题的探讨

探讨了一种多标签研究论文的方法。我们开发了用ACM计算分类系统中的关键短语注释/标记信息学和计算机科学研究论文的技术。这些技术利用短语到文本的相关性度量，因此只有那些最相关的短语才会进入注释。在这种情况下，实验比较了三种短语与文本的相关性度量。度量是:(a)文本的传统向量空间表示与tf-idf加权之间的余弦相关分数;(b)术语生成概率的流行特征BM25;(c)代表文本和短语的后缀树中匹配片段的符号平均条件概率的内部特征，CPAMF。在对一组发表在ACM期刊上并由作者手工注释的文本进行的实验中，camf的性能大大优于余弦测量和BM25。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Eighth ACM International Conference on Web Search and Data Mining

自引率

0.00%

发文量