Cross-lingual document similarity

Proceedings of the ITI 2012 34th International Conference on Information Technology Interfaces Pub Date : 2012-06-25 DOI:10.2498/iti.2012.0467

A. Muhic, Jan Rupnik, P. Skraba

引用次数: 10

Abstract

In this paper we investigated how to compute similarities between documents written in different languages based on a weekly aligned multi-lingual collection of documents. Computing the cross-lingual similarities is based on an aligned set of basis vectors obtained by either latent semantic indexing or the k-means algorithm on an aligned multi-lingual corpus. We evaluated the methods on two data sets: Wikipedia and European Parliament Proceedings Parallel Corpus.

查看原文本刊更多论文

跨语言文档相似度

在本文中，我们研究了如何基于每周对齐的多语言文档集合计算用不同语言编写的文档之间的相似性。跨语言相似度的计算基于一组对齐的基向量，这些基向量是通过潜在语义索引或k-means算法在对齐的多语言语料库上获得的。我们在两个数据集上评估了这些方法:维基百科和欧洲议会会议并行语料库。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ITI 2012 34th International Conference on Information Technology Interfaces

自引率

0.00%

发文量