基于分布方法的文本相似度

Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99 Pub Date : 1999-09-01 DOI:10.1109/DEXA.1999.795163

Romaric Besançon, M. Rajman, Jean-Cédric Chappelier

{"title":"基于分布方法的文本相似度","authors":"Romaric Besançon, M. Rajman, Jean-Cédric Chappelier","doi":"10.1109/DEXA.1999.795163","DOIUrl":null,"url":null,"abstract":"The design of efficient textual similarities is an important issue in the domain of textual data exploration. Textual similarities are for example central in document collection structuring (e.g. clustering), or in information retrieval (IR) which relies on the computation of textual similarities for measuring the adequacy between a query and documents. The objective of this paper is to present and compare several textual similarity measures in the framework of the distributional semantics (DS) model for IR. This model is an extension of the standard vector space model, which further takes the co-frequencies between the terms in a given reference corpus into account. These co-frequencies are considered to provide a distributional representation of the \"semantics\" of the terms. The co-occurrence profiles are used to represent the documents as vectors. Practical retrieval experiments using DS-based similarity models have been conducted in the framework of the AMARYLLIS evaluation campaign. The results obtained are presented. They indicate significant improvement of the performance in comparison with the standard approach.","PeriodicalId":276867,"journal":{"name":"Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99","volume":"132 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":"{\"title\":\"Textual similarities based on a distributional approach\",\"authors\":\"Romaric Besançon, M. Rajman, Jean-Cédric Chappelier\",\"doi\":\"10.1109/DEXA.1999.795163\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The design of efficient textual similarities is an important issue in the domain of textual data exploration. Textual similarities are for example central in document collection structuring (e.g. clustering), or in information retrieval (IR) which relies on the computation of textual similarities for measuring the adequacy between a query and documents. The objective of this paper is to present and compare several textual similarity measures in the framework of the distributional semantics (DS) model for IR. This model is an extension of the standard vector space model, which further takes the co-frequencies between the terms in a given reference corpus into account. These co-frequencies are considered to provide a distributional representation of the \\\"semantics\\\" of the terms. The co-occurrence profiles are used to represent the documents as vectors. Practical retrieval experiments using DS-based similarity models have been conducted in the framework of the AMARYLLIS evaluation campaign. The results obtained are presented. They indicate significant improvement of the performance in comparison with the standard approach.\",\"PeriodicalId\":276867,\"journal\":{\"name\":\"Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99\",\"volume\":\"132 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1999-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"29\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DEXA.1999.795163\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DEXA.1999.795163","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 29

摘要

高效文本相似度的设计是文本数据挖掘领域的一个重要问题。例如，文本相似度是文档集合结构(例如聚类)或信息检索(IR)的中心，信息检索依赖于文本相似度的计算来衡量查询和文档之间的充分性。本文的目的是提出并比较分布式语义模型框架下的几种文本相似度度量。该模型是标准向量空间模型的扩展，该模型进一步考虑了给定参考语料库中项之间的共频率。这些共频率被认为提供了术语“语义”的分布表示。共现概要文件用于将文档表示为向量。在AMARYLLIS评估活动的框架下，使用基于ds的相似度模型进行了实际的检索实验。给出了所得结果。它们表明与标准方法相比，性能有了显著提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Textual similarities based on a distributional approach

The design of efficient textual similarities is an important issue in the domain of textual data exploration. Textual similarities are for example central in document collection structuring (e.g. clustering), or in information retrieval (IR) which relies on the computation of textual similarities for measuring the adequacy between a query and documents. The objective of this paper is to present and compare several textual similarity measures in the framework of the distributional semantics (DS) model for IR. This model is an extension of the standard vector space model, which further takes the co-frequencies between the terms in a given reference corpus into account. These co-frequencies are considered to provide a distributional representation of the "semantics" of the terms. The co-occurrence profiles are used to represent the documents as vectors. Practical retrieval experiments using DS-based similarity models have been conducted in the framework of the AMARYLLIS evaluation campaign. The results obtained are presented. They indicate significant improvement of the performance in comparison with the standard approach.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99

自引率

0.00%

发文量