Automatic Extraction of Meaning from the Web

2006 IEEE International Symposium on Information Theory Pub Date : 2006-07-09 DOI:10.1109/ISIT.2006.261979

Rudi Cilibrasi, P. Vitányi

引用次数: 30

Abstract

We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have literal embodiments like the first type, but may also be abstract like "red" or "Christianity". For the first type we consider a family of computable distance measures corresponding to parameters expressing similarity according to particular features between pairs of literal objects. For the second type we consider similarity distances generated by Web users corresponding to particular semantic relations between the (names for) the designated objects. For both families we give universal similarity distance measures, incorporating all particular distance measures in the family. In the first case the universal distance is based on compression and in the second case it is based on Google page counts related to search terms. In both cases experiments on a massive scale give evidence of the viability of the approaches

查看原文本刊更多论文

自动从网络中提取意义

我们考虑两种对象的相似距离:一种是包含其所有含义的文字对象，如基因组或书籍，另一种是对象的名称。后者可能有像第一种类型的文字体现，但也可能是抽象的，如“红色”或“基督教”。对于第一种类型，我们考虑一组可计算的距离度量，对应于根据文字对象对之间的特定特征表示相似性的参数。对于第二种类型，我们考虑对应于指定对象(名称)之间特定语义关系的Web用户生成的相似距离。对于这两个家庭，我们给出了普遍的相似距离度量，包括家庭中所有特定的距离度量。在第一种情况下，通用距离是基于压缩的，在第二种情况下，它是基于与搜索词相关的谷歌页面数。在这两种情况下，大规模的实验都证明了这些方法的可行性

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2006 IEEE International Symposium on Information Theory

自引率

0.00%

发文量