Clustering and semantics preservation in cultural heritage information spaces

Javier Pereira, Felipe Schmidt, Pedro Contreras, F. Murtagh, H. Astudillo
{"title":"Clustering and semantics preservation in cultural heritage information spaces","authors":"Javier Pereira, Felipe Schmidt, Pedro Contreras, F. Murtagh, H. Astudillo","doi":"10.5555/1937055.1937078","DOIUrl":null,"url":null,"abstract":"In this paper, we analyze the preservation of original semantic similarity among objects when dimensional reduction is applied on the original data source and a further clustering process is performed on dimensionally reduced data. An experiment is designed to test Baire, or longest common prefix ultrametric, and K-Means when prior random projection is applied. A data matrix extracted from a cultural heritage database has been prepared for the experiment. Given that the random projection produces a vector with components ranging on the interval [0, 1], clusters are obtained at different precision levels. Next, the mean semantic similarity of clusters is calculated using a modified version of the Jaccard index. Our findings show that semantics is difficult to preserve by these methods. However, a Student's hypothesis test on mean similarity indicates that Baire clusters objects are semantically better than K-Means when we increase the digit precision, but paying an increasing cost for orphan clustered objects. Despite this cost, it is argued that the ultrametric technique provides an efficient process to detect semantic homogeneity on the original data space.","PeriodicalId":120472,"journal":{"name":"RIAO Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"RIAO Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5555/1937055.1937078","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

In this paper, we analyze the preservation of original semantic similarity among objects when dimensional reduction is applied on the original data source and a further clustering process is performed on dimensionally reduced data. An experiment is designed to test Baire, or longest common prefix ultrametric, and K-Means when prior random projection is applied. A data matrix extracted from a cultural heritage database has been prepared for the experiment. Given that the random projection produces a vector with components ranging on the interval [0, 1], clusters are obtained at different precision levels. Next, the mean semantic similarity of clusters is calculated using a modified version of the Jaccard index. Our findings show that semantics is difficult to preserve by these methods. However, a Student's hypothesis test on mean similarity indicates that Baire clusters objects are semantically better than K-Means when we increase the digit precision, but paying an increasing cost for orphan clustered objects. Despite this cost, it is argued that the ultrametric technique provides an efficient process to detect semantic homogeneity on the original data space.
文化遗产信息空间的聚类与语义保存
本文分析了在对原始数据源进行降维处理时,对象之间保持原始语义相似度的问题,并对降维后的数据进行进一步聚类处理。设计了一个实验来测试Baire(即最长公共前缀超度量)和K-Means在应用先验随机投影时的性能。从文物数据库中提取的数据矩阵已经准备好用于实验。由于随机投影产生的向量的分量范围在区间[0,1]上,因此可以得到不同精度水平的聚类。接下来,使用Jaccard索引的修改版本计算聚类的平均语义相似度。我们的研究结果表明,这些方法很难保持语义。然而,平均相似度的学生假设检验表明,当我们提高数字精度时,Baire聚类对象在语义上优于K-Means,但为孤儿聚类对象付出的代价越来越大。尽管有这样的成本,但有人认为超度量技术提供了一种有效的方法来检测原始数据空间上的语义同质性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信