{"title":"为基于相似性的XML文档聚类集成元素和术语语义","authors":"Jianwu Yang, W. K. Cheung, Xiaoou Chen","doi":"10.1109/WI.2005.80","DOIUrl":null,"url":null,"abstract":"Structured link vector model (SLVM) is a recently proposed document representation that takes into account both structural and semantic information for measuring XML document similarity. Its formulation includes an element similarity matrix for capturing the semantic similarity between XML elements - the structural components of XML documents. In this paper, instead of applying heuristics to define the similarity matrix, we proposed to learn the matrix using pair wise similar training data in an iterative manner. In addition, we extended SLVM to SLVM-LSI by incorporating term semantics into SLVM using latent semantic indexing, with the element similarity related properties of the original SLVM preserved. For performance evaluation, we applied SLVM-LSI to similarity-based clustering of two XML datasets and the proposed SLVM-LSI was found to significantly outperform the conventional vector space model and the edit-distance based methods. The similarity matrix, obtained as a byproduct via the learning, can provide higher level knowledge about the semantic relationship between the XML elements.","PeriodicalId":213856,"journal":{"name":"The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":"{\"title\":\"Integrating element and term semantics for similarity-based XML document clustering\",\"authors\":\"Jianwu Yang, W. K. Cheung, Xiaoou Chen\",\"doi\":\"10.1109/WI.2005.80\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Structured link vector model (SLVM) is a recently proposed document representation that takes into account both structural and semantic information for measuring XML document similarity. Its formulation includes an element similarity matrix for capturing the semantic similarity between XML elements - the structural components of XML documents. In this paper, instead of applying heuristics to define the similarity matrix, we proposed to learn the matrix using pair wise similar training data in an iterative manner. In addition, we extended SLVM to SLVM-LSI by incorporating term semantics into SLVM using latent semantic indexing, with the element similarity related properties of the original SLVM preserved. For performance evaluation, we applied SLVM-LSI to similarity-based clustering of two XML datasets and the proposed SLVM-LSI was found to significantly outperform the conventional vector space model and the edit-distance based methods. The similarity matrix, obtained as a byproduct via the learning, can provide higher level knowledge about the semantic relationship between the XML elements.\",\"PeriodicalId\":213856,\"journal\":{\"name\":\"The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05)\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2005-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"17\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WI.2005.80\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WI.2005.80","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17
摘要
结构化链接向量模型(Structured link vector model, SLVM)是最近提出的一种文档表示,它考虑了用于度量XML文档相似性的结构和语义信息。其公式包括一个元素相似矩阵,用于捕获XML元素(XML文档的结构组件)之间的语义相似度。在本文中,我们提出了使用成对相似训练数据以迭代的方式学习矩阵,而不是使用启发式方法来定义相似矩阵。此外,我们通过使用潜在语义索引将术语语义整合到SLVM中,将SLVM扩展到SLVM- lsi,同时保留了原始SLVM的元素相似相关属性。为了进行性能评估,我们将SLVM-LSI应用于基于相似性的两个XML数据集聚类,发现所提出的SLVM-LSI明显优于传统的向量空间模型和基于编辑距离的方法。通过学习获得的副产品相似度矩阵可以提供关于XML元素之间语义关系的更高层次的知识。
Integrating element and term semantics for similarity-based XML document clustering
Structured link vector model (SLVM) is a recently proposed document representation that takes into account both structural and semantic information for measuring XML document similarity. Its formulation includes an element similarity matrix for capturing the semantic similarity between XML elements - the structural components of XML documents. In this paper, instead of applying heuristics to define the similarity matrix, we proposed to learn the matrix using pair wise similar training data in an iterative manner. In addition, we extended SLVM to SLVM-LSI by incorporating term semantics into SLVM using latent semantic indexing, with the element similarity related properties of the original SLVM preserved. For performance evaluation, we applied SLVM-LSI to similarity-based clustering of two XML datasets and the proposed SLVM-LSI was found to significantly outperform the conventional vector space model and the edit-distance based methods. The similarity matrix, obtained as a byproduct via the learning, can provide higher level knowledge about the semantic relationship between the XML elements.