Nafisse Samadi, Sri Devi Ravana (Corresponding Author)
{"title":"XML CLUSTERING FRAMEWORK BASED ON DOCUMENT CONTENT AND STRUCTURE IN A HETEROGENEOUS DIGITAL LIBRARY","authors":"Nafisse Samadi, Sri Devi Ravana (Corresponding Author)","doi":"10.22452/mjcs.vol36no2.2","DOIUrl":null,"url":null,"abstract":"As textually published information is increasing in digital libraries, efficient retrieval methods are required. Textual documents in a digital library are available in various structures and contents. It is possible to represent these documents with hierarchical levels of granularity when these are organized in XML structure to improve precision by focused retrieval. By this means, contextual elements of each document can be retrieved from a known structure. One solution for retrieving these elements is clustering from a combination of Content and Structural similarities. To achieve this, a novel two-level clustering framework based on Content and Structure is proposed. The framework decomposes a document into meaningful structural units and analyzes all its rich text in its own structure. The quality of the proposed framework was experimented on a heterogeneous XML document collection, having varieties of data sources, structures, and content, be represented as a sample of a real digital library. This collection was made with capabilities to test all of our objectives. The clustering results were evaluated by the Entropy criterion. Finally, the Content and Structure clustering was compared with the usual clustering based on the Content Only to prove the efficacy of considering structural features against the existing Content Only methods in the retrieval process. The total Entropy results of the two-level Content and Structural clustering are almost twice better than the Content Only clustering approach. Consequently, the proposed framework has the ability to improve Information Retrieval systems from two points of view: i) considering the structural aspect of text-rich documents in the retrieval process, and ii) replacing the document-level retrieval with the element-level retrieval.","PeriodicalId":49894,"journal":{"name":"Malaysian Journal of Computer Science","volume":" ","pages":""},"PeriodicalIF":1.1000,"publicationDate":"2023-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Malaysian Journal of Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.22452/mjcs.vol36no2.2","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
As textually published information is increasing in digital libraries, efficient retrieval methods are required. Textual documents in a digital library are available in various structures and contents. It is possible to represent these documents with hierarchical levels of granularity when these are organized in XML structure to improve precision by focused retrieval. By this means, contextual elements of each document can be retrieved from a known structure. One solution for retrieving these elements is clustering from a combination of Content and Structural similarities. To achieve this, a novel two-level clustering framework based on Content and Structure is proposed. The framework decomposes a document into meaningful structural units and analyzes all its rich text in its own structure. The quality of the proposed framework was experimented on a heterogeneous XML document collection, having varieties of data sources, structures, and content, be represented as a sample of a real digital library. This collection was made with capabilities to test all of our objectives. The clustering results were evaluated by the Entropy criterion. Finally, the Content and Structure clustering was compared with the usual clustering based on the Content Only to prove the efficacy of considering structural features against the existing Content Only methods in the retrieval process. The total Entropy results of the two-level Content and Structural clustering are almost twice better than the Content Only clustering approach. Consequently, the proposed framework has the ability to improve Information Retrieval systems from two points of view: i) considering the structural aspect of text-rich documents in the retrieval process, and ii) replacing the document-level retrieval with the element-level retrieval.
随着数字图书馆文本出版信息的不断增加,需要高效的检索方法。数字图书馆中的文本文档具有不同的结构和内容。当这些文档以XML结构组织起来,通过集中检索提高精度时,就可以用层次粒度级别来表示这些文档。通过这种方式,可以从已知结构中检索每个文档的上下文元素。检索这些元素的一个解决方案是根据内容和结构相似性组合进行聚类。为此,提出了一种基于内容和结构的两级聚类框架。该框架将文档分解为有意义的结构单元,并以自己的结构分析其所有的富文本。提出的框架的质量在异构XML文档集合上进行了实验,这些文档集合具有各种数据源、结构和内容,被表示为真实数字图书馆的样本。这个集合具有测试我们所有目标的能力。采用熵准则对聚类结果进行评价。最后,将Content and Structure聚类方法与基于Content Only的聚类方法进行比较,证明在检索过程中考虑结构特征对现有Content Only方法的有效性。两级内容和结构聚类的总熵结果几乎是纯内容聚类方法的两倍。因此,所提出的框架能够从两个角度改进信息检索系统:1)在检索过程中考虑富文本文档的结构方面;2)用元素级检索取代文档级检索。
期刊介绍:
The Malaysian Journal of Computer Science (ISSN 0127-9084) is published four times a year in January, April, July and October by the Faculty of Computer Science and Information Technology, University of Malaya, since 1985. Over the years, the journal has gained popularity and the number of paper submissions has increased steadily. The rigorous reviews from the referees have helped in ensuring that the high standard of the journal is maintained. The objectives are to promote exchange of information and knowledge in research work, new inventions/developments of Computer Science and on the use of Information Technology towards the structuring of an information-rich society and to assist the academic staff from local and foreign universities, business and industrial sectors, government departments and academic institutions on publishing research results and studies in Computer Science and Information Technology through a scholarly publication. The journal is being indexed and abstracted by Clarivate Analytics'' Web of Science and Elsevier''s Scopus