S. Karthick, S. Mercy Shalinie, Ar Eswarimeena, P. Madhumitha, T. Naga Abhinaya
{"title":"Effect of multi-word features on the hierarchical clustering of web documents","authors":"S. Karthick, S. Mercy Shalinie, Ar Eswarimeena, P. Madhumitha, T. Naga Abhinaya","doi":"10.1109/ICRTIT.2014.6996185","DOIUrl":null,"url":null,"abstract":"Contemporary search engines and other automated web tools are faced with the task of extracting relevant information from huge web archives. This is supposed to be a difficult task due to the semi-structured and unstructured nature of the web documents. Users need automated ways of organizing and cataloging the web documents so that they can be queried efficiently. Clustering is typically employed to organize web archives and to subsequently handle user queries. This paper analyzes the effect of including multi-word features on the performance of a hierarchical clustering algorithm. Noun sequences are the predominant features considered in our work, while most of the previous research uses n-grams as features. The paper also analyzes the effect of combining link and content based representations for the web documents and their inter-relationships on the clustering performance. Empirical evaluation of the hierarchical clustering engine suggests that including multi-word features enhances the performance of the hierarchical clustering algorithm with respect to precision.","PeriodicalId":422275,"journal":{"name":"2014 International Conference on Recent Trends in Information Technology","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Recent Trends in Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICRTIT.2014.6996185","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Contemporary search engines and other automated web tools are faced with the task of extracting relevant information from huge web archives. This is supposed to be a difficult task due to the semi-structured and unstructured nature of the web documents. Users need automated ways of organizing and cataloging the web documents so that they can be queried efficiently. Clustering is typically employed to organize web archives and to subsequently handle user queries. This paper analyzes the effect of including multi-word features on the performance of a hierarchical clustering algorithm. Noun sequences are the predominant features considered in our work, while most of the previous research uses n-grams as features. The paper also analyzes the effect of combining link and content based representations for the web documents and their inter-relationships on the clustering performance. Empirical evaluation of the hierarchical clustering engine suggests that including multi-word features enhances the performance of the hierarchical clustering algorithm with respect to precision.