Visualizing document similarity using n-grams and latent semantic analysis

2016 SAI Computing Conference (SAI) Pub Date : 2016-07-13 DOI:10.1109/SAI.2016.7555994

A. S. Hussein

{"title":"Visualizing document similarity using n-grams and latent semantic analysis","authors":"A. S. Hussein","doi":"10.1109/SAI.2016.7555994","DOIUrl":null,"url":null,"abstract":"As the number of information resources and document quantity explodes, efficient tools with intuitive visualization capabilities desperately needed to assist users in conducting document similarity analysis and/or plagiarism detection tasks by discovering hidden relations among documents. This paper proposes a content-based method for document similarity analysis and visualization. The proposed method is based on modeling the relationship between documents and their n-gram phrases, which are generated from the normalized text, exploiting morphology analysis and lexical lookup. Resolving possible morphological ambiguities is carried out by tagging the words within the examined documents. Text indexing and stop-words removal are performed, employing a new technique that is efficient in dealing with multiple long documents. The examined documents' TF-IDF model is constructed using heuristic based pair-wise matching algorithm, considering lexical and syntactic changes. Then, the hidden associations between the documents and their unique n-gram phrases are investigated using Latent Semantic Analysis (LSA). Next, the pairwise document subset and similarity measures are derived from the Singular Value Decomposition (SVD) computations. Different visualization techniques are then applied on the SVD results to expose the hidden relations among the documents under consideration. As Arabic is one of the most morphological and complicated languages, this paper emphasizes Arabic documents similarity analysis and visualization. Various experiments were carried out revealing the strong capabilities of the proposed method in analyzing and visualizing literal and some types of intelligent similarities.","PeriodicalId":219896,"journal":{"name":"2016 SAI Computing Conference (SAI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 SAI Computing Conference (SAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SAI.2016.7555994","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

Abstract

As the number of information resources and document quantity explodes, efficient tools with intuitive visualization capabilities desperately needed to assist users in conducting document similarity analysis and/or plagiarism detection tasks by discovering hidden relations among documents. This paper proposes a content-based method for document similarity analysis and visualization. The proposed method is based on modeling the relationship between documents and their n-gram phrases, which are generated from the normalized text, exploiting morphology analysis and lexical lookup. Resolving possible morphological ambiguities is carried out by tagging the words within the examined documents. Text indexing and stop-words removal are performed, employing a new technique that is efficient in dealing with multiple long documents. The examined documents' TF-IDF model is constructed using heuristic based pair-wise matching algorithm, considering lexical and syntactic changes. Then, the hidden associations between the documents and their unique n-gram phrases are investigated using Latent Semantic Analysis (LSA). Next, the pairwise document subset and similarity measures are derived from the Singular Value Decomposition (SVD) computations. Different visualization techniques are then applied on the SVD results to expose the hidden relations among the documents under consideration. As Arabic is one of the most morphological and complicated languages, this paper emphasizes Arabic documents similarity analysis and visualization. Various experiments were carried out revealing the strong capabilities of the proposed method in analyzing and visualizing literal and some types of intelligent similarities.

查看原文本刊更多论文

使用n-grams和潜在语义分析可视化文档相似度

随着信息资源和文档数量的爆炸式增长，迫切需要具有直观可视化功能的高效工具，通过发现文档之间的隐藏关系来帮助用户进行文档相似度分析和/或剽窃检测任务。提出了一种基于内容的文档相似度分析和可视化方法。所提出的方法基于对文档及其n-gram短语之间的关系建模，这些n-gram短语是从规范化文本生成的，利用词法分析和词法查找。解决可能的形态歧义是通过在被检查的文档中标记单词来进行的。执行文本索引和停止词删除，采用一种新技术，可以有效地处理多个长文档。在考虑词法和句法变化的情况下，使用启发式配对匹配算法构建被检查文档的TF-IDF模型。然后，使用潜在语义分析(Latent Semantic Analysis, LSA)研究文档与其唯一n-gram短语之间的隐藏关联。其次，通过奇异值分解(SVD)计算得到两两文档子集和相似性度量。然后对SVD结果应用不同的可视化技术，以暴露所考虑的文档之间的隐藏关系。阿拉伯文是形态最复杂的语言之一，本文着重于阿拉伯文文献的相似度分析和可视化。实验结果表明，该方法具有较强的文字和智能相似度分析和可视化能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 SAI Computing Conference (SAI)

自引率

0.00%

发文量