M. Rafi, Hamza Mustafa Khan, Haya Nadeem, H. Shakeel
{"title":"面向文档聚类的无监督主题感知文档级语义表示","authors":"M. Rafi, Hamza Mustafa Khan, Haya Nadeem, H. Shakeel","doi":"10.1109/acit53391.2021.9677217","DOIUrl":null,"url":null,"abstract":"Text representation is critical for a lot of applications in natural/computational language processing and understanding. Textual documents are the basic unit of communication in written/typed form, in a lot of practical settings in a human work environment. Humans are very good at understanding the purpose of the document and comprehension of the semantics from the document. Document clustering is a specialized clustering in which documents written in human language are automatically partitioned into groups(clusters) of distinguishable sub collection. Document clustering process is very sensitive to document representation. This paper investigates the document representation models for succinctly representing document level semantics and employs a topic aware document representation approach. It suggests several desirable features for effective document representation (i) it should capture word-to-word relationship, (ii) it should derive from the basic words and their relation to form topic-units (a larger composition of words) and (iii) feature –weighting scheme based on frequent topic-units and document correlation. The task of document clustering is used to evaluate representation schemes. Standard text mining datasets are used and clustering is evaluated on internal and external clustering evaluation measures.","PeriodicalId":302120,"journal":{"name":"2021 22nd International Arab Conference on Information Technology (ACIT)","volume":"121 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Unsupervised Topic Aware Document-Level Semantic Representation for Document Clustering\",\"authors\":\"M. Rafi, Hamza Mustafa Khan, Haya Nadeem, H. Shakeel\",\"doi\":\"10.1109/acit53391.2021.9677217\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text representation is critical for a lot of applications in natural/computational language processing and understanding. Textual documents are the basic unit of communication in written/typed form, in a lot of practical settings in a human work environment. Humans are very good at understanding the purpose of the document and comprehension of the semantics from the document. Document clustering is a specialized clustering in which documents written in human language are automatically partitioned into groups(clusters) of distinguishable sub collection. Document clustering process is very sensitive to document representation. This paper investigates the document representation models for succinctly representing document level semantics and employs a topic aware document representation approach. It suggests several desirable features for effective document representation (i) it should capture word-to-word relationship, (ii) it should derive from the basic words and their relation to form topic-units (a larger composition of words) and (iii) feature –weighting scheme based on frequent topic-units and document correlation. The task of document clustering is used to evaluate representation schemes. Standard text mining datasets are used and clustering is evaluated on internal and external clustering evaluation measures.\",\"PeriodicalId\":302120,\"journal\":{\"name\":\"2021 22nd International Arab Conference on Information Technology (ACIT)\",\"volume\":\"121 2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 22nd International Arab Conference on Information Technology (ACIT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/acit53391.2021.9677217\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 22nd International Arab Conference on Information Technology (ACIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/acit53391.2021.9677217","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Unsupervised Topic Aware Document-Level Semantic Representation for Document Clustering
Text representation is critical for a lot of applications in natural/computational language processing and understanding. Textual documents are the basic unit of communication in written/typed form, in a lot of practical settings in a human work environment. Humans are very good at understanding the purpose of the document and comprehension of the semantics from the document. Document clustering is a specialized clustering in which documents written in human language are automatically partitioned into groups(clusters) of distinguishable sub collection. Document clustering process is very sensitive to document representation. This paper investigates the document representation models for succinctly representing document level semantics and employs a topic aware document representation approach. It suggests several desirable features for effective document representation (i) it should capture word-to-word relationship, (ii) it should derive from the basic words and their relation to form topic-units (a larger composition of words) and (iii) feature –weighting scheme based on frequent topic-units and document correlation. The task of document clustering is used to evaluate representation schemes. Standard text mining datasets are used and clustering is evaluated on internal and external clustering evaluation measures.