{"title":"基于术语语义单元的文本聚类","authors":"L. Jing, Jiali Yun, Jian Yu, Houkuan Huang","doi":"10.1109/WI-IAT.2010.23","DOIUrl":null,"url":null,"abstract":"How best to represent text data is an important problem in text mining tasks including information retrieval, clustering, classification and etc.. In this paper, we proposed a compact document representation with term semantic units which are identified from the implicit and explicit semantic information. Among it, the implicit semantic information is extracted from syntactic content via statistical methods such as latent semantic indexing and information bottleneck. The explicit semantic information is mined from the external semantic resource (Wikipedia). The proposed compact representation model can map a document collection in a low-dimension space (term semantic units which are much less than the number of all unique terms). Experimental results on real data sets have shown that the compact representation efficiently improve the performance of text clustering.","PeriodicalId":340211,"journal":{"name":"2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology","volume":"98 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Text Clustering via Term Semantic Units\",\"authors\":\"L. Jing, Jiali Yun, Jian Yu, Houkuan Huang\",\"doi\":\"10.1109/WI-IAT.2010.23\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"How best to represent text data is an important problem in text mining tasks including information retrieval, clustering, classification and etc.. In this paper, we proposed a compact document representation with term semantic units which are identified from the implicit and explicit semantic information. Among it, the implicit semantic information is extracted from syntactic content via statistical methods such as latent semantic indexing and information bottleneck. The explicit semantic information is mined from the external semantic resource (Wikipedia). The proposed compact representation model can map a document collection in a low-dimension space (term semantic units which are much less than the number of all unique terms). Experimental results on real data sets have shown that the compact representation efficiently improve the performance of text clustering.\",\"PeriodicalId\":340211,\"journal\":{\"name\":\"2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology\",\"volume\":\"98 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-08-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WI-IAT.2010.23\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WI-IAT.2010.23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
How best to represent text data is an important problem in text mining tasks including information retrieval, clustering, classification and etc.. In this paper, we proposed a compact document representation with term semantic units which are identified from the implicit and explicit semantic information. Among it, the implicit semantic information is extracted from syntactic content via statistical methods such as latent semantic indexing and information bottleneck. The explicit semantic information is mined from the external semantic resource (Wikipedia). The proposed compact representation model can map a document collection in a low-dimension space (term semantic units which are much less than the number of all unique terms). Experimental results on real data sets have shown that the compact representation efficiently improve the performance of text clustering.