{"title":"Document clustering based on time series","authors":"L. Matei, Stefan Trausan-Matu","doi":"10.1109/ICSTCC.2015.7321281","DOIUrl":null,"url":null,"abstract":"This paper presents a novel document clustering algorithm that represents documents as a time series of words. Document clustering is very important due to the fact that it permits us to group them based on some certain criteria, especially nowadays when a large number of articles are available. The timed series representation of the document instead of the vector model permits us to consider a new algorithm for the computation of the distance between documents: dynamic time warping. This novel representation together with the dynamic time warping algorithm represents the foundation for computing the similarity and the clustering of the documents. The clustering algorithm used is hierarchical clustering. This novel clustering method of texts is applied on named entities and on the parts of speech of the words that compose the documents. As test data we are using the Reuters corpus of newspaper articles.","PeriodicalId":257135,"journal":{"name":"2015 19th International Conference on System Theory, Control and Computing (ICSTCC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 19th International Conference on System Theory, Control and Computing (ICSTCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSTCC.2015.7321281","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
This paper presents a novel document clustering algorithm that represents documents as a time series of words. Document clustering is very important due to the fact that it permits us to group them based on some certain criteria, especially nowadays when a large number of articles are available. The timed series representation of the document instead of the vector model permits us to consider a new algorithm for the computation of the distance between documents: dynamic time warping. This novel representation together with the dynamic time warping algorithm represents the foundation for computing the similarity and the clustering of the documents. The clustering algorithm used is hierarchical clustering. This novel clustering method of texts is applied on named entities and on the parts of speech of the words that compose the documents. As test data we are using the Reuters corpus of newspaper articles.