{"title":"一种用于文档图像类型无监督分类的分层特征分解聚类算法","authors":"Dean Curtis, V. Kubushyn, E. Yfantis, M. Rogers","doi":"10.1109/ICMLA.2007.13","DOIUrl":null,"url":null,"abstract":"In a system where medical paper document images have been converted to a digital format by a scanning operation, understanding the document types that exists in this system could provide for vital data indexing and retrieval. In a system where millions of document images have been scanned, it is infeasible to expect a supervised based algorithm or a tedious (human based) effort to discover the document types. The most sensible and practical way is an unsupervised algorithm. Many clustering techniques have been developed for unsupervised classification. Many rely on all data being presented at once, the number of clusters to be known, or both. The algorithm presented in this paper is a two-threshold based technique relying on a hierarchical decomposition of the features. On a subset of document images, it discovered document types at an acceptable level and confidentially classified unknown document images.","PeriodicalId":448863,"journal":{"name":"Sixth International Conference on Machine Learning and Applications (ICMLA 2007)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"A hierarchical feature decomposition clustering algorithm for unsupervised classification of document image types\",\"authors\":\"Dean Curtis, V. Kubushyn, E. Yfantis, M. Rogers\",\"doi\":\"10.1109/ICMLA.2007.13\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In a system where medical paper document images have been converted to a digital format by a scanning operation, understanding the document types that exists in this system could provide for vital data indexing and retrieval. In a system where millions of document images have been scanned, it is infeasible to expect a supervised based algorithm or a tedious (human based) effort to discover the document types. The most sensible and practical way is an unsupervised algorithm. Many clustering techniques have been developed for unsupervised classification. Many rely on all data being presented at once, the number of clusters to be known, or both. The algorithm presented in this paper is a two-threshold based technique relying on a hierarchical decomposition of the features. On a subset of document images, it discovered document types at an acceptable level and confidentially classified unknown document images.\",\"PeriodicalId\":448863,\"journal\":{\"name\":\"Sixth International Conference on Machine Learning and Applications (ICMLA 2007)\",\"volume\":\"40 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-12-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Sixth International Conference on Machine Learning and Applications (ICMLA 2007)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICMLA.2007.13\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sixth International Conference on Machine Learning and Applications (ICMLA 2007)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2007.13","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A hierarchical feature decomposition clustering algorithm for unsupervised classification of document image types
In a system where medical paper document images have been converted to a digital format by a scanning operation, understanding the document types that exists in this system could provide for vital data indexing and retrieval. In a system where millions of document images have been scanned, it is infeasible to expect a supervised based algorithm or a tedious (human based) effort to discover the document types. The most sensible and practical way is an unsupervised algorithm. Many clustering techniques have been developed for unsupervised classification. Many rely on all data being presented at once, the number of clusters to be known, or both. The algorithm presented in this paper is a two-threshold based technique relying on a hierarchical decomposition of the features. On a subset of document images, it discovered document types at an acceptable level and confidentially classified unknown document images.