{"title":"使用深度特征的无监督词聚类","authors":"Mandar Kulkarni, S. Karande, S. Lodha","doi":"10.1109/DAS.2016.14","DOIUrl":null,"url":null,"abstract":"Digitization is crucial especially in the Indian context. OCR engines fail on Indian scripts mainly because character segmentation is non-trivial. Even word based recognition approaches suffer from the issues such as time degradations, word segmentation errors, font style/size variations. In this paper, we propose a deep learning architecture based approach for unsupervised word clustering. An edge responsive untrained Convolutional Neural Network (CNN) is used as a feature extractor. Graph connected component analysis is applied on the similarity graph computed from the word features. Our approach inherently detects similar shape patterns at word level and hence, it is language agnostic. We validated our approach against multiple state of art word matching techniques. Experimental results show that our approach significantly outperforms all of them on variety of data sets. In addition, the approach is observed to be robust to word segmentation errors, font style/size variations.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Unsupervised Word Clustering Using Deep Features\",\"authors\":\"Mandar Kulkarni, S. Karande, S. Lodha\",\"doi\":\"10.1109/DAS.2016.14\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Digitization is crucial especially in the Indian context. OCR engines fail on Indian scripts mainly because character segmentation is non-trivial. Even word based recognition approaches suffer from the issues such as time degradations, word segmentation errors, font style/size variations. In this paper, we propose a deep learning architecture based approach for unsupervised word clustering. An edge responsive untrained Convolutional Neural Network (CNN) is used as a feature extractor. Graph connected component analysis is applied on the similarity graph computed from the word features. Our approach inherently detects similar shape patterns at word level and hence, it is language agnostic. We validated our approach against multiple state of art word matching techniques. Experimental results show that our approach significantly outperforms all of them on variety of data sets. In addition, the approach is observed to be robust to word segmentation errors, font style/size variations.\",\"PeriodicalId\":197359,\"journal\":{\"name\":\"2016 12th IAPR Workshop on Document Analysis Systems (DAS)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-04-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 12th IAPR Workshop on Document Analysis Systems (DAS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DAS.2016.14\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DAS.2016.14","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Digitization is crucial especially in the Indian context. OCR engines fail on Indian scripts mainly because character segmentation is non-trivial. Even word based recognition approaches suffer from the issues such as time degradations, word segmentation errors, font style/size variations. In this paper, we propose a deep learning architecture based approach for unsupervised word clustering. An edge responsive untrained Convolutional Neural Network (CNN) is used as a feature extractor. Graph connected component analysis is applied on the similarity graph computed from the word features. Our approach inherently detects similar shape patterns at word level and hence, it is language agnostic. We validated our approach against multiple state of art word matching techniques. Experimental results show that our approach significantly outperforms all of them on variety of data sets. In addition, the approach is observed to be robust to word segmentation errors, font style/size variations.