Bishwajit Purkaystha, T. Datta, Md Saiful Islam, Marium-E-Jannat
{"title":"Layered Representation of Bengali Texts in Reduced Dimension Using Deep Feedforward Neural Network for Categorization","authors":"Bishwajit Purkaystha, T. Datta, Md Saiful Islam, Marium-E-Jannat","doi":"10.1109/ICCITECHN.2018.8631935","DOIUrl":null,"url":null,"abstract":"Automatic text categorization is a primary step in information retrieval where it is necessary to find the most relevant documents in an enormous volume. It is also useful in a wide range of web domains, such as from portal sites to news indexing, or from spam filtering to genre tagging. A significant amount of research works has been carried out in this field, and they are mostly dominated by Support Vector Machines (SVMs) models. Although these models have been very successful, but they require careful feature engineering to achieve optimum results. In this paper, we propose a model for Bengali text categorization that doesn't require feature engineering and is able to capture nonlinearity in data. We had first found a lower dimensional representation for the tf-idf vectors of each document using denoising autoencoders, and then we fed this transformed domain data vector into a deep feedforward network to find its most plausible category. We also show empirically that our model achieves 94.05 % accuracy for 12 categories that surmounts the best existing models on Bengali text categorization.","PeriodicalId":355984,"journal":{"name":"2018 21st International Conference of Computer and Information Technology (ICCIT)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 21st International Conference of Computer and Information Technology (ICCIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCITECHN.2018.8631935","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Automatic text categorization is a primary step in information retrieval where it is necessary to find the most relevant documents in an enormous volume. It is also useful in a wide range of web domains, such as from portal sites to news indexing, or from spam filtering to genre tagging. A significant amount of research works has been carried out in this field, and they are mostly dominated by Support Vector Machines (SVMs) models. Although these models have been very successful, but they require careful feature engineering to achieve optimum results. In this paper, we propose a model for Bengali text categorization that doesn't require feature engineering and is able to capture nonlinearity in data. We had first found a lower dimensional representation for the tf-idf vectors of each document using denoising autoencoders, and then we fed this transformed domain data vector into a deep feedforward network to find its most plausible category. We also show empirically that our model achieves 94.05 % accuracy for 12 categories that surmounts the best existing models on Bengali text categorization.