Layered Representation of Bengali Texts in Reduced Dimension Using Deep Feedforward Neural Network for Categorization

Bishwajit Purkaystha, T. Datta, Md Saiful Islam, Marium-E-Jannat
{"title":"Layered Representation of Bengali Texts in Reduced Dimension Using Deep Feedforward Neural Network for Categorization","authors":"Bishwajit Purkaystha, T. Datta, Md Saiful Islam, Marium-E-Jannat","doi":"10.1109/ICCITECHN.2018.8631935","DOIUrl":null,"url":null,"abstract":"Automatic text categorization is a primary step in information retrieval where it is necessary to find the most relevant documents in an enormous volume. It is also useful in a wide range of web domains, such as from portal sites to news indexing, or from spam filtering to genre tagging. A significant amount of research works has been carried out in this field, and they are mostly dominated by Support Vector Machines (SVMs) models. Although these models have been very successful, but they require careful feature engineering to achieve optimum results. In this paper, we propose a model for Bengali text categorization that doesn't require feature engineering and is able to capture nonlinearity in data. We had first found a lower dimensional representation for the tf-idf vectors of each document using denoising autoencoders, and then we fed this transformed domain data vector into a deep feedforward network to find its most plausible category. We also show empirically that our model achieves 94.05 % accuracy for 12 categories that surmounts the best existing models on Bengali text categorization.","PeriodicalId":355984,"journal":{"name":"2018 21st International Conference of Computer and Information Technology (ICCIT)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 21st International Conference of Computer and Information Technology (ICCIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCITECHN.2018.8631935","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Automatic text categorization is a primary step in information retrieval where it is necessary to find the most relevant documents in an enormous volume. It is also useful in a wide range of web domains, such as from portal sites to news indexing, or from spam filtering to genre tagging. A significant amount of research works has been carried out in this field, and they are mostly dominated by Support Vector Machines (SVMs) models. Although these models have been very successful, but they require careful feature engineering to achieve optimum results. In this paper, we propose a model for Bengali text categorization that doesn't require feature engineering and is able to capture nonlinearity in data. We had first found a lower dimensional representation for the tf-idf vectors of each document using denoising autoencoders, and then we fed this transformed domain data vector into a deep feedforward network to find its most plausible category. We also show empirically that our model achieves 94.05 % accuracy for 12 categories that surmounts the best existing models on Bengali text categorization.
基于深度前馈神经网络分类的孟加拉语文本降维分层表示
自动文本分类是信息检索的一个基本步骤,它需要在海量中找到最相关的文档。它在广泛的网络领域也很有用,比如从门户网站到新闻索引,或者从垃圾邮件过滤到类型标记。在这一领域已经开展了大量的研究工作,其中以支持向量机(svm)模型为主。虽然这些模型已经非常成功,但它们需要仔细的特征工程来达到最佳结果。在本文中,我们提出了一个孟加拉语文本分类模型,它不需要特征工程,并且能够捕获数据中的非线性。我们首先使用去噪自动编码器找到了每个文档的tf-idf向量的低维表示,然后我们将这个转换后的域数据向量馈送到一个深度前馈网络中,以找到其最合理的类别。我们还通过经验证明,我们的模型在12个类别中达到了94.05%的准确率,超过了孟加拉语文本分类的最佳现有模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信