基于ML-KNN算法和神经网络的多标签孟加拉文文章分类

Wahiduzzaman Akanda, A. Uddin
{"title":"基于ML-KNN算法和神经网络的多标签孟加拉文文章分类","authors":"Wahiduzzaman Akanda, A. Uddin","doi":"10.1109/ICICT4SD50815.2021.9396882","DOIUrl":null,"url":null,"abstract":"Multi-label classification is a very complex and critical task to solve in Natural Language Processing and Text Mining domain. Moreover, Bengali has limited resources to work with. The goal of this research is to overcome these constraints and provide a sophisticated and standard solution that will solve this problem for Bengali text. This research output can be utilized by any Bengali newspaper portals to improve their recommendation system as well as reduce manual labor of document tagging. In this work, we have utilized a large dataset that contains 4,16,289 news articles and 4,302 unique labels. These news articles are collected from one of the most popular Bengali newspapers of Bangladesh named Prothom Alo. The news articles span over seven years (2013 to 2019). These news articles are categorized into six categories named Sports, Technology, Economy, Entertainment, International, and State. This huge dataset helps us to build a supervised model using the ML-KNN algorithm and Neural Network. Furthermore, for the word embedding feature, we have utilized Count Vectorizer. We will also briefly discuss how different parameters like words per document, labels per category impact the result.","PeriodicalId":239251,"journal":{"name":"2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Multi-Label Bengali article classification using ML-KNN algorithm and Neural Network\",\"authors\":\"Wahiduzzaman Akanda, A. Uddin\",\"doi\":\"10.1109/ICICT4SD50815.2021.9396882\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multi-label classification is a very complex and critical task to solve in Natural Language Processing and Text Mining domain. Moreover, Bengali has limited resources to work with. The goal of this research is to overcome these constraints and provide a sophisticated and standard solution that will solve this problem for Bengali text. This research output can be utilized by any Bengali newspaper portals to improve their recommendation system as well as reduce manual labor of document tagging. In this work, we have utilized a large dataset that contains 4,16,289 news articles and 4,302 unique labels. These news articles are collected from one of the most popular Bengali newspapers of Bangladesh named Prothom Alo. The news articles span over seven years (2013 to 2019). These news articles are categorized into six categories named Sports, Technology, Economy, Entertainment, International, and State. This huge dataset helps us to build a supervised model using the ML-KNN algorithm and Neural Network. Furthermore, for the word embedding feature, we have utilized Count Vectorizer. We will also briefly discuss how different parameters like words per document, labels per category impact the result.\",\"PeriodicalId\":239251,\"journal\":{\"name\":\"2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD)\",\"volume\":\"59 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-02-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICICT4SD50815.2021.9396882\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICICT4SD50815.2021.9396882","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

多标签分类是自然语言处理和文本挖掘领域中一个非常复杂和关键的问题。此外,孟加拉语的资源有限。本研究的目标是克服这些限制,并提供一个复杂和标准的解决方案,将解决这个问题的孟加拉文本。这一研究成果可以被任何孟加拉语报纸门户网站用来改进他们的推荐系统,并减少文档标记的手工劳动。在这项工作中,我们使用了一个包含4,16,289篇新闻文章和4,302个唯一标签的大型数据集。这些新闻文章收集自孟加拉国最受欢迎的孟加拉语报纸之一Prothom Alo。这些新闻报道跨越了7年(2013年至2019年)。这些新闻文章被分为六个类别,分别是体育、科技、经济、娱乐、国际和国家。这个庞大的数据集帮助我们使用ML-KNN算法和神经网络建立一个监督模型。此外,对于词嵌入特征,我们使用了计数矢量器。我们还将简要讨论不同的参数(如每个文档的单词、每个类别的标签)如何影响结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Multi-Label Bengali article classification using ML-KNN algorithm and Neural Network
Multi-label classification is a very complex and critical task to solve in Natural Language Processing and Text Mining domain. Moreover, Bengali has limited resources to work with. The goal of this research is to overcome these constraints and provide a sophisticated and standard solution that will solve this problem for Bengali text. This research output can be utilized by any Bengali newspaper portals to improve their recommendation system as well as reduce manual labor of document tagging. In this work, we have utilized a large dataset that contains 4,16,289 news articles and 4,302 unique labels. These news articles are collected from one of the most popular Bengali newspapers of Bangladesh named Prothom Alo. The news articles span over seven years (2013 to 2019). These news articles are categorized into six categories named Sports, Technology, Economy, Entertainment, International, and State. This huge dataset helps us to build a supervised model using the ML-KNN algorithm and Neural Network. Furthermore, for the word embedding feature, we have utilized Count Vectorizer. We will also briefly discuss how different parameters like words per document, labels per category impact the result.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信