基于ML-KNN算法和神经网络的多标签孟加拉文文章分类

2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD) Pub Date : 2021-02-27 DOI:10.1109/ICICT4SD50815.2021.9396882

Wahiduzzaman Akanda, A. Uddin

{"title":"基于ML-KNN算法和神经网络的多标签孟加拉文文章分类","authors":"Wahiduzzaman Akanda, A. Uddin","doi":"10.1109/ICICT4SD50815.2021.9396882","DOIUrl":null,"url":null,"abstract":"Multi-label classification is a very complex and critical task to solve in Natural Language Processing and Text Mining domain. Moreover, Bengali has limited resources to work with. The goal of this research is to overcome these constraints and provide a sophisticated and standard solution that will solve this problem for Bengali text. This research output can be utilized by any Bengali newspaper portals to improve their recommendation system as well as reduce manual labor of document tagging. In this work, we have utilized a large dataset that contains 4,16,289 news articles and 4,302 unique labels. These news articles are collected from one of the most popular Bengali newspapers of Bangladesh named Prothom Alo. The news articles span over seven years (2013 to 2019). These news articles are categorized into six categories named Sports, Technology, Economy, Entertainment, International, and State. This huge dataset helps us to build a supervised model using the ML-KNN algorithm and Neural Network. Furthermore, for the word embedding feature, we have utilized Count Vectorizer. We will also briefly discuss how different parameters like words per document, labels per category impact the result.","PeriodicalId":239251,"journal":{"name":"2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Multi-Label Bengali article classification using ML-KNN algorithm and Neural Network\",\"authors\":\"Wahiduzzaman Akanda, A. Uddin\",\"doi\":\"10.1109/ICICT4SD50815.2021.9396882\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multi-label classification is a very complex and critical task to solve in Natural Language Processing and Text Mining domain. Moreover, Bengali has limited resources to work with. The goal of this research is to overcome these constraints and provide a sophisticated and standard solution that will solve this problem for Bengali text. This research output can be utilized by any Bengali newspaper portals to improve their recommendation system as well as reduce manual labor of document tagging. In this work, we have utilized a large dataset that contains 4,16,289 news articles and 4,302 unique labels. These news articles are collected from one of the most popular Bengali newspapers of Bangladesh named Prothom Alo. The news articles span over seven years (2013 to 2019). These news articles are categorized into six categories named Sports, Technology, Economy, Entertainment, International, and State. This huge dataset helps us to build a supervised model using the ML-KNN algorithm and Neural Network. Furthermore, for the word embedding feature, we have utilized Count Vectorizer. We will also briefly discuss how different parameters like words per document, labels per category impact the result.\",\"PeriodicalId\":239251,\"journal\":{\"name\":\"2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD)\",\"volume\":\"59 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-02-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICICT4SD50815.2021.9396882\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICICT4SD50815.2021.9396882","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

多标签分类是自然语言处理和文本挖掘领域中一个非常复杂和关键的问题。此外，孟加拉语的资源有限。本研究的目标是克服这些限制，并提供一个复杂和标准的解决方案，将解决这个问题的孟加拉文本。这一研究成果可以被任何孟加拉语报纸门户网站用来改进他们的推荐系统，并减少文档标记的手工劳动。在这项工作中，我们使用了一个包含4,16,289篇新闻文章和4,302个唯一标签的大型数据集。这些新闻文章收集自孟加拉国最受欢迎的孟加拉语报纸之一Prothom Alo。这些新闻报道跨越了7年(2013年至2019年)。这些新闻文章被分为六个类别，分别是体育、科技、经济、娱乐、国际和国家。这个庞大的数据集帮助我们使用ML-KNN算法和神经网络建立一个监督模型。此外，对于词嵌入特征，我们使用了计数矢量器。我们还将简要讨论不同的参数(如每个文档的单词、每个类别的标签)如何影响结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multi-Label Bengali article classification using ML-KNN algorithm and Neural Network

Multi-label classification is a very complex and critical task to solve in Natural Language Processing and Text Mining domain. Moreover, Bengali has limited resources to work with. The goal of this research is to overcome these constraints and provide a sophisticated and standard solution that will solve this problem for Bengali text. This research output can be utilized by any Bengali newspaper portals to improve their recommendation system as well as reduce manual labor of document tagging. In this work, we have utilized a large dataset that contains 4,16,289 news articles and 4,302 unique labels. These news articles are collected from one of the most popular Bengali newspapers of Bangladesh named Prothom Alo. The news articles span over seven years (2013 to 2019). These news articles are categorized into six categories named Sports, Technology, Economy, Entertainment, International, and State. This huge dataset helps us to build a supervised model using the ML-KNN algorithm and Neural Network. Furthermore, for the word embedding feature, we have utilized Count Vectorizer. We will also briefly discuss how different parameters like words per document, labels per category impact the result.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD)

自引率

0.00%

发文量