Fahim Quadery, A. Maruf, Tamjid Ahmed, Md Saiful Islam
{"title":"基于半监督关键字的孟加拉文文档分类","authors":"Fahim Quadery, A. Maruf, Tamjid Ahmed, Md Saiful Islam","doi":"10.1109/CEEICT.2016.7873040","DOIUrl":null,"url":null,"abstract":"Document Categorization is an area of important research over the last couple of decades. The basic task in document categorization is classifying a given document in some predefined classes. Bengali is among the top ten most spoken languages in the world and is spoken by more than 200 million people, but the candid truth is, it still lacks significant research efforts in the area of Bengali Document Categorization. In the first phase of this paper a model has been designed that extracts keywords from a Bengali document. We crawled over 35000 news documents form popular Bengali newspapers and journals. Those documents have been stemmed and less significant words are removed using stemmer and Parts-of-Speech(POS) tagger. Statistical approach is used to extract keywords form the documents. Then probabilistic distribution and semi supervised learning with Naïve Bayes algorithm is used to approximate the category of a given Bengali document. Result and statistical data show the effectiveness of this model.","PeriodicalId":240329,"journal":{"name":"2016 3rd International Conference on Electrical Engineering and Information Communication Technology (ICEEICT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Semi supervised keyword based bengali document categorization\",\"authors\":\"Fahim Quadery, A. Maruf, Tamjid Ahmed, Md Saiful Islam\",\"doi\":\"10.1109/CEEICT.2016.7873040\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Document Categorization is an area of important research over the last couple of decades. The basic task in document categorization is classifying a given document in some predefined classes. Bengali is among the top ten most spoken languages in the world and is spoken by more than 200 million people, but the candid truth is, it still lacks significant research efforts in the area of Bengali Document Categorization. In the first phase of this paper a model has been designed that extracts keywords from a Bengali document. We crawled over 35000 news documents form popular Bengali newspapers and journals. Those documents have been stemmed and less significant words are removed using stemmer and Parts-of-Speech(POS) tagger. Statistical approach is used to extract keywords form the documents. Then probabilistic distribution and semi supervised learning with Naïve Bayes algorithm is used to approximate the category of a given Bengali document. Result and statistical data show the effectiveness of this model.\",\"PeriodicalId\":240329,\"journal\":{\"name\":\"2016 3rd International Conference on Electrical Engineering and Information Communication Technology (ICEEICT)\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 3rd International Conference on Electrical Engineering and Information Communication Technology (ICEEICT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CEEICT.2016.7873040\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 3rd International Conference on Electrical Engineering and Information Communication Technology (ICEEICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CEEICT.2016.7873040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Semi supervised keyword based bengali document categorization
Document Categorization is an area of important research over the last couple of decades. The basic task in document categorization is classifying a given document in some predefined classes. Bengali is among the top ten most spoken languages in the world and is spoken by more than 200 million people, but the candid truth is, it still lacks significant research efforts in the area of Bengali Document Categorization. In the first phase of this paper a model has been designed that extracts keywords from a Bengali document. We crawled over 35000 news documents form popular Bengali newspapers and journals. Those documents have been stemmed and less significant words are removed using stemmer and Parts-of-Speech(POS) tagger. Statistical approach is used to extract keywords form the documents. Then probabilistic distribution and semi supervised learning with Naïve Bayes algorithm is used to approximate the category of a given Bengali document. Result and statistical data show the effectiveness of this model.