{"title":"斯洛伐克语中不适当评论分类的新公共数据集","authors":"J´n Mojžiš, M. Kvassay","doi":"10.1109/ICETA57911.2022.9974852","DOIUrl":null,"url":null,"abstract":"Discussion sections under online news articles enable the exchange of opinions and information between discussants and the provision of information to potential visitors and readers. Discussion moderators have the task of guarding the level of discussions and preventing the publication of harmful content (vulgarity, hate speech, abuse, etc.). The process of moderating discussions should ideally be predictable and based on clear, predetermined rules of conduct. Discussion participants should not have the feeling of either rigid censorship or passive toleration of inappropriate content. Manual moderation is time-consuming and not particularly popular among newspaper workers, especially when there is a large number of comments. For the above reasons, we decided to propose attributes that would help to detect inappropriate content automatically and perform machine learning for discussion comments, in order to guarantee both a certain degree of predictability and uni-formity in the classification of inappropriate comments and, secondly, to offer a list of potential machine learning models that are appropriate for the task. For evaluation purposes, we collected an entirely new dataset with 2,283 inappropri-ate comments (marked as such by human moderators) and 10,000 ordinary comments in the Slovak language from past discussions on a major Slovak online news portal. We offer this dataset freely to other researchers to develop and test their own algorithms. By way of illustrative example we have been able to classify inappropriate comments with a macro F-measure of 67.3% on the basis of various statistical attributes, which we consider an encouraging result given that Slovak is a highly inflected synthetic language.","PeriodicalId":151344,"journal":{"name":"2022 20th International Conference on Emerging eLearning Technologies and Applications (ICETA)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"New Public Dataset for Classification of Inappropriate Comments in Slovak language\",\"authors\":\"J´n Mojžiš, M. Kvassay\",\"doi\":\"10.1109/ICETA57911.2022.9974852\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Discussion sections under online news articles enable the exchange of opinions and information between discussants and the provision of information to potential visitors and readers. Discussion moderators have the task of guarding the level of discussions and preventing the publication of harmful content (vulgarity, hate speech, abuse, etc.). The process of moderating discussions should ideally be predictable and based on clear, predetermined rules of conduct. Discussion participants should not have the feeling of either rigid censorship or passive toleration of inappropriate content. Manual moderation is time-consuming and not particularly popular among newspaper workers, especially when there is a large number of comments. For the above reasons, we decided to propose attributes that would help to detect inappropriate content automatically and perform machine learning for discussion comments, in order to guarantee both a certain degree of predictability and uni-formity in the classification of inappropriate comments and, secondly, to offer a list of potential machine learning models that are appropriate for the task. For evaluation purposes, we collected an entirely new dataset with 2,283 inappropri-ate comments (marked as such by human moderators) and 10,000 ordinary comments in the Slovak language from past discussions on a major Slovak online news portal. We offer this dataset freely to other researchers to develop and test their own algorithms. By way of illustrative example we have been able to classify inappropriate comments with a macro F-measure of 67.3% on the basis of various statistical attributes, which we consider an encouraging result given that Slovak is a highly inflected synthetic language.\",\"PeriodicalId\":151344,\"journal\":{\"name\":\"2022 20th International Conference on Emerging eLearning Technologies and Applications (ICETA)\",\"volume\":\"58 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 20th International Conference on Emerging eLearning Technologies and Applications (ICETA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICETA57911.2022.9974852\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 20th International Conference on Emerging eLearning Technologies and Applications (ICETA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICETA57911.2022.9974852","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
New Public Dataset for Classification of Inappropriate Comments in Slovak language
Discussion sections under online news articles enable the exchange of opinions and information between discussants and the provision of information to potential visitors and readers. Discussion moderators have the task of guarding the level of discussions and preventing the publication of harmful content (vulgarity, hate speech, abuse, etc.). The process of moderating discussions should ideally be predictable and based on clear, predetermined rules of conduct. Discussion participants should not have the feeling of either rigid censorship or passive toleration of inappropriate content. Manual moderation is time-consuming and not particularly popular among newspaper workers, especially when there is a large number of comments. For the above reasons, we decided to propose attributes that would help to detect inappropriate content automatically and perform machine learning for discussion comments, in order to guarantee both a certain degree of predictability and uni-formity in the classification of inappropriate comments and, secondly, to offer a list of potential machine learning models that are appropriate for the task. For evaluation purposes, we collected an entirely new dataset with 2,283 inappropri-ate comments (marked as such by human moderators) and 10,000 ordinary comments in the Slovak language from past discussions on a major Slovak online news portal. We offer this dataset freely to other researchers to develop and test their own algorithms. By way of illustrative example we have been able to classify inappropriate comments with a macro F-measure of 67.3% on the basis of various statistical attributes, which we consider an encouraging result given that Slovak is a highly inflected synthetic language.