斯洛伐克语中不适当评论分类的新公共数据集

2022 20th International Conference on Emerging eLearning Technologies and Applications (ICETA) Pub Date : 2022-10-20 DOI:10.1109/ICETA57911.2022.9974852

J´n Mojžiš, M. Kvassay

{"title":"斯洛伐克语中不适当评论分类的新公共数据集","authors":"J´n Mojžiš, M. Kvassay","doi":"10.1109/ICETA57911.2022.9974852","DOIUrl":null,"url":null,"abstract":"Discussion sections under online news articles enable the exchange of opinions and information between discussants and the provision of information to potential visitors and readers. Discussion moderators have the task of guarding the level of discussions and preventing the publication of harmful content (vulgarity, hate speech, abuse, etc.). The process of moderating discussions should ideally be predictable and based on clear, predetermined rules of conduct. Discussion participants should not have the feeling of either rigid censorship or passive toleration of inappropriate content. Manual moderation is time-consuming and not particularly popular among newspaper workers, especially when there is a large number of comments. For the above reasons, we decided to propose attributes that would help to detect inappropriate content automatically and perform machine learning for discussion comments, in order to guarantee both a certain degree of predictability and uni-formity in the classification of inappropriate comments and, secondly, to offer a list of potential machine learning models that are appropriate for the task. For evaluation purposes, we collected an entirely new dataset with 2,283 inappropri-ate comments (marked as such by human moderators) and 10,000 ordinary comments in the Slovak language from past discussions on a major Slovak online news portal. We offer this dataset freely to other researchers to develop and test their own algorithms. By way of illustrative example we have been able to classify inappropriate comments with a macro F-measure of 67.3% on the basis of various statistical attributes, which we consider an encouraging result given that Slovak is a highly inflected synthetic language.","PeriodicalId":151344,"journal":{"name":"2022 20th International Conference on Emerging eLearning Technologies and Applications (ICETA)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"New Public Dataset for Classification of Inappropriate Comments in Slovak language\",\"authors\":\"J´n Mojžiš, M. Kvassay\",\"doi\":\"10.1109/ICETA57911.2022.9974852\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Discussion sections under online news articles enable the exchange of opinions and information between discussants and the provision of information to potential visitors and readers. Discussion moderators have the task of guarding the level of discussions and preventing the publication of harmful content (vulgarity, hate speech, abuse, etc.). The process of moderating discussions should ideally be predictable and based on clear, predetermined rules of conduct. Discussion participants should not have the feeling of either rigid censorship or passive toleration of inappropriate content. Manual moderation is time-consuming and not particularly popular among newspaper workers, especially when there is a large number of comments. For the above reasons, we decided to propose attributes that would help to detect inappropriate content automatically and perform machine learning for discussion comments, in order to guarantee both a certain degree of predictability and uni-formity in the classification of inappropriate comments and, secondly, to offer a list of potential machine learning models that are appropriate for the task. For evaluation purposes, we collected an entirely new dataset with 2,283 inappropri-ate comments (marked as such by human moderators) and 10,000 ordinary comments in the Slovak language from past discussions on a major Slovak online news portal. We offer this dataset freely to other researchers to develop and test their own algorithms. By way of illustrative example we have been able to classify inappropriate comments with a macro F-measure of 67.3% on the basis of various statistical attributes, which we consider an encouraging result given that Slovak is a highly inflected synthetic language.\",\"PeriodicalId\":151344,\"journal\":{\"name\":\"2022 20th International Conference on Emerging eLearning Technologies and Applications (ICETA)\",\"volume\":\"58 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 20th International Conference on Emerging eLearning Technologies and Applications (ICETA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICETA57911.2022.9974852\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 20th International Conference on Emerging eLearning Technologies and Applications (ICETA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICETA57911.2022.9974852","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在线新闻文章下的讨论区使讨论者之间能够交换意见和信息，并向潜在的访问者和读者提供信息。讨论版主的任务是保护讨论的水平，防止有害内容(粗俗，仇恨言论，辱骂等)的发布。理想情况下，主持讨论的过程应该是可预测的，并以明确、预先确定的行为规则为基础。讨论参与者不应该有严格审查或被动容忍不适当内容的感觉。人工审核很耗时，而且在报纸工作者中并不特别受欢迎，尤其是在有大量评论的时候。基于上述原因，我们决定提出有助于自动检测不适当内容并对讨论评论执行机器学习的属性，以保证不适当评论分类具有一定程度的可预测性和一致性，其次，提供适合该任务的潜在机器学习模型列表。为了评估目的，我们收集了一个全新的数据集，其中包含2,283条不适当的评论(由人类版主标记)和10,000条斯洛伐克语的普通评论，这些评论来自斯洛伐克主要在线新闻门户网站过去的讨论。我们将这个数据集免费提供给其他研究人员来开发和测试他们自己的算法。通过说明性的例子，我们能够根据各种统计属性将不恰当的评论分类为67.3%的宏观f值，我们认为这是一个令人鼓舞的结果，因为斯洛伐克语是一种高度屈折的合成语言。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

New Public Dataset for Classification of Inappropriate Comments in Slovak language

Discussion sections under online news articles enable the exchange of opinions and information between discussants and the provision of information to potential visitors and readers. Discussion moderators have the task of guarding the level of discussions and preventing the publication of harmful content (vulgarity, hate speech, abuse, etc.). The process of moderating discussions should ideally be predictable and based on clear, predetermined rules of conduct. Discussion participants should not have the feeling of either rigid censorship or passive toleration of inappropriate content. Manual moderation is time-consuming and not particularly popular among newspaper workers, especially when there is a large number of comments. For the above reasons, we decided to propose attributes that would help to detect inappropriate content automatically and perform machine learning for discussion comments, in order to guarantee both a certain degree of predictability and uni-formity in the classification of inappropriate comments and, secondly, to offer a list of potential machine learning models that are appropriate for the task. For evaluation purposes, we collected an entirely new dataset with 2,283 inappropri-ate comments (marked as such by human moderators) and 10,000 ordinary comments in the Slovak language from past discussions on a major Slovak online news portal. We offer this dataset freely to other researchers to develop and test their own algorithms. By way of illustrative example we have been able to classify inappropriate comments with a macro F-measure of 67.3% on the basis of various statistical attributes, which we consider an encouraging result given that Slovak is a highly inflected synthetic language.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 20th International Conference on Emerging eLearning Technologies and Applications (ICETA)

自引率

0.00%

发文量