应用机器学习算法处理YouTube视频托管下的训练视频评论

Science and Transport Progress. Bulletin of Dnipropetrovsk National University of Railway Transport Pub Date : 2021-04-08 DOI:10.15802/stp2020/225264

L. Koriashkina, H. V. Symonets

{"title":"应用机器学习算法处理YouTube视频托管下的训练视频评论","authors":"L. Koriashkina, H. V. Symonets","doi":"10.15802/stp2020/225264","DOIUrl":null,"url":null,"abstract":"Purpose. Detecting toxic comments on YouTube video hosting under training videos by classifying unstructured text using a combination of machine learning methods. Methodology. To work with the specified type of data, machine learning methods were used for cleaning, normalizing, and presenting textual data in a form acceptable for processing on a computer. Directly to classify comments as “toxic”, we used a logistic regression classifier, a linear support vector classification method without and with a learning method – stochastic gradient descent, a random forest classifier and a gradient enhancement classifier. In order to assess the work of the classifiers, the methods of calculating the matrix of errors, accuracy, completeness and F-measure were used. For a more generalized assessment, a cross-validation method was used. Python programming language. Findings. Based on the assessment indicators, the most optimal methods were selected – support vector machine (Linear SVM), without and with the training method using stochastic gradient descent. The described technologies can be used to analyze the textual comments under any training videos to detect toxic reviews. Also, the approach can be useful for identifying unwanted or even aggressive information on social networks or services where reviews are provided. Originality. It consists in a combination of methods for preprocessing a specific type of text, taking into account such features as the possibility of having a timecode, emoji, links, and the like, as well as in the adaptation of classification methods of machine learning for the analysis of Russian-language comments. Practical value. It is about optimizing (simplification) the comment analysis process. The need for this processing is due to the growing volumes of text data, especially in the field of education through quarantine conditions and the transition to distance learning. The volume of educational Internet content already needs to automate the processing and analysis of feedback, over time this need will only grow.","PeriodicalId":120413,"journal":{"name":"Science and Transport Progress. Bulletin of Dnipropetrovsk National University of Railway Transport","volume":"87 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"APPLICATION OF MACHINE LEARNING ALGORITHMS FOR PROCESSING COMMENTS FROM THE YOUTUBE VIDEO HOSTING UNDER TRAINING VIDEOS\",\"authors\":\"L. Koriashkina, H. V. Symonets\",\"doi\":\"10.15802/stp2020/225264\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose. Detecting toxic comments on YouTube video hosting under training videos by classifying unstructured text using a combination of machine learning methods. Methodology. To work with the specified type of data, machine learning methods were used for cleaning, normalizing, and presenting textual data in a form acceptable for processing on a computer. Directly to classify comments as “toxic”, we used a logistic regression classifier, a linear support vector classification method without and with a learning method – stochastic gradient descent, a random forest classifier and a gradient enhancement classifier. In order to assess the work of the classifiers, the methods of calculating the matrix of errors, accuracy, completeness and F-measure were used. For a more generalized assessment, a cross-validation method was used. Python programming language. Findings. Based on the assessment indicators, the most optimal methods were selected – support vector machine (Linear SVM), without and with the training method using stochastic gradient descent. The described technologies can be used to analyze the textual comments under any training videos to detect toxic reviews. Also, the approach can be useful for identifying unwanted or even aggressive information on social networks or services where reviews are provided. Originality. It consists in a combination of methods for preprocessing a specific type of text, taking into account such features as the possibility of having a timecode, emoji, links, and the like, as well as in the adaptation of classification methods of machine learning for the analysis of Russian-language comments. Practical value. It is about optimizing (simplification) the comment analysis process. The need for this processing is due to the growing volumes of text data, especially in the field of education through quarantine conditions and the transition to distance learning. The volume of educational Internet content already needs to automate the processing and analysis of feedback, over time this need will only grow.\",\"PeriodicalId\":120413,\"journal\":{\"name\":\"Science and Transport Progress. Bulletin of Dnipropetrovsk National University of Railway Transport\",\"volume\":\"87 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Science and Transport Progress. Bulletin of Dnipropetrovsk National University of Railway Transport\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15802/stp2020/225264\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Science and Transport Progress. Bulletin of Dnipropetrovsk National University of Railway Transport","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15802/stp2020/225264","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

目的。通过结合机器学习方法对非结构化文本进行分类，检测YouTube视频托管在训练视频下的有毒评论。方法。为了处理指定类型的数据，机器学习方法用于清理、规范化和以计算机上可接受的形式呈现文本数据。为了直接将评论分类为“有毒”，我们使用了逻辑回归分类器，线性支持向量分类方法，不使用和使用学习方法-随机梯度下降，随机森林分类器和梯度增强分类器。为了评估分类器的工作，使用了计算误差矩阵、准确度矩阵、完备性矩阵和f -测度矩阵的方法。为了进行更广泛的评估，使用了交叉验证方法。Python编程语言。发现。基于评价指标，选择了最优的方法——支持向量机(Linear SVM)、无随机梯度下降训练法和有随机梯度下降训练法。所描述的技术可以用于分析任何训练视频下的文本评论，以检测有毒评论。此外，该方法还可以用于识别提供评论的社交网络或服务上不需要的甚至攻击性的信息。创意。它包括对特定类型文本进行预处理的方法组合，考虑到具有时间码、表情符号、链接等特征的可能性，以及适应机器学习的分类方法来分析俄语评论。实用价值。它是关于优化(简化)评论分析过程。需要进行这种处理是由于文本数据量不断增加，特别是在通过隔离条件进行教育和向远程学习过渡的领域。教育互联网内容的数量已经需要自动化处理和分析反馈，随着时间的推移，这种需求只会增长。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

APPLICATION OF MACHINE LEARNING ALGORITHMS FOR PROCESSING COMMENTS FROM THE YOUTUBE VIDEO HOSTING UNDER TRAINING VIDEOS

Purpose. Detecting toxic comments on YouTube video hosting under training videos by classifying unstructured text using a combination of machine learning methods. Methodology. To work with the specified type of data, machine learning methods were used for cleaning, normalizing, and presenting textual data in a form acceptable for processing on a computer. Directly to classify comments as “toxic”, we used a logistic regression classifier, a linear support vector classification method without and with a learning method – stochastic gradient descent, a random forest classifier and a gradient enhancement classifier. In order to assess the work of the classifiers, the methods of calculating the matrix of errors, accuracy, completeness and F-measure were used. For a more generalized assessment, a cross-validation method was used. Python programming language. Findings. Based on the assessment indicators, the most optimal methods were selected – support vector machine (Linear SVM), without and with the training method using stochastic gradient descent. The described technologies can be used to analyze the textual comments under any training videos to detect toxic reviews. Also, the approach can be useful for identifying unwanted or even aggressive information on social networks or services where reviews are provided. Originality. It consists in a combination of methods for preprocessing a specific type of text, taking into account such features as the possibility of having a timecode, emoji, links, and the like, as well as in the adaptation of classification methods of machine learning for the analysis of Russian-language comments. Practical value. It is about optimizing (simplification) the comment analysis process. The need for this processing is due to the growing volumes of text data, especially in the field of education through quarantine conditions and the transition to distance learning. The volume of educational Internet content already needs to automate the processing and analysis of feedback, over time this need will only grow.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Science and Transport Progress. Bulletin of Dnipropetrovsk National University of Railway Transport

自引率

0.00%

发文量