YouTube评论宇宙:一种机器学习方法，用于在自定义准备数据集上对YouTube评论进行系统分类

2023 World Conference on Communication & Computing (WCONF) Pub Date : 2023-07-14 DOI:10.1109/WCONF58270.2023.10235049

Sankalp Naik, Ashay Katre

{"title":"YouTube评论宇宙:一种机器学习方法，用于在自定义准备数据集上对YouTube评论进行系统分类","authors":"Sankalp Naik, Ashay Katre","doi":"10.1109/WCONF58270.2023.10235049","DOIUrl":null,"url":null,"abstract":"At present, YouTube can be regarded as a cloud service owing to the amount of data it adds every second and the enormous data it stores in its data farms. It doesn’t delete old content, it uses redundant storage. The platform can be more sustainable and cost efficient, if they were to discard redundancies of which major portion is constituted by the spam comments or comments that are offensive/abusive. In this paper several machine learning models are used in order to reduce those comments and eventually towards a more efficient storage model. We first address the task of dataset preparation by designing a comprehensive annotation scheme, considering various dimensions such as sentiment, topic, toxicity, and engagement. Leveraging this annotated dataset, we develop a robust machine learning framework that combines state-of-the-art natural language processing techniques with advanced classification algorithms. Our methodology involves several stages, including preprocessing, feature extraction, and model training. We also employ techniques like sentiment analysis and toxicity detection to capture the sentiment and abusive nature of comments, respectively. We also introduced gravity to the comments which would act as a reward mechanism to the comments. To evaluate the performance of our approach, we conduct extensive experiments on a large-scale YouTube comments dataset. We compare the effectiveness of various classification algorithms, including support vector machines, random forests, and deep learning models, in accurately categorizing comments based on our predefined annotation scheme. Additionally, we assess the generalizability of our model by conducting cross-domain experiments on different genres of YouTube videos. Overall, our work contributes to the understanding and management of the YouTube comment ecosystem, showcasing the power of machine learning techniques in systematically classifying and analyzing comments on this popular platform.","PeriodicalId":202864,"journal":{"name":"2023 World Conference on Communication & Computing (WCONF)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"YouTube Universe of Comments: A Machine Learning approach for systematic classification of YouTube Comments on custom prepared dataset\",\"authors\":\"Sankalp Naik, Ashay Katre\",\"doi\":\"10.1109/WCONF58270.2023.10235049\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"At present, YouTube can be regarded as a cloud service owing to the amount of data it adds every second and the enormous data it stores in its data farms. It doesn’t delete old content, it uses redundant storage. The platform can be more sustainable and cost efficient, if they were to discard redundancies of which major portion is constituted by the spam comments or comments that are offensive/abusive. In this paper several machine learning models are used in order to reduce those comments and eventually towards a more efficient storage model. We first address the task of dataset preparation by designing a comprehensive annotation scheme, considering various dimensions such as sentiment, topic, toxicity, and engagement. Leveraging this annotated dataset, we develop a robust machine learning framework that combines state-of-the-art natural language processing techniques with advanced classification algorithms. Our methodology involves several stages, including preprocessing, feature extraction, and model training. We also employ techniques like sentiment analysis and toxicity detection to capture the sentiment and abusive nature of comments, respectively. We also introduced gravity to the comments which would act as a reward mechanism to the comments. To evaluate the performance of our approach, we conduct extensive experiments on a large-scale YouTube comments dataset. We compare the effectiveness of various classification algorithms, including support vector machines, random forests, and deep learning models, in accurately categorizing comments based on our predefined annotation scheme. Additionally, we assess the generalizability of our model by conducting cross-domain experiments on different genres of YouTube videos. Overall, our work contributes to the understanding and management of the YouTube comment ecosystem, showcasing the power of machine learning techniques in systematically classifying and analyzing comments on this popular platform.\",\"PeriodicalId\":202864,\"journal\":{\"name\":\"2023 World Conference on Communication & Computing (WCONF)\",\"volume\":\"14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 World Conference on Communication & Computing (WCONF)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WCONF58270.2023.10235049\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 World Conference on Communication & Computing (WCONF)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WCONF58270.2023.10235049","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

目前，YouTube可以被视为一种云服务，因为它每秒增加的数据量和它在数据农场中存储的巨大数据。它不会删除旧内容，而是使用冗余存储。如果他们抛弃冗余的内容(其中大部分是由垃圾评论或攻击性/辱骂性评论组成的)，该平台可以更具可持续性和成本效益。在本文中，使用了几个机器学习模型来减少这些注释，并最终实现更有效的存储模型。我们首先通过设计一个综合的注释方案来解决数据集准备的任务，该方案考虑了各种维度，如情感、主题、毒性和参与度。利用这个带注释的数据集，我们开发了一个强大的机器学习框架，将最先进的自然语言处理技术与先进的分类算法相结合。我们的方法涉及几个阶段，包括预处理、特征提取和模型训练。我们还采用情感分析和毒性检测等技术，分别捕捉评论的情感和滥用性质。我们还为评论引入了重力，这将作为评论的奖励机制。为了评估我们的方法的性能，我们在一个大规模的YouTube评论数据集上进行了广泛的实验。我们比较了各种分类算法，包括支持向量机、随机森林和深度学习模型，在基于我们预定义的注释方案的评论准确分类方面的有效性。此外，我们通过对不同类型的YouTube视频进行跨域实验来评估我们模型的泛化性。总的来说，我们的工作有助于理解和管理YouTube评论生态系统，展示了机器学习技术在系统分类和分析这个流行平台上的评论方面的强大功能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

YouTube Universe of Comments: A Machine Learning approach for systematic classification of YouTube Comments on custom prepared dataset

At present, YouTube can be regarded as a cloud service owing to the amount of data it adds every second and the enormous data it stores in its data farms. It doesn’t delete old content, it uses redundant storage. The platform can be more sustainable and cost efficient, if they were to discard redundancies of which major portion is constituted by the spam comments or comments that are offensive/abusive. In this paper several machine learning models are used in order to reduce those comments and eventually towards a more efficient storage model. We first address the task of dataset preparation by designing a comprehensive annotation scheme, considering various dimensions such as sentiment, topic, toxicity, and engagement. Leveraging this annotated dataset, we develop a robust machine learning framework that combines state-of-the-art natural language processing techniques with advanced classification algorithms. Our methodology involves several stages, including preprocessing, feature extraction, and model training. We also employ techniques like sentiment analysis and toxicity detection to capture the sentiment and abusive nature of comments, respectively. We also introduced gravity to the comments which would act as a reward mechanism to the comments. To evaluate the performance of our approach, we conduct extensive experiments on a large-scale YouTube comments dataset. We compare the effectiveness of various classification algorithms, including support vector machines, random forests, and deep learning models, in accurately categorizing comments based on our predefined annotation scheme. Additionally, we assess the generalizability of our model by conducting cross-domain experiments on different genres of YouTube videos. Overall, our work contributes to the understanding and management of the YouTube comment ecosystem, showcasing the power of machine learning techniques in systematically classifying and analyzing comments on this popular platform.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 World Conference on Communication & Computing (WCONF)

自引率

0.00%

发文量