提取和可视化维基百科讨论页上的用户参与

Proceedings of the 17th International Symposium on Open Collaboration Pub Date : 2021-09-15 DOI:10.1145/3479986.3479995

Carlin MacKenzie, J. R. Hott

{"title":"提取和可视化维基百科讨论页上的用户参与","authors":"Carlin MacKenzie, J. R. Hott","doi":"10.1145/3479986.3479995","DOIUrl":null,"url":null,"abstract":"As Wikipedia has grown in popularity, it is important to investigate its diverse user community and collaborative editorial base. Although all user data, from traffic to user edits, are available for download under a free and open license, it is difficult to work with this data due to its scale. In this paper, we demonstrate how consumer hardware can be used to create a local database of Wikipedia’s full edit history from their public XML data dumps. Using this database, we create and present the first visualizations of how editing on talk pages differs between user groups. Our visualizations demonstrate that low quality edits are primarily performed by IP users, rather than blocked users, and that overall engagement with talk pages has plateaued over the last 10 years across all user groups. Finally, we investigate the feasibility of classifying blocked users using this dataset as an example of future research directions. However, we demonstrate the difficulty of this task and find that additional data or a more advanced model would be needed to classify them, as our approach didn’t provide sufficient information to do this. We anticipate that our visualizations and data extraction process are of interest to the community and will provide researchers with the tools needed to use Wikipedia’s valuable data when resources are limited.","PeriodicalId":159312,"journal":{"name":"Proceedings of the 17th International Symposium on Open Collaboration","volume":"103 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Extracting and Visualizing User Engagement on Wikipedia Talk Pages\",\"authors\":\"Carlin MacKenzie, J. R. Hott\",\"doi\":\"10.1145/3479986.3479995\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As Wikipedia has grown in popularity, it is important to investigate its diverse user community and collaborative editorial base. Although all user data, from traffic to user edits, are available for download under a free and open license, it is difficult to work with this data due to its scale. In this paper, we demonstrate how consumer hardware can be used to create a local database of Wikipedia’s full edit history from their public XML data dumps. Using this database, we create and present the first visualizations of how editing on talk pages differs between user groups. Our visualizations demonstrate that low quality edits are primarily performed by IP users, rather than blocked users, and that overall engagement with talk pages has plateaued over the last 10 years across all user groups. Finally, we investigate the feasibility of classifying blocked users using this dataset as an example of future research directions. However, we demonstrate the difficulty of this task and find that additional data or a more advanced model would be needed to classify them, as our approach didn’t provide sufficient information to do this. We anticipate that our visualizations and data extraction process are of interest to the community and will provide researchers with the tools needed to use Wikipedia’s valuable data when resources are limited.\",\"PeriodicalId\":159312,\"journal\":{\"name\":\"Proceedings of the 17th International Symposium on Open Collaboration\",\"volume\":\"103 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 17th International Symposium on Open Collaboration\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3479986.3479995\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 17th International Symposium on Open Collaboration","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3479986.3479995","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

随着维基百科越来越受欢迎，调查其多样化的用户社区和协作编辑基础是很重要的。尽管所有用户数据，从流量到用户编辑，都可以在免费开放的许可下下载，但由于这些数据的规模，很难使用这些数据。在本文中，我们将演示如何使用消费者硬件从其公共XML数据转储中创建维基百科完整编辑历史的本地数据库。使用这个数据库，我们创建并展示了讨论页编辑在不同用户组之间的差异。我们的可视化显示，低质量的编辑主要是由IP用户执行的，而不是被屏蔽的用户，并且在过去的10年里，所有用户组对讨论页的总体参与度都趋于稳定。最后，以该数据集为例，探讨了屏蔽用户分类的可行性，并提出了未来的研究方向。然而，我们证明了这项任务的难度，并发现需要额外的数据或更高级的模型来对它们进行分类，因为我们的方法没有提供足够的信息来做到这一点。我们期望我们的可视化和数据提取过程会引起社区的兴趣，并将为研究人员提供在资源有限时使用维基百科宝贵数据所需的工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Extracting and Visualizing User Engagement on Wikipedia Talk Pages

As Wikipedia has grown in popularity, it is important to investigate its diverse user community and collaborative editorial base. Although all user data, from traffic to user edits, are available for download under a free and open license, it is difficult to work with this data due to its scale. In this paper, we demonstrate how consumer hardware can be used to create a local database of Wikipedia’s full edit history from their public XML data dumps. Using this database, we create and present the first visualizations of how editing on talk pages differs between user groups. Our visualizations demonstrate that low quality edits are primarily performed by IP users, rather than blocked users, and that overall engagement with talk pages has plateaued over the last 10 years across all user groups. Finally, we investigate the feasibility of classifying blocked users using this dataset as an example of future research directions. However, we demonstrate the difficulty of this task and find that additional data or a more advanced model would be needed to classify them, as our approach didn’t provide sufficient information to do this. We anticipate that our visualizations and data extraction process are of interest to the community and will provide researchers with the tools needed to use Wikipedia’s valuable data when resources are limited.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 17th International Symposium on Open Collaboration

自引率

0.00%

发文量