A DATA-DRIVEN STUDY OF CITIZEN SCIENCE DATA QUALITY ASSESSMENT PROFILE

Proceedings of the International Conferences on WWW/Internet 2021 and Applied Computing 2021 Pub Date : 2021-10-13 DOI:10.33965/icwi_ac2021_202109l012

J. N. Leocadio, A. Saraiva

{"title":"A DATA-DRIVEN STUDY OF CITIZEN SCIENCE DATA QUALITY ASSESSMENT PROFILE","authors":"J. N. Leocadio, A. Saraiva","doi":"10.33965/icwi_ac2021_202109l012","DOIUrl":null,"url":null,"abstract":"In Citizen Science (CS) projects, data quality (DQ) has been a major concern and discussions have been held to evaluate and ensure the quality of what is produced by volunteers, but few studies have assessed how volunteers get involved and the impact of their behavior on data quality. This study aimed to study a data-driven CS profile to data quality assessment. Here, we analyzed citizen science data extracted from the iNaturalist, a platform to record species observations. We used 58,488 observations recorded in São Paulo, Brazil, and Manchester, England, to train machine learning models, using Random Forest, and to create a DQ profile to classify data according to its quality. We applied an approach that, first identifies information elements (IE) and quality dimensions to describe the data and users’ behavior. The data was then cleaned, pre-processed and transformed. Three models were created: a complete model (with all features), a reduced model (with dimension reduction) and a model with only characteristics that describe the users’ behavior. The precision score for the models were 0.931, 0.932 and 0.774, respectively. The results showed that data quality can be described with few features and user behavior is very important to understand the quality of what is produced by volunteers.","PeriodicalId":178063,"journal":{"name":"Proceedings of the International Conferences on WWW/Internet 2021 and Applied Computing 2021","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conferences on WWW/Internet 2021 and Applied Computing 2021","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.33965/icwi_ac2021_202109l012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In Citizen Science (CS) projects, data quality (DQ) has been a major concern and discussions have been held to evaluate and ensure the quality of what is produced by volunteers, but few studies have assessed how volunteers get involved and the impact of their behavior on data quality. This study aimed to study a data-driven CS profile to data quality assessment. Here, we analyzed citizen science data extracted from the iNaturalist, a platform to record species observations. We used 58,488 observations recorded in São Paulo, Brazil, and Manchester, England, to train machine learning models, using Random Forest, and to create a DQ profile to classify data according to its quality. We applied an approach that, first identifies information elements (IE) and quality dimensions to describe the data and users’ behavior. The data was then cleaned, pre-processed and transformed. Three models were created: a complete model (with all features), a reduced model (with dimension reduction) and a model with only characteristics that describe the users’ behavior. The precision score for the models were 0.931, 0.932 and 0.774, respectively. The results showed that data quality can be described with few features and user behavior is very important to understand the quality of what is produced by volunteers.

查看原文本刊更多论文

公民科学数据质量评估概况的数据驱动研究

在公民科学(Citizen Science, CS)项目中，数据质量(data quality, DQ)一直是一个主要关注的问题，人们也举行了讨论，以评估和确保志愿者产出的数据质量，但很少有研究评估志愿者如何参与以及他们的行为对数据质量的影响。本研究旨在研究数据驱动的CS概况对数据质量的评估。在这里，我们分析了从iNaturalist(一个记录物种观察的平台)提取的公民科学数据。我们使用在巴西圣保罗和英国曼彻斯特记录的58,488个观测值来训练机器学习模型，使用随机森林，并创建DQ配置文件，根据数据的质量对数据进行分类。我们采用了一种方法，首先确定信息元素(IE)和质量维度来描述数据和用户行为。然后对数据进行清理、预处理和转换。创建了三个模型:完整模型(包含所有特征)，简化模型(包含降维)和仅包含描述用户行为的特征的模型。模型的精度得分分别为0.931、0.932和0.774。结果表明，数据质量可以用很少的特征来描述，用户行为对于理解志愿者产出的数据质量非常重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the International Conferences on WWW/Internet 2021 and Applied Computing 2021

自引率

0.00%

发文量