A DATA-DRIVEN STUDY OF CITIZEN SCIENCE DATA QUALITY ASSESSMENT PROFILE

J. N. Leocadio, A. Saraiva
{"title":"A DATA-DRIVEN STUDY OF CITIZEN SCIENCE DATA QUALITY ASSESSMENT PROFILE","authors":"J. N. Leocadio, A. Saraiva","doi":"10.33965/icwi_ac2021_202109l012","DOIUrl":null,"url":null,"abstract":"In Citizen Science (CS) projects, data quality (DQ) has been a major concern and discussions have been held to evaluate and ensure the quality of what is produced by volunteers, but few studies have assessed how volunteers get involved and the impact of their behavior on data quality. This study aimed to study a data-driven CS profile to data quality assessment. Here, we analyzed citizen science data extracted from the iNaturalist, a platform to record species observations. We used 58,488 observations recorded in São Paulo, Brazil, and Manchester, England, to train machine learning models, using Random Forest, and to create a DQ profile to classify data according to its quality. We applied an approach that, first identifies information elements (IE) and quality dimensions to describe the data and users’ behavior. The data was then cleaned, pre-processed and transformed. Three models were created: a complete model (with all features), a reduced model (with dimension reduction) and a model with only characteristics that describe the users’ behavior. The precision score for the models were 0.931, 0.932 and 0.774, respectively. The results showed that data quality can be described with few features and user behavior is very important to understand the quality of what is produced by volunteers.","PeriodicalId":178063,"journal":{"name":"Proceedings of the International Conferences on WWW/Internet 2021 and Applied Computing 2021","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conferences on WWW/Internet 2021 and Applied Computing 2021","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.33965/icwi_ac2021_202109l012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In Citizen Science (CS) projects, data quality (DQ) has been a major concern and discussions have been held to evaluate and ensure the quality of what is produced by volunteers, but few studies have assessed how volunteers get involved and the impact of their behavior on data quality. This study aimed to study a data-driven CS profile to data quality assessment. Here, we analyzed citizen science data extracted from the iNaturalist, a platform to record species observations. We used 58,488 observations recorded in São Paulo, Brazil, and Manchester, England, to train machine learning models, using Random Forest, and to create a DQ profile to classify data according to its quality. We applied an approach that, first identifies information elements (IE) and quality dimensions to describe the data and users’ behavior. The data was then cleaned, pre-processed and transformed. Three models were created: a complete model (with all features), a reduced model (with dimension reduction) and a model with only characteristics that describe the users’ behavior. The precision score for the models were 0.931, 0.932 and 0.774, respectively. The results showed that data quality can be described with few features and user behavior is very important to understand the quality of what is produced by volunteers.
公民科学数据质量评估概况的数据驱动研究
在公民科学(Citizen Science, CS)项目中,数据质量(data quality, DQ)一直是一个主要关注的问题,人们也举行了讨论,以评估和确保志愿者产出的数据质量,但很少有研究评估志愿者如何参与以及他们的行为对数据质量的影响。本研究旨在研究数据驱动的CS概况对数据质量的评估。在这里,我们分析了从iNaturalist(一个记录物种观察的平台)提取的公民科学数据。我们使用在巴西圣保罗和英国曼彻斯特记录的58,488个观测值来训练机器学习模型,使用随机森林,并创建DQ配置文件,根据数据的质量对数据进行分类。我们采用了一种方法,首先确定信息元素(IE)和质量维度来描述数据和用户行为。然后对数据进行清理、预处理和转换。创建了三个模型:完整模型(包含所有特征),简化模型(包含降维)和仅包含描述用户行为的特征的模型。模型的精度得分分别为0.931、0.932和0.774。结果表明,数据质量可以用很少的特征来描述,用户行为对于理解志愿者产出的数据质量非常重要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信