{"title":"A DATA-DRIVEN STUDY OF CITIZEN SCIENCE DATA QUALITY ASSESSMENT PROFILE","authors":"J. N. Leocadio, A. Saraiva","doi":"10.33965/icwi_ac2021_202109l012","DOIUrl":null,"url":null,"abstract":"In Citizen Science (CS) projects, data quality (DQ) has been a major concern and discussions have been held to evaluate and ensure the quality of what is produced by volunteers, but few studies have assessed how volunteers get involved and the impact of their behavior on data quality. This study aimed to study a data-driven CS profile to data quality assessment. Here, we analyzed citizen science data extracted from the iNaturalist, a platform to record species observations. We used 58,488 observations recorded in São Paulo, Brazil, and Manchester, England, to train machine learning models, using Random Forest, and to create a DQ profile to classify data according to its quality. We applied an approach that, first identifies information elements (IE) and quality dimensions to describe the data and users’ behavior. The data was then cleaned, pre-processed and transformed. Three models were created: a complete model (with all features), a reduced model (with dimension reduction) and a model with only characteristics that describe the users’ behavior. The precision score for the models were 0.931, 0.932 and 0.774, respectively. The results showed that data quality can be described with few features and user behavior is very important to understand the quality of what is produced by volunteers.","PeriodicalId":178063,"journal":{"name":"Proceedings of the International Conferences on WWW/Internet 2021 and Applied Computing 2021","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conferences on WWW/Internet 2021 and Applied Computing 2021","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.33965/icwi_ac2021_202109l012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In Citizen Science (CS) projects, data quality (DQ) has been a major concern and discussions have been held to evaluate and ensure the quality of what is produced by volunteers, but few studies have assessed how volunteers get involved and the impact of their behavior on data quality. This study aimed to study a data-driven CS profile to data quality assessment. Here, we analyzed citizen science data extracted from the iNaturalist, a platform to record species observations. We used 58,488 observations recorded in São Paulo, Brazil, and Manchester, England, to train machine learning models, using Random Forest, and to create a DQ profile to classify data according to its quality. We applied an approach that, first identifies information elements (IE) and quality dimensions to describe the data and users’ behavior. The data was then cleaned, pre-processed and transformed. Three models were created: a complete model (with all features), a reduced model (with dimension reduction) and a model with only characteristics that describe the users’ behavior. The precision score for the models were 0.931, 0.932 and 0.774, respectively. The results showed that data quality can be described with few features and user behavior is very important to understand the quality of what is produced by volunteers.