O. Starunova, S. Rudnev, A. Ivanova, V. G. Semenova, V. Starodubov
{"title":"Benford定律在预防筛查数据质量评价中的应用","authors":"O. Starunova, S. Rudnev, A. Ivanova, V. G. Semenova, V. Starodubov","doi":"10.17537/2022.17.230","DOIUrl":null,"url":null,"abstract":"\n An empirical Benford's law which describes the probability of the appearance of certain first significant digits in many distributions taken from real life, is used to identify anomalies in various kinds of data. Our aim was to test Benford's law to assess the quality of mass preventive screening data on the example of bioelectrical impedance analysis (BIA) data from Moscow health centers. As was shown earlier, such a data is characterized by a high level of contamination by artificially generated and falsified data. A generated 2010–2019 database of BIA measurements contained 1361019 measurement records in the age range of the examined persons from 5 to 96 years. Application of the expert quality assessment algorithm, which was used as a reference for evaluation of the effectiveness of Benford analysis, revealed a high percentage of incorrect data (66.5 %) which was dominated by falsified data. To characterize the degree of the data compliance with Benford's law, the mean absolute deviations of the frequency distributions of the first and first two significant digits deviations from the proper values and chi-squared statistics for the tenth powers of the standardized resistance, reactance, and resistance index values were assessed for each health center. A significant correlation was observed between the data deviation from Benford's law and the percentage of incorrect data as provided by the expert quality assessment algorithm (ρmax = 0.66 and 0.62 for the mean absolute deviations and χ2 statistics, respectively, based on the resistance value and the first significant digit). It is suggested that deviation of the BIA data from Benford's law serves as a sufficient, but not a necessary, condition for their contamination. For those health centers, in which most of the incorrect data were represented by multiple measurements of the same person under the guise of different ones, the data were in good agreement with Benford's law. If the structure of incorrect data was dominated by measurements of the calibration block, software emulations of BIA measurements and outliers, then the use of Benford's law made it possible to effectively rank health centers by the level of data authenticity.\n","PeriodicalId":53525,"journal":{"name":"Mathematical Biology and Bioinformatics","volume":"49 2 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Application of Benford's Law for Quality Assessment of Preventive Screening Data\",\"authors\":\"O. Starunova, S. Rudnev, A. Ivanova, V. G. Semenova, V. Starodubov\",\"doi\":\"10.17537/2022.17.230\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n An empirical Benford's law which describes the probability of the appearance of certain first significant digits in many distributions taken from real life, is used to identify anomalies in various kinds of data. Our aim was to test Benford's law to assess the quality of mass preventive screening data on the example of bioelectrical impedance analysis (BIA) data from Moscow health centers. As was shown earlier, such a data is characterized by a high level of contamination by artificially generated and falsified data. A generated 2010–2019 database of BIA measurements contained 1361019 measurement records in the age range of the examined persons from 5 to 96 years. Application of the expert quality assessment algorithm, which was used as a reference for evaluation of the effectiveness of Benford analysis, revealed a high percentage of incorrect data (66.5 %) which was dominated by falsified data. To characterize the degree of the data compliance with Benford's law, the mean absolute deviations of the frequency distributions of the first and first two significant digits deviations from the proper values and chi-squared statistics for the tenth powers of the standardized resistance, reactance, and resistance index values were assessed for each health center. A significant correlation was observed between the data deviation from Benford's law and the percentage of incorrect data as provided by the expert quality assessment algorithm (ρmax = 0.66 and 0.62 for the mean absolute deviations and χ2 statistics, respectively, based on the resistance value and the first significant digit). It is suggested that deviation of the BIA data from Benford's law serves as a sufficient, but not a necessary, condition for their contamination. For those health centers, in which most of the incorrect data were represented by multiple measurements of the same person under the guise of different ones, the data were in good agreement with Benford's law. If the structure of incorrect data was dominated by measurements of the calibration block, software emulations of BIA measurements and outliers, then the use of Benford's law made it possible to effectively rank health centers by the level of data authenticity.\\n\",\"PeriodicalId\":53525,\"journal\":{\"name\":\"Mathematical Biology and Bioinformatics\",\"volume\":\"49 2 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Mathematical Biology and Bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.17537/2022.17.230\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Mathematics\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mathematical Biology and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17537/2022.17.230","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}
Application of Benford's Law for Quality Assessment of Preventive Screening Data
An empirical Benford's law which describes the probability of the appearance of certain first significant digits in many distributions taken from real life, is used to identify anomalies in various kinds of data. Our aim was to test Benford's law to assess the quality of mass preventive screening data on the example of bioelectrical impedance analysis (BIA) data from Moscow health centers. As was shown earlier, such a data is characterized by a high level of contamination by artificially generated and falsified data. A generated 2010–2019 database of BIA measurements contained 1361019 measurement records in the age range of the examined persons from 5 to 96 years. Application of the expert quality assessment algorithm, which was used as a reference for evaluation of the effectiveness of Benford analysis, revealed a high percentage of incorrect data (66.5 %) which was dominated by falsified data. To characterize the degree of the data compliance with Benford's law, the mean absolute deviations of the frequency distributions of the first and first two significant digits deviations from the proper values and chi-squared statistics for the tenth powers of the standardized resistance, reactance, and resistance index values were assessed for each health center. A significant correlation was observed between the data deviation from Benford's law and the percentage of incorrect data as provided by the expert quality assessment algorithm (ρmax = 0.66 and 0.62 for the mean absolute deviations and χ2 statistics, respectively, based on the resistance value and the first significant digit). It is suggested that deviation of the BIA data from Benford's law serves as a sufficient, but not a necessary, condition for their contamination. For those health centers, in which most of the incorrect data were represented by multiple measurements of the same person under the guise of different ones, the data were in good agreement with Benford's law. If the structure of incorrect data was dominated by measurements of the calibration block, software emulations of BIA measurements and outliers, then the use of Benford's law made it possible to effectively rank health centers by the level of data authenticity.