{"title":"Exploring an AI Data Dredging Birthday Paradox","authors":"Marco Pollanen","doi":"10.1109/ACDSA59508.2024.10467565","DOIUrl":null,"url":null,"abstract":"In the era of AI and Data Science, the extensive use of big databases for various purposes, including crime investigations, medical studies, and general population profiling, leads increasingly to the possibility of random database matches driven merely by coincidence akin to the famous birthday paradox. As databases swell in size and complexity, we show in this paper that under some circumstances the likelihood of coincidental matches between seemingly unrelated entries increases dramatically. These extraneous matches can inadvertently mislead investigators and analysts, ultimately resulting in incorrect source attributions.Applying the mathematics of generalized birthday problems, this paper uses an expository approach to delve into the intricacies of data dredging across diverse data sets, emphasizing the need for caution when interpreting results obtained through post-hoc analysis. We explore the potential consequences of relying on post-facto data-driven storytelling, highlighting the dangers of attributing meaning to even matches that occur with seemingly extraordinary odds.","PeriodicalId":518964,"journal":{"name":"2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA)","volume":"442 ","pages":"1-5"},"PeriodicalIF":0.0000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ACDSA59508.2024.10467565","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In the era of AI and Data Science, the extensive use of big databases for various purposes, including crime investigations, medical studies, and general population profiling, leads increasingly to the possibility of random database matches driven merely by coincidence akin to the famous birthday paradox. As databases swell in size and complexity, we show in this paper that under some circumstances the likelihood of coincidental matches between seemingly unrelated entries increases dramatically. These extraneous matches can inadvertently mislead investigators and analysts, ultimately resulting in incorrect source attributions.Applying the mathematics of generalized birthday problems, this paper uses an expository approach to delve into the intricacies of data dredging across diverse data sets, emphasizing the need for caution when interpreting results obtained through post-hoc analysis. We explore the potential consequences of relying on post-facto data-driven storytelling, highlighting the dangers of attributing meaning to even matches that occur with seemingly extraordinary odds.