K-nearest neighbor algorithm for imputing missing longitudinal prenatal alcohol data.

Advances in drug and alcohol research Pub Date : 2025-01-28 eCollection Date: 2024-01-01 DOI:10.3389/adar.2024.13449
Ayesha Sania, Nicolò Pini, Morgan E Nelson, Michael M Myers, Lauren C Shuffrey, Maristella Lucchini, Amy J Elliott, Hein J Odendaal, William P Fifer
{"title":"K-nearest neighbor algorithm for imputing missing longitudinal prenatal alcohol data.","authors":"Ayesha Sania, Nicolò Pini, Morgan E Nelson, Michael M Myers, Lauren C Shuffrey, Maristella Lucchini, Amy J Elliott, Hein J Odendaal, William P Fifer","doi":"10.3389/adar.2024.13449","DOIUrl":null,"url":null,"abstract":"<p><strong>Aims: </strong>The objective of this study is to illustrate the application of a machine learning algorithm, K Nearest Neighbor (<i>k-NN</i>) to impute missing alcohol data in a prospective study among pregnant women.</p><p><strong>Methods: </strong>We used data from the Safe Passage study (n = 11,083). Daily alcohol consumption for the last reported drinking day and 30 days prior was recorded using the Timeline Follow back method, which generated a variable amount of missing data per participants. Of the 3.2 million person-days of observation, data were missing for 0.36 million (11.4%). Using the <i>k-NN</i> imputed values were weighted for the distances and matched for the day of the week. Since participants with no missing days were not comparable to those with missing data, segments of non-missing data from all participants were included as a reference. Validation was done after randomly deleting data for 5-15 consecutive days from the first trimester.</p><p><strong>Results: </strong>We found that data from 5 nearest neighbors (i.e., K = 5) and segments of 55 days provided imputed values with least imputation error. After deleting data segments from the first trimester data set with no missing days, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/-1 drink/day of the actual. Imputation accuracy varied by study site because of the differences in the magnitude of drinking and proportion of missing data.</p><p><strong>Conclusion: </strong><i>k-NN</i> can be used to impute missing data from longitudinal studies of alcohol during pregnancy with high accuracy.</p>","PeriodicalId":72092,"journal":{"name":"Advances in drug and alcohol research","volume":"4 ","pages":"13449"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11811783/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in drug and alcohol research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/adar.2024.13449","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Aims: The objective of this study is to illustrate the application of a machine learning algorithm, K Nearest Neighbor (k-NN) to impute missing alcohol data in a prospective study among pregnant women.

Methods: We used data from the Safe Passage study (n = 11,083). Daily alcohol consumption for the last reported drinking day and 30 days prior was recorded using the Timeline Follow back method, which generated a variable amount of missing data per participants. Of the 3.2 million person-days of observation, data were missing for 0.36 million (11.4%). Using the k-NN imputed values were weighted for the distances and matched for the day of the week. Since participants with no missing days were not comparable to those with missing data, segments of non-missing data from all participants were included as a reference. Validation was done after randomly deleting data for 5-15 consecutive days from the first trimester.

Results: We found that data from 5 nearest neighbors (i.e., K = 5) and segments of 55 days provided imputed values with least imputation error. After deleting data segments from the first trimester data set with no missing days, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/-1 drink/day of the actual. Imputation accuracy varied by study site because of the differences in the magnitude of drinking and proportion of missing data.

Conclusion: k-NN can be used to impute missing data from longitudinal studies of alcohol during pregnancy with high accuracy.

缺失纵向产前酒精数据的k近邻算法。
目的:本研究的目的是说明机器学习算法K近邻(K - nn)在孕妇前瞻性研究中的应用,以推算缺失的酒精数据。方法:我们使用来自安全通道研究的数据(n = 11083)。使用时间轴追踪方法记录了最后一次报告的饮酒日和30天前的每日饮酒量,这对每个参与者产生了不同数量的缺失数据。在320万人次日的观测中,有36万人次(11.4%)数据缺失。使用k-NN输入值对距离进行加权,并对一周中的一天进行匹配。由于没有缺失天数的参与者与数据缺失的参与者没有可比性,因此包括所有参与者的非缺失数据片段作为参考。在从妊娠早期开始连续5-15天随机删除数据后进行验证。结果:我们发现来自5个最近邻居(即K = 5)和55天段的数据提供的估算值具有最小的估算误差。从没有缺失天数的孕早期数据集中删除数据段后,64%的删除数据段的实际值与预测值之间没有差异。对于31%的细分市场,输入的数据与实际数据相差不超过+/-1杯/天。由于饮酒量和缺失数据的比例不同,计算的准确性因研究地点而异。结论:k-NN可用于推测妊娠期酒精纵向研究中缺失的数据,准确度较高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信