{"title":"利用神经网络预测不平衡数据集中的糖尿病","authors":"H. Guan, Chonghao Zhang","doi":"10.1145/3535508.3545540","DOIUrl":null,"url":null,"abstract":"Diabetes is a long-standing disease caused by high blood sugar over a long period of time and one in every ten Americans has diabetes. The neural networks have gained attention in large-scale genetic research because of its ability in non-linear relationships. However, the data imbalance problem, which is caused by the disproportion between the number of disease samples and the number of healthy samples, will decrease the prediction accuracy. In this project, we tackle the data imbalance problem when predicting diabetes with genotype SNP data and phenotype data provided by UK BioBank. The dataset is highly skewed with healthy samples with the ratio of 20. We build a phenotype neural network and a genotype neural network, which uses two sampling techniques and a data augmentation method by generative adversarial neural network (GAN) to counter the data imbalance problem before feeding the data to the neural networks. We found out that the phenotype neural network outperforms the genotype neural network and achieves 90% accuracy. We reach the conclusion that undersampling performs better than both oversampling and the GAN, and the phenotype is better than the genotype in terms of predicting diabetes. We have identified key phenotype and genotype features that contributed to the effectiveness of the prediction.","PeriodicalId":354504,"journal":{"name":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Predicting diabetes in imbalanced datasets using neural networks\",\"authors\":\"H. Guan, Chonghao Zhang\",\"doi\":\"10.1145/3535508.3545540\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Diabetes is a long-standing disease caused by high blood sugar over a long period of time and one in every ten Americans has diabetes. The neural networks have gained attention in large-scale genetic research because of its ability in non-linear relationships. However, the data imbalance problem, which is caused by the disproportion between the number of disease samples and the number of healthy samples, will decrease the prediction accuracy. In this project, we tackle the data imbalance problem when predicting diabetes with genotype SNP data and phenotype data provided by UK BioBank. The dataset is highly skewed with healthy samples with the ratio of 20. We build a phenotype neural network and a genotype neural network, which uses two sampling techniques and a data augmentation method by generative adversarial neural network (GAN) to counter the data imbalance problem before feeding the data to the neural networks. We found out that the phenotype neural network outperforms the genotype neural network and achieves 90% accuracy. We reach the conclusion that undersampling performs better than both oversampling and the GAN, and the phenotype is better than the genotype in terms of predicting diabetes. We have identified key phenotype and genotype features that contributed to the effectiveness of the prediction.\",\"PeriodicalId\":354504,\"journal\":{\"name\":\"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"volume\":\"62 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3535508.3545540\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3535508.3545540","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Predicting diabetes in imbalanced datasets using neural networks
Diabetes is a long-standing disease caused by high blood sugar over a long period of time and one in every ten Americans has diabetes. The neural networks have gained attention in large-scale genetic research because of its ability in non-linear relationships. However, the data imbalance problem, which is caused by the disproportion between the number of disease samples and the number of healthy samples, will decrease the prediction accuracy. In this project, we tackle the data imbalance problem when predicting diabetes with genotype SNP data and phenotype data provided by UK BioBank. The dataset is highly skewed with healthy samples with the ratio of 20. We build a phenotype neural network and a genotype neural network, which uses two sampling techniques and a data augmentation method by generative adversarial neural network (GAN) to counter the data imbalance problem before feeding the data to the neural networks. We found out that the phenotype neural network outperforms the genotype neural network and achieves 90% accuracy. We reach the conclusion that undersampling performs better than both oversampling and the GAN, and the phenotype is better than the genotype in terms of predicting diabetes. We have identified key phenotype and genotype features that contributed to the effectiveness of the prediction.