β -地中海贫血严重程度分类问题的SNP亚群选择

Proceedings of the 9th International Conference on Computational Systems-Biology and Bioinformatics Pub Date : 2018-12-10 DOI:10.1145/3291757.3291770

Ek Thamwiwatthana, Kitsuchart Pasupa, S. Tongsima

{"title":"β -地中海贫血严重程度分类问题的SNP亚群选择","authors":"Ek Thamwiwatthana, Kitsuchart Pasupa, S. Tongsima","doi":"10.1145/3291757.3291770","DOIUrl":null,"url":null,"abstract":"Single-nucleotide polymorphisms (SNPs) are important genetic variables that are very popular in Genome-wide association study at the present time. They are often used in studies related to genetic disorders. A distinctive trait of SNPs is that there are a lot of them since they are variables originated from various positions in a DNA sequence. Unfortunately, the number of samples investigated are usually far fewer than the number of SNPs and so an over-fitting often occurs when one wants to construct a predictive model for classifying a sample into a case or a control. This study investigated a dataset on beta-thalassemia, a common genetic disorder widely found in Thai population. The data in the set are divided into two groups: severe and mild groups. The aims of the study were to develop and evaluate methods for screening and ranking SNPs related to this disorder. The screening methods tested were Chi-squared test (χ2), Information Gain, and Gradient Boosting (GB). The SNPs that were screened in and selected were then used to construct a predictive model for classifying a sample to be either a severe or mild case. The model construction methods tested were Support Vector Machine (SVM), GB, and Naïve Bayes. Several combinations of a screening method and a model construction method were evaluated, and the evaluation results show that the best combination was χ2-SVM which used the number of selected SNPs of 10.","PeriodicalId":307264,"journal":{"name":"Proceedings of the 9th International Conference on Computational Systems-Biology and Bioinformatics","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Selection of SNP Subsets for Severity of Beta-thalassaemia Classification Problem\",\"authors\":\"Ek Thamwiwatthana, Kitsuchart Pasupa, S. Tongsima\",\"doi\":\"10.1145/3291757.3291770\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Single-nucleotide polymorphisms (SNPs) are important genetic variables that are very popular in Genome-wide association study at the present time. They are often used in studies related to genetic disorders. A distinctive trait of SNPs is that there are a lot of them since they are variables originated from various positions in a DNA sequence. Unfortunately, the number of samples investigated are usually far fewer than the number of SNPs and so an over-fitting often occurs when one wants to construct a predictive model for classifying a sample into a case or a control. This study investigated a dataset on beta-thalassemia, a common genetic disorder widely found in Thai population. The data in the set are divided into two groups: severe and mild groups. The aims of the study were to develop and evaluate methods for screening and ranking SNPs related to this disorder. The screening methods tested were Chi-squared test (χ2), Information Gain, and Gradient Boosting (GB). The SNPs that were screened in and selected were then used to construct a predictive model for classifying a sample to be either a severe or mild case. The model construction methods tested were Support Vector Machine (SVM), GB, and Naïve Bayes. Several combinations of a screening method and a model construction method were evaluated, and the evaluation results show that the best combination was χ2-SVM which used the number of selected SNPs of 10.\",\"PeriodicalId\":307264,\"journal\":{\"name\":\"Proceedings of the 9th International Conference on Computational Systems-Biology and Bioinformatics\",\"volume\":\"27 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 9th International Conference on Computational Systems-Biology and Bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3291757.3291770\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 9th International Conference on Computational Systems-Biology and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3291757.3291770","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

单核苷酸多态性(snp)是目前全基因组关联研究中非常受欢迎的重要遗传变量。它们经常用于与遗传疾病有关的研究。snp的一个显著特征是它们的数量很多，因为它们是来自DNA序列中不同位置的变量。不幸的是，调查样本的数量通常远远少于snp的数量，因此当人们想要构建一个预测模型来将样本分类为病例或对照时，经常会出现过度拟合。本研究调查了β -地中海贫血的数据集，这是一种在泰国人群中广泛发现的常见遗传疾病。集合中的数据分为两组:重度组和轻度组。该研究的目的是开发和评估筛选和排序与该疾病相关的snp的方法。检验的筛选方法为卡方检验(χ2)、信息增益和梯度增强(GB)。筛选和选择的snp然后用于构建预测模型，用于将样本分类为严重或轻度病例。所测试的模型构建方法有支持向量机(SVM)、GB和Naïve贝叶斯。对筛选方法和模型构建方法的几种组合进行了评价，评价结果表明，选择snp数为10的χ2-SVM组合为最佳组合。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Selection of SNP Subsets for Severity of Beta-thalassaemia Classification Problem

Single-nucleotide polymorphisms (SNPs) are important genetic variables that are very popular in Genome-wide association study at the present time. They are often used in studies related to genetic disorders. A distinctive trait of SNPs is that there are a lot of them since they are variables originated from various positions in a DNA sequence. Unfortunately, the number of samples investigated are usually far fewer than the number of SNPs and so an over-fitting often occurs when one wants to construct a predictive model for classifying a sample into a case or a control. This study investigated a dataset on beta-thalassemia, a common genetic disorder widely found in Thai population. The data in the set are divided into two groups: severe and mild groups. The aims of the study were to develop and evaluate methods for screening and ranking SNPs related to this disorder. The screening methods tested were Chi-squared test (χ2), Information Gain, and Gradient Boosting (GB). The SNPs that were screened in and selected were then used to construct a predictive model for classifying a sample to be either a severe or mild case. The model construction methods tested were Support Vector Machine (SVM), GB, and Naïve Bayes. Several combinations of a screening method and a model construction method were evaluated, and the evaluation results show that the best combination was χ2-SVM which used the number of selected SNPs of 10.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 9th International Conference on Computational Systems-Biology and Bioinformatics

自引率

0.00%

发文量