基于分布分析和SVM模型的RNA-Seq转录组学组织分类*

Dominick DeCanio, Minah Kim, Samuel Haddox, G. Guadagni
{"title":"基于分布分析和SVM模型的RNA-Seq转录组学组织分类*","authors":"Dominick DeCanio, Minah Kim, Samuel Haddox, G. Guadagni","doi":"10.1109/SIEDS58326.2023.10137900","DOIUrl":null,"url":null,"abstract":"The human body generates more proteins than it has genes that code for proteins. The diversity of proteins stems from the alternative ways in which RNA can be spliced and reassembled. Each alternative version of RNA produces a different protein, providing a way for our bodies to produce a wide range of proteins with a single gene. Some alternative RNA transcripts, however, have splicing errors and produce faulty proteins involved in genetic diseases. Understanding splicing patterns and profiles has wide implications for our understanding of healthy and diseased tissue states. Currently little is known regarding the splicing profiles of healthy tissue which vary across individuals and within individuals by tissue type. Therefore, this project explored the use of RNA splicing data from the first chromosome to predict the tissue type of non-cancerous samples using distribution analysis and supervised learning methods. The Kolmogorov-Smirnov test was used to classify the samples based on empirical cumulative distribution functions and was not able to reliably distinguish between tissue types. Our SVM model was run using both the quantity of splice junctions observed and their presence, and had a high prediction accuracy for both data sets. The performance between the two SVM model outcomes were not significantly different. Overall, the findings suggest the utility of using splice junction data in biological classification and sets the foundation for future work of mapping splicing patterns with phenotype.","PeriodicalId":267464,"journal":{"name":"2023 Systems and Information Engineering Design Symposium (SIEDS)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Tissue Classification Using RNA-Seq Transcriptomics with Distribution Analysis and SVM Models*\",\"authors\":\"Dominick DeCanio, Minah Kim, Samuel Haddox, G. Guadagni\",\"doi\":\"10.1109/SIEDS58326.2023.10137900\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The human body generates more proteins than it has genes that code for proteins. The diversity of proteins stems from the alternative ways in which RNA can be spliced and reassembled. Each alternative version of RNA produces a different protein, providing a way for our bodies to produce a wide range of proteins with a single gene. Some alternative RNA transcripts, however, have splicing errors and produce faulty proteins involved in genetic diseases. Understanding splicing patterns and profiles has wide implications for our understanding of healthy and diseased tissue states. Currently little is known regarding the splicing profiles of healthy tissue which vary across individuals and within individuals by tissue type. Therefore, this project explored the use of RNA splicing data from the first chromosome to predict the tissue type of non-cancerous samples using distribution analysis and supervised learning methods. The Kolmogorov-Smirnov test was used to classify the samples based on empirical cumulative distribution functions and was not able to reliably distinguish between tissue types. Our SVM model was run using both the quantity of splice junctions observed and their presence, and had a high prediction accuracy for both data sets. The performance between the two SVM model outcomes were not significantly different. Overall, the findings suggest the utility of using splice junction data in biological classification and sets the foundation for future work of mapping splicing patterns with phenotype.\",\"PeriodicalId\":267464,\"journal\":{\"name\":\"2023 Systems and Information Engineering Design Symposium (SIEDS)\",\"volume\":\"50 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 Systems and Information Engineering Design Symposium (SIEDS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SIEDS58326.2023.10137900\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 Systems and Information Engineering Design Symposium (SIEDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIEDS58326.2023.10137900","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

人体产生的蛋白质比编码蛋白质的基因要多。蛋白质的多样性源于RNA可以被剪接和重组的不同方式。RNA的每一种变体都会产生一种不同的蛋白质,为我们的身体提供了一种用单一基因产生多种蛋白质的方法。然而,一些替代的RNA转录物有剪接错误,产生与遗传疾病有关的有缺陷的蛋白质。了解剪接模式和概况对我们理解健康和患病组织状态具有广泛的意义。目前,人们对健康组织的剪接谱知之甚少,这种剪接谱在个体之间和个体内部因组织类型而异。因此,本项目探索利用来自第一条染色体的RNA剪接数据,利用分布分析和监督学习方法来预测非癌样本的组织类型。采用基于经验累积分布函数的Kolmogorov-Smirnov检验对样本进行分类,不能可靠地区分组织类型。我们的支持向量机模型使用观察到的剪接的数量和它们的存在来运行,并且对两个数据集都有很高的预测精度。两种SVM模型结果的性能无显著差异。总的来说,这些发现表明了在生物学分类中使用剪接连接数据的效用,并为未来的剪接模式与表型的映射工作奠定了基础。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Tissue Classification Using RNA-Seq Transcriptomics with Distribution Analysis and SVM Models*
The human body generates more proteins than it has genes that code for proteins. The diversity of proteins stems from the alternative ways in which RNA can be spliced and reassembled. Each alternative version of RNA produces a different protein, providing a way for our bodies to produce a wide range of proteins with a single gene. Some alternative RNA transcripts, however, have splicing errors and produce faulty proteins involved in genetic diseases. Understanding splicing patterns and profiles has wide implications for our understanding of healthy and diseased tissue states. Currently little is known regarding the splicing profiles of healthy tissue which vary across individuals and within individuals by tissue type. Therefore, this project explored the use of RNA splicing data from the first chromosome to predict the tissue type of non-cancerous samples using distribution analysis and supervised learning methods. The Kolmogorov-Smirnov test was used to classify the samples based on empirical cumulative distribution functions and was not able to reliably distinguish between tissue types. Our SVM model was run using both the quantity of splice junctions observed and their presence, and had a high prediction accuracy for both data sets. The performance between the two SVM model outcomes were not significantly different. Overall, the findings suggest the utility of using splice junction data in biological classification and sets the foundation for future work of mapping splicing patterns with phenotype.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信