用于癌细胞系的计算祖先推断的优化。

IF 1.3 Q3 BIOCHEMICAL RESEARCH METHODS

Biology Methods and Protocols Pub Date : 2025-06-02 eCollection Date: 2025-01-01 DOI:10.1093/biomethods/bpaf043

Matthew S Chang, Katherine A Martinez, Chayil C Lattimore, Christina M Gobin, Kimberly J Newsom, Kristianna M Fredenburg

{"title":"用于癌细胞系的计算祖先推断的优化。","authors":"Matthew S Chang, Katherine A Martinez, Chayil C Lattimore, Christina M Gobin, Kimberly J Newsom, Kristianna M Fredenburg","doi":"10.1093/biomethods/bpaf043","DOIUrl":null,"url":null,"abstract":"Cancer cell lines have provided invaluable preclinical mechanistic data for cancer health disparities research. Although there are several studies that detail ancestry inference methods using microarray data, there are none that provide investigators with documentation of ancestry inference methods using sequencing data. Here, we describe our computational workflow for inferring genetic ancestry using either whole genome sequencing (WGS) or RNA-sequencing (RNA-seq) data from cancer cell lines. RNA-seq and WGS datasets were generated from four head and neck cancer cell lines with self-identified race/ethnicity (SIRE) as either White or Black. Our workflow included variant calling and genotype imputation via Illumina DRAGEN pipelines, merging genotyping datasets with the 1000 Genomes Project (1KGP), single nucleotide polymorphism (SNP) filtering via PLINK, and ancestry inference with ADMIXTURE. We encountered challenges in workflow development with SNP filtering and clustering of 1KGP superpopulations. Adjusting stringency of filtering parameters to a window size of 100 kb and r 2 threshold of 0.8 resulted in 312,821 SNPs remaining for the RNA-seq dataset and 1,569,578 SNPs remaining for the WGS dataset. Clustering with 1KGP improved with a panel of 291 ancestry informative markers. To estimate proportions of genetic ancestry, we used all filtered SNPs. For the WGS dataset, both clustering and genetic ancestry proportions for each cancer cell line showed concurrence with SIRE. In conclusion, our optimized workflow offers investigators a robust approach for transforming cancer cell line sequencing data to infer genetic ancestry and suggests that WGS datasets are superior to RNA-seq datasets in clustering superpopulations and more accurately estimating genetic ancestry.","PeriodicalId":36528,"journal":{"name":"Biology Methods and Protocols","volume":"10 1","pages":"bpaf043"},"PeriodicalIF":1.3000,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12203193/pdf/","citationCount":"0","resultStr":"{\"title\":\"Optimization of computational ancestry inference for use in cancer cell lines.\",\"authors\":\"Matthew S Chang, Katherine A Martinez, Chayil C Lattimore, Christina M Gobin, Kimberly J Newsom, Kristianna M Fredenburg\",\"doi\":\"10.1093/biomethods/bpaf043\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cancer cell lines have provided invaluable preclinical mechanistic data for cancer health disparities research. Although there are several studies that detail ancestry inference methods using microarray data, there are none that provide investigators with documentation of ancestry inference methods using sequencing data. Here, we describe our computational workflow for inferring genetic ancestry using either whole genome sequencing (WGS) or RNA-sequencing (RNA-seq) data from cancer cell lines. RNA-seq and WGS datasets were generated from four head and neck cancer cell lines with self-identified race/ethnicity (SIRE) as either White or Black. Our workflow included variant calling and genotype imputation via Illumina DRAGEN pipelines, merging genotyping datasets with the 1000 Genomes Project (1KGP), single nucleotide polymorphism (SNP) filtering via PLINK, and ancestry inference with ADMIXTURE. We encountered challenges in workflow development with SNP filtering and clustering of 1KGP superpopulations. Adjusting stringency of filtering parameters to a window size of 100 kb and r 2 threshold of 0.8 resulted in 312,821 SNPs remaining for the RNA-seq dataset and 1,569,578 SNPs remaining for the WGS dataset. Clustering with 1KGP improved with a panel of 291 ancestry informative markers. To estimate proportions of genetic ancestry, we used all filtered SNPs. For the WGS dataset, both clustering and genetic ancestry proportions for each cancer cell line showed concurrence with SIRE. In conclusion, our optimized workflow offers investigators a robust approach for transforming cancer cell line sequencing data to infer genetic ancestry and suggests that WGS datasets are superior to RNA-seq datasets in clustering superpopulations and more accurately estimating genetic ancestry.\",\"PeriodicalId\":36528,\"journal\":{\"name\":\"Biology Methods and Protocols\",\"volume\":\"10 1\",\"pages\":\"bpaf043\"},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2025-06-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12203193/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biology Methods and Protocols\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/biomethods/bpaf043\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q3\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biology Methods and Protocols","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/biomethods/bpaf043","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

摘要

癌细胞系为癌症健康差异研究提供了宝贵的临床前机制数据。尽管有几项研究详细介绍了使用微阵列数据的祖先推断方法，但没有一项研究为研究人员提供了使用测序数据的祖先推断方法的文档。在这里，我们描述了使用来自癌细胞系的全基因组测序（WGS）或rna测序（RNA-seq）数据推断遗传祖先的计算工作流程。RNA-seq和WGS数据集来自四种头颈癌细胞系，这些细胞系自我鉴定的种族/民族（SIRE）为白人或黑人。我们的工作流程包括通过Illumina DRAGEN管道进行变异调用和基因型插入，通过1000基因组计划（1KGP）合并基因分型数据集，通过PLINK进行单核苷酸多态性（SNP）过滤，以及使用admix进行祖先推断。我们在工作流程开发中遇到了SNP过滤和1KGP超种群聚类的挑战。将过滤参数的严格程度调整为窗口大小为100 kb， r2阈值为0.8，结果导致RNA-seq数据集保留312,821个snp， WGS数据集保留1,569,578个snp。用291个祖先信息标记改进了1KGP聚类。为了估计遗传祖先的比例，我们使用了所有过滤过的snp。对于WGS数据集，每个癌细胞系的聚类和遗传祖先比例都与SIRE一致。总之，我们优化的工作流程为研究人员提供了一种强大的方法来转化癌细胞系测序数据来推断遗传祖先，并表明WGS数据集在聚类超群体中优于RNA-seq数据集，并且更准确地估计遗传祖先。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Optimization of computational ancestry inference for use in cancer cell lines.

Cancer cell lines have provided invaluable preclinical mechanistic data for cancer health disparities research. Although there are several studies that detail ancestry inference methods using microarray data, there are none that provide investigators with documentation of ancestry inference methods using sequencing data. Here, we describe our computational workflow for inferring genetic ancestry using either whole genome sequencing (WGS) or RNA-sequencing (RNA-seq) data from cancer cell lines. RNA-seq and WGS datasets were generated from four head and neck cancer cell lines with self-identified race/ethnicity (SIRE) as either White or Black. Our workflow included variant calling and genotype imputation via Illumina DRAGEN pipelines, merging genotyping datasets with the 1000 Genomes Project (1KGP), single nucleotide polymorphism (SNP) filtering via PLINK, and ancestry inference with ADMIXTURE. We encountered challenges in workflow development with SNP filtering and clustering of 1KGP superpopulations. Adjusting stringency of filtering parameters to a window size of 100 kb and r ² threshold of 0.8 resulted in 312,821 SNPs remaining for the RNA-seq dataset and 1,569,578 SNPs remaining for the WGS dataset. Clustering with 1KGP improved with a panel of 291 ancestry informative markers. To estimate proportions of genetic ancestry, we used all filtered SNPs. For the WGS dataset, both clustering and genetic ancestry proportions for each cancer cell line showed concurrence with SIRE. In conclusion, our optimized workflow offers investigators a robust approach for transforming cancer cell line sequencing data to infer genetic ancestry and suggests that WGS datasets are superior to RNA-seq datasets in clustering superpopulations and more accurately estimating genetic ancestry.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Biology Methods and Protocols Agricultural and Biological Sciences-Agricultural and Biological Sciences (all)

CiteScore

3.80

自引率

2.80%

发文量

审稿时长

19 weeks