Matthew S Chang, Katherine A Martinez, Chayil C Lattimore, Christina M Gobin, Kimberly J Newsom, Kristianna M Fredenburg
{"title":"用于癌细胞系的计算祖先推断的优化。","authors":"Matthew S Chang, Katherine A Martinez, Chayil C Lattimore, Christina M Gobin, Kimberly J Newsom, Kristianna M Fredenburg","doi":"10.1093/biomethods/bpaf043","DOIUrl":null,"url":null,"abstract":"<p><p>Cancer cell lines have provided invaluable preclinical mechanistic data for cancer health disparities research. Although there are several studies that detail ancestry inference methods using microarray data, there are none that provide investigators with documentation of ancestry inference methods using sequencing data. Here, we describe our computational workflow for inferring genetic ancestry using either whole genome sequencing (WGS) or RNA-sequencing (RNA-seq) data from cancer cell lines. RNA-seq and WGS datasets were generated from four head and neck cancer cell lines with self-identified race/ethnicity (SIRE) as either White or Black. Our workflow included variant calling and genotype imputation via Illumina DRAGEN pipelines, merging genotyping datasets with the 1000 Genomes Project (1KGP), single nucleotide polymorphism (SNP) filtering via PLINK, and ancestry inference with ADMIXTURE. We encountered challenges in workflow development with SNP filtering and clustering of 1KGP superpopulations. Adjusting stringency of filtering parameters to a window size of 100 kb and <i>r</i> <sup>2</sup> threshold of 0.8 resulted in 312,821 SNPs remaining for the RNA-seq dataset and 1,569,578 SNPs remaining for the WGS dataset. Clustering with 1KGP improved with a panel of 291 ancestry informative markers. To estimate proportions of genetic ancestry, we used all filtered SNPs. For the WGS dataset, both clustering and genetic ancestry proportions for each cancer cell line showed concurrence with SIRE. In conclusion, our optimized workflow offers investigators a robust approach for transforming cancer cell line sequencing data to infer genetic ancestry and suggests that WGS datasets are superior to RNA-seq datasets in clustering superpopulations and more accurately estimating genetic ancestry.</p>","PeriodicalId":36528,"journal":{"name":"Biology Methods and Protocols","volume":"10 1","pages":"bpaf043"},"PeriodicalIF":1.3000,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12203193/pdf/","citationCount":"0","resultStr":"{\"title\":\"Optimization of computational ancestry inference for use in cancer cell lines.\",\"authors\":\"Matthew S Chang, Katherine A Martinez, Chayil C Lattimore, Christina M Gobin, Kimberly J Newsom, Kristianna M Fredenburg\",\"doi\":\"10.1093/biomethods/bpaf043\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Cancer cell lines have provided invaluable preclinical mechanistic data for cancer health disparities research. Although there are several studies that detail ancestry inference methods using microarray data, there are none that provide investigators with documentation of ancestry inference methods using sequencing data. Here, we describe our computational workflow for inferring genetic ancestry using either whole genome sequencing (WGS) or RNA-sequencing (RNA-seq) data from cancer cell lines. RNA-seq and WGS datasets were generated from four head and neck cancer cell lines with self-identified race/ethnicity (SIRE) as either White or Black. Our workflow included variant calling and genotype imputation via Illumina DRAGEN pipelines, merging genotyping datasets with the 1000 Genomes Project (1KGP), single nucleotide polymorphism (SNP) filtering via PLINK, and ancestry inference with ADMIXTURE. We encountered challenges in workflow development with SNP filtering and clustering of 1KGP superpopulations. Adjusting stringency of filtering parameters to a window size of 100 kb and <i>r</i> <sup>2</sup> threshold of 0.8 resulted in 312,821 SNPs remaining for the RNA-seq dataset and 1,569,578 SNPs remaining for the WGS dataset. Clustering with 1KGP improved with a panel of 291 ancestry informative markers. To estimate proportions of genetic ancestry, we used all filtered SNPs. For the WGS dataset, both clustering and genetic ancestry proportions for each cancer cell line showed concurrence with SIRE. In conclusion, our optimized workflow offers investigators a robust approach for transforming cancer cell line sequencing data to infer genetic ancestry and suggests that WGS datasets are superior to RNA-seq datasets in clustering superpopulations and more accurately estimating genetic ancestry.</p>\",\"PeriodicalId\":36528,\"journal\":{\"name\":\"Biology Methods and Protocols\",\"volume\":\"10 1\",\"pages\":\"bpaf043\"},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2025-06-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12203193/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biology Methods and Protocols\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/biomethods/bpaf043\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q3\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biology Methods and Protocols","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/biomethods/bpaf043","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
Optimization of computational ancestry inference for use in cancer cell lines.
Cancer cell lines have provided invaluable preclinical mechanistic data for cancer health disparities research. Although there are several studies that detail ancestry inference methods using microarray data, there are none that provide investigators with documentation of ancestry inference methods using sequencing data. Here, we describe our computational workflow for inferring genetic ancestry using either whole genome sequencing (WGS) or RNA-sequencing (RNA-seq) data from cancer cell lines. RNA-seq and WGS datasets were generated from four head and neck cancer cell lines with self-identified race/ethnicity (SIRE) as either White or Black. Our workflow included variant calling and genotype imputation via Illumina DRAGEN pipelines, merging genotyping datasets with the 1000 Genomes Project (1KGP), single nucleotide polymorphism (SNP) filtering via PLINK, and ancestry inference with ADMIXTURE. We encountered challenges in workflow development with SNP filtering and clustering of 1KGP superpopulations. Adjusting stringency of filtering parameters to a window size of 100 kb and r2 threshold of 0.8 resulted in 312,821 SNPs remaining for the RNA-seq dataset and 1,569,578 SNPs remaining for the WGS dataset. Clustering with 1KGP improved with a panel of 291 ancestry informative markers. To estimate proportions of genetic ancestry, we used all filtered SNPs. For the WGS dataset, both clustering and genetic ancestry proportions for each cancer cell line showed concurrence with SIRE. In conclusion, our optimized workflow offers investigators a robust approach for transforming cancer cell line sequencing data to infer genetic ancestry and suggests that WGS datasets are superior to RNA-seq datasets in clustering superpopulations and more accurately estimating genetic ancestry.