Discovering Genetic Signatures Associated with Alzheimer's Disease in Tiled Whole Genome Sequence Data: Results from the Artificial Intelligence for Alzheimer's Disease (AI4AD) Consortium

medRxiv - Genetic and Genomic Medicine Pub Date : 2024-08-03 DOI:10.1101/2024.08.01.24311329

Sarah W Zaranek, Alexander Wait Zaranek, Peter Amstutz, Jingxuan Bao, Jiong Chen, Tom Clegg, Hannah Craft, Taeho Jo, Brian Lee, Kwangsik Nho, Sophia I Thomopoulos, Christos Davatzikos, Li Shen, Heng Huang, Paul M Thompson, Andrew J Saykin, The Alzheimer's Disease Neuroimaging Initiative as a consortium author for the AI4AD Initiative

{"title":"Discovering Genetic Signatures Associated with Alzheimer's Disease in Tiled Whole Genome Sequence Data: Results from the Artificial Intelligence for Alzheimer's Disease (AI4AD) Consortium","authors":"Sarah W Zaranek, Alexander Wait Zaranek, Peter Amstutz, Jingxuan Bao, Jiong Chen, Tom Clegg, Hannah Craft, Taeho Jo, Brian Lee, Kwangsik Nho, Sophia I Thomopoulos, Christos Davatzikos, Li Shen, Heng Huang, Paul M Thompson, Andrew J Saykin, The Alzheimer's Disease Neuroimaging Initiative as a consortium author for the AI4AD Initiative","doi":"10.1101/2024.08.01.24311329","DOIUrl":null,"url":null,"abstract":"Currently, the ability to analyze large-scale whole genome sequence (WGS) data is limited due to both the size of the data and the inability of many existing tools to scale. To address this challenge, we use data \"tiling\" to efficiently partition whole genome sequences into smaller segments resulting in a simple numeric matrix of small integers. This lossless representation is particularly suitable for machine learning (ML) models. As an example of the benefits of tiling, we showcase results from tiled data as part of the Artificial Intelligence for Alzheimer's Disease (AI4AD) consortium. AI4AD is a coordinated initiative to develop transformative AI approaches for high throughput analysis of next generation sequencing and related imaging, AD biomarker, and cognitive data. The collective effort integrates imaging, genomic, biomarker, and cognitive data to address fundamental barriers in AD prevention and drug discovery. One of the project's initial aims is to discover new genetic signatures in WGS data that can be used to understand AD risk and progression in conjunction with imaging, biomarker and cognitive data. We tiled and analyzed 15,000+ genomes from the Alzheimer's Disease Sequencing Project (ADSP) and the Alzheimer's Disease Neuroimaging Initiative (ADNI). We tile 11,762 genomes, a subset of the release which does not include family-based datasets (AD Cases: 4,983, age range: 50-90 years , mean age: 73.8 years). We illustrate the use of tiled data in ML classification methods to predict phenotypes. Specifically, we identify and prioritize tile variants/genetic variants that are possible genetic signatures for AD. The model shows added predictive value from variants of genes previously found to be associated with AD risk, age of onset, neurofibrillary tangle measurements, and other AD-related traits--including the APOE variant (rs429358).","PeriodicalId":501375,"journal":{"name":"medRxiv - Genetic and Genomic Medicine","volume":"72 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Genetic and Genomic Medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.08.01.24311329","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Currently, the ability to analyze large-scale whole genome sequence (WGS) data is limited due to both the size of the data and the inability of many existing tools to scale. To address this challenge, we use data "tiling" to efficiently partition whole genome sequences into smaller segments resulting in a simple numeric matrix of small integers. This lossless representation is particularly suitable for machine learning (ML) models. As an example of the benefits of tiling, we showcase results from tiled data as part of the Artificial Intelligence for Alzheimer's Disease (AI4AD) consortium. AI4AD is a coordinated initiative to develop transformative AI approaches for high throughput analysis of next generation sequencing and related imaging, AD biomarker, and cognitive data. The collective effort integrates imaging, genomic, biomarker, and cognitive data to address fundamental barriers in AD prevention and drug discovery. One of the project's initial aims is to discover new genetic signatures in WGS data that can be used to understand AD risk and progression in conjunction with imaging, biomarker and cognitive data. We tiled and analyzed 15,000+ genomes from the Alzheimer's Disease Sequencing Project (ADSP) and the Alzheimer's Disease Neuroimaging Initiative (ADNI). We tile 11,762 genomes, a subset of the release which does not include family-based datasets (AD Cases: 4,983, age range: 50-90 years , mean age: 73.8 years). We illustrate the use of tiled data in ML classification methods to predict phenotypes. Specifically, we identify and prioritize tile variants/genetic variants that are possible genetic signatures for AD. The model shows added predictive value from variants of genes previously found to be associated with AD risk, age of onset, neurofibrillary tangle measurements, and other AD-related traits--including the APOE variant (rs429358).

查看原文本刊更多论文

在平铺的全基因组序列数据中发现与阿尔茨海默病相关的遗传特征：阿尔茨海默病人工智能（AI4AD）联盟的研究成果

目前，分析大规模全基因组序列（WGS）数据的能力受到限制，原因在于数据的大小和许多现有工具无法扩展。为了应对这一挑战，我们利用数据 "平铺 "技术将全基因组序列有效地分割成更小的片段，形成一个简单的小整数数字矩阵。这种无损表示法特别适合机器学习（ML）模型。作为平铺好处的一个例子，我们展示了平铺数据的结果，这是阿尔茨海默病人工智能（AI4AD）联盟的一部分。AI4AD 是一项协调行动，旨在为下一代测序和相关成像、AD 生物标记和认知数据的高通量分析开发变革性人工智能方法。这一集体努力整合了成像、基因组、生物标志物和认知数据，以解决注意力缺失症预防和药物发现方面的基本障碍。该项目的初步目标之一是在 WGS 数据中发现新的遗传特征，这些特征可与成像、生物标记和认知数据相结合，用于了解注意力缺失症的风险和进展。我们对来自阿尔茨海默病测序项目（ADSP）和阿尔茨海默病神经影像计划（ADNI）的15000多个基因组进行了平铺和分析。我们平铺了 11,762 个基因组，这是此次发布的一个子集，其中不包括基于家庭的数据集（AD 病例：4,983 例，年龄范围：50-90 岁，平均年龄：73.8 岁）。我们说明了如何在预测表型的 ML 分类方法中使用平铺数据。具体来说，我们识别并优先处理了可能是 AD 遗传特征的瓦片变异/遗传变异。该模型显示了先前发现的与 AD 风险、发病年龄、神经纤维缠结测量和其他 AD 相关特征（包括 APOE 变体 (rs429358)）相关的基因变异所带来的预测价值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

medRxiv - Genetic and Genomic Medicine

自引率

0.00%

发文量