GenoTEX：与生物信息学家一起评估基于 LLM 的基因表达数据对齐探索的基准工具

arXiv - QuanBio - Genomics Pub Date : 2024-06-21 DOI:arxiv-2406.15341

Haoyang Liu, Haohan Wang

{"title":"GenoTEX：与生物信息学家一起评估基于 LLM 的基因表达数据对齐探索的基准工具","authors":"Haoyang Liu, Haohan Wang","doi":"arxiv-2406.15341","DOIUrl":null,"url":null,"abstract":"Recent advancements in machine learning have significantly improved the\nidentification of disease-associated genes from gene expression datasets.\nHowever, these processes often require extensive expertise and manual effort,\nlimiting their scalability. Large Language Model (LLM)-based agents have shown\npromise in automating these tasks due to their increasing problem-solving\nabilities. To support the evaluation and development of such methods, we\nintroduce GenoTEX, a benchmark dataset for the automatic exploration of gene\nexpression data, involving the tasks of dataset selection, preprocessing, and\nstatistical analysis. GenoTEX provides annotated code and results for solving a\nwide range of gene identification problems, in a full analysis pipeline that\nfollows the standard of computational genomics. These annotations are curated\nby human bioinformaticians who carefully analyze the datasets to ensure\naccuracy and reliability. To provide baselines for these tasks, we present\nGenoAgents, a team of LLM-based agents designed with context-aware planning,\niterative correction, and domain expert consultation to collaboratively explore\ngene datasets. Our experiments with GenoAgents demonstrate the potential of\nLLM-based approaches in genomics data analysis, while error analysis highlights\nthe challenges and areas for future improvement. We propose GenoTEX as a\npromising resource for benchmarking and enhancing AI-driven methods for\ngenomics data analysis. We make our benchmark publicly available at\n\\url{https://github.com/Liu-Hy/GenoTex}.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"4 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GenoTEX: A Benchmark for Evaluating LLM-Based Exploration of Gene Expression Data in Alignment with Bioinformaticians\",\"authors\":\"Haoyang Liu, Haohan Wang\",\"doi\":\"arxiv-2406.15341\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent advancements in machine learning have significantly improved the\\nidentification of disease-associated genes from gene expression datasets.\\nHowever, these processes often require extensive expertise and manual effort,\\nlimiting their scalability. Large Language Model (LLM)-based agents have shown\\npromise in automating these tasks due to their increasing problem-solving\\nabilities. To support the evaluation and development of such methods, we\\nintroduce GenoTEX, a benchmark dataset for the automatic exploration of gene\\nexpression data, involving the tasks of dataset selection, preprocessing, and\\nstatistical analysis. GenoTEX provides annotated code and results for solving a\\nwide range of gene identification problems, in a full analysis pipeline that\\nfollows the standard of computational genomics. These annotations are curated\\nby human bioinformaticians who carefully analyze the datasets to ensure\\naccuracy and reliability. To provide baselines for these tasks, we present\\nGenoAgents, a team of LLM-based agents designed with context-aware planning,\\niterative correction, and domain expert consultation to collaboratively explore\\ngene datasets. Our experiments with GenoAgents demonstrate the potential of\\nLLM-based approaches in genomics data analysis, while error analysis highlights\\nthe challenges and areas for future improvement. We propose GenoTEX as a\\npromising resource for benchmarking and enhancing AI-driven methods for\\ngenomics data analysis. We make our benchmark publicly available at\\n\\\\url{https://github.com/Liu-Hy/GenoTex}.\",\"PeriodicalId\":501070,\"journal\":{\"name\":\"arXiv - QuanBio - Genomics\",\"volume\":\"4 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Genomics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2406.15341\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.15341","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

机器学习领域的最新进展大大提高了从基因表达数据集中识别疾病相关基因的能力。然而，这些过程往往需要大量的专业知识和人工操作，限制了其可扩展性。由于基于大语言模型（LLM）的代理解决问题的能力越来越强，因此在自动完成这些任务方面大有可为。为了支持此类方法的评估和开发，我们引入了 GenoTEX，这是一个用于自动探索基因表达数据的基准数据集，涉及数据集选择、预处理和统计分析等任务。GenoTEX 按照计算基因组学的标准，在一个完整的分析流水线中提供了用于解决各种基因识别问题的注释代码和结果。这些注释是由人类生物信息学家策划的，他们会仔细分析数据集，以确保准确性和可靠性。为了给这些任务提供基线，我们提出了 GenoAgents，这是一个基于 LLM 的代理团队，其设计具有上下文感知规划、迭代校正和领域专家咨询功能，可以协同探索基因数据集。我们用 GenoAgents 进行的实验证明了基于 LLM 的方法在基因组学数据分析中的潜力，而误差分析则凸显了所面临的挑战和未来需要改进的地方。我们建议将 GenoTEX 作为一个令人兴奋的资源，用于为基因组学数据分析制定基准并改进人工智能驱动的方法。我们在（url{https://github.com/Liu-Hy/GenoTex}）上公开了我们的基准。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

GenoTEX: A Benchmark for Evaluating LLM-Based Exploration of Gene Expression Data in Alignment with Bioinformaticians

Recent advancements in machine learning have significantly improved the identification of disease-associated genes from gene expression datasets. However, these processes often require extensive expertise and manual effort, limiting their scalability. Large Language Model (LLM)-based agents have shown promise in automating these tasks due to their increasing problem-solving abilities. To support the evaluation and development of such methods, we introduce GenoTEX, a benchmark dataset for the automatic exploration of gene expression data, involving the tasks of dataset selection, preprocessing, and statistical analysis. GenoTEX provides annotated code and results for solving a wide range of gene identification problems, in a full analysis pipeline that follows the standard of computational genomics. These annotations are curated by human bioinformaticians who carefully analyze the datasets to ensure accuracy and reliability. To provide baselines for these tasks, we present GenoAgents, a team of LLM-based agents designed with context-aware planning, iterative correction, and domain expert consultation to collaboratively explore gene datasets. Our experiments with GenoAgents demonstrate the potential of LLM-based approaches in genomics data analysis, while error analysis highlights the challenges and areas for future improvement. We propose GenoTEX as a promising resource for benchmarking and enhancing AI-driven methods for genomics data analysis. We make our benchmark publicly available at \url{https://github.com/Liu-Hy/GenoTex}.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - QuanBio - Genomics

自引率

0.00%

发文量