Mohammad Hosseini, Spencer Hong, Kristi Holmes, Kris Wetterstrand, Christopher Donohue, Luis A Nunes Amaral, Thomas Stoeger
{"title":"Ethical considerations in utilizing artificial intelligence for analyzing the NHGRI's History of Genomics and Human Genome Project archives.","authors":"Mohammad Hosseini, Spencer Hong, Kristi Holmes, Kris Wetterstrand, Christopher Donohue, Luis A Nunes Amaral, Thomas Stoeger","doi":"10.7191/jeslib.811","DOIUrl":null,"url":null,"abstract":"<p><p>Understanding \"how to optimize the production of scientific knowledge\" is paramount to those who support scientific research-funders as well as research institutions-to the communities served, and to researchers. Structured archives can help all involved to learn what decisions and processes help or hinder the production of new knowledge. Using artificial intelligence (AI) and large language models (LLMs), we recently created the first structured digital representation of the historic archives of the National Human Genome Research Institute (NHGRI), part of the National Institutes of Health. This work yielded a digital knowledge base of entities, topics, and documents that can be used to probe the inner workings of the Human Genome Project, a massive international public-private effort to sequence the human genome, and several of its offshoots like The Cancer Genome Atlas (TCGA) and the Encyclopedia of DNA Elements (ENCODE). The resulting knowledge base will be instrumental in understanding not only how the Human Genome Project and genomics research developed collaboratively, but also how scientific goals come to be formulated and evolve. Given the diverse and rich data used in this project, we evaluated the ethical implications of employing AI and LLMs to process and analyze this valuable archive. As the first computational investigation of the internal archives of a massive collaborative project with multiple funders and institutions, this study will inform future efforts to conduct similar investigations while also considering and minimizing ethical challenges. Our methodology and risk-mitigating measures could also inform future initiatives in developing standards for project planning, policymaking, enhancing transparency, and ensuring ethical utilization of artificial intelligence technologies and large language models in archive exploration.</p>","PeriodicalId":90214,"journal":{"name":"Journal of escience librarianship","volume":"13 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11566842/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of escience librarianship","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.7191/jeslib.811","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/3/5 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Understanding "how to optimize the production of scientific knowledge" is paramount to those who support scientific research-funders as well as research institutions-to the communities served, and to researchers. Structured archives can help all involved to learn what decisions and processes help or hinder the production of new knowledge. Using artificial intelligence (AI) and large language models (LLMs), we recently created the first structured digital representation of the historic archives of the National Human Genome Research Institute (NHGRI), part of the National Institutes of Health. This work yielded a digital knowledge base of entities, topics, and documents that can be used to probe the inner workings of the Human Genome Project, a massive international public-private effort to sequence the human genome, and several of its offshoots like The Cancer Genome Atlas (TCGA) and the Encyclopedia of DNA Elements (ENCODE). The resulting knowledge base will be instrumental in understanding not only how the Human Genome Project and genomics research developed collaboratively, but also how scientific goals come to be formulated and evolve. Given the diverse and rich data used in this project, we evaluated the ethical implications of employing AI and LLMs to process and analyze this valuable archive. As the first computational investigation of the internal archives of a massive collaborative project with multiple funders and institutions, this study will inform future efforts to conduct similar investigations while also considering and minimizing ethical challenges. Our methodology and risk-mitigating measures could also inform future initiatives in developing standards for project planning, policymaking, enhancing transparency, and ensuring ethical utilization of artificial intelligence technologies and large language models in archive exploration.
了解 "如何优化科学知识的生产 "对于那些支持科学研究的人--资助者和研究机构--所服务的社区和研究人员都至关重要。结构化档案可以帮助所有相关人员了解哪些决策和流程有助于或阻碍新知识的产生。最近,我们利用人工智能(AI)和大型语言模型(LLMs),为美国国立卫生研究院(National Institutes of Health)下属的国家人类基因组研究所(NHGRI)的历史档案创建了首个结构化数字表征。这项工作产生了一个包含实体、主题和文件的数字知识库,可用于探究人类基因组计划的内部运作,该计划是国际社会为人类基因组测序而开展的一项大规模公私合作项目,其分支项目包括癌症基因组图谱(TCGA)和 DNA 元素百科全书(ENCODE)。由此产生的知识库不仅有助于了解人类基因组计划和基因组学研究是如何合作发展的,还有助于了解科学目标是如何制定和发展的。鉴于该项目使用的数据多样而丰富,我们评估了使用人工智能和 LLM 处理和分析这一宝贵档案的伦理意义。作为对一个有多个资助者和机构参与的大型合作项目的内部档案进行的首次计算调查,本研究将为今后开展类似调查提供参考,同时也将考虑并尽量减少伦理挑战。我们的方法和风险缓解措施还可以为未来制定项目规划标准、政策制定、提高透明度以及确保在档案探索中合乎道德地使用人工智能技术和大型语言模型提供参考。