Single and Multi-Hop Question-Answering Datasets for Reticular Chemistry with GPT-4-Turbo

IF 5.5 1区化学 Q2 CHEMISTRY, PHYSICAL

Journal of Chemical Theory and Computation Pub Date : 2024-10-08 DOI:10.1021/acs.jctc.4c00805

Nakul Rampal, Kaiyu Wang, Matthew Burigana, Lingxiang Hou, Juri Al-Johani, Anna Sackmann, Hanan S. Murayshid, Walaa A. AlSumari, Arwa M. AlAbdulkarim, Nahla E. Alhazmi, Majed O. Alawad, Christian Borgs, Jennifer T. Chayes, Omar M. Yaghi

{"title":"Single and Multi-Hop Question-Answering Datasets for Reticular Chemistry with GPT-4-Turbo","authors":"Nakul Rampal, Kaiyu Wang, Matthew Burigana, Lingxiang Hou, Juri Al-Johani, Anna Sackmann, Hanan S. Murayshid, Walaa A. AlSumari, Arwa M. AlAbdulkarim, Nahla E. Alhazmi, Majed O. Alawad, Christian Borgs, Jennifer T. Chayes, Omar M. Yaghi","doi":"10.1021/acs.jctc.4c00805","DOIUrl":null,"url":null,"abstract":"The rapid advancement in artificial intelligence and natural language processing has led to the development of large-scale datasets aimed at benchmarking the performance of machine learning models. Herein, we introduce “RetChemQA”, a comprehensive benchmark dataset designed to evaluate the capabilities of such models in the domain of reticular chemistry. This dataset includes both single-hop and multi-hop question-answer pairs, encompassing approximately 45,000 question and answers (Q&As) for each type. The questions have been extracted from an extensive corpus of literature containing about 2,530 research papers from publishers including NAS, ACS, RSC, Elsevier, and Nature Publishing Group, among others. The dataset has been generated using OpenAI’s GPT-4 Turbo, a cutting-edge model known for its exceptional language understanding and generation capabilities. In addition to the Q&A dataset, we also release a dataset of synthesis conditions extracted from the corpus of literature used in this study. The aim of RetChemQA is to provide a robust platform for the development and evaluation of advanced machine learning algorithms, particularly for the reticular chemistry community. The dataset is structured to reflect the complexities and nuances of real-world scientific discourse, thereby enabling nuanced performance assessments across a variety of tasks.","PeriodicalId":45,"journal":{"name":"Journal of Chemical Theory and Computation","volume":"1 1","pages":""},"PeriodicalIF":5.5000,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Theory and Computation","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.jctc.4c00805","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}

引用次数: 0

Abstract

The rapid advancement in artificial intelligence and natural language processing has led to the development of large-scale datasets aimed at benchmarking the performance of machine learning models. Herein, we introduce “RetChemQA”, a comprehensive benchmark dataset designed to evaluate the capabilities of such models in the domain of reticular chemistry. This dataset includes both single-hop and multi-hop question-answer pairs, encompassing approximately 45,000 question and answers (Q&As) for each type. The questions have been extracted from an extensive corpus of literature containing about 2,530 research papers from publishers including NAS, ACS, RSC, Elsevier, and Nature Publishing Group, among others. The dataset has been generated using OpenAI’s GPT-4 Turbo, a cutting-edge model known for its exceptional language understanding and generation capabilities. In addition to the Q&A dataset, we also release a dataset of synthesis conditions extracted from the corpus of literature used in this study. The aim of RetChemQA is to provide a robust platform for the development and evaluation of advanced machine learning algorithms, particularly for the reticular chemistry community. The dataset is structured to reflect the complexities and nuances of real-world scientific discourse, thereby enabling nuanced performance assessments across a variety of tasks.

Abstract Image

查看原文本刊更多论文

使用 GPT-4-Turbo 的网状化学单跳和多跳答题数据集

人工智能和自然语言处理技术的飞速发展促使人们开发了大规模数据集，旨在为机器学习模型的性能设定基准。在此，我们介绍 "RetChemQA"，这是一个综合性基准数据集，旨在评估此类模型在网状化学领域的能力。该数据集包括单跳和多跳问答对，每种类型都包含约 45,000 个问题和答案（Q&As）。这些问题是从大量文献中提取的，其中包含来自 NAS、ACS、RSC、Elsevier 和 Nature Publishing Group 等出版商的约 2,530 篇研究论文。该数据集是使用 OpenAI 的 GPT-4 Turbo 生成的，GPT-4 Turbo 是一种尖端模型，以其卓越的语言理解和生成能力而著称。除了 Q&A 数据集，我们还发布了一个从本研究使用的文献语料库中提取的合成条件数据集。RetChemQA 的目的是为高级机器学习算法的开发和评估提供一个强大的平台，尤其是为网状化学界提供这样一个平台。该数据集的结构反映了真实世界中科学话语的复杂性和细微差别，从而能够对各种任务进行细致入微的性能评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Chemical Theory and Computation 化学-物理：原子、分子和化学物理

CiteScore

9.90

自引率

16.40%

发文量

568

审稿时长

1 months

期刊介绍： The Journal of Chemical Theory and Computation invites new and original contributions with the understanding that, if accepted, they will not be published elsewhere. Papers reporting new theories, methodology, and/or important applications in quantum electronic structure, molecular dynamics, and statistical mechanics are appropriate for submission to this Journal. Specific topics include advances in or applications of ab initio quantum mechanics, density functional theory, design and properties of new materials, surface science, Monte Carlo simulations, solvation models, QM/MM calculations, biomolecular structure prediction, and molecular dynamics in the broadest sense including gas-phase dynamics, ab initio dynamics, biomolecular dynamics, and protein folding. The Journal does not consider papers that are straightforward applications of known methods including DFT and molecular dynamics. The Journal favors submissions that include advances in theory or methodology with applications to compelling problems.