从科学文本中提取量子级联激光特性的指令数据集。

IF 1 Q3 MULTIDISCIPLINARY SCIENCES

Data in Brief Pub Date : 2025-02-01 DOI:10.1016/j.dib.2024.111255

Deperias Kerre , Anne Laurent , Kenneth Maussang , Dickson Owuor

{"title":"从科学文本中提取量子级联激光特性的指令数据集。","authors":"Deperias Kerre , Anne Laurent , Kenneth Maussang , Dickson Owuor","doi":"10.1016/j.dib.2024.111255","DOIUrl":null,"url":null,"abstract":"<div><div>Quantum Cascade Lasers (QCL) are promising semiconductor lasers, compact and powerful, but of complex design. Availability of structured data of the QCL properties can support data mining activities that seek to understand the relationship between these properties, for instance between the design and performance features. The main open source of QCL data is in scientific text which in most cases is usually unstructured. One of the ways to extract and organize this data is by utilizing Information Extraction techniques. These techniques can accelerate the process of curating QCL properties data from scientific articles for further analysis. One of the main challenges in developing machine learning algorithms for extraction of QCL properties from text is lack of quality training data for these algorithms. Large Language Models (LLMs) have demonstrated great capabilities in materials property extraction from text. They however experience challenges with domain specific properties, for instance the heterostructure and design types in the QCL domain hence for adaptation. In this paper, we present an original instruction dataset for training and evaluation of LLMs for QCL properties extraction from text. The data is generated by augmenting sample sentences from scientific articles with GPT-3.5 instruct with a few shot strategy. The dataset then is manually annotated with the help of QCL experts and is composed of 1300 rows of training examples consisting of an Instruction, Input Text and the Output.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"58 ","pages":"Article 111255"},"PeriodicalIF":1.0000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11742563/pdf/","citationCount":"0","resultStr":"{\"title\":\"An instruction dataset for extracting quantum cascade laser properties from scientific text\",\"authors\":\"Deperias Kerre , Anne Laurent , Kenneth Maussang , Dickson Owuor\",\"doi\":\"10.1016/j.dib.2024.111255\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Quantum Cascade Lasers (QCL) are promising semiconductor lasers, compact and powerful, but of complex design. Availability of structured data of the QCL properties can support data mining activities that seek to understand the relationship between these properties, for instance between the design and performance features. The main open source of QCL data is in scientific text which in most cases is usually unstructured. One of the ways to extract and organize this data is by utilizing Information Extraction techniques. These techniques can accelerate the process of curating QCL properties data from scientific articles for further analysis. One of the main challenges in developing machine learning algorithms for extraction of QCL properties from text is lack of quality training data for these algorithms. Large Language Models (LLMs) have demonstrated great capabilities in materials property extraction from text. They however experience challenges with domain specific properties, for instance the heterostructure and design types in the QCL domain hence for adaptation. In this paper, we present an original instruction dataset for training and evaluation of LLMs for QCL properties extraction from text. The data is generated by augmenting sample sentences from scientific articles with GPT-3.5 instruct with a few shot strategy. The dataset then is manually annotated with the help of QCL experts and is composed of 1300 rows of training examples consisting of an Instruction, Input Text and the Output.</div></div>\",\"PeriodicalId\":10973,\"journal\":{\"name\":\"Data in Brief\",\"volume\":\"58 \",\"pages\":\"Article 111255\"},\"PeriodicalIF\":1.0000,\"publicationDate\":\"2025-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11742563/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Data in Brief\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2352340924012174\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data in Brief","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352340924012174","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

量子级联激光器（QCL）是一种很有前途的半导体激光器，体积小、功率大，但设计复杂。QCL属性的结构化数据的可用性可以支持试图理解这些属性之间关系的数据挖掘活动，例如设计和性能特性之间的关系。QCL数据的主要开源是科学文本，在大多数情况下通常是非结构化的。提取和组织这些数据的方法之一是利用信息提取技术。这些技术可以加速整理科学文章中的QCL属性数据以供进一步分析的过程。开发用于从文本中提取QCL属性的机器学习算法的主要挑战之一是缺乏这些算法的高质量训练数据。大型语言模型（llm）在从文本中提取材料属性方面表现出了巨大的能力。然而，它们会遇到领域特定属性的挑战，例如QCL领域中的异质结构和设计类型，因此需要适应。在本文中，我们提出了一个原始指令数据集，用于从文本中提取QCL属性的llm的训练和评估。数据是通过GPT-3.5指令和几个射击策略对科学文章中的样句进行扩增而生成的。然后在QCL专家的帮助下对数据集进行手动注释，并由1300行训练示例组成，包括指令，输入文本和输出。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An instruction dataset for extracting quantum cascade laser properties from scientific text

Quantum Cascade Lasers (QCL) are promising semiconductor lasers, compact and powerful, but of complex design. Availability of structured data of the QCL properties can support data mining activities that seek to understand the relationship between these properties, for instance between the design and performance features. The main open source of QCL data is in scientific text which in most cases is usually unstructured. One of the ways to extract and organize this data is by utilizing Information Extraction techniques. These techniques can accelerate the process of curating QCL properties data from scientific articles for further analysis. One of the main challenges in developing machine learning algorithms for extraction of QCL properties from text is lack of quality training data for these algorithms. Large Language Models (LLMs) have demonstrated great capabilities in materials property extraction from text. They however experience challenges with domain specific properties, for instance the heterostructure and design types in the QCL domain hence for adaptation. In this paper, we present an original instruction dataset for training and evaluation of LLMs for QCL properties extraction from text. The data is generated by augmenting sample sentences from scientific articles with GPT-3.5 instruct with a few shot strategy. The dataset then is manually annotated with the help of QCL experts and is composed of 1300 rows of training examples consisting of an Instruction, Input Text and the Output.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Data in Brief MULTIDISCIPLINARY SCIENCES-

CiteScore

3.10

自引率

0.00%

发文量

996

审稿时长

70 days

期刊介绍： Data in Brief provides a way for researchers to easily share and reuse each other''s datasets by publishing data articles that: -Thoroughly describe your data, facilitating reproducibility. -Make your data, which is often buried in supplementary material, easier to find. -Increase traffic towards associated research articles and data, leading to more citations. -Open up doors for new collaborations. Because you never know what data will be useful to someone else, Data in Brief welcomes submissions that describe data from all research areas.