An instruction dataset for extracting quantum cascade laser properties from scientific text

IF 1 Q3 MULTIDISCIPLINARY SCIENCES
Deperias Kerre , Anne Laurent , Kenneth Maussang , Dickson Owuor
{"title":"An instruction dataset for extracting quantum cascade laser properties from scientific text","authors":"Deperias Kerre ,&nbsp;Anne Laurent ,&nbsp;Kenneth Maussang ,&nbsp;Dickson Owuor","doi":"10.1016/j.dib.2024.111255","DOIUrl":null,"url":null,"abstract":"<div><div>Quantum Cascade Lasers (QCL) are promising semiconductor lasers, compact and powerful, but of complex design. Availability of structured data of the QCL properties can support data mining activities that seek to understand the relationship between these properties, for instance between the design and performance features. The main open source of QCL data is in scientific text which in most cases is usually unstructured. One of the ways to extract and organize this data is by utilizing Information Extraction techniques. These techniques can accelerate the process of curating QCL properties data from scientific articles for further analysis. One of the main challenges in developing machine learning algorithms for extraction of QCL properties from text is lack of quality training data for these algorithms. Large Language Models (LLMs) have demonstrated great capabilities in materials property extraction from text. They however experience challenges with domain specific properties, for instance the heterostructure and design types in the QCL domain hence for adaptation. In this paper, we present an original instruction dataset for training and evaluation of LLMs for QCL properties extraction from text. The data is generated by augmenting sample sentences from scientific articles with GPT-3.5 instruct with a few shot strategy. The dataset then is manually annotated with the help of QCL experts and is composed of 1300 rows of training examples consisting of an Instruction, Input Text and the Output.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"58 ","pages":"Article 111255"},"PeriodicalIF":1.0000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11742563/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data in Brief","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352340924012174","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

Quantum Cascade Lasers (QCL) are promising semiconductor lasers, compact and powerful, but of complex design. Availability of structured data of the QCL properties can support data mining activities that seek to understand the relationship between these properties, for instance between the design and performance features. The main open source of QCL data is in scientific text which in most cases is usually unstructured. One of the ways to extract and organize this data is by utilizing Information Extraction techniques. These techniques can accelerate the process of curating QCL properties data from scientific articles for further analysis. One of the main challenges in developing machine learning algorithms for extraction of QCL properties from text is lack of quality training data for these algorithms. Large Language Models (LLMs) have demonstrated great capabilities in materials property extraction from text. They however experience challenges with domain specific properties, for instance the heterostructure and design types in the QCL domain hence for adaptation. In this paper, we present an original instruction dataset for training and evaluation of LLMs for QCL properties extraction from text. The data is generated by augmenting sample sentences from scientific articles with GPT-3.5 instruct with a few shot strategy. The dataset then is manually annotated with the help of QCL experts and is composed of 1300 rows of training examples consisting of an Instruction, Input Text and the Output.
从科学文本中提取量子级联激光特性的指令数据集。
量子级联激光器(QCL)是一种很有前途的半导体激光器,体积小、功率大,但设计复杂。QCL属性的结构化数据的可用性可以支持试图理解这些属性之间关系的数据挖掘活动,例如设计和性能特性之间的关系。QCL数据的主要开源是科学文本,在大多数情况下通常是非结构化的。提取和组织这些数据的方法之一是利用信息提取技术。这些技术可以加速整理科学文章中的QCL属性数据以供进一步分析的过程。开发用于从文本中提取QCL属性的机器学习算法的主要挑战之一是缺乏这些算法的高质量训练数据。大型语言模型(llm)在从文本中提取材料属性方面表现出了巨大的能力。然而,它们会遇到领域特定属性的挑战,例如QCL领域中的异质结构和设计类型,因此需要适应。在本文中,我们提出了一个原始指令数据集,用于从文本中提取QCL属性的llm的训练和评估。数据是通过GPT-3.5指令和几个射击策略对科学文章中的样句进行扩增而生成的。然后在QCL专家的帮助下对数据集进行手动注释,并由1300行训练示例组成,包括指令,输入文本和输出。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Data in Brief
Data in Brief MULTIDISCIPLINARY SCIENCES-
CiteScore
3.10
自引率
0.00%
发文量
996
审稿时长
70 days
期刊介绍: Data in Brief provides a way for researchers to easily share and reuse each other''s datasets by publishing data articles that: -Thoroughly describe your data, facilitating reproducibility. -Make your data, which is often buried in supplementary material, easier to find. -Increase traffic towards associated research articles and data, leading to more citations. -Open up doors for new collaborations. Because you never know what data will be useful to someone else, Data in Brief welcomes submissions that describe data from all research areas.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信