Quantifying Domain Knowledge in Large Language Models

Sudhashree Sayenju, Ramazan S. Aygun, Bill Franks, Sereres Johnston, George Lee, Hansook Choi, Girish Modgil
{"title":"Quantifying Domain Knowledge in Large Language Models","authors":"Sudhashree Sayenju, Ramazan S. Aygun, Bill Franks, Sereres Johnston, George Lee, Hansook Choi, Girish Modgil","doi":"10.1109/CAI54212.2023.00091","DOIUrl":null,"url":null,"abstract":"Transformer based Large language models such as BERT, have demonstrated the ability to derive contextual information from the words surrounding it. However, when these models are applied in specific domains such as medicine, insurance, or scientific disciplines, publicly available models trained on general knowledge sources such as Wikipedia, it may not be as effective in inferring the appropriate context compared to domain-specific models trained on specialized corpora. Given the limited availability of training data for specific domains, pre-trained models can be fine-tuned via transfer learning using relatively small domain-specific corpora. However, there is currently no standardized method for quantifying the effectiveness of these domain-specific models in acquiring the necessary domain knowledge. To address this issue, we explore hidden layer embeddings and introduce domain_gain, a measure to quantify the ability of a model to infer the correct context. In this paper, we show how our measure could be utilized to determine whether words with multiple meanings are more likely to be associated with domain-related meanings rather than their colloquial meanings.","PeriodicalId":129324,"journal":{"name":"2023 IEEE Conference on Artificial Intelligence (CAI)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE Conference on Artificial Intelligence (CAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CAI54212.2023.00091","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Transformer based Large language models such as BERT, have demonstrated the ability to derive contextual information from the words surrounding it. However, when these models are applied in specific domains such as medicine, insurance, or scientific disciplines, publicly available models trained on general knowledge sources such as Wikipedia, it may not be as effective in inferring the appropriate context compared to domain-specific models trained on specialized corpora. Given the limited availability of training data for specific domains, pre-trained models can be fine-tuned via transfer learning using relatively small domain-specific corpora. However, there is currently no standardized method for quantifying the effectiveness of these domain-specific models in acquiring the necessary domain knowledge. To address this issue, we explore hidden layer embeddings and introduce domain_gain, a measure to quantify the ability of a model to infer the correct context. In this paper, we show how our measure could be utilized to determine whether words with multiple meanings are more likely to be associated with domain-related meanings rather than their colloquial meanings.
在大型语言模型中量化领域知识
基于Transformer的大型语言模型(如BERT)已经证明了从围绕它的单词中派生上下文信息的能力。然而,当这些模型应用于特定领域,如医学、保险或科学学科时,在一般知识来源(如Wikipedia)上训练的公开可用模型,与在专门语料库上训练的领域特定模型相比,在推断适当的上下文方面可能不那么有效。鉴于特定领域的训练数据可用性有限,可以通过使用相对较小的特定领域语料库的迁移学习来微调预训练模型。然而,目前还没有标准化的方法来量化这些特定领域模型在获取必要的领域知识方面的有效性。为了解决这个问题,我们探索了隐藏层嵌入,并引入了domain_gain,这是一种量化模型推断正确上下文的能力的度量。在本文中,我们展示了如何利用我们的测量来确定具有多个含义的单词是否更有可能与领域相关的含义相关联,而不是与它们的口语化含义相关联。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信