Quantifying Domain Knowledge in Large Language Models

2023 IEEE Conference on Artificial Intelligence (CAI) Pub Date : 2023-06-01 DOI:10.1109/CAI54212.2023.00091

Sudhashree Sayenju, Ramazan S. Aygun, Bill Franks, Sereres Johnston, George Lee, Hansook Choi, Girish Modgil

{"title":"Quantifying Domain Knowledge in Large Language Models","authors":"Sudhashree Sayenju, Ramazan S. Aygun, Bill Franks, Sereres Johnston, George Lee, Hansook Choi, Girish Modgil","doi":"10.1109/CAI54212.2023.00091","DOIUrl":null,"url":null,"abstract":"Transformer based Large language models such as BERT, have demonstrated the ability to derive contextual information from the words surrounding it. However, when these models are applied in specific domains such as medicine, insurance, or scientific disciplines, publicly available models trained on general knowledge sources such as Wikipedia, it may not be as effective in inferring the appropriate context compared to domain-specific models trained on specialized corpora. Given the limited availability of training data for specific domains, pre-trained models can be fine-tuned via transfer learning using relatively small domain-specific corpora. However, there is currently no standardized method for quantifying the effectiveness of these domain-specific models in acquiring the necessary domain knowledge. To address this issue, we explore hidden layer embeddings and introduce domain_gain, a measure to quantify the ability of a model to infer the correct context. In this paper, we show how our measure could be utilized to determine whether words with multiple meanings are more likely to be associated with domain-related meanings rather than their colloquial meanings.","PeriodicalId":129324,"journal":{"name":"2023 IEEE Conference on Artificial Intelligence (CAI)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE Conference on Artificial Intelligence (CAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CAI54212.2023.00091","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Transformer based Large language models such as BERT, have demonstrated the ability to derive contextual information from the words surrounding it. However, when these models are applied in specific domains such as medicine, insurance, or scientific disciplines, publicly available models trained on general knowledge sources such as Wikipedia, it may not be as effective in inferring the appropriate context compared to domain-specific models trained on specialized corpora. Given the limited availability of training data for specific domains, pre-trained models can be fine-tuned via transfer learning using relatively small domain-specific corpora. However, there is currently no standardized method for quantifying the effectiveness of these domain-specific models in acquiring the necessary domain knowledge. To address this issue, we explore hidden layer embeddings and introduce domain_gain, a measure to quantify the ability of a model to infer the correct context. In this paper, we show how our measure could be utilized to determine whether words with multiple meanings are more likely to be associated with domain-related meanings rather than their colloquial meanings.

查看原文本刊更多论文

在大型语言模型中量化领域知识

基于Transformer的大型语言模型(如BERT)已经证明了从围绕它的单词中派生上下文信息的能力。然而，当这些模型应用于特定领域，如医学、保险或科学学科时，在一般知识来源(如Wikipedia)上训练的公开可用模型，与在专门语料库上训练的领域特定模型相比，在推断适当的上下文方面可能不那么有效。鉴于特定领域的训练数据可用性有限，可以通过使用相对较小的特定领域语料库的迁移学习来微调预训练模型。然而，目前还没有标准化的方法来量化这些特定领域模型在获取必要的领域知识方面的有效性。为了解决这个问题，我们探索了隐藏层嵌入，并引入了domain_gain，这是一种量化模型推断正确上下文的能力的度量。在本文中，我们展示了如何利用我们的测量来确定具有多个含义的单词是否更有可能与领域相关的含义相关联，而不是与它们的口语化含义相关联。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE Conference on Artificial Intelligence (CAI)

自引率

0.00%

发文量