Sudhashree Sayenju, Ramazan S. Aygun, Bill Franks, Sereres Johnston, George Lee, Hansook Choi, Girish Modgil
{"title":"在大型语言模型中量化领域知识","authors":"Sudhashree Sayenju, Ramazan S. Aygun, Bill Franks, Sereres Johnston, George Lee, Hansook Choi, Girish Modgil","doi":"10.1109/CAI54212.2023.00091","DOIUrl":null,"url":null,"abstract":"Transformer based Large language models such as BERT, have demonstrated the ability to derive contextual information from the words surrounding it. However, when these models are applied in specific domains such as medicine, insurance, or scientific disciplines, publicly available models trained on general knowledge sources such as Wikipedia, it may not be as effective in inferring the appropriate context compared to domain-specific models trained on specialized corpora. Given the limited availability of training data for specific domains, pre-trained models can be fine-tuned via transfer learning using relatively small domain-specific corpora. However, there is currently no standardized method for quantifying the effectiveness of these domain-specific models in acquiring the necessary domain knowledge. To address this issue, we explore hidden layer embeddings and introduce domain_gain, a measure to quantify the ability of a model to infer the correct context. In this paper, we show how our measure could be utilized to determine whether words with multiple meanings are more likely to be associated with domain-related meanings rather than their colloquial meanings.","PeriodicalId":129324,"journal":{"name":"2023 IEEE Conference on Artificial Intelligence (CAI)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Quantifying Domain Knowledge in Large Language Models\",\"authors\":\"Sudhashree Sayenju, Ramazan S. Aygun, Bill Franks, Sereres Johnston, George Lee, Hansook Choi, Girish Modgil\",\"doi\":\"10.1109/CAI54212.2023.00091\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Transformer based Large language models such as BERT, have demonstrated the ability to derive contextual information from the words surrounding it. However, when these models are applied in specific domains such as medicine, insurance, or scientific disciplines, publicly available models trained on general knowledge sources such as Wikipedia, it may not be as effective in inferring the appropriate context compared to domain-specific models trained on specialized corpora. Given the limited availability of training data for specific domains, pre-trained models can be fine-tuned via transfer learning using relatively small domain-specific corpora. However, there is currently no standardized method for quantifying the effectiveness of these domain-specific models in acquiring the necessary domain knowledge. To address this issue, we explore hidden layer embeddings and introduce domain_gain, a measure to quantify the ability of a model to infer the correct context. In this paper, we show how our measure could be utilized to determine whether words with multiple meanings are more likely to be associated with domain-related meanings rather than their colloquial meanings.\",\"PeriodicalId\":129324,\"journal\":{\"name\":\"2023 IEEE Conference on Artificial Intelligence (CAI)\",\"volume\":\"28 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE Conference on Artificial Intelligence (CAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CAI54212.2023.00091\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE Conference on Artificial Intelligence (CAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CAI54212.2023.00091","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Quantifying Domain Knowledge in Large Language Models
Transformer based Large language models such as BERT, have demonstrated the ability to derive contextual information from the words surrounding it. However, when these models are applied in specific domains such as medicine, insurance, or scientific disciplines, publicly available models trained on general knowledge sources such as Wikipedia, it may not be as effective in inferring the appropriate context compared to domain-specific models trained on specialized corpora. Given the limited availability of training data for specific domains, pre-trained models can be fine-tuned via transfer learning using relatively small domain-specific corpora. However, there is currently no standardized method for quantifying the effectiveness of these domain-specific models in acquiring the necessary domain knowledge. To address this issue, we explore hidden layer embeddings and introduce domain_gain, a measure to quantify the ability of a model to infer the correct context. In this paper, we show how our measure could be utilized to determine whether words with multiple meanings are more likely to be associated with domain-related meanings rather than their colloquial meanings.