On the relation between K–L divergence and transfer learning performance on causality extraction tasks

Natural Language Processing Journal Pub Date : 2024-01-22 DOI:10.1016/j.nlp.2024.100055

Seethalakshmi Gopalakrishnan , Victor Zitian Chen , Wenwen Dou , Wlodek Zadrozny

{"title":"On the relation between K–L divergence and transfer learning performance on causality extraction tasks","authors":"Seethalakshmi Gopalakrishnan , Victor Zitian Chen , Wenwen Dou , Wlodek Zadrozny","doi":"10.1016/j.nlp.2024.100055","DOIUrl":null,"url":null,"abstract":"<div><p>The problem of extracting causal relations from text remains a challenging task, even in the age of Large Language Models (LLMs). A key factor that impedes the progress of this research is the availability of the annotated data and the lack of common labeling methods. We investigate the applicability of transfer learning (domain adaptation) to address these impediments in experiments with three publicly available datasets: FinCausal, SCITE, and Organizational. We perform pairwise transfer experiments between the datasets using DistilBERT, BERT, and SpanBERT (variants of BERT) and measure the performance of the resulting models. To understand the relationship between datasets and performance, we measure the differences between vocabulary distributions in the datasets using four methods: Kullback–Leibler (K–L) divergence, Wasserstein metric, Maximum Mean Discrepancy, and Kolmogorov–Smirnov test. We also estimate the predictive capability of each method using linear regression. We record the predictive values of each measure. Our results show that K–L divergence between the distribution of the vocabularies in the data predicts the performance of the transfer learning with R2 = 0.0746. Surprisingly, the Wasserstein distance predictive value is low (R2=0.52912), and the same for the Kolmogorov–Smirnov test (R2 =0.40025979). This is confirmed in a series of experiments. For example, with variants of BERT, we observe an almost a 29% to 32% increase in the macro-average F1-score, when the gap between the training and test distributions is small, according to the K–L divergence — the best-performing predictor on this task. We also discuss these results in the context of the sub-par performance of some large language models on causality extraction tasks. Finally, we report the results of transfer learning informed by K–L divergence; namely, we show that there is a 12 to 63% increase in the performance when a small portion of the test data is added to the training data. This shows that corpus expansion and n-shot learning benefit, when the process of choosing examples maximizes their information content, according to the K–L divergence.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"6 ","pages":"Article 100055"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949719124000037/pdfft?md5=b947d57bb804a1d8d27703e9d2e10448&pid=1-s2.0-S2949719124000037-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719124000037","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The problem of extracting causal relations from text remains a challenging task, even in the age of Large Language Models (LLMs). A key factor that impedes the progress of this research is the availability of the annotated data and the lack of common labeling methods. We investigate the applicability of transfer learning (domain adaptation) to address these impediments in experiments with three publicly available datasets: FinCausal, SCITE, and Organizational. We perform pairwise transfer experiments between the datasets using DistilBERT, BERT, and SpanBERT (variants of BERT) and measure the performance of the resulting models. To understand the relationship between datasets and performance, we measure the differences between vocabulary distributions in the datasets using four methods: Kullback–Leibler (K–L) divergence, Wasserstein metric, Maximum Mean Discrepancy, and Kolmogorov–Smirnov test. We also estimate the predictive capability of each method using linear regression. We record the predictive values of each measure. Our results show that K–L divergence between the distribution of the vocabularies in the data predicts the performance of the transfer learning with R2 = 0.0746. Surprisingly, the Wasserstein distance predictive value is low (R2=0.52912), and the same for the Kolmogorov–Smirnov test (R2 =0.40025979). This is confirmed in a series of experiments. For example, with variants of BERT, we observe an almost a 29% to 32% increase in the macro-average F1-score, when the gap between the training and test distributions is small, according to the K–L divergence — the best-performing predictor on this task. We also discuss these results in the context of the sub-par performance of some large language models on causality extraction tasks. Finally, we report the results of transfer learning informed by K–L divergence; namely, we show that there is a 12 to 63% increase in the performance when a small portion of the test data is added to the training data. This shows that corpus expansion and n-shot learning benefit, when the process of choosing examples maximizes their information content, according to the K–L divergence.

查看原文本刊更多论文

因果关系提取任务中 K-L 发散与迁移学习绩效之间的关系

即使在大语言模型（LLM）时代，从文本中提取因果关系仍然是一项具有挑战性的任务。阻碍这项研究取得进展的一个关键因素是注释数据的可用性和通用标注方法的缺乏。我们利用三个公开可用的数据集进行实验，研究迁移学习（领域适应）在解决这些障碍方面的适用性：FinCausal、SCITE 和 Organizational。我们使用 DistilBERT、BERT 和 SpanBERT（BERT 的变体）在数据集之间进行了成对迁移实验，并测量了所得模型的性能。为了解数据集与性能之间的关系，我们使用四种方法测量数据集中词汇分布的差异：库尔巴克-莱伯勒（K-L）发散、瓦瑟斯坦度量、最大均值差异和 Kolmogorov-Smirnov 检验。我们还使用线性回归估算了每种方法的预测能力。我们记录了每种方法的预测值。我们的结果表明，数据中词汇分布之间的 K-L 发散可以预测迁移学习的性能，R2 = 0.0746。令人惊讶的是，Wasserstein 距离预测值很低（R2=0.52912），Kolmogorov-Smirnov 检验的预测值也很低（R2=0.40025979）。这在一系列实验中得到了证实。例如，通过 BERT 的变体，我们观察到当训练分布和测试分布之间的差距较小时，宏观平均 F1 分数几乎提高了 29% 到 32%，这与 K-L 发散有关--K-L 发散是这项任务中表现最好的预测器。我们还结合一些大型语言模型在因果关系提取任务中表现不佳的情况讨论了这些结果。最后，我们报告了以 K-L 发散为基础的迁移学习的结果；即，我们表明，当一小部分测试数据被添加到训练数据中时，性能会提高 12% 到 63%。这表明，根据 K-L 发散，当选择示例的过程最大化了其信息含量时，语料库扩展和 n-shot 学习都会受益。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Natural Language Processing Journal

自引率

0.00%

发文量