Learning conditional dependence graph for concepts via matrix normal graphical model

IF 0.7 4区数学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Statistics and Its Interface Pub Date : 2024-02-01 DOI:10.4310/23-sii784

Jizheng Lai, Jianxin Yin

{"title":"Learning conditional dependence graph for concepts via matrix normal graphical model","authors":"Jizheng Lai, Jianxin Yin","doi":"10.4310/23-sii784","DOIUrl":null,"url":null,"abstract":"Conditional dependence relationships for random vectors are extensively studied and broadly applied. But it is not very clear how to construct the dependence graph for unstructured data like concept words or phrases in text corpus, where the variables(concepts) are not jointly observed with i.i.d. assumption. Using the global embedding methods like GloVe, we get the ‘structured’ representation vectors for concepts. Then we assume that all the concept vectors jointly follow a matrix normal distribution with sparse precision matrices. With the observation of the word-word co-occurrence matrix and the GloVe construction procedure, we can test this assumption empirically. The asymptotic distribution for the test statistics is derived. Another advantage of this matrix-normal distributional assumption is that the linearly additive property in word analogy tasks is natural and straightforward. Different from knowledge graph methods, the conditional dependence graph describes the conditional dependence structure between concepts given all other concepts, which means that the concepts(nodes) linked by edges cannot be separated by other concepts. It represents an essential semantic relationship. There is no need to enumerate all related pairs as head and tail elements of a triplet in knowledge graph regime. And the relation type in this graph is solely the conditional dependence between concepts. A penalized matrix normal graphical model (MNGM) is then employed to learn the conditional dependence graph for both the concepts and the embedding ‘dimensions’. Since the concept words are nodes in our graph with huge dimensions, we employ the MDMC optimization method to speed up the glasso algorithm. Also, the algorithm is adaptive to incremental accumulation of new concepts in text corpus. On the other hand, we propose a sentence granularity bootstrap to get ‘independent’ repeats of samples to enhance the penalized MNGM algorithm.We name the proposed method as Matrix-GloVe. In simulation studies, we check that the graph learned by Matrix-GloVe is more suitable for Graph Convolutional Networks(GCN) than a correlation graph, i.e. a graph determined from the k-NN method. We employ the proposed method in two scenarios from real data. The first scenario is concept graph learning for concepts in textbook corpus. Under this scenario, two tasks are studied. One is comparing the vectors output by GloVe and other word2vec methods, i.e. CBOW and Skip-Gram, then the vectors are used by penalized MNGM. Another task is link prediction among the concepts. On both tasks, Matrix-GloVe achieves better. In the second scenario, Matrix-GloVe is applied to a downstream method i.e. GCN. For node classification tasks on the BBC and BBCSport datasets, both GCN with Matrix- GloVe and GCN with Matrix-GloVe plus Deepwalk outperform GCN with k-NN.","PeriodicalId":51230,"journal":{"name":"Statistics and Its Interface","volume":"281 1","pages":""},"PeriodicalIF":0.7000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistics and Its Interface","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.4310/23-sii784","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Conditional dependence relationships for random vectors are extensively studied and broadly applied. But it is not very clear how to construct the dependence graph for unstructured data like concept words or phrases in text corpus, where the variables(concepts) are not jointly observed with i.i.d. assumption. Using the global embedding methods like GloVe, we get the ‘structured’ representation vectors for concepts. Then we assume that all the concept vectors jointly follow a matrix normal distribution with sparse precision matrices. With the observation of the word-word co-occurrence matrix and the GloVe construction procedure, we can test this assumption empirically. The asymptotic distribution for the test statistics is derived. Another advantage of this matrix-normal distributional assumption is that the linearly additive property in word analogy tasks is natural and straightforward. Different from knowledge graph methods, the conditional dependence graph describes the conditional dependence structure between concepts given all other concepts, which means that the concepts(nodes) linked by edges cannot be separated by other concepts. It represents an essential semantic relationship. There is no need to enumerate all related pairs as head and tail elements of a triplet in knowledge graph regime. And the relation type in this graph is solely the conditional dependence between concepts. A penalized matrix normal graphical model (MNGM) is then employed to learn the conditional dependence graph for both the concepts and the embedding ‘dimensions’. Since the concept words are nodes in our graph with huge dimensions, we employ the MDMC optimization method to speed up the glasso algorithm. Also, the algorithm is adaptive to incremental accumulation of new concepts in text corpus. On the other hand, we propose a sentence granularity bootstrap to get ‘independent’ repeats of samples to enhance the penalized MNGM algorithm.We name the proposed method as Matrix-GloVe. In simulation studies, we check that the graph learned by Matrix-GloVe is more suitable for Graph Convolutional Networks(GCN) than a correlation graph, i.e. a graph determined from the k-NN method. We employ the proposed method in two scenarios from real data. The first scenario is concept graph learning for concepts in textbook corpus. Under this scenario, two tasks are studied. One is comparing the vectors output by GloVe and other word2vec methods, i.e. CBOW and Skip-Gram, then the vectors are used by penalized MNGM. Another task is link prediction among the concepts. On both tasks, Matrix-GloVe achieves better. In the second scenario, Matrix-GloVe is applied to a downstream method i.e. GCN. For node classification tasks on the BBC and BBCSport datasets, both GCN with Matrix- GloVe and GCN with Matrix-GloVe plus Deepwalk outperform GCN with k-NN.

查看原文本刊更多论文

通过矩阵正态图模型学习概念的条件依赖图

随机向量的条件依赖关系已被广泛研究和应用。但是，对于文本语料库中的概念词或短语等非结构化数据，变量（概念）并非以 i.i.d. 假设联合观测，如何构建其依赖关系图还不是很清楚。使用 GloVe 等全局嵌入方法，我们可以得到概念的 "结构化 "表示向量。然后，我们假定所有概念向量共同遵循具有稀疏精度矩阵的矩阵正态分布。通过观察词-词共现矩阵和 GloVe 构建程序，我们可以对这一假设进行实证检验。测试统计量的渐近分布由此得出。这种矩阵正态分布假设的另一个优点是，单词类比任务中的线性相加属性是自然而直接的。与知识图谱方法不同，条件依存图描述的是给定所有其他概念的概念之间的条件依存结构，这意味着由边连接的概念（节点）不能被其他概念分开。它代表了一种基本的语义关系。在知识图谱体系中，没有必要将所有相关的对作为三元组的头元素和尾元素进行枚举。而且这种图中的关系类型仅是概念之间的条件依赖关系。然后，我们采用惩罚矩阵正则图模型（MNGM）来学习概念和嵌入 "维度 "的条件依赖图。由于概念词是具有巨大维度的图中节点，我们采用了 MDMC 优化方法来加快玻璃算法的速度。此外，该算法还能适应文本语料中新概念的增量积累。另一方面，我们提出了一种句子粒度引导方法，以获得 "独立 "的重复样本，从而增强受惩罚的 MNGM 算法。在模拟研究中，我们验证了矩阵-GloVe 学习到的图比相关图（即由 k-NN 方法确定的图）更适合图卷积网络（GCN）。我们在两个真实数据场景中使用了所提出的方法。第一个场景是教科书语料库中的概念图学习。在这种情况下，我们研究了两个任务。一个是比较 GloVe 和其他 word2vec 方法（即 CBOW 和 Skip-Gram）输出的向量，然后将向量用于受惩罚的 MNGM。另一项任务是概念之间的链接预测。在这两项任务中，Matrix-GloVe 都取得了较好的成绩。在第二种情况下，Matrix-GloVe 被应用于下游方法，即 GCN。在 BBC 和 BBCSport 数据集的节点分类任务中，使用矩阵-GloVe 的 GCN 和使用矩阵-GloVe 加 Deepwalk 的 GCN 均优于使用 k-NN 的 GCN。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Statistics and Its Interface MATHEMATICAL & COMPUTATIONAL BIOLOGY-MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

CiteScore

0.90

自引率

12.50%

发文量

审稿时长

6 months

期刊介绍： Exploring the interface between the field of statistics and other disciplines, including but not limited to: biomedical sciences, geosciences, computer sciences, engineering, and social and behavioral sciences. Publishes high-quality articles in broad areas of statistical science, emphasizing substantive problems, sound statistical models and methods, clear and efficient computational algorithms, and insightful discussions of the motivating problems.