基于熵和图的基于篇章实体的文档一致性建模:在IR中的应用

Proceedings of the 2015 International Conference on The Theory of Information Retrieval Pub Date : 2015-07-29 DOI:10.1145/2808194.2809458

Casper Petersen, C. Lioma, J. Simonsen, Birger Larsen

{"title":"基于熵和图的基于篇章实体的文档一致性建模:在IR中的应用","authors":"Casper Petersen, C. Lioma, J. Simonsen, Birger Larsen","doi":"10.1145/2808194.2809458","DOIUrl":null,"url":null,"abstract":"We present two novel models of document coherence and their application to information retrieval (IR). Both models approximate document coherence using discourse entities, e.g. the subject or object of a sentence. Our first model views text as a Markov process generating sequences of discourse entities (entity n-grams); we use the entropy of these entity n-grams to approximate the rate at which new information appears in text, reasoning that as more new words appear, the topic increasingly drifts and text coherence decreases. Our second model extends the work of Guinaudeau & Strube [28] that represents text as a graph of discourse entities, linked by different relations, such as their distance or adjacency in text. We use several graph topology metrics to approximate different aspects of the discourse flow that can indicate coherence, such as the average clustering or betweenness of discourse entities in text. Experiments with several instantiations of these models show that: (i) our models perform on a par with two other well-known models of text coherence even without any parameter tuning, and (ii) reranking retrieval results according to their coherence scores gives notable performance gains, confirming a relation between document coherence and relevance. This work contributes two novel models of document coherence, the application of which to IR complements recent work in the integration of document cohesiveness or comprehensibility to ranking [5, 56].","PeriodicalId":440325,"journal":{"name":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"Entropy and Graph Based Modelling of Document Coherence using Discourse Entities: An Application to IR\",\"authors\":\"Casper Petersen, C. Lioma, J. Simonsen, Birger Larsen\",\"doi\":\"10.1145/2808194.2809458\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present two novel models of document coherence and their application to information retrieval (IR). Both models approximate document coherence using discourse entities, e.g. the subject or object of a sentence. Our first model views text as a Markov process generating sequences of discourse entities (entity n-grams); we use the entropy of these entity n-grams to approximate the rate at which new information appears in text, reasoning that as more new words appear, the topic increasingly drifts and text coherence decreases. Our second model extends the work of Guinaudeau & Strube [28] that represents text as a graph of discourse entities, linked by different relations, such as their distance or adjacency in text. We use several graph topology metrics to approximate different aspects of the discourse flow that can indicate coherence, such as the average clustering or betweenness of discourse entities in text. Experiments with several instantiations of these models show that: (i) our models perform on a par with two other well-known models of text coherence even without any parameter tuning, and (ii) reranking retrieval results according to their coherence scores gives notable performance gains, confirming a relation between document coherence and relevance. This work contributes two novel models of document coherence, the application of which to IR complements recent work in the integration of document cohesiveness or comprehensibility to ranking [5, 56].\",\"PeriodicalId\":440325,\"journal\":{\"name\":\"Proceedings of the 2015 International Conference on The Theory of Information Retrieval\",\"volume\":\"45 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-07-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2015 International Conference on The Theory of Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2808194.2809458\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2808194.2809458","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

摘要

本文提出了两种新的文档一致性模型及其在信息检索中的应用。这两种模型都使用话语实体(例如句子的主语或宾语)来近似文档一致性。我们的第一个模型将文本视为生成话语实体序列(实体n-grams)的马尔可夫过程;我们使用这些实体n-图的熵来近似新信息在文本中出现的速率，理由是随着越来越多的新词出现，主题越来越偏离，文本一致性下降。我们的第二个模型扩展了Guinaudeau & Strube[28]的工作，将文本表示为话语实体的图，通过不同的关系(例如它们在文本中的距离或邻接关系)连接起来。我们使用几个图拓扑度量来近似话语流的不同方面，这些方面可以表示连贯性，例如文本中话语实体的平均聚类或之间性。对这些模型的几个实例进行的实验表明:(i)即使没有任何参数调整，我们的模型的性能也与其他两个知名的文本一致性模型相当;(ii)根据检索结果的一致性分数对检索结果进行重新排序，可以显著提高性能，证实了文档一致性和相关性之间的关系。这项工作贡献了两个新的文档连贯性模型，将其应用于IR，补充了最近将文档凝聚力或可理解性整合到排名中的工作[5,56]。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Entropy and Graph Based Modelling of Document Coherence using Discourse Entities: An Application to IR

We present two novel models of document coherence and their application to information retrieval (IR). Both models approximate document coherence using discourse entities, e.g. the subject or object of a sentence. Our first model views text as a Markov process generating sequences of discourse entities (entity n-grams); we use the entropy of these entity n-grams to approximate the rate at which new information appears in text, reasoning that as more new words appear, the topic increasingly drifts and text coherence decreases. Our second model extends the work of Guinaudeau & Strube [28] that represents text as a graph of discourse entities, linked by different relations, such as their distance or adjacency in text. We use several graph topology metrics to approximate different aspects of the discourse flow that can indicate coherence, such as the average clustering or betweenness of discourse entities in text. Experiments with several instantiations of these models show that: (i) our models perform on a par with two other well-known models of text coherence even without any parameter tuning, and (ii) reranking retrieval results according to their coherence scores gives notable performance gains, confirming a relation between document coherence and relevance. This work contributes two novel models of document coherence, the application of which to IR complements recent work in the integration of document cohesiveness or comprehensibility to ranking [5, 56].

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2015 International Conference on The Theory of Information Retrieval

自引率

0.00%

发文量