Comparative Document Analysis for Large Text Corpora

Xiang Ren, Yuanhua Lv, Kuansan Wang, Jiawei Han
{"title":"Comparative Document Analysis for Large Text Corpora","authors":"Xiang Ren, Yuanhua Lv, Kuansan Wang, Jiawei Han","doi":"10.1145/3018661.3018690","DOIUrl":null,"url":null,"abstract":"This paper presents a novel research problem, Comparative Document Analysis (CDA), that is, joint discovery of commonalities and differences between two individual documents (or two sets of documents) in a large text corpus. Given any pair of documents from a (background) document collection, CDA aims to automatically identify sets of quality phrases to summarize the commonalities of both documents and highlight the distinctions of each with respect to the other informatively and concisely. Our solution uses a general graph-based framework to derive novel measures on phrase semantic commonality and pairwise distinction, where the background corpus is used for computing phrase-document semantic relevance. We use the measures to guide the selection of sets of phrases by solving two joint optimization problems. A scalable iterative algorithm is developed to integrate the maximization of phrase commonality or distinction measure with the learning of phrase-document semantic relevance. Experiments on large text corpora from two different domains---scientific papers and news---demonstrate the effectiveness and robustness of the proposed framework on comparing documents. Analysis on a 10GB+ text corpus demonstrates the scalability of our method, whose computation time grows linearly as the corpus size increases. Our case study on comparing news articles published at different dates shows the power of the proposed method on comparing sets of documents.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3018661.3018690","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20

Abstract

This paper presents a novel research problem, Comparative Document Analysis (CDA), that is, joint discovery of commonalities and differences between two individual documents (or two sets of documents) in a large text corpus. Given any pair of documents from a (background) document collection, CDA aims to automatically identify sets of quality phrases to summarize the commonalities of both documents and highlight the distinctions of each with respect to the other informatively and concisely. Our solution uses a general graph-based framework to derive novel measures on phrase semantic commonality and pairwise distinction, where the background corpus is used for computing phrase-document semantic relevance. We use the measures to guide the selection of sets of phrases by solving two joint optimization problems. A scalable iterative algorithm is developed to integrate the maximization of phrase commonality or distinction measure with the learning of phrase-document semantic relevance. Experiments on large text corpora from two different domains---scientific papers and news---demonstrate the effectiveness and robustness of the proposed framework on comparing documents. Analysis on a 10GB+ text corpus demonstrates the scalability of our method, whose computation time grows linearly as the corpus size increases. Our case study on comparing news articles published at different dates shows the power of the proposed method on comparing sets of documents.
大文本语料库的比较文献分析
本文提出了一个新的研究问题——比较文献分析(CDA),即在大型文本语料库中共同发现两个单独的文档(或两组文档)之间的共性和差异。给定(背景)文档集合中的任何一对文档,CDA的目标是自动识别质量短语集,以总结两个文档的共性,并突出每个文档相对于其他文档的信息和简洁的区别。我们的解决方案使用一个通用的基于图的框架来推导短语语义共性和两两区分的新度量,其中背景语料库用于计算短语-文档语义相关性。通过求解两个联合优化问题,利用这些度量来指导短语集的选择。提出了一种可扩展迭代算法,将短语共性或区别度量最大化与短语-文档语义关联学习相结合。在两个不同领域(科学论文和新闻)的大型文本语料库上进行的实验证明了所提出的框架在比较文档方面的有效性和鲁棒性。对一个10GB以上的文本语料库的分析证明了我们的方法的可扩展性,其计算时间随着语料库大小的增加而线性增长。我们对比较不同日期发布的新闻文章的案例研究显示了所提出的方法在比较文档集方面的强大功能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信