文本文档的两部分分割

Proceedings of the 21st ACM international conference on Information and knowledge management Pub Date : 2012-10-29 DOI:10.1145/2396761.2396862

Deepak P, Karthik Venkat Ramanan, N. Wiratunga, Sadiq Sani

{"title":"文本文档的两部分分割","authors":"Deepak P, Karthik Venkat Ramanan, N. Wiratunga, Sadiq Sani","doi":"10.1145/2396761.2396862","DOIUrl":null,"url":null,"abstract":"We consider the problem of segmenting text documents that have a two-part structure such as a problem part and a solution part. Documents of this genre include incident reports that typically involve description of events relating to a problem followed by those pertaining to the solution that was tried. Segmenting such documents into the component two parts would render them usable in knowledge reuse frameworks such as Case-Based Reasoning. This segmentation problem presents a hard case for traditional text segmentation due to the lexical inter-relatedness of the segments. We develop a two-part segmentation technique that can harness a corpus of similar documents to model the behavior of the two segments and their inter-relatedness using language models and translation models respectively. In particular, we use separate language models for the problem and solution segment types, whereas the inter-relatedness between segment types is modeled using an IBM Model 1 translation model. We model documents as being generated starting from the problem part that comprises of words sampled from the problem language model, followed by the solution part whose words are sampled either from the solution language model or from a translation model conditioned on the words already chosen in the problem part. We show, through an extensive set of experiments on real-world data, that our approach outperforms the state-of-the-art text segmentation algorithms in the accuracy of segmentation, and that such improved accuracy translates well to improved usability in Case-based Reasoning systems. We also analyze the robustness of our technique to varying amounts and types of noise and empirically illustrate that our technique is quite noise tolerant, and degrades gracefully with increasing amounts of noise.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Two-part segmentation of text documents\",\"authors\":\"Deepak P, Karthik Venkat Ramanan, N. Wiratunga, Sadiq Sani\",\"doi\":\"10.1145/2396761.2396862\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We consider the problem of segmenting text documents that have a two-part structure such as a problem part and a solution part. Documents of this genre include incident reports that typically involve description of events relating to a problem followed by those pertaining to the solution that was tried. Segmenting such documents into the component two parts would render them usable in knowledge reuse frameworks such as Case-Based Reasoning. This segmentation problem presents a hard case for traditional text segmentation due to the lexical inter-relatedness of the segments. We develop a two-part segmentation technique that can harness a corpus of similar documents to model the behavior of the two segments and their inter-relatedness using language models and translation models respectively. In particular, we use separate language models for the problem and solution segment types, whereas the inter-relatedness between segment types is modeled using an IBM Model 1 translation model. We model documents as being generated starting from the problem part that comprises of words sampled from the problem language model, followed by the solution part whose words are sampled either from the solution language model or from a translation model conditioned on the words already chosen in the problem part. We show, through an extensive set of experiments on real-world data, that our approach outperforms the state-of-the-art text segmentation algorithms in the accuracy of segmentation, and that such improved accuracy translates well to improved usability in Case-based Reasoning systems. We also analyze the robustness of our technique to varying amounts and types of noise and empirically illustrate that our technique is quite noise tolerant, and degrades gracefully with increasing amounts of noise.\",\"PeriodicalId\":313414,\"journal\":{\"name\":\"Proceedings of the 21st ACM international conference on Information and knowledge management\",\"volume\":\"47 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-10-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 21st ACM international conference on Information and knowledge management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2396761.2396862\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21st ACM international conference on Information and knowledge management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2396761.2396862","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

我们考虑分割具有两部分结构的文本文档的问题，例如问题部分和解决方案部分。这种类型的文档包括事件报告，通常包括与问题相关的事件描述，然后是与所尝试的解决方案相关的描述。将这些文档分割成组件的两个部分将使它们在知识重用框架(如基于案例的推理)中可用。由于词法上的相关性，这一切分问题给传统的文本切分提出了一个难题。我们开发了一种两部分分割技术，该技术可以利用相似文档的语料库分别使用语言模型和翻译模型来建模两个部分的行为及其相互关系。特别是，我们为问题和解决方案片段类型使用单独的语言模型，而片段类型之间的相互关系是使用IBM Model 1翻译模型建模的。我们将文档建模为从问题部分(包括从问题语言模型中采样的单词)开始生成的文档，然后是解决方案部分(其单词从解决方案语言模型或以问题部分中已经选择的单词为条件的翻译模型中采样)。我们通过对真实世界数据的大量实验表明，我们的方法在分割的准确性方面优于最先进的文本分割算法，并且这种提高的准确性可以很好地转化为基于案例的推理系统中可用性的提高。我们还分析了我们的技术对不同数量和类型的噪声的鲁棒性，并通过经验说明我们的技术具有相当的噪声容忍性，并且随着噪声的增加而优雅地退化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Two-part segmentation of text documents

We consider the problem of segmenting text documents that have a two-part structure such as a problem part and a solution part. Documents of this genre include incident reports that typically involve description of events relating to a problem followed by those pertaining to the solution that was tried. Segmenting such documents into the component two parts would render them usable in knowledge reuse frameworks such as Case-Based Reasoning. This segmentation problem presents a hard case for traditional text segmentation due to the lexical inter-relatedness of the segments. We develop a two-part segmentation technique that can harness a corpus of similar documents to model the behavior of the two segments and their inter-relatedness using language models and translation models respectively. In particular, we use separate language models for the problem and solution segment types, whereas the inter-relatedness between segment types is modeled using an IBM Model 1 translation model. We model documents as being generated starting from the problem part that comprises of words sampled from the problem language model, followed by the solution part whose words are sampled either from the solution language model or from a translation model conditioned on the words already chosen in the problem part. We show, through an extensive set of experiments on real-world data, that our approach outperforms the state-of-the-art text segmentation algorithms in the accuracy of segmentation, and that such improved accuracy translates well to improved usability in Case-based Reasoning systems. We also analyze the robustness of our technique to varying amounts and types of noise and empirically illustrate that our technique is quite noise tolerant, and degrades gracefully with increasing amounts of noise.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 21st ACM international conference on Information and knowledge management

自引率

0.00%

发文量