OntoSeg: A Novel Approach to Text Segmentation Using Ontological Similarity

2015 IEEE International Conference on Data Mining Workshop (ICDMW) Pub Date : 2015-11-14 DOI:10.1109/ICDMW.2015.6

Mostafa Bayomi, Killian Levacher, M. R. Ghorab, S. Lawless

{"title":"OntoSeg: A Novel Approach to Text Segmentation Using Ontological Similarity","authors":"Mostafa Bayomi, Killian Levacher, M. R. Ghorab, S. Lawless","doi":"10.1109/ICDMW.2015.6","DOIUrl":null,"url":null,"abstract":"Text segmentation (TS) aims at dividing long text into coherent segments which reflect the subtopic structure of the text. It is beneficial to many natural language processing tasks, such as Information Retrieval (IR) and document summarisation. Current approaches to text segmentation are similar in that they all use word-frequency metrics to measure the similarity between two regions of text, so that a document is segmented based on the lexical cohesion between its words. Various NLP tasks are now moving towards the semantic web and ontologies, such as ontology-based IR systems, to capture the conceptualizations associated with user needs and contents. Text segmentation based on lexical cohesion between words is hence not sufficient anymore for such tasks. This paper proposes OntoSeg, a novel approach to text segmentation based on the ontological similarity between text blocks. The proposed method uses ontological similarity to explore conceptual relations between text segments and a Hierarchical Agglomerative Clustering (HAC) algorithm to represent the text as a tree-like hierarchy that is conceptually structured. The rich structure of the created tree further allows the segmentation of text in a linear fashion at various levels of granularity. The proposed method was evaluated on a wellknown dataset, and the results show that using ontological similarity in text segmentation is very promising. Also we enhance the proposed method by combining ontological similarity with lexical similarity and the results show an enhancement of the segmentation quality.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW.2015.6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

Text segmentation (TS) aims at dividing long text into coherent segments which reflect the subtopic structure of the text. It is beneficial to many natural language processing tasks, such as Information Retrieval (IR) and document summarisation. Current approaches to text segmentation are similar in that they all use word-frequency metrics to measure the similarity between two regions of text, so that a document is segmented based on the lexical cohesion between its words. Various NLP tasks are now moving towards the semantic web and ontologies, such as ontology-based IR systems, to capture the conceptualizations associated with user needs and contents. Text segmentation based on lexical cohesion between words is hence not sufficient anymore for such tasks. This paper proposes OntoSeg, a novel approach to text segmentation based on the ontological similarity between text blocks. The proposed method uses ontological similarity to explore conceptual relations between text segments and a Hierarchical Agglomerative Clustering (HAC) algorithm to represent the text as a tree-like hierarchy that is conceptually structured. The rich structure of the created tree further allows the segmentation of text in a linear fashion at various levels of granularity. The proposed method was evaluated on a wellknown dataset, and the results show that using ontological similarity in text segmentation is very promising. Also we enhance the proposed method by combining ontological similarity with lexical similarity and the results show an enhancement of the segmentation quality.

查看原文本刊更多论文

本体分割:一种基于本体相似度的文本分割新方法

文本分割(TS)的目的是将长文本分割成连贯的片段，这些片段反映了文本的子主题结构。它有利于许多自然语言处理任务，如信息检索(IR)和文档摘要。当前的文本分割方法都是相似的，它们都使用词频度量来度量文本两个区域之间的相似度，从而根据单词之间的词汇衔接来分割文档。各种NLP任务现在正在向语义网和本体(如基于本体的IR系统)转移，以捕获与用户需求和内容相关的概念化。因此，基于词间词汇衔接的文本分割已经不能满足这种任务。本文提出了一种基于文本块间本体相似度的文本分割新方法OntoSeg。该方法使用本体相似性来探索文本段之间的概念关系，并使用层次聚类(HAC)算法将文本表示为概念结构的树状层次结构。所创建树的丰富结构进一步允许在不同粒度级别上以线性方式对文本进行分割。在一个知名的数据集上对该方法进行了评估，结果表明利用本体相似度进行文本分割是很有前途的。将本体相似度与词汇相似度相结合，对该方法进行了改进，结果表明分割质量得到了提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE International Conference on Data Mining Workshop (ICDMW)

自引率

0.00%

发文量