{"title":"利用来自分割距离概率密度函数模型的附加信息,使用 BERT 改进段落分割","authors":"Byunghwa Yoo , Kyung-Joong Kim","doi":"10.1016/j.nlp.2024.100061","DOIUrl":null,"url":null,"abstract":"<div><p>Paragraphs play a key role in writing and reading texts. Therefore, studies about dividing texts into appropriate paragraphs, or paragraph segmentation have gathered academic attention for a long time. Recent advancements in pre-trained language models have achieved state-of-the-art performances in various natural language processing fields, including paragraph segmentation. However, pre-trained language model based paragraph segmentation methods had a problem in that they could not consider statistical metadata such as how far each paragraph segmentation point should be apart from each other. Therefore we focused on combining paragraph segmentation distance and pre-trained language models so that both statistical metadata and state-of-the-art representation ability could be considered at the same time. We propose a novel model by modifying BERT, a state-of-the-art pre-trained language model, by adding segmentation distance information via probability density function modeling. Our model was trained and tested on the domain of the novel, and showed improved performance compared to baseline BERT and previous study, acquiring a mean of 0.8877 F1-score and 0.8708 MCC. Furthermore, our model showed robust performance regardless of the authors of the novels.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"6 ","pages":"Article 100061"},"PeriodicalIF":0.0000,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949719124000098/pdfft?md5=a63ebac8bf9ebdb5e4d76b386ec366f1&pid=1-s2.0-S2949719124000098-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Improving paragraph segmentation using BERT with additional information from probability density function modeling of segmentation distances\",\"authors\":\"Byunghwa Yoo , Kyung-Joong Kim\",\"doi\":\"10.1016/j.nlp.2024.100061\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Paragraphs play a key role in writing and reading texts. Therefore, studies about dividing texts into appropriate paragraphs, or paragraph segmentation have gathered academic attention for a long time. Recent advancements in pre-trained language models have achieved state-of-the-art performances in various natural language processing fields, including paragraph segmentation. However, pre-trained language model based paragraph segmentation methods had a problem in that they could not consider statistical metadata such as how far each paragraph segmentation point should be apart from each other. Therefore we focused on combining paragraph segmentation distance and pre-trained language models so that both statistical metadata and state-of-the-art representation ability could be considered at the same time. We propose a novel model by modifying BERT, a state-of-the-art pre-trained language model, by adding segmentation distance information via probability density function modeling. Our model was trained and tested on the domain of the novel, and showed improved performance compared to baseline BERT and previous study, acquiring a mean of 0.8877 F1-score and 0.8708 MCC. Furthermore, our model showed robust performance regardless of the authors of the novels.</p></div>\",\"PeriodicalId\":100944,\"journal\":{\"name\":\"Natural Language Processing Journal\",\"volume\":\"6 \",\"pages\":\"Article 100061\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2949719124000098/pdfft?md5=a63ebac8bf9ebdb5e4d76b386ec366f1&pid=1-s2.0-S2949719124000098-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Natural Language Processing Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2949719124000098\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719124000098","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
段落在文章的写作和阅读中起着关键作用。因此,将文本划分为适当段落或段落分割的研究长期以来一直受到学术界的关注。预训练语言模型的最新进展已经在包括段落分割在内的多个自然语言处理领域实现了最先进的性能。然而,基于预训练语言模型的段落分割方法存在一个问题,即它们无法考虑统计元数据,例如每个段落分割点之间的距离。因此,我们致力于将段落分割距离和预训练语言模型结合起来,以便同时考虑统计元数据和最先进的表示能力。我们通过概率密度函数建模,在最先进的预训练语言模型 BERT 的基础上增加了分割距离信息,从而提出了一种新的模型。与基线 BERT 和之前的研究相比,我们的模型性能有所提高,平均 F1 分数为 0.8877,MCC 为 0.8708。此外,无论小说的作者是谁,我们的模型都表现出了稳健的性能。
Improving paragraph segmentation using BERT with additional information from probability density function modeling of segmentation distances
Paragraphs play a key role in writing and reading texts. Therefore, studies about dividing texts into appropriate paragraphs, or paragraph segmentation have gathered academic attention for a long time. Recent advancements in pre-trained language models have achieved state-of-the-art performances in various natural language processing fields, including paragraph segmentation. However, pre-trained language model based paragraph segmentation methods had a problem in that they could not consider statistical metadata such as how far each paragraph segmentation point should be apart from each other. Therefore we focused on combining paragraph segmentation distance and pre-trained language models so that both statistical metadata and state-of-the-art representation ability could be considered at the same time. We propose a novel model by modifying BERT, a state-of-the-art pre-trained language model, by adding segmentation distance information via probability density function modeling. Our model was trained and tested on the domain of the novel, and showed improved performance compared to baseline BERT and previous study, acquiring a mean of 0.8877 F1-score and 0.8708 MCC. Furthermore, our model showed robust performance regardless of the authors of the novels.