A Domain-Independent Text Segmentation Method for Educational Course Content

Yuwei Tu, Ying Xiong, Weiyu Chen, Christopher G. Brinton
{"title":"A Domain-Independent Text Segmentation Method for Educational Course Content","authors":"Yuwei Tu, Ying Xiong, Weiyu Chen, Christopher G. Brinton","doi":"10.1109/ICDMW.2018.00053","DOIUrl":null,"url":null,"abstract":"In this study, we have proposed a domain-independent text segmentation algorithm which is particularly useful in online educational courses. Text segmentation is proven to be helpful in improving the readability of large corpora of documents, which is essential in education scenarios. While existing domain-dependent text segmentation methods have much better performance than domain-independent methods in most cases, only domain-independent methods are applicable to sparse training content in education scenarios. Our method, unlike other domain-dependent text segmentation methods, doesn't require heavy training on prior documents, but only need to train on the current corpus of documents with topic distributions and word vector representations. Our proposed method develops text boundaries between small text units in three steps. We first calculate input text features via topical distributions (latent Dirichlet allocation) and word embeddings (GloVe). We then calculate similarity values between such textual features and detect distribution changes between the similarities. We finally perform clustering on the similarities and detect sub-topic boundaries via cluster differences. We test our method on two datasets, one from an online education course and one from a popular public dataset - Choi Dataset. The results demonstrate that our method outperforms other state-of-the-art domain-independent text segmentation approaches while achieving performance comparable to a few domain-dependent algorithms.","PeriodicalId":259600,"journal":{"name":"2018 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Data Mining Workshops (ICDMW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW.2018.00053","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

In this study, we have proposed a domain-independent text segmentation algorithm which is particularly useful in online educational courses. Text segmentation is proven to be helpful in improving the readability of large corpora of documents, which is essential in education scenarios. While existing domain-dependent text segmentation methods have much better performance than domain-independent methods in most cases, only domain-independent methods are applicable to sparse training content in education scenarios. Our method, unlike other domain-dependent text segmentation methods, doesn't require heavy training on prior documents, but only need to train on the current corpus of documents with topic distributions and word vector representations. Our proposed method develops text boundaries between small text units in three steps. We first calculate input text features via topical distributions (latent Dirichlet allocation) and word embeddings (GloVe). We then calculate similarity values between such textual features and detect distribution changes between the similarities. We finally perform clustering on the similarities and detect sub-topic boundaries via cluster differences. We test our method on two datasets, one from an online education course and one from a popular public dataset - Choi Dataset. The results demonstrate that our method outperforms other state-of-the-art domain-independent text segmentation approaches while achieving performance comparable to a few domain-dependent algorithms.
一种领域无关的教育课程内容文本分割方法
在这项研究中,我们提出了一种领域无关的文本分割算法,该算法在在线教育课程中特别有用。文本分割被证明有助于提高大型文档语料库的可读性,这在教育场景中是必不可少的。虽然现有的基于领域的文本分割方法在大多数情况下都比基于领域的文本分割方法具有更好的性能,但只有基于领域的文本分割方法才适用于教育场景中稀疏的训练内容。与其他领域相关的文本分割方法不同,我们的方法不需要在先前的文档上进行大量训练,而只需要在具有主题分布和词向量表示的当前文档语料库上进行训练。我们提出的方法分三步在小文本单元之间建立文本边界。我们首先通过主题分布(潜在狄利克雷分配)和词嵌入(GloVe)计算输入文本特征。然后,我们计算这些文本特征之间的相似度值,并检测相似度之间的分布变化。最后,我们对相似点进行聚类,并通过聚类差异检测子主题边界。我们在两个数据集上测试了我们的方法,一个来自在线教育课程,另一个来自流行的公共数据集——Choi数据集。结果表明,我们的方法优于其他最先进的领域独立文本分割方法,同时实现与一些领域相关算法相当的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信