USE OF THE STATISTICAL MODEL OF COHERENCE OF CONNECTED TEXT AS AN ADDITIONAL TOOL OF QUANTITATIVE CONTENT ANALYSIS

I. Shevchenko, Pavlo Andreiev, N. Khairova, Maiia Dernova
{"title":"USE OF THE STATISTICAL MODEL OF COHERENCE OF CONNECTED TEXT AS AN ADDITIONAL TOOL OF QUANTITATIVE CONTENT ANALYSIS","authors":"I. Shevchenko, Pavlo Andreiev, N. Khairova, Maiia Dernova","doi":"10.30929/1995-0519.2021.5.62-67","DOIUrl":null,"url":null,"abstract":"Purpose. We consider the language system as a set of subsystems, structured in the form of a semiotic hierarchy, in which the content of higher-level units is not completely reduced to the substantive components of lower-level units. Therefore, the meaning of higher-level units cannot always be «calculated» taking into account information about the meaning of lower-level units and information about the relationships between these units. At the same time, the structural model of the language system uses thematic or semantic features of connectivity between units of one level of the hierarchy. This opens up certain possibilities for quantitative content analysis. Methodology. Considering the results of known works, we noticed that none of them uses the analysis of paragraphs as independent structural units of the text. The paragraph usually reveals one micro-theme of the text, which is in the development of the theme of the whole text. It is hypothesized that there should be certain patterns in the gradual dynamics of the frequencies of certain words from one paragraph to another, if the studied text has the property of coherence, when a certain topic plays the role of leitmotif. The aim of this work is to study the possibility of using the coherence of the frequency characteristics of paragraphs to identify keywords and satellite words surrounding the keywords – context sets. Results. To achieve this goal the following tasks are solved: development of a text model that takes into account the task of paragraph-by-paragraph analysis of the dynamics of relative frequencies; development of a method of paragraph-by-paragraph text analysis; testing of the developed method on a collection of documents. Originality. A text representation model has been developed that differs from the existing ones in that it includes a set of the most common words, a set of keywords, a set of satellite words, the intersection of sets of paragraphs, keywords, and satellite words. This provides a formal basis for building a method of analyzing the dynamics of relative frequencies of words that are most common in the text and identifying keywords and context sets. A method of text analysis has been developed, which differs from the existing ones in that it is based on the detection of positive correlations between the relative frequencies of occurrence of a subset of the most frequent words in paragraphs. This allows you to identify keywords and context subsets in texts that have some coherence and in individual paragraphs of text that have weak coherence. Practical value. A set of Ukrainian-language, Russian-language and English-language scientific and technical texts was formed to test the efficiency of the text analysis method. The set includes scientific and technical articles on various topics and fragments of textbooks. The results of machine analysis for keyword detection were compared with the author's sets of keywords in scientific and technical articles. Experts were involved to determine the keyword sets of the textbook fragments. Comparison of author's and expert sets of keywords with sets that were formed by the proposed method showed its efficiency. The match ranged from 50 % to 90 %, taking into account the fact that in the author's sets there were phrases, and in the machine sets the elements of these phrases were shown separately. The developed method can be used as an auxiliary tool for content analysis of related texts. References: 15.","PeriodicalId":405654,"journal":{"name":"Transactions of Kremenchuk Mykhailo Ostrohradskyi National University","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transactions of Kremenchuk Mykhailo Ostrohradskyi National University","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30929/1995-0519.2021.5.62-67","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Purpose. We consider the language system as a set of subsystems, structured in the form of a semiotic hierarchy, in which the content of higher-level units is not completely reduced to the substantive components of lower-level units. Therefore, the meaning of higher-level units cannot always be «calculated» taking into account information about the meaning of lower-level units and information about the relationships between these units. At the same time, the structural model of the language system uses thematic or semantic features of connectivity between units of one level of the hierarchy. This opens up certain possibilities for quantitative content analysis. Methodology. Considering the results of known works, we noticed that none of them uses the analysis of paragraphs as independent structural units of the text. The paragraph usually reveals one micro-theme of the text, which is in the development of the theme of the whole text. It is hypothesized that there should be certain patterns in the gradual dynamics of the frequencies of certain words from one paragraph to another, if the studied text has the property of coherence, when a certain topic plays the role of leitmotif. The aim of this work is to study the possibility of using the coherence of the frequency characteristics of paragraphs to identify keywords and satellite words surrounding the keywords – context sets. Results. To achieve this goal the following tasks are solved: development of a text model that takes into account the task of paragraph-by-paragraph analysis of the dynamics of relative frequencies; development of a method of paragraph-by-paragraph text analysis; testing of the developed method on a collection of documents. Originality. A text representation model has been developed that differs from the existing ones in that it includes a set of the most common words, a set of keywords, a set of satellite words, the intersection of sets of paragraphs, keywords, and satellite words. This provides a formal basis for building a method of analyzing the dynamics of relative frequencies of words that are most common in the text and identifying keywords and context sets. A method of text analysis has been developed, which differs from the existing ones in that it is based on the detection of positive correlations between the relative frequencies of occurrence of a subset of the most frequent words in paragraphs. This allows you to identify keywords and context subsets in texts that have some coherence and in individual paragraphs of text that have weak coherence. Practical value. A set of Ukrainian-language, Russian-language and English-language scientific and technical texts was formed to test the efficiency of the text analysis method. The set includes scientific and technical articles on various topics and fragments of textbooks. The results of machine analysis for keyword detection were compared with the author's sets of keywords in scientific and technical articles. Experts were involved to determine the keyword sets of the textbook fragments. Comparison of author's and expert sets of keywords with sets that were formed by the proposed method showed its efficiency. The match ranged from 50 % to 90 %, taking into account the fact that in the author's sets there were phrases, and in the machine sets the elements of these phrases were shown separately. The developed method can be used as an auxiliary tool for content analysis of related texts. References: 15.
使用连贯文本的统计模型作为定量内容分析的附加工具
目的。我们认为语言系统是一组以符号层次结构构成的子系统,其中高层单位的内容并不完全简化为低层单位的实质性组成部分。因此,考虑到有关较低级单位的含义和有关这些单位之间关系的信息,不能总是“计算”高级单位的含义。同时,语言系统的结构模型利用了某一层次单位之间的主题或语义连接特征。这为定量内容分析开辟了一定的可能性。方法。考虑到已知作品的结果,我们注意到它们都没有将段落分析作为文本的独立结构单元。段落通常表现出文本的一个微观主题,而这个微观主题又处于整个文本主题的发展之中。假设如果所研究的文本具有连贯性,当某个主题起着主题的作用时,某些词在段落之间的频率变化应该有一定的规律。这项工作的目的是研究使用段落频率特征的一致性来识别关键词和围绕关键词的卫星词-上下文集的可能性。结果。为了实现这一目标,解决了以下任务:开发一个考虑逐段相对频率动态分析任务的文本模型;逐段文本分析方法的发展在一组文件上测试所开发的方法。创意。与现有的文本表示模型不同的是,它包括一组最常见的词、一组关键字、一组卫星词、一组段落、关键字和卫星词的交集。这为建立一种分析文本中最常见单词的相对频率动态以及识别关键字和上下文集的方法提供了正式的基础。本文开发了一种文本分析方法,它与现有方法的不同之处在于,它是基于检测段落中最常见单词子集的相对出现频率之间的正相关性。这可以让你在具有一定连贯性的文本和具有弱连贯性的文本的个别段落中识别关键字和上下文子集。实用价值。形成了一套乌克兰语、俄语和英语科技文本,以测试文本分析方法的效率。该集包括各种主题的科学和技术文章和教科书的片段。将关键词检测的机器分析结果与作者在科技文章中的关键词集进行比较。专家参与了确定教科书片段关键词集的工作。将作者的关键字集和专家的关键字集与该方法生成的关键字集进行了比较,结果表明了该方法的有效性。考虑到在作者的集合中有短语,而在机器集合中这些短语的元素是单独显示的,匹配范围从50%到90%不等。该方法可作为相关文本内容分析的辅助工具。引用:15。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信