Automatic Indexing and Creating Semantic Networks for Agricultural Science Papers in the Polish Language

P. Wrzeciono, W. Karwowski
{"title":"Automatic Indexing and Creating Semantic Networks for Agricultural Science Papers in the Polish Language","authors":"P. Wrzeciono, W. Karwowski","doi":"10.1109/COMPSACW.2013.63","DOIUrl":null,"url":null,"abstract":"This paper presents an automatic indexing system, created on the basis of text analysis, which involves grouping words and reducing them to their dictionary form. The system, developed with the help of an inflection dictionary of the Polish language, is designed to store and retrieve scientific papers dedicated to agriculture. During the analysis, auxiliary words such as pronouns, conjunctions, etc. were omitted. The words which are not present in the inflection dictionary, were used to create a dictionary of new terms. The words stored in the dictionary of new terms were used for the extraction of agricultural terms, which then could be located in the AGROVOC thesaurus. For each of the analyzed papers, a set of concepts with assigned weights was created. For each of the stored scientific papers, an \"artificial sentence\" was generated. An \"artificial sentence\" was created on the basis of the frequency of occurrence of dictionary forms of a word appearing in the texts and the word's grammatical category. This \"artificial sentence\" as well as sets of terms were used to find relationships between the papers stored in the system. These dependencies are used in an algorithm of searching for articles matching a query. It was observed that the number of correct results depends on the number of words in the paper. If a work consisted of at least a thousand words, the probability of misdiagnosis of content was not higher than 5%. In the case of short texts, such as abstracts, the probability of misdiagnosis was much higher, approximately 23%. Results obtained in the presented system are more accurate than those obtained by standard search engines. This method can also be applied to other natural languages with extensive inflection systems. The presented solution is a continuation of the work carried out under a grant [N N310 038538].","PeriodicalId":152957,"journal":{"name":"2013 IEEE 37th Annual Computer Software and Applications Conference Workshops","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 37th Annual Computer Software and Applications Conference Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COMPSACW.2013.63","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

This paper presents an automatic indexing system, created on the basis of text analysis, which involves grouping words and reducing them to their dictionary form. The system, developed with the help of an inflection dictionary of the Polish language, is designed to store and retrieve scientific papers dedicated to agriculture. During the analysis, auxiliary words such as pronouns, conjunctions, etc. were omitted. The words which are not present in the inflection dictionary, were used to create a dictionary of new terms. The words stored in the dictionary of new terms were used for the extraction of agricultural terms, which then could be located in the AGROVOC thesaurus. For each of the analyzed papers, a set of concepts with assigned weights was created. For each of the stored scientific papers, an "artificial sentence" was generated. An "artificial sentence" was created on the basis of the frequency of occurrence of dictionary forms of a word appearing in the texts and the word's grammatical category. This "artificial sentence" as well as sets of terms were used to find relationships between the papers stored in the system. These dependencies are used in an algorithm of searching for articles matching a query. It was observed that the number of correct results depends on the number of words in the paper. If a work consisted of at least a thousand words, the probability of misdiagnosis of content was not higher than 5%. In the case of short texts, such as abstracts, the probability of misdiagnosis was much higher, approximately 23%. Results obtained in the presented system are more accurate than those obtained by standard search engines. This method can also be applied to other natural languages with extensive inflection systems. The presented solution is a continuation of the work carried out under a grant [N N310 038538].
波兰语农业科学论文的自动标引与语义网络建立
本文提出了一种基于文本分析的自动标引系统,该系统包括对单词进行分组并将其还原为词典形式。该系统是在波兰语词形变化词典的帮助下开发的,旨在存储和检索与农业有关的科学论文。在分析过程中,省略了代词、连词等助词。在词形变化词典中没有出现的单词,被用来创建一个新的术语词典。储存在新词词典中的词被用来提取农业术语,然后可以在AGROVOC同义词典中找到。对于每一篇被分析的论文,都创建了一组具有指定权重的概念。对于每一篇存储的科学论文,生成一个“人工句子”。一个“人工句子”是根据一个词在文本中出现的字典形式的频率和这个词的语法类别创建的。这个“人工句子”和一组术语被用来寻找存储在系统中的论文之间的关系。这些依赖关系用于搜索匹配查询的文章的算法中。我们观察到,正确结果的数量取决于论文中的单词数量。如果一部作品至少有一千字,那么内容误诊的概率不高于5%。对于诸如摘要之类的短文本,误诊的概率要高得多,约为23%。在本系统中获得的结果比标准搜索引擎获得的结果更准确。这种方法也可以应用于其他具有大量屈折系统的自然语言。提出的解决方案是在一项拨款[N N310 038538]下继续开展工作。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信