Extract of Japanese Text Characteristics of Simplified Corpora using Non-negative Matrix Factorization

J. Data Intell. Pub Date : 2020-03-01 DOI:10.26421/JDI1.1-5
Koji Wajima, Kei Koqure, Toshihiro Furukawa, T. Satoh
{"title":"Extract of Japanese Text Characteristics of Simplified Corpora using Non-negative Matrix Factorization","authors":"Koji Wajima, Kei Koqure, Toshihiro Furukawa, T. Satoh","doi":"10.26421/JDI1.1-5","DOIUrl":null,"url":null,"abstract":"Ways of disseminating(Verbreitungsmedien) information through different media have rapidly changed owing to technological progress, especially in the field of information and communication technologies. Reflecting the changes in terms of conditions of technological progress, communication methods, and abilities have also changed. On the Internet, contents with different expressions of difficulty are mixed even though they have almost the same contents. A user who intends to search for new things or unknown things may get confused and spend a lot of time in selecting contents that are understandable for them because there are large amounts of similar contents with different difficulties. Herein, The characteristics of relevant simplified corpora are critical for everybody. In this research, we propose a method to compare two types of documents with different difficulty, and select a characteristic related to simple of expression from various characteristics related to text. In our proposed method, thousands of text characteristics are compressed and converted by Non-negative Matrix Factorization(NMF), and a basis for characterizing the simplified document is selected. The proposed method combines the characteristics of the most conducted research using the characteristics of 32 types and 2,196 dimensions. We evaluated the text characteristics in the NMF Base of the results using a classifier. As a result of applying the proposed method to two kinds of environment white papers, it became clear that an effective basis can be selected. In Addtionally, We showed estimate of the causation relationships, Optimization of the parameter. Furthermore, We showed flexibility to other media.","PeriodicalId":232625,"journal":{"name":"J. Data Intell.","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Data Intell.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.26421/JDI1.1-5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Ways of disseminating(Verbreitungsmedien) information through different media have rapidly changed owing to technological progress, especially in the field of information and communication technologies. Reflecting the changes in terms of conditions of technological progress, communication methods, and abilities have also changed. On the Internet, contents with different expressions of difficulty are mixed even though they have almost the same contents. A user who intends to search for new things or unknown things may get confused and spend a lot of time in selecting contents that are understandable for them because there are large amounts of similar contents with different difficulties. Herein, The characteristics of relevant simplified corpora are critical for everybody. In this research, we propose a method to compare two types of documents with different difficulty, and select a characteristic related to simple of expression from various characteristics related to text. In our proposed method, thousands of text characteristics are compressed and converted by Non-negative Matrix Factorization(NMF), and a basis for characterizing the simplified document is selected. The proposed method combines the characteristics of the most conducted research using the characteristics of 32 types and 2,196 dimensions. We evaluated the text characteristics in the NMF Base of the results using a classifier. As a result of applying the proposed method to two kinds of environment white papers, it became clear that an effective basis can be selected. In Addtionally, We showed estimate of the causation relationships, Optimization of the parameter. Furthermore, We showed flexibility to other media.
非负矩阵分解法提取简化语料库日语文本特征
由于技术进步,特别是在信息和通信技术领域,通过不同媒介传播信息的方式发生了迅速的变化。反映了技术进步条件的变化,通信方法和能力也发生了变化。在互联网上,虽然内容几乎相同,但不同难度表达的内容却被混在一起。对于想要搜索新事物或未知事物的用户来说,由于有大量的相似内容,不同的难度,他们可能会感到困惑,并花费大量的时间来选择自己可以理解的内容。在此,相关简化语料库的特点对每个人都至关重要。在本研究中,我们提出了一种比较两类不同难度的文档的方法,并从各种与文本相关的特征中选择一个与表达简单相关的特征。在该方法中,通过非负矩阵分解(NMF)对数千个文本特征进行压缩和转换,选择简化后的文档特征基。所提出的方法结合了大多数研究的特征,使用32种类型和2196个维度的特征。我们使用分类器在结果的NMF库中评估文本特征。将该方法应用于两类环境白皮书,结果表明可以选择有效依据。此外,我们还展示了因果关系的估计,参数的优化。此外,我们对其他媒体表现出灵活性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信