Extract of Japanese Text Characteristics of Simplified Corpora using Non-negative Matrix Factorization

J. Data Intell. Pub Date : 2020-03-01 DOI:10.26421/JDI1.1-5

Koji Wajima, Kei Koqure, Toshihiro Furukawa, T. Satoh

{"title":"Extract of Japanese Text Characteristics of Simplified Corpora using Non-negative Matrix Factorization","authors":"Koji Wajima, Kei Koqure, Toshihiro Furukawa, T. Satoh","doi":"10.26421/JDI1.1-5","DOIUrl":null,"url":null,"abstract":"Ways of disseminating(Verbreitungsmedien) information through different media have rapidly changed owing to technological progress, especially in the field of information and communication technologies. Reflecting the changes in terms of conditions of technological progress, communication methods, and abilities have also changed. On the Internet, contents with different expressions of difficulty are mixed even though they have almost the same contents. A user who intends to search for new things or unknown things may get confused and spend a lot of time in selecting contents that are understandable for them because there are large amounts of similar contents with different difficulties. Herein, The characteristics of relevant simplified corpora are critical for everybody. In this research, we propose a method to compare two types of documents with different difficulty, and select a characteristic related to simple of expression from various characteristics related to text. In our proposed method, thousands of text characteristics are compressed and converted by Non-negative Matrix Factorization(NMF), and a basis for characterizing the simplified document is selected. The proposed method combines the characteristics of the most conducted research using the characteristics of 32 types and 2,196 dimensions. We evaluated the text characteristics in the NMF Base of the results using a classifier. As a result of applying the proposed method to two kinds of environment white papers, it became clear that an effective basis can be selected. In Addtionally, We showed estimate of the causation relationships, Optimization of the parameter. Furthermore, We showed flexibility to other media.","PeriodicalId":232625,"journal":{"name":"J. Data Intell.","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Data Intell.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.26421/JDI1.1-5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Ways of disseminating(Verbreitungsmedien) information through different media have rapidly changed owing to technological progress, especially in the field of information and communication technologies. Reflecting the changes in terms of conditions of technological progress, communication methods, and abilities have also changed. On the Internet, contents with different expressions of difficulty are mixed even though they have almost the same contents. A user who intends to search for new things or unknown things may get confused and spend a lot of time in selecting contents that are understandable for them because there are large amounts of similar contents with different difficulties. Herein, The characteristics of relevant simplified corpora are critical for everybody. In this research, we propose a method to compare two types of documents with different difficulty, and select a characteristic related to simple of expression from various characteristics related to text. In our proposed method, thousands of text characteristics are compressed and converted by Non-negative Matrix Factorization(NMF), and a basis for characterizing the simplified document is selected. The proposed method combines the characteristics of the most conducted research using the characteristics of 32 types and 2,196 dimensions. We evaluated the text characteristics in the NMF Base of the results using a classifier. As a result of applying the proposed method to two kinds of environment white papers, it became clear that an effective basis can be selected. In Addtionally, We showed estimate of the causation relationships, Optimization of the parameter. Furthermore, We showed flexibility to other media.

查看原文本刊更多论文

非负矩阵分解法提取简化语料库日语文本特征

由于技术进步，特别是在信息和通信技术领域，通过不同媒介传播信息的方式发生了迅速的变化。反映了技术进步条件的变化，通信方法和能力也发生了变化。在互联网上，虽然内容几乎相同，但不同难度表达的内容却被混在一起。对于想要搜索新事物或未知事物的用户来说，由于有大量的相似内容，不同的难度，他们可能会感到困惑，并花费大量的时间来选择自己可以理解的内容。在此，相关简化语料库的特点对每个人都至关重要。在本研究中，我们提出了一种比较两类不同难度的文档的方法，并从各种与文本相关的特征中选择一个与表达简单相关的特征。在该方法中，通过非负矩阵分解(NMF)对数千个文本特征进行压缩和转换，选择简化后的文档特征基。所提出的方法结合了大多数研究的特征，使用32种类型和2196个维度的特征。我们使用分类器在结果的NMF库中评估文本特征。将该方法应用于两类环境白皮书，结果表明可以选择有效依据。此外，我们还展示了因果关系的估计，参数的优化。此外，我们对其他媒体表现出灵活性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

J. Data Intell.

自引率

0.00%

发文量